lexedata.edit package

Submodules

lexedata.edit.add_central_concepts module

lexedata.edit.add_central_concepts.add_central_concepts_to_cognateset_table(dataset: ~pycldf.dataset.Dataset, add_column: bool = True, overwrite_existing: bool = True, logger: ~logging.Logger = <Logger lexedata (INFO)>, status_update: ~typing.Optional = None) Dataset
lexedata.edit.add_central_concepts.central_concept(concepts: Counter[str], concepts_to_concepticon: Mapping[str, int], clics: Optional[Graph])

Find the most central concept among a weighted set.

If the concepts are not linked to CLICS through Concepticon references, we can only take the simple majority among the given concepts.

>>> concepts = {"woman": 3, "mother": 4, "aunt": 3}
>>> central_concept(concepts, {}, None)
'mother'

However, if the concepts can be linked to the CLICS graph, centrality actually can be defined using that graph.

>>> concepticon_mapping = {"arm": 1673, "five": 493, "hand": 1277, "leaf": 628}
>>> central_concept(
...   collections.Counter(["arm", "hand", "five", "leaf"]),
...   concepticon_mapping,
...   load_clics()
... )
'hand'

When counts and concepticon references are both given, the value with the maximum product of CLICS centrality and count is returned. If the concepts do not form a connected subgraph in CLICS (eg. ‘five’, ‘hand’, ‘lower arm’, ‘palm of hand’ – with no attested form meaning ‘arm’ to link them), only centrality within the disjoint subgraphs is considered, so in this example, ‘hand’ would be considered the most central concept.

lexedata.edit.add_central_concepts.concepts_to_concepticon(dataset: Wordlist) Mapping[str, int]
lexedata.edit.add_central_concepts.connected_concepts(dataset: Wordlist) Mapping[str, Counter[str]]

For each cognate set it the dataset, check which concepts it is connected to.

>>>
lexedata.edit.add_central_concepts.load_concepts_by_form(dataset: Dataset) Dict[str, Sequence[str]]

Look up all concepts for each form, and return them as dictionary.

lexedata.edit.add_central_concepts.reshape_dataset(dataset: Wordlist, add_column: bool = True) Dataset

lexedata.edit.add_cognate_table module

Add a CognateTable to the dataset.

If the dataset has a CognateTable, do nothing. If the dataset has no cognatesetReference column anywhere, add an empty CognateTable. If the dataset has a cognatesetReference in the FormTable, extract that to a separate cognateTable, also transferring alignments if they exist. If the dataset has a cognatesetReference anywhere else, admit you don’t know what is going on and die.

lexedata.edit.add_cognate_table.add_cognate_table(dataset: ~pycldf.dataset.Wordlist, split: bool = True, logger: ~logging.Logger = <Logger lexedata (INFO)>) int

Add a cognate (judgment) table.

split: bool

Make sure that the same raw cognate code in different concepts gives rise to different cognate set ids, because raw cognate codes are not globally unique, only within each concept.

Return type:

The number of partial cognate judgements in the new cognate table

lexedata.edit.add_cognate_table.morphemes(segments: Sequence[S], markers: Container[S] = {'+', '_'}) Tuple[Sequence[range], Sequence[Sequence[S]]]

Split a list of segments at the markers.

Return the segments without markers, and the sequence of morphemes (groups of segments between markers) in terms of indices into the segments result.

>>> morphemes("test")
(['t', 'e', 's', 't'], [range(0, 4)])
>>> morphemes("test+ing")
(['t', 'e', 's', 't', 'i', 'n', 'g'], [range(0, 4), range(4, 7)])
>>> s, r = morphemes(["th"]+list("is is a test+ing example"), {" ", "+"})
>>> s[:3]
['th', 'i', 's']
>>> r
[range(0, 3), range(3, 5), range(5, 6), range(6, 10), range(10, 13), range(13, 20)]
lexedata.edit.add_cognate_table.split_at_markers(segments: Sequence[S], markers: Container[S] = {'+', '_'}) Tuple[Sequence[range], Sequence[Sequence[S]]]

Split a list of segments at the markers.

Return the segments without markers, and the sequence of morphemes (groups of segments between markers) in terms of indices into the segments result.

>>> split_at_markers("test")
[['t', 'e', 's', 't']]
>>> split_at_markers("test+ing")
[['t', 'e', 's', 't'], ['i', 'n', 'g']]
>>> split_at_markers(["th"]+list("is is a test+i")+["ng"]+list(" example"), {" ", "+"})
[['th', 'i', 's'], ['i', 's'], ['a'], ['t', 'e', 's', 't'], ['i', 'ng'], ['e', 'x', 'a', 'm', 'p', 'l', 'e']]

lexedata.edit.add_concepticon module

Guess which Concepticon concepts the entries in the ParameterTable refer to.

The full list of available gloss languages uses the ISO 693-1 two-letter codes and can be found on https://github.com/concepticon/concepticon-data/tree/master/mappings (or in the mappings/ folder of your local Concepticon catalog installation).

Fill the Concepticon_ID (or generally, #concepticonReference) column of the dateset’s ParameterTable with best guesses for Concepticon IDs, based on gloss columns in potentially different languages.

lexedata.edit.add_concepticon.add_concepticon_definitions(dataset: ~pycldf.dataset.Dataset, column_name: str = 'Concepticon_Definition', logger: ~logging.Logger = <Logger lexedata (INFO)>) None
lexedata.edit.add_concepticon.add_concepticon_names(dataset: Wordlist, column_name: str = 'Concepticon_Gloss')
lexedata.edit.add_concepticon.add_concepticon_references(dataset: Wordlist, gloss_languages: Mapping[str, str], status_update: Optional[str], overwrite: bool = False) None

Guess Concepticon links for a multilingual Concept table.

Fill the concepticonReference column of the dateset’s ParameterTable with best guesses for Concepticon IDs, based on gloss columns in different languages.

Parameters:
  • dataset (A pycldf.Wordlist with a concepticonReference column in its) – ParameterTable

  • gloss_languages (A mapping from ParameterTable column names to ISO-639-1) – language codes that Concepticon has concept lists for (eg. en, fr, de, es, zh, pt)

  • status_update (String written to Status_Column of #parameterTable if provided) –

  • overwrite (Overwrite existing Concepticon references) –

lexedata.edit.add_concepticon.create_concepticon_for_concepts(dataset: Dataset, language: Sequence[Tuple[str, str]], concepticon_glosses: bool, concepticon_definition: bool, overwrite: bool, status_update: Optional[str])
lexedata.edit.add_concepticon.equal_separated(option: str) Tuple[str, str]

lexedata.edit.add_metadata module

Adds a metadata.json file automatically starting from a forms.csv file. Lexedata can guess metadata for a number of columns, including, but not limited to, all default CLDF properties (e.g. Language, Form) and CLDF reference properties (e.g. parameterReference). We recommend that you check the metadata file created and adjust if necessary.

lexedata.edit.add_segments module

Segment the form.

Take a form, in a phonemic transcription compatible with IPA, and split it into phonemic segments, which are written back to the Segments column of the FormTable. Segmentation essentially uses CLTS, including diphthongs and affricates.

For details on the segmentation procedure, see the manual.

class lexedata.edit.add_segments.ReportEntry(count: int = 0, comment: str = '')

Bases: object

comment: str
count: int
class lexedata.edit.add_segments.SegmentReport(sounds: MutableMapping[str, ReportEntry] = _Nothing.NOTHING)

Bases: object

sounds: MutableMapping[str, ReportEntry]
lexedata.edit.add_segments.add_segments_to_dataset(dataset: ~pycldf.dataset.Dataset, transcription: str, overwrite_existing: bool, replace_form: bool, logger: ~logging.Logger = <Logger lexedata (INFO)>)
lexedata.edit.add_segments.cleanup(form: str) str
>>> cleanup("dummy;form")
'dummy'
>>> cleanup("dummy,form")
'dummy'
>>> cleanup("(dummy)")
'dummy'
>>> cleanup("dummy-form")
'dummy+form'
lexedata.edit.add_segments.segment_form(formstring: str, report: ~lexedata.edit.add_segments.SegmentReport, system=<pyclts.transcriptionsystem.TranscriptionSystem object>, split_diphthongs: bool = True, context_for_warnings: str = '', logger: ~logging.Logger = <Logger lexedata (INFO)>) Iterable[Symbol]

Segment the form.

First, apply some pre-processing replacements. Forms supplied contain all sorts of noise and lookalike symbols. This function comes with reasonable defaults, but if you encounter other problems, or you actually want to be strict about IPA transcriptions, pass a dictionary of your choice as pre_replace.

Then, naïvely segment the form using the IPA tokenizer from the segments package. Check each returned segment to see whether it is valid according to CLTS’s BIPA, and if not, try to fix some issues (in particular pre-aspirated or pre-nasalized consonants showing up as post-aspirated resp. post-nasalized vowels, which BIPA does not accept).

>>> [str(x) for x in segment_form("iɾũndɨ", report=SegmentReport())]
['i', 'ɾ', 'ũ', 'n', 'd', 'ɨ']
>>> [str(x) for x in segment_form("mokõi", report=SegmentReport())]
['m', 'o', 'k', 'õ', 'i']
>>> segment_form("pan̥onoót͡síkoːʔú", report=SegmentReport())  
[<pyclts.models.Consonant: voiceless bilabial stop consonant>, <pyclts.models.Vowel: unrounded open front vowel>, <pyclts.models.Consonant: devoiced voiced alveolar nasal consonant>, <pyclts.models.Vowel: rounded close-mid back vowel>, <pyclts.models.Consonant: voiced alveolar nasal consonant>, <pyclts.models.Vowel: rounded close-mid back vowel>, <pyclts.models.Vowel: rounded close-mid back ... vowel>, <pyclts.models.Consonant: voiceless alveolar sibilant affricate consonant>, <pyclts.models.Vowel: unrounded close front ... vowel>, <pyclts.models.Consonant: voiceless velar stop consonant>, <pyclts.models.Vowel: long rounded close-mid back vowel>, <pyclts.models.Consonant: voiceless glottal stop consonant>, <pyclts.models.Vowel: rounded close back ... vowel>]

lexedata.edit.add_singleton_cognatesets module

Add trivial cognatesets

Make sure that every segment of every form is in at least one cognateset (there can be more than one, eg. for nasalization), by creating singleton cognatesets for streaks of segments not in cognatesets.

lexedata.edit.add_singleton_cognatesets.create_singletons(dataset: ~lexedata.types.Wordlist[~lexedata.types.Language_ID, ~lexedata.types.Form_ID, ~lexedata.types.Parameter_ID, ~lexedata.types.Cognate_ID, ~lexedata.types.Cognateset_ID], status: ~typing.Optional[str] = None, by_segment: bool = False, logger: ~logging.Logger = <Logger lexedata (INFO)>) Tuple[Sequence[CogSet], Sequence[Judgement]]

Create singleton cognate judgements for forms that don’t have cognate judgements.

Depending on by_segment, singletons are created for every range of segments that is not in any cognate set yet (True) or just for every form where no segment is in any cognate sets (False).

lexedata.edit.add_singleton_cognatesets.uncoded_forms(forms: Iterable[Form], judged: Container[Form_ID]) Iterator[Tuple[Form_ID, range]]

Find the uncoded forms, and represent them as segment slices.

>>> list(uncoded_forms([
...   {"id": "f1", "form": "ex", "segments": list("ex")},
...   {"id": "f2", "form": "test", "segments": list("test")},
... ], {"f1"}))
[('f2', range(0, 4))]
lexedata.edit.add_singleton_cognatesets.uncoded_segments(segment_to_cognateset: ~typing.Mapping[~lexedata.types.Form_ID, ~typing.List[~typing.Set[~lexedata.types.Cognateset_ID]]], logger: ~logging.Logger = <Logger lexedata (INFO)>) Iterator[Tuple[Form_ID, range]]

Find the slices of uncoded segments.

>>> list(uncoded_segments({"f1": [{}, {}, {"s1"}, {}]}))
[('f1', range(0, 2)), ('f1', range(3, 4))]

lexedata.edit.add_status_column module

lexedata.edit.add_status_column.add_status_column_to_table(dataset: Dataset, table_name: str) None
lexedata.edit.add_status_column.status_column_to_table_list(dataset: Dataset, tables: List[str]) Dataset

lexedata.edit.add_table module

Creates an empty table with all the references to that table found in the dataset.

This script can be used to add LanguageTable, CognatesetTable, and ParameterTable (i.e. the table with the concepts). For a CognateTable, see the help of lexedata.edit.add_cognate_table.

lexedata.edit.align module

Automatically align morphemes within each cognateset.

If possible, align using existing lexstat scorer.

lexedata.edit.align.align(forms)

‘Align’ forms by adding gap characters to the end.

TODO: This is DUMB. Write a function that does this more sensibly, using LexStat scorers where available.

lexedata.edit.align.aligne_cognate_table(dataset: Dataset, status_update: Optional[str] = None)

lexedata.edit.change_id_column module

lexedata.edit.change_id_column.rename(ds, old_values_to_new_values, logger: Logger, status_update: Optional[str])
lexedata.edit.change_id_column.replace_column(dataset: ~pycldf.dataset.Dataset, original: str, replacement: str, column_replace: bool, smush: bool, status_update: ~typing.Optional[str], logger: ~logging.Logger = <Logger lexedata (INFO)>) None
lexedata.edit.change_id_column.substitute_many(row, columns, old_values_to_new_values, status_update: Optional[str])

lexedata.edit.clean_forms module

Clean forms.

Move comma-separated alternative forms to the variants column. Move elements in brackets to the comments if they are separated from the forms by whitespace; strip brackets and move the form with brackets to the variants if there is no whitespace separating it.

This is a rough heuristic, but hopefully it helps with the majority of cases.

exception lexedata.edit.clean_forms.Skip(message)

Bases: Exception

Mark this form to be skipped.

lexedata.edit.clean_forms.clean_forms(table: ~typing.Iterable[~lexedata.edit.clean_forms.R], form_column_name='form', variants_column_name='variants', split_at=[',', ';'], split_at_and_keep=['~'], logger: ~logging.Logger = <Logger lexedata (INFO)>) Iterator[R]

Split all forms that contain separators into form+variants.

>>> for row in clean_forms([
...   {'F': 'a ~ æ', 'V': []},
...   {'F': 'bə-, be-', 'V': ['b-']}],
...   "F", "V"):
...   print(row)
{'F': 'a', 'V': ['~æ']}
{'F': 'bə-', 'V': ['b-', 'be-']}
lexedata.edit.clean_forms.treat_brackets(table: ~typing.Iterable[~lexedata.edit.clean_forms.R], form_column_name='form', variants_column_name='variants', comment_column_name='comment', bracket_pairs=[('(', ')')], logger: ~logging.Logger = <Logger lexedata (INFO)>) Iterator[R]

Make sure forms contain no brackets.

>>> for row in treat_brackets([
...   {'F': 'a(m)ba', 'V': [], 'C': ''},
...   {'F': 'da (dialectal)', 'V': [], 'C': ''},
...   {'F': 'tu(m) (informal)', 'V': [], 'C': '2p'}],
...   "F", "V", "C"):
...   print(row)
{'F': 'amba', 'V': ['aba'], 'C': ''}
{'F': 'da', 'V': [], 'C': '(dialectal)'}
{'F': 'tum', 'V': ['tu'], 'C': '2p; (informal)'}

Skipping works even when it is noticed only late in the process.

>>> for row in treat_brackets([
...   {'F': 'a[m]ba (unbalanced', 'V': [], 'C': ''},
...   {'F': 'tu(m) (informal', 'V': [], 'C': ''}],
...   "F", "V", "C", [("[", "]"), ("(", ")")]):
...   print(row)
{'F': 'a[m]ba (unbalanced', 'V': [], 'C': ''}
{'F': 'tu(m) (informal', 'V': [], 'C': ''}
lexedata.edit.clean_forms.unbracket_single_form(form, opening_bracket, closing_bracket)

Remove a type of brackets from a form.

Return the modified form, the variants (i.e. a list containing the form with brackets), and all comments (bracket parts that were separated from the form by whitespace)

>>> unbracket_single_form("not in here anyway", "(", ")")
('not in here anyway', [], [])
>>> unbracket_single_form("da (dialectal)", "(", ")")
('da', [], ['(dialectal)'])
>>> unbracket_single_form("da(n)", "(", ")")
('dan', ['da'], [])
>>> unbracket_single_form("(n)da(s) (dialectal)", "(", ")")
('ndas', ['da', 'das', 'nda'], ['(dialectal)'])

lexedata.edit.detect_cognates module

Similarity code tentative cognates in a word list and align them

class lexedata.edit.detect_cognates.SimpleScoreDict

Bases: dict

lexedata.edit.detect_cognates.clean_segments(segment_string: List[str]) Iterable[Symbol]

Reduce the row’s segments to not contain empty morphemes.

This function removes all unknown sound segments (/0/) from the segments string it is passed, and removes empty morphemes by collapsing subsequent morpheme boundary markers (_#◦+→←) into one, normalizing the phonetics in the process.

>>> segments = "t w o _ m o r ph e m e s".split()
>>> c = clean_segments(segments)
>>> [str(s) for s in c]
['t', 'w', 'o', '_', 'm', 'o', 'r', 'pʰ', 'e', 'm', 'e', 's']
>>> segments = "+ _ t a + 0 + a t".split()
>>> c = clean_segments(segments)
>>> [str(s) for s in c]
['t', 'a', '+', 'a', 't']
lexedata.edit.detect_cognates.cognate_code_to_file(dataset: Wordlist, ratio: float, soundclass: str, cluster_method: str, threshold: float, initial_threshold: float, gop: float, mode: str, output_file: Path) None
lexedata.edit.detect_cognates.compute_one_matrix(tokens_by_index: Mapping[Form_ID, List[str]], align_function: Callable[[Form_ID, Form_ID, slice, slice], float]) Tuple[Mapping[Form_ID, List[Tuple[slice, int]]], List[List[float]]]

Compute the distance matrix for pairwise alignment of related morphemes.

Align all pairs of morphemes (presumably each of them part of a form associated to one common concept), while assuming there is no reduplication, so one morpheme in one form can be cognate (and thus have small relative edit distance) to at most one morpheme in another form.

Return the identifiers of the morphemes, and the corresponding distance matrix.

The align_function is a function that calculates the alignment score for two slices of token strings, such as

>>> identity_scorer = SimpleScoreDict()
>>> def align(f1, f2): return alignment_functions["global"](
...   f1, f2,
...   [-1 for _ in f1], [-1 for _ in f2],
...   ['X' for _ in f1], ['X' for _ in f2],
...   len(f1), len(f2),
...   scale = 0.5,
...   factor = 0.0,
...   scorer = identity_scorer
...   )[2] * 2 / (len(f2) + len(f1))
>>> def slice_align(f1, f2, s1, s2): return align(data[f1][s1], data[f2][s2])

In the simplest case, this is just a pairwise alignment, with optimum at 0.

>>> data = {"f1": "form", "f2": "form"}
>>> compute_one_matrix(data, slice_align)
({'f1': [(slice(0, 4, None), 0)], 'f2': [(slice(0, 4, None), 1)]}, [[0.0, 0.0], [0.0, 0.0]])

This goes up with partial matches.

>>> data = {"f1": "form", "f3": "fowl"}
>>> compute_one_matrix(data, slice_align)
({'f1': [(slice(0, 4, None), 0)], 'f3': [(slice(0, 4, None), 1)]}, [[0.0, -0.5], [-0.5, 0.0]])

If the forms are completely different, the matrix entries are negative values.

>>> data = {"f1": "form", "f4": "diff't"}
>>> compute_one_matrix(data, slice_align)
({'f1': [(slice(0, 4, None), 0)], 'f4': [(slice(0, 6, None), 1)]}, [[0.0, -0.8], [-0.8, 0.0]])

If there are partial similarities, the matrix picks those up.

>>> data = {"f1": "form", "f4": "diff't", "f5": "diff't+folm"}
>>> compute_one_matrix(data, slice_align)
({'f1': [(slice(0, 4, None), 0)], 'f4': [(slice(0, 6, None), 1)], 'f5': [(slice(0, 6, None), 2), (slice(7, 11, None), 3)]}, [[0.0, -0.8, -0.8, 1.0], [-0.8, 0.0, 1.0, -0.8], [-0.8, 1.0, 0.0, 1.0], [1.0, -0.8, 1.0, 0.0]])
lexedata.edit.detect_cognates.filter_function_factory(dataset: Wordlist) Callable[[Dict[str, Any]], bool]
lexedata.edit.detect_cognates.get_partial_matrices(self, concepts: Iterable[Parameter_ID], method='lexstat', scale=0.5, factor=0.3, mode='global') Iterator[Tuple[Parameter_ID, Mapping[Hashable, List[Tuple[slice, int]]], List[List[float]]]]

Function creates matrices for the purpose of partial cognate detection.

lexedata.edit.detect_cognates.get_slices(tokens: List[str], include_empty=False) Iterator[slice]

Return slices for all morphemes in the token string

This function computes the morpheme slices in an annotated token set. Empty morphemes are not yielded, unless include_empty is set to True.

>>> list(get_slices("t w o _ m o r ph e m e s".split()))
[slice(0, 3, None), slice(4, 12, None)]
>>> list(get_slices("+ _ t a + 0 + a t".split()))
[slice(2, 4, None), slice(5, 6, None), slice(7, 9, None)]
lexedata.edit.detect_cognates.import_back(dataset, output_file)
lexedata.edit.detect_cognates.partial_cluster(self, method='sca', threshold=0.45, scale=0.5, factor=0.3, mode='overlap', cluster_function=<function infomap_clustering>) Iterable[Tuple[Hashable, slice, int]]
lexedata.edit.detect_cognates.sha1(path)

lexedata.edit.merge_cognate_sets module

Read a homophones report (an edited one, most likely) and merge all pairs of form in there.

Different treatment for separated fields, and un-separated fields Form variants into variants? Make sure concepts have a separator What other columns give warnings, what other columns give errors?

Optionally, merge cognate sets that get merged by this procedure.

lexedata.edit.merge_cognate_sets.merge_cogsets(data: ~lexedata.types.Wordlist[~lexedata.types.Language_ID, ~lexedata.types.Form_ID, ~lexedata.types.Parameter_ID, ~lexedata.types.Cognate_ID, ~lexedata.types.Cognateset_ID], mergers: ~typing.Mapping[str, ~typing.Callable[[~typing.Sequence[~lexedata.edit.merge_homophones.C], ~typing.Optional[~lexedata.types.Form]], ~typing.Optional[~lexedata.edit.merge_homophones.C]]], cogset_groups: ~typing.MutableMapping[~lexedata.types.Cognateset_ID, ~typing.Sequence[~lexedata.types.Cognateset_ID]], logger: ~logging.Logger = <Logger lexedata (INFO)>) Iterable[CogSet]

Merge cognate sets in a dataset.

TODO: Construct an example that shows that the order given in cogset_groups is maintained.

Side Effects

Changes cogset_groups:

Groups that are skipped are removed

lexedata.edit.merge_cognate_sets.merge_group(cogsets: ~typing.Sequence[~lexedata.types.CogSet], target: ~lexedata.types.CogSet, mergers: ~typing.Mapping[str, ~typing.Callable[[~typing.Sequence[~lexedata.edit.merge_homophones.C], ~typing.Optional[~lexedata.types.Form]], ~typing.Optional[~lexedata.edit.merge_homophones.C]]], dataset: ~lexedata.types.Wordlist[~lexedata.types.Language_ID, ~lexedata.types.Form_ID, ~lexedata.types.Parameter_ID, ~lexedata.types.Cognate_ID, ~lexedata.types.Cognateset_ID], logger: ~logging.Logger = <Logger lexedata (INFO)>) CogSet

Merge one group of cognate sets.

The target is assumed to be already included in the forms.

lexedata.edit.merge_cognate_sets.parse_cognatesets_report(report: ~typing.TextIO, logger: ~logging.Logger = <Logger lexedata (INFO)>) List[List[Cognateset_ID]]

Parse cognateset merge instructions

The format of the input file is the same as the output of the homophones report >>> from io import StringIO >>> file = StringIO(“Cluster of overlapping cognate sets:n” … “ bark-22n” … “ skin-27”) >>> parse_cognatesets_report(file) [[‘bark-22’, ‘skin-27’]]

lexedata.edit.merge_homophones module

Read a homophones report (an edited one, most likely) and merge all pairs of form in there.

Different treatment for separated fields, and un-separated fields Form variants into variants? Make sure concepts have a separator What other columns give warnings, what other columns give errors?

Optionally, merge cognate sets that get merged by this procedure.

exception lexedata.edit.merge_homophones.Skip

Bases: Exception

Skip this merge, leave all forms as expected.

This is not an Error! It is more akin to StopIteration.

lexedata.edit.merge_homophones.cancel_and_skip(sequence: Sequence[C], target: Optional[Form] = None) Optional[C]

If entries differ, do not merge this set of forms.

>>> cancel_and_skip([])
>>> cancel_and_skip([1, 2]) 
Traceback (most recent call last):
...
    raise Skip
lexedata.edit.merge_homophones.Skip
>>> cancel_and_skip([1, 1])
1
lexedata.edit.merge_homophones.concatenate(sequence: Sequence[C], target: Optional[Form] = None) Optional[C]

Concatenate all entries, even if they are identical, in the given order.

Strings are concatenated using ‘; ‘ as a separator. Other iterables are flattened.

>>> concatenate([[1, 2], [2, 4]])
[1, 2, 2, 4]
>>> concatenate([["a", "b"], ["c", "a"]])
['a', 'b', 'c', 'a']
>>> concatenate(["a", "b"])
'a; b'
>>> concatenate([]) is None
True
>>> concatenate([[1, 2], [2, 4]])
[1, 2, 2, 4]
>>> concatenate([None, [1], [3]])
[1, 3]
>>> concatenate([[1, 1], [2]])
[1, 1, 2]
>>> concatenate([["a", "b"], ["c", "a"]])
['a', 'b', 'c', 'a']
>>> concatenate([None, "a", "b"])
'; a; b'
>>> concatenate(["a", "b", "a", ""])
'a; b; a; '
>>> concatenate(["a", "b", None, "a"])
'a; b; ; a'
>>> concatenate(["a", "b", "a; c", None])
'a; b; a; c; '
lexedata.edit.merge_homophones.constant_factory(c: C) Callable[[Sequence[C], Optional[Form]], Optional[C]]

Create a merger that always returns c.

This is useful eg. for the status column, which needs to be updated when forms are merged, to a value that does not depend on the earlier status.

>>> constant = constant_factory("a")
>>> constant([None, 'b'])
'a'
>>> constant([])
'a'
lexedata.edit.merge_homophones.default(sequence: Sequence[C], target: Optional[Form] = None) Optional[C]

Merge with senbible defaults.

Union for sequence-shaped entries (strings, and lists with a separator in the metadata), must_be_equal otherwise

>>> default([1, 2]) 
Traceback (most recent call last):
AssertionError: ...
>>> default([[1, 2], [3, 4]])
[1, 2, 3, 4]
>>> default(["a; b", "a", "c; b"])
'a; b; c'
lexedata.edit.merge_homophones.first(sequence: Sequence[C], target: Optional[Form] = None) Optional[C]

Take the first nonzero entry, no matter whether the others match or not.

>>> first([1, 2])
1
>>> first([])
>>> first([None, 1, 2])
1
lexedata.edit.merge_homophones.format_mergers(mergers: Mapping[str, Callable[[Sequence[C], Optional[Form]], Optional[C]]]) str
lexedata.edit.merge_homophones.isiterable(obj: object) bool

Test whether object is iterable, BUT NOT A STRING.

For merging purposes, we consider strings ATOMIC and thus NOT iterable.

lexedata.edit.merge_homophones.merge_forms(data: ~lexedata.types.Wordlist[~lexedata.types.Language_ID, ~lexedata.types.Form_ID, ~lexedata.types.Parameter_ID, ~lexedata.types.Cognate_ID, ~lexedata.types.Cognateset_ID], mergers: ~typing.Mapping[str, ~typing.Callable[[~typing.Sequence[~lexedata.edit.merge_homophones.C], ~typing.Optional[~lexedata.types.Form]], ~typing.Optional[~lexedata.edit.merge_homophones.C]]], homophone_groups: ~typing.MutableMapping[~lexedata.types.Form_ID, ~typing.Sequence[~lexedata.types.Form_ID]], logger: ~logging.Logger = <Logger lexedata (INFO)>) Iterable[Form]

Merge forms from a dataset.

TODO: Construct an example that shows that the order given in homophone_groups is maintained.

Side Effects

Changes homophone_groups:

Groups that are skipped are removed

lexedata.edit.merge_homophones.merge_group(forms: ~typing.Sequence[~lexedata.types.Form], target: ~lexedata.types.Form, mergers: ~typing.Mapping[str, ~typing.Callable[[~typing.Sequence[~lexedata.edit.merge_homophones.C], ~typing.Optional[~lexedata.types.Form]], ~typing.Optional[~lexedata.edit.merge_homophones.C]]], dataset: ~lexedata.types.Wordlist[~lexedata.types.Language_ID, ~lexedata.types.Form_ID, ~lexedata.types.Parameter_ID, ~lexedata.types.Cognate_ID, ~lexedata.types.Cognateset_ID], logger: ~logging.Logger = <Logger lexedata (INFO)>) Form

Merge one group of homophones.

>>> merge_group(
...   [{"Parameter_ID": [1, 1]}, {"Parameter_ID": [2]}],
...   {"Parameter_ID": [1, 1]}, {"Parameter_ID": union}, util.fs.new_wordlist())
{'Parameter_ID': [1, 2]}

The target is assumed to be already included in the forms.

>>> merge_group(
...   [{"Parameter_ID": [1, 1]}, {"Parameter_ID": [2]}],
...   {"Parameter_ID": [1, 1]}, {"Parameter_ID": concatenate}, util.fs.new_wordlist())
{'Parameter_ID': [1, 1, 2]}
lexedata.edit.merge_homophones.must_be_equal(sequence: Sequence[C], target: Optional[Form] = None) Optional[C]

End with an error if entries are not equal.

>>> must_be_equal([1, 2]) 
Traceback (most recent call last):
AssertionError: assert 2 <= 1
>>> must_be_equal([1, 1])
1
>>> must_be_equal([])
lexedata.edit.merge_homophones.must_be_equal_or_null(sequence: Sequence[C], target: Optional[Form] = None) Optional[C]

End with an error if those entries which are present are not equal.

>>> must_be_equal_or_null([1, 2]) 
Traceback (most recent call last):
AssertionError: assert 2 <= 1
...
>>> must_be_equal_or_null([1, 1])
1
>>> must_be_equal_or_null([1, 1, None])
1
lexedata.edit.merge_homophones.parse_homophones_old_format(report: TextIO) Mapping[Form_ID, Sequence[Form_ID]]

Parse legacy homophones merge instructions >>> from io import StringIO >>> file = StringIO(“Unconnected: Matsigenka kis {(‘ANGRY’, ‘19148’), (‘FIGHT (v. Or n.)’, ‘19499’), (‘CRITICIZE, SCOLD’, ‘19819’)}”) >>> parse_homophones_old_format(file) defaultdict(<class ‘list’>, {‘19148’: [‘19499’, ‘19819’]})

lexedata.edit.merge_homophones.parse_homophones_report(report: TextIO) Mapping[Form_ID, Sequence[Form_ID]]

Parse legacy homophones merge instructions

The format of the input file is the same as the output of the homophones report >>> from io import StringIO >>> file = StringIO(“ache, e.ta.’kɾã: Unknown (but at least one concept not found)n” … “ ache_one (one)n” … “ ache_single_3 (single)n”) >>> parse_homophones_report(file) defaultdict(<class ‘list’>, {‘ache_one’: [‘ache_one’, ‘ache_single_3’]})

lexedata.edit.merge_homophones.parse_merge_override(string: str) Tuple[str, Callable[[Sequence[C], Optional[Form]], Optional[C]]]
lexedata.edit.merge_homophones.transcription(wrapper: str = '{}')

Make a closure that adds variants to a variants column.

>>> row = {"variants": None}
>>> orthographic = transcription("<{}>")
>>> orthographic(["a", "a", "an"], row)
'a'
>>> row
{'variants': ['<an>']}
lexedata.edit.merge_homophones.union(sequence: Sequence[C], target: Optional[Form] = None) Optional[C]

Concatenate all entries, without duplicates.

Iterables are flattened. Strings are considered sequences of ‘; ‘-separated strings and flattened accordingly. Empty values are ignored.

>>> union([]) is None
True
>>> union([[1, 2], [2, 4]])
[1, 2, 4]
>>> union([None, [1], [3]])
[1, 3]
>>> union([[1, 1], [2]])
[1, 2]
>>> union([["a", "b"], ["c", "a"]])
['a', 'b', 'c']
>>> union([None, "a", "b"])
'a; b'
>>> union(["a", "b", "a", ""])
'a; b'
>>> union(["a", "b", "a", None])
'a; b'
>>> union(["a", "b", "a; c", None])
'a; b; c'
>>> union([['one', 'one'], ['one1', 'one1'], ['two1', None], ['one'], ['one']])
['one', 'one1', 'two1']
lexedata.edit.merge_homophones.warn(sequence: Sequence[C], target: Optional[Form] = None) Optional[C]

Print a warning if entries are not equal, but proceed taking the first one.

>>> warn([1, 2])
1
>>> warn([1, 1])
1

lexedata.edit.normalize_unicode module

Normalize a dataset or file to NFC unicode normalization.

Make sure every string entry in every table of the dataset uses NFC unicode normalization, or take a list of files that each gets normalized.

lexedata.edit.normalize_unicode.n(s: str) str
lexedata.edit.normalize_unicode.normalize(file, original_encoding='utf-8')

lexedata.edit.replace_id module

lexedata.edit.replace_id_column module

lexedata.edit.simplify_ids module

Clean up all ID columns in the dataset.

Take every ID column and convert it to either an integer-valued or a restricted-string-valued (only containing a-z, 0-9, or _) column, maintaining uniqueness of IDs, and keeping IDs as they are where they fit the format.

Optionally, create ‘transparent’ IDs, that is alphanumerical IDs which are derived from the characteristic columns of the corresponding table. For example, the ID of a FormTable would be derived from language and concept; for a CognatesetTable from the central concept if there is one.

lexedata.edit.simplify_ids.parser()

Construct the CLI argument parser for this script.

Module contents