lexedata.exporter package¶
Submodules¶
lexedata.exporter.cognates module¶
- class lexedata.exporter.cognates.BaseExcelWriter(dataset: ~pycldf.dataset.Dataset, database_url: ~typing.Optional[str] = None, logger: ~logging.Logger = <Logger lexedata (INFO)>)¶
Bases:
object
Class logic for matrix-shaped Excel export.
- collect_forms_by_row(judgements: Iterable[Judgement], rows: Iterable[Union[Cognateset_ID, Parameter_ID]]) Mapping[Cognateset_ID, Mapping[Form_ID, Sequence[Judgement]]] ¶
Collect forms by row object (ie. concept or cognate set)
- create_excel(rows: Iterable[RowObject], languages, judgements: Iterable[Judgement], forms, size_sort: bool = False) None ¶
Convert the initial CLDF into an Excel cognate view
The Excel file has columns “CogSet”, one column each mirroring the other cognateset metadata, and then one column per language.
The rows contain cognate data. If a language has multiple reflexes in the same cognateset, these appear in different cells, one below the other.
- create_formcell(form: Form, column: int, row: int) None ¶
Fill the given cell with the form’s data.
In the cell described by ws, column, row, dump the data for the form: Write into the the form data, and supply a comment from the judgement if there is one.
- create_formcells(row_forms: Iterable[Form], row_index: int) int ¶
Writes all forms for given cognate set to Excel.
Take all forms for a given cognate set as given by the database, create a hyperlink cell for each form, and write those into rows starting at row_index.
Return the row number of the first empty row after this cognate set, which can then be filled by the following cognate set.
- header: List[Tuple[str, str]]¶
- row_table: str¶
- class lexedata.exporter.cognates.ExcelWriter(dataset: ~pycldf.dataset.Dataset, database_url: ~typing.Optional[str] = None, singleton_cognate: bool = False, singleton_status: ~typing.Optional[str] = None, logger: ~logging.Logger = <Logger lexedata (INFO)>)¶
Bases:
BaseExcelWriter
Class logic for cognateset Excel export.
- form_to_cell_value(form: Form) str ¶
Build a string describing the form itself
Provide the best transcription and all translations of the form strung together.
>>> ds = util.fs.new_wordlist(FormTable=[], CognatesetTable=[], CognateTable=[]) >>> E = ExcelWriter(dataset=ds) >>> E.form_to_cell_value({"form": "f", "parameterReference": "c"}) 'f ‘c’' >>> E.form_to_cell_value( ... {"form": "f", "parameterReference": "c", "formComment": "Not empty"}) 'f ‘c’ ⚠' >>> E.form_to_cell_value( ... {"form": "fo", "parameterReference": "c", "segments": ["f", "o"]}) '{ f o } ‘c’' >>> E.form_to_cell_value( ... {"form": "fo", ... "parameterReference": "c", ... "segments": ["f", "o"], ... "segmentSlice": ["1:1"]}) '{ f }o ‘c’'
TODO: This function should at some point support alignments, so that the following call will return ‘{ - f - }o ‘c’’ instead.
>>> E.form_to_cell_value( ... {"form": "fo", ... "parameterReference": "c", ... "segments": ["f", "o"], ... "segmentSlice": ["1:1"], ... "alignment": ["", "f", ""]}) '{ f }o ‘c’'
- header: List[Tuple[str, str]]¶
- row_table: str = 'CognatesetTable'¶
- set_header(dataset: Wordlist[Language_ID, Form_ID, Parameter_ID, Cognate_ID, Cognateset_ID])¶
Define the header for the first few columns
- write_row_header(cogset, row_number: int)¶
Write a row header
Write into the first few columns of the row row_index of self.ws the metadata of a row, eg. concept ID and gloss or cognateset ID, cognateset name and status.
- ws: Worksheet¶
- lexedata.exporter.cognates.cogsets_and_judgements(dataset, status: ~typing.Optional[str], by_segment=True, logger: ~logging.Logger = <Logger lexedata (INFO)>)¶
- lexedata.exporter.cognates.parser()¶
- lexedata.exporter.cognates.properties_as_key(data, columns)¶
lexedata.exporter.edictor module¶
Export a dataset to Edictor/Lingpy.
Input for edictor is a .tsv file containing the forms. The first column needs to be ‘ID’, containing 1-based integers. Cognatesets IDs need to be 1-based integers.
- lexedata.exporter.edictor.add_edictor_settings(file, dataset)¶
Write a block of Edictor setting comments to a file.
Edictor takes some comments in its TSV files as directives for how it should behave. The important settings here are to set the morphology mode to ‘partial’ and pass the order of languages and concepts through. Everything else is pretty much edictor standard.
- lexedata.exporter.edictor.forms_to_tsv(dataset: ~lexedata.types.Wordlist[~lexedata.types.Language_ID, ~lexedata.types.Form_ID, ~lexedata.types.Parameter_ID, ~lexedata.types.Cognate_ID, ~lexedata.types.Cognateset_ID], languages: ~typing.Iterable[str], concepts: ~typing.Set[str], cognatesets: ~typing.Iterable[str], logger: ~logging.Logger = <Logger lexedata (INFO)>)¶
- lexedata.exporter.edictor.glue_in_alignment(global_alignment, cogsets, new_alignment, new_cogset, segments: slice)¶
Add a partial alignment to a global alignment with gaps
NOTE: This function does not check for overlapping alignments, it just assumes alignments do not overlap!
>>> alm = "(t) (e) (s) (t)".split() >>> cogsets = [None] >>> glue_in_alignment(alm, cogsets, list("es-"), 1, slice(1, 3)) >>> alm ['(t)', '+', 'e', 's', '-', '+', '(t)'] >>> cogsets [None, 1, None] >>> glue_in_alignment(alm, cogsets, list("-t-"), 2, slice(3, 4)) >>> alm ['(t)', '+', 'e', 's', '-', '+', '-', 't', '-'] >>> cogsets [None, 1, 2]
This is independent of the order in which alignments are glued in.
>>> alm = "(t) (e) (s) (t)".split() >>> cogsets = [None] >>> glue_in_alignment(alm, cogsets, list("-t-"), 2, slice(3, 4)) >>> alm ['(t)', '(e)', '(s)', '+', '-', 't', '-'] >>> cogsets [None, 2] >>> glue_in_alignment(alm, cogsets, list("es-"), 1, slice(1, 3)) >>> alm ['(t)', '+', 'e', 's', '-', '+', '-', 't', '-'] >>> cogsets [None, 1, 2]
Of course, it also works for coplete forms, not just for partial cognate judgements.
>>> alm = "(t) (e) (s) (t)".split() >>> cogsets = [None] >>> glue_in_alignment(alm, cogsets, list("t-es-t-"), 3, slice(0, 4)) >>> alm ['t', '-', 'e', 's', '-', 't', '-'] >>> cogsets [3]
- lexedata.exporter.edictor.rename(form_column)¶
lexedata.exporter.matrix module¶
- class lexedata.exporter.matrix.MatrixExcelWriter(dataset: ~pycldf.dataset.Dataset, database_url: ~typing.Optional[str] = None, logger: ~logging.Logger = <Logger lexedata (INFO)>)¶
Bases:
BaseExcelWriter
Class logic for Excel matrix export.
- create_formcell(form: Form, column: int, row: int) None ¶
Fill the given cell with the form’s data.
In the cell described by ws, column, row, dump the data for the form: Write into the the form data, and supply a comment from the judgement if there is one.
- header: List[Tuple[str, str]]¶
- row_table: str = 'ParameterTable'¶
- set_header(dataset)¶
Define the header for the first few columns
- write_row_header(cogset, row)¶
Write a row header
Write into the first few columns of the row row_index of self.ws the metadata of a row, eg. concept ID and gloss or cognateset ID, cognateset name and status.
- ws: Worksheet¶
lexedata.exporter.phylogenetics module¶
- class lexedata.exporter.phylogenetics.AbsenceHeuristic(value)¶
Bases:
Enum
An enumeration.
- CENTRALCONCEPT = 0¶
- HALFPRIMARYCONCEPTS = 1¶
- class lexedata.exporter.phylogenetics.CodingProcedure(value)¶
Bases:
Enum
An enumeration.
- MULTISTATE = 2¶
- ROOTMEANING = 1¶
- ROOTPRESENCE = 0¶
- class lexedata.exporter.phylogenetics.Cognateset_ID¶
Bases:
str
- class lexedata.exporter.phylogenetics.Language_ID¶
Bases:
str
- class lexedata.exporter.phylogenetics.Parameter_ID¶
Bases:
str
- lexedata.exporter.phylogenetics.add_partitions(data_object: Element, partitions: Dict[str, Iterable[int]])¶
Add partitions after the <data> object
>>> xml = ET.fromstring("<beast><data id='alignment'/></beast>") >>> data = xml.find(".//data") >>> partitions = {"a": [1, 2, 3, 5], "b": [4, 6, 7]} >>> add_partitions(data, partitions) >>> print(ET.tostring(xml).decode("utf-8")) <beast><data id="alignment"/><data id="concept:a" spec="FilteredAlignment" filter="1,2-4,6" data="@alignment" ascertained="true" excludefrom="0" excludeto="1"/><data id="concept:b" spec="FilteredAlignment" filter="1,5,7-8" data="@alignment" ascertained="true" excludefrom="0" excludeto="1"/></beast>
- lexedata.exporter.phylogenetics.apply_heuristics(dataset: ~lexedata.types.Wordlist, heuristic: ~typing.Optional[~lexedata.exporter.phylogenetics.AbsenceHeuristic] = None, primary_concepts: ~typing.Union[~lexedata.types.WorldSet[~lexedata.types.Parameter_ID], ~typing.AbstractSet[~lexedata.types.Parameter_ID]] = <lexedata.types.WorldSet object>, logger: ~logging.Logger = <Logger lexedata (INFO)>) Mapping[Cognateset_ID, Set[Parameter_ID]] ¶
Compute the relevant concepts for cognatesets, depending on the heuristic.
These concepts will be considered when deciding whether a root is deemed absent in a language.
For the CentralConcept heuristic, the relevant concepts are the central concept of a cognateset, as given by the #parameterReference column of the CognatesetTable. A central concept not included in the primary_concepts is ignored with a warning.
>>> ds = util.fs.new_wordlist() >>> cst = ds.add_component("CognatesetTable") >>> ds["CognatesetTable"].tableSchema.columns.append( ... pycldf.dataset.Column( ... name="Central_Concept", ... propertyUrl="http://cldf.clld.org/v1.0/terms.rdf#parameterReference")) >>> ds.auto_constraints(cst) >>> _= ds.write(CognatesetTable=[ ... {"ID": "cognateset1", "Central_Concept": "concept1"} ... ]) >>> apply_heuristics(ds, heuristic=AbsenceHeuristic.CENTRALCONCEPT) == {'cognateset1': {'concept1'}} True
This extends to the case where a cognateset may have more than one central concept.
>>> ds = util.fs.new_wordlist() >>> cst = ds.add_component("CognatesetTable") >>> ds["CognatesetTable"].tableSchema.columns.append( ... pycldf.dataset.Column( ... name="Central_Concepts", ... propertyUrl="http://cldf.clld.org/v1.0/terms.rdf#parameterReference", ... separator=",")) >>> ds.auto_constraints(cst) >>> _ = ds.write(CognatesetTable=[ ... {"ID": "cognateset1", "Central_Concepts": ["concept1", "concept2"]} ... ]) >>> apply_heuristics(ds, heuristic=AbsenceHeuristic.CENTRALCONCEPT) == { ... 'cognateset1': {'concept1', 'concept2'}} True
For the HalfPrimaryConcepts heurisitc, the relevant concepts are all primary concepts connected to a cognateset.
>>> ds = util.fs.new_wordlist( ... FormTable=[ ... {"ID": "f1", "Parameter_ID": "c1", "Language_ID": "l1", "Form": "x"}, ... {"ID": "f2", "Parameter_ID": "c2", "Language_ID": "l1", "Form": "x"}], ... CognateTable=[ ... {"ID": "1", "Form_ID": "f1", "Cognateset_ID": "s1"}, ... {"ID": "2", "Form_ID": "f2", "Cognateset_ID": "s1"}]) >>> apply_heuristics(ds, heuristic=AbsenceHeuristic.HALFPRIMARYCONCEPTS) == { ... 's1': {'c1', 'c2'}} True
NOTE: This function cannot guarantee that every concept has at least one relevant concept, there may be cognatesets without! A cognateset with 0 relevant concepts will always be included, because 0 is at least half of 0.
- lexedata.exporter.phylogenetics.compress_indices(indices: Set[int]) Iterator[slice] ¶
Turn groups of largely contiguous indices into slices.
>>> list(compress_indices(set(range(10)))) [slice(0, 10, None)]
>>> list(compress_indices([1, 2, 5, 6, 7])) [slice(1, 3, None), slice(5, 8, None)]
- lexedata.exporter.phylogenetics.fill_beast(data_object: Element, languages, sequences) None ¶
Add sequences to BEAST as Alignment object.
>>> xml = ET.fromstring("<beast><data /></beast>") >>> fill_beast(xml.find(".//data"), ["L1", "L2"], ["0110", "0011"]) >>> print(ET.tostring(xml).decode("utf-8")) <beast><data id="vocabulary" dataType="integer" spec="Alignment"> <sequence id="language_data_vocabulary:L1" taxon="L1" value="0110"/> <sequence id="language_data_vocabulary:L2" taxon="L2" value="0011"/> <taxonset id="taxa" spec="TaxonSet"><plate var="language" range="{languages}"><taxon id="$(language)" spec="Taxon"/></plate></taxonset></data></beast>
- lexedata.exporter.phylogenetics.format_nexus(languages: Iterable[str], sequences: Iterable[str], n_symbols: int, n_characters: int, datatype: str, partitions: Optional[Mapping[str, Iterable[int]]] = None)¶
Format a Nexus output with the sequences.
This function only formats and performs no further validity checks!
>>> print(format_nexus( ... ["l1", "l2"], ... ["0010", "0111"], ... 2, 3, ... "binary", ... {"one": [1], "two": [2,3]} ... )) #NEXUS Begin Taxa; Dimensions ntax=2; TaxLabels l1 l2; End; Begin Characters; Dimensions NChar=3; Format Datatype=Restriction Missing=? Gap=- Symbols="0 1" ; Matrix [The first column is constant zero, for programs with ascertainment correction] l1 0010 l2 0111 ; End; Begin Sets; CharSet one=1; CharSet two=2 3; End;
- lexedata.exporter.phylogenetics.multistate_code(dataset: Mapping[Language_ID, Mapping[Parameter_ID, Set[Cognateset_ID]]]) Tuple[Mapping[Language_ID, Sequence[Set[int]]], Sequence[int]] ¶
Create a multistate root-meaning coding from cognate codes in a dataset
Take the cognate code information from a wordlist, i.e. a mapping of the form {Language ID: {Concept ID: {Cognateset ID}}}, and generate a multistate alignment from it that lists for every meaning which roots are used to represent that meaning in each language.
Also return the number of roots for each concept.
Examples
>>> alignment, lengths = multistate_code({"Language": {"Meaning": {"Cognateset 1"}}}) >>> alignment =={'Language': [{0}]} True >>> lengths == [1] True
>>> alignment, statecounts = multistate_code( ... {"l1": {"m1": {"c1"}}, ... "l2": {"m1": {"c2"}, "m2": {"c1", "c3"}}}) >>> alignment["l1"][1] set() >>> alignment["l2"][1] == {0, 1} True >>> statecounts [2, 2]
- lexedata.exporter.phylogenetics.parser()¶
Construct the CLI argument parser for this script.
- lexedata.exporter.phylogenetics.raw_binary_alignment(alignment)¶
- lexedata.exporter.phylogenetics.raw_multistate_alignment(alignment, long_sep: str = ',')¶
- lexedata.exporter.phylogenetics.read_cldf_dataset(dataset: ~lexedata.types.Wordlist[~lexedata.types.Language_ID, ~lexedata.types.Form_ID, ~lexedata.types.Parameter_ID, ~lexedata.types.Cognate_ID, ~lexedata.types.Cognateset_ID], code_column: ~typing.Optional[str] = None, logger: ~logging.Logger = <Logger lexedata (INFO)>) Mapping[Language_ID, Mapping[Parameter_ID, Set[Cognateset_ID]]] ¶
Load a CLDF dataset.
Load the file as json CLDF metadata description file, or as metadata-free dataset contained in a single csv file.
The distinction is made depending on the file extension: .json files are loaded as metadata descriptions, all other files are matched against the CLDF module specifications. Directories are checked for the presence of any CLDF datasets in undefined order of the dataset types.
If use_ids == False, the reader is free to choose language names or language glottocodes for the output if they are unique.
Examples
>>> import tempfile >>> dirname = Path(tempfile.mkdtemp(prefix="lexedata-test")) >>> target = dirname / "forms.csv" >>> _size = open(target, "w", encoding="utf-8").write(''' ... ID,Language_ID,Parameter_ID,Form,Cognateset_ID ... '''.strip()) >>> ds = pycldf.Wordlist.from_data(target)
{‘autaa’: defaultdict(<class ‘set’>, {‘Woman’: {‘WOMAN1’}, ‘Person’: {‘PERSON1’}})} TODO: FIXME THIS EXAMPLE IS INCOMPLETE
- Parameters:
fname (str or Path) – Path to a CLDF dataset
- Return type:
Data
- lexedata.exporter.phylogenetics.read_structure_dataset(dataset: ~pycldf.dataset.StructureDataset, logger: ~logging.Logger = <Logger lexedata (INFO)>) MutableMapping[Language_ID, MutableMapping[Parameter_ID, Set]] ¶
- lexedata.exporter.phylogenetics.read_wordlist(dataset: ~lexedata.types.Wordlist[~lexedata.types.Language_ID, ~lexedata.types.Form_ID, ~lexedata.types.Parameter_ID, ~lexedata.types.Cognate_ID, ~lexedata.types.Cognateset_ID], code_column: ~typing.Optional[str], logger: ~logging.Logger = <Logger lexedata (INFO)>) MutableMapping[Language_ID, MutableMapping[Parameter_ID, Set]] ¶
- lexedata.exporter.phylogenetics.root_meaning_code(dataset: ~typing.Mapping[~lexedata.types.Language_ID, ~typing.Mapping[~lexedata.types.Parameter_ID, ~typing.Set[~lexedata.types.Cognateset_ID]]], core_concepts: ~typing.Set[~lexedata.types.Parameter_ID] = <lexedata.types.WorldSet object>, ascertainment: ~typing.Sequence[~typing.Literal['0', '1', '?']] = ['0']) Tuple[Mapping[Language_ID, List[Literal['0', '1', '?']]], Mapping[Parameter_ID, Mapping[Cognateset_ID, int]]] ¶
Create a root-meaning coding from cognate codes in a dataset
Take the cognate code information from a wordlist, i.e. a mapping of the form {Language ID: {Concept ID: {Cognateset ID}}}, and generate a binary alignment from it that lists for every meaning which roots are used to represent that meaning in each language.
Return the aligment, and the list of slices belonging to each meaning.
The default ascertainment is the a single absence (‘0’): The configuration where a form is absent from all languages is never observed, but always possible, so we add this entry for the purposes of ascertainment correction.
Examples
>>> alignment, concepts = root_meaning_code({"Language": {"Meaning": {"Cognateset 1"}}}) >>> alignment {'Language': ['0', '1']}
>>> alignment, concepts = root_meaning_code( ... {"l1": {"m1": {"c1"}}, ... "l2": {"m1": {"c2"}, "m2": {"c1", "c3"}}}) >>> sorted(concepts) ['m1', 'm2'] >>> sorted(concepts["m1"]) ['c1', 'c2'] >>> {language: sequence[concepts["m1"]["c1"]] for language, sequence in alignment.items()} {'l1': '1', 'l2': '0'} >>> {language: sequence[concepts["m2"]["c3"]] for language, sequence in alignment.items()} {'l1': '?', 'l2': '1'} >>> list(zip(*sorted(zip(*alignment.values())))) [('0', '0', '1', '?', '?'), ('0', '1', '0', '1', '1')]
- lexedata.exporter.phylogenetics.root_presence_code(dataset: ~typing.Mapping[~lexedata.types.Language_ID, ~typing.Mapping[~lexedata.types.Parameter_ID, ~typing.Set[~lexedata.types.Cognateset_ID]]], relevant_concepts: ~typing.Mapping[~lexedata.types.Cognateset_ID, ~typing.Iterable[~lexedata.types.Parameter_ID]], ascertainment: ~typing.Sequence[~typing.Literal['0', '1', '?']] = ['0'], logger: ~logging.Logger = <Logger lexedata (INFO)>) Tuple[Mapping[Language_ID, List[Literal['0', '1', '?']]], Mapping[Cognateset_ID, int]] ¶
Create a root-presence/absence coding from cognate codes in a dataset
Take the cognate code information from a wordlist, i.e. a mapping of the form {Language ID: {Concept ID: {Cognateset ID}}}, and generate a binary alignment from it that lists for every root whether it is present in that language or not. Return that, and the association between cognatesets and characters.
>>> alignment, roots = root_presence_code( ... {"Language": {"Meaning": {"Cognateset 1"}}}, ... relevant_concepts={"Cognateset 1": ["Meaning"]}) >>> alignment {'Language': ['0', '1']} >>> roots {'Cognateset 1': 1}
The first entry in each sequence is always ‘0’: The configuration where a form is absent from all languages is never observed, but always possible, so we add this entry for the purposes of ascertainment correction.
If a root is attested at all, in any concept, it is considered present. Because the word list is never a complete description of the language’s lexicon, the function employs a heuristic to generate ‘absent’ states.
If a root is unattested, and at least half of the relevant concepts associated with this root are attested, but each expressed by another root, the root is assumed to be absent in the target language. (If there is exactly one central concept, then that central concept being attested or unknown is a special case of this general rule.) Otherwise the presence/absence of the root is considered unknown.
>>> alignment, roots = root_presence_code( ... {"l1": {"m1": {"c1"}}, ... "l2": {"m1": {"c2"}, "m2": {"c1", "c3"}}}, ... relevant_concepts={"c1": ["m1"], "c2": ["m1"], "c3": ["m2"]}) >>> sorted(roots) ['c1', 'c2', 'c3'] >>> sorted_roots = sorted(roots.items()) >>> {language: [sequence[k[1]] for k in sorted_roots] for language, sequence in alignment.items()} {'l1': ['1', '0', '?'], 'l2': ['1', '1', '1']} >>> list(zip(*sorted(zip(*alignment.values())))) [('0', '0', '1', '?'), ('0', '1', '1', '1')]