lexedata.exporter package

Submodules

lexedata.exporter.cognates module

class lexedata.exporter.cognates.BaseExcelWriter(dataset: ~pycldf.dataset.Dataset, database_url: ~typing.Optional[str] = None, logger: ~logging.Logger = <Logger lexedata (INFO)>)

Bases: object

Class logic for matrix-shaped Excel export.

collect_forms_by_row(judgements: Iterable[Judgement], rows: Iterable[Union[Cognateset_ID, Parameter_ID]]) Mapping[Cognateset_ID, Mapping[Form_ID, Sequence[Judgement]]]

Collect forms by row object (ie. concept or cognate set)

create_excel(rows: Iterable[RowObject], languages, judgements: Iterable[Judgement], forms, size_sort: bool = False) None

Convert the initial CLDF into an Excel cognate view

The Excel file has columns “CogSet”, one column each mirroring the other cognateset metadata, and then one column per language.

The rows contain cognate data. If a language has multiple reflexes in the same cognateset, these appear in different cells, one below the other.

create_formcell(form: Form, column: int, row: int) None

Fill the given cell with the form’s data.

In the cell described by ws, column, row, dump the data for the form: Write into the the form data, and supply a comment from the judgement if there is one.

create_formcells(row_forms: Iterable[Form], row_index: int) int

Writes all forms for given cognate set to Excel.

Take all forms for a given cognate set as given by the database, create a hyperlink cell for each form, and write those into rows starting at row_index.

Return the row number of the first empty row after this cognate set, which can then be filled by the following cognate set.

abstract form_to_cell_value(form: Form)

Format a form into text for an Excel cell value

header: List[Tuple[str, str]]
row_table: str
abstract set_header(dataset: Wordlist[Language_ID, Form_ID, Parameter_ID, Cognate_ID, Cognateset_ID])

Define the header for the first few columns

abstract write_row_header(row_object: RowObject, row_index: int)

Write a row header

Write into the first few columns of the row row_index of self.ws the metadata of a row, eg. concept ID and gloss or cognateset ID, cognateset name and status.

class lexedata.exporter.cognates.ExcelWriter(dataset: ~pycldf.dataset.Dataset, database_url: ~typing.Optional[str] = None, singleton_cognate: bool = False, singleton_status: ~typing.Optional[str] = None, logger: ~logging.Logger = <Logger lexedata (INFO)>)

Bases: BaseExcelWriter

Class logic for cognateset Excel export.

form_to_cell_value(form: Form) str

Build a string describing the form itself

Provide the best transcription and all translations of the form strung together.

>>> ds = util.fs.new_wordlist(FormTable=[], CognatesetTable=[], CognateTable=[])
>>> E = ExcelWriter(dataset=ds)
>>> E.form_to_cell_value({"form": "f", "parameterReference": "c"})
'f ‘c’'
>>> E.form_to_cell_value(
...   {"form": "f", "parameterReference": "c", "formComment": "Not empty"})
'f ‘c’ ⚠'
>>> E.form_to_cell_value(
...   {"form": "fo", "parameterReference": "c", "segments": ["f", "o"]})
'{ f o } ‘c’'
>>> E.form_to_cell_value(
...   {"form": "fo",
...    "parameterReference": "c",
...    "segments": ["f", "o"],
...    "segmentSlice": ["1:1"]})
'{ f }o ‘c’'

TODO: This function should at some point support alignments, so that the following call will return ‘{ - f - }o ‘c’’ instead.

>>> E.form_to_cell_value(
...   {"form": "fo",
...    "parameterReference": "c",
...    "segments": ["f", "o"],
...    "segmentSlice": ["1:1"],
...    "alignment": ["", "f", ""]})
'{ f }o ‘c’'
header: List[Tuple[str, str]]
row_table: str = 'CognatesetTable'
set_header(dataset: Wordlist[Language_ID, Form_ID, Parameter_ID, Cognate_ID, Cognateset_ID])

Define the header for the first few columns

write_row_header(cogset, row_number: int)

Write a row header

Write into the first few columns of the row row_index of self.ws the metadata of a row, eg. concept ID and gloss or cognateset ID, cognateset name and status.

ws: Worksheet
lexedata.exporter.cognates.cogsets_and_judgements(dataset, status: ~typing.Optional[str], by_segment=True, logger: ~logging.Logger = <Logger lexedata (INFO)>)
lexedata.exporter.cognates.parser()
lexedata.exporter.cognates.properties_as_key(data, columns)
lexedata.exporter.cognates.sort_cognatesets(cogsets: List[CogSet], judgements: Sequence[Judgement] = [], sort_column: Optional[str] = None, size: bool = True) None

Sort cognatesets by a given column, and optionally by size.

lexedata.exporter.edictor module

Export a dataset to Edictor/Lingpy.

Input for edictor is a .tsv file containing the forms. The first column needs to be ‘ID’, containing 1-based integers. Cognatesets IDs need to be 1-based integers.

lexedata.exporter.edictor.add_edictor_settings(file, dataset)

Write a block of Edictor setting comments to a file.

Edictor takes some comments in its TSV files as directives for how it should behave. The important settings here are to set the morphology mode to ‘partial’ and pass the order of languages and concepts through. Everything else is pretty much edictor standard.

lexedata.exporter.edictor.forms_to_tsv(dataset: ~lexedata.types.Wordlist[~lexedata.types.Language_ID, ~lexedata.types.Form_ID, ~lexedata.types.Parameter_ID, ~lexedata.types.Cognate_ID, ~lexedata.types.Cognateset_ID], languages: ~typing.Iterable[str], concepts: ~typing.Set[str], cognatesets: ~typing.Iterable[str], logger: ~logging.Logger = <Logger lexedata (INFO)>)
lexedata.exporter.edictor.glue_in_alignment(global_alignment, cogsets, new_alignment, new_cogset, segments: slice)

Add a partial alignment to a global alignment with gaps

NOTE: This function does not check for overlapping alignments, it just assumes alignments do not overlap!

>>> alm = "(t) (e) (s) (t)".split()
>>> cogsets = [None]
>>> glue_in_alignment(alm, cogsets, list("es-"), 1, slice(1, 3))
>>> alm
['(t)', '+', 'e', 's', '-', '+', '(t)']
>>> cogsets
[None, 1, None]
>>> glue_in_alignment(alm, cogsets, list("-t-"), 2, slice(3, 4))
>>> alm
['(t)', '+', 'e', 's', '-', '+', '-', 't', '-']
>>> cogsets
[None, 1, 2]

This is independent of the order in which alignments are glued in.

>>> alm = "(t) (e) (s) (t)".split()
>>> cogsets = [None]
>>> glue_in_alignment(alm, cogsets, list("-t-"), 2, slice(3, 4))
>>> alm
['(t)', '(e)', '(s)', '+', '-', 't', '-']
>>> cogsets
[None, 2]
>>> glue_in_alignment(alm, cogsets, list("es-"), 1, slice(1, 3))
>>> alm
['(t)', '+', 'e', 's', '-', '+', '-', 't', '-']
>>> cogsets
[None, 1, 2]

Of course, it also works for coplete forms, not just for partial cognate judgements.

>>> alm = "(t) (e) (s) (t)".split()
>>> cogsets = [None]
>>> glue_in_alignment(alm, cogsets, list("t-es-t-"), 3, slice(0, 4))
>>> alm
['t', '-', 'e', 's', '-', 't', '-']
>>> cogsets
[3]
lexedata.exporter.edictor.rename(form_column)
lexedata.exporter.edictor.write_edictor_file(dataset: Wordlist[Language_ID, Form_ID, Parameter_ID, Cognate_ID, Cognateset_ID], file: TextIO, forms: Mapping[Form_ID, Mapping[str, Any]], judgements_about_form, cognateset_numbers)

Write the judgements of a dataset to file, in edictor format.

lexedata.exporter.matrix module

class lexedata.exporter.matrix.MatrixExcelWriter(dataset: ~pycldf.dataset.Dataset, database_url: ~typing.Optional[str] = None, logger: ~logging.Logger = <Logger lexedata (INFO)>)

Bases: BaseExcelWriter

Class logic for Excel matrix export.

create_formcell(form: Form, column: int, row: int) None

Fill the given cell with the form’s data.

In the cell described by ws, column, row, dump the data for the form: Write into the the form data, and supply a comment from the judgement if there is one.

form_to_cell_value(form: Form) str

Format a form into text for an Excel cell value

header: List[Tuple[str, str]]
row_table: str = 'ParameterTable'
set_header(dataset)

Define the header for the first few columns

write_row_header(cogset, row)

Write a row header

Write into the first few columns of the row row_index of self.ws the metadata of a row, eg. concept ID and gloss or cognateset ID, cognateset name and status.

ws: Worksheet

lexedata.exporter.phylogenetics module

class lexedata.exporter.phylogenetics.AbsenceHeuristic(value)

Bases: Enum

An enumeration.

CENTRALCONCEPT = 0
HALFPRIMARYCONCEPTS = 1
class lexedata.exporter.phylogenetics.CodingProcedure(value)

Bases: Enum

An enumeration.

MULTISTATE = 2
ROOTMEANING = 1
ROOTPRESENCE = 0
class lexedata.exporter.phylogenetics.Cognateset_ID

Bases: str

class lexedata.exporter.phylogenetics.Language_ID

Bases: str

class lexedata.exporter.phylogenetics.Parameter_ID

Bases: str

lexedata.exporter.phylogenetics.add_partitions(data_object: Element, partitions: Dict[str, Iterable[int]])

Add partitions after the <data> object

>>> xml = ET.fromstring("<beast><data id='alignment'/></beast>")
>>> data = xml.find(".//data")
>>> partitions = {"a": [1, 2, 3, 5], "b": [4, 6, 7]}
>>> add_partitions(data, partitions)
>>> print(ET.tostring(xml).decode("utf-8"))
<beast><data id="alignment"/><data id="concept:a" spec="FilteredAlignment" filter="1,2-4,6" data="@alignment" ascertained="true" excludefrom="0" excludeto="1"/><data id="concept:b" spec="FilteredAlignment" filter="1,5,7-8" data="@alignment" ascertained="true" excludefrom="0" excludeto="1"/></beast>
lexedata.exporter.phylogenetics.apply_heuristics(dataset: ~lexedata.types.Wordlist, heuristic: ~typing.Optional[~lexedata.exporter.phylogenetics.AbsenceHeuristic] = None, primary_concepts: ~typing.Union[~lexedata.types.WorldSet[~lexedata.types.Parameter_ID], ~typing.AbstractSet[~lexedata.types.Parameter_ID]] = <lexedata.types.WorldSet object>, logger: ~logging.Logger = <Logger lexedata (INFO)>) Mapping[Cognateset_ID, Set[Parameter_ID]]

Compute the relevant concepts for cognatesets, depending on the heuristic.

These concepts will be considered when deciding whether a root is deemed absent in a language.

For the CentralConcept heuristic, the relevant concepts are the central concept of a cognateset, as given by the #parameterReference column of the CognatesetTable. A central concept not included in the primary_concepts is ignored with a warning.

>>> ds = util.fs.new_wordlist()
>>> cst = ds.add_component("CognatesetTable")
>>> ds["CognatesetTable"].tableSchema.columns.append(
...     pycldf.dataset.Column(
...         name="Central_Concept",
...         propertyUrl="http://cldf.clld.org/v1.0/terms.rdf#parameterReference"))
>>> ds.auto_constraints(cst)
>>> _= ds.write(CognatesetTable=[
...     {"ID": "cognateset1", "Central_Concept": "concept1"}
... ])
>>> apply_heuristics(ds, heuristic=AbsenceHeuristic.CENTRALCONCEPT) == {'cognateset1': {'concept1'}}
True

This extends to the case where a cognateset may have more than one central concept.

>>> ds = util.fs.new_wordlist()
>>> cst = ds.add_component("CognatesetTable")
>>> ds["CognatesetTable"].tableSchema.columns.append(
...     pycldf.dataset.Column(
...         name="Central_Concepts",
...         propertyUrl="http://cldf.clld.org/v1.0/terms.rdf#parameterReference",
...         separator=","))
>>> ds.auto_constraints(cst)
>>> _ = ds.write(CognatesetTable=[
...     {"ID": "cognateset1", "Central_Concepts": ["concept1", "concept2"]}
... ])
>>> apply_heuristics(ds, heuristic=AbsenceHeuristic.CENTRALCONCEPT) == {
...     'cognateset1': {'concept1', 'concept2'}}
True

For the HalfPrimaryConcepts heurisitc, the relevant concepts are all primary concepts connected to a cognateset.

>>> ds = util.fs.new_wordlist(
...     FormTable=[
...         {"ID": "f1", "Parameter_ID": "c1", "Language_ID": "l1", "Form": "x"},
...         {"ID": "f2", "Parameter_ID": "c2", "Language_ID": "l1", "Form": "x"}],
...     CognateTable=[
...         {"ID": "1", "Form_ID": "f1", "Cognateset_ID": "s1"},
...         {"ID": "2", "Form_ID": "f2", "Cognateset_ID": "s1"}])
>>> apply_heuristics(ds, heuristic=AbsenceHeuristic.HALFPRIMARYCONCEPTS) == {
...     's1': {'c1', 'c2'}}
True

NOTE: This function cannot guarantee that every concept has at least one relevant concept, there may be cognatesets without! A cognateset with 0 relevant concepts will always be included, because 0 is at least half of 0.

lexedata.exporter.phylogenetics.compress_indices(indices: Set[int]) Iterator[slice]

Turn groups of largely contiguous indices into slices.

>>> list(compress_indices(set(range(10))))
[slice(0, 10, None)]
>>> list(compress_indices([1, 2, 5, 6, 7]))
[slice(1, 3, None), slice(5, 8, None)]
lexedata.exporter.phylogenetics.fill_beast(data_object: Element, languages, sequences) None

Add sequences to BEAST as Alignment object.

>>> xml = ET.fromstring("<beast><data /></beast>")
>>> fill_beast(xml.find(".//data"), ["L1", "L2"], ["0110", "0011"])
>>> print(ET.tostring(xml).decode("utf-8"))
<beast><data id="vocabulary" dataType="integer" spec="Alignment">
<sequence id="language_data_vocabulary:L1" taxon="L1" value="0110"/>
<sequence id="language_data_vocabulary:L2" taxon="L2" value="0011"/>
<taxonset id="taxa" spec="TaxonSet"><plate var="language" range="{languages}"><taxon id="$(language)" spec="Taxon"/></plate></taxonset></data></beast>
lexedata.exporter.phylogenetics.format_nexus(languages: Iterable[str], sequences: Iterable[str], n_symbols: int, n_characters: int, datatype: str, partitions: Optional[Mapping[str, Iterable[int]]] = None)

Format a Nexus output with the sequences.

This function only formats and performs no further validity checks!

>>> print(format_nexus(
...   ["l1", "l2"],
...   ["0010", "0111"],
...   2, 3,
...   "binary",
...   {"one": [1], "two": [2,3]}
... )) 
#NEXUS
Begin Taxa;
  Dimensions ntax=2;
  TaxLabels l1 l2;
End;
Begin Characters;
  Dimensions NChar=3;
  Format Datatype=Restriction Missing=? Gap=- Symbols="0 1" ;
  Matrix
    [The first column is constant zero, for programs with ascertainment correction]
    l1  0010
    l2  0111
  ;
End;
Begin Sets;
  CharSet one=1;
  CharSet two=2 3;
End;
lexedata.exporter.phylogenetics.multistate_code(dataset: Mapping[Language_ID, Mapping[Parameter_ID, Set[Cognateset_ID]]]) Tuple[Mapping[Language_ID, Sequence[Set[int]]], Sequence[int]]

Create a multistate root-meaning coding from cognate codes in a dataset

Take the cognate code information from a wordlist, i.e. a mapping of the form {Language ID: {Concept ID: {Cognateset ID}}}, and generate a multistate alignment from it that lists for every meaning which roots are used to represent that meaning in each language.

Also return the number of roots for each concept.

Examples

>>> alignment, lengths = multistate_code({"Language": {"Meaning": {"Cognateset 1"}}})
>>> alignment =={'Language': [{0}]}
True
>>> lengths == [1]
True
>>> alignment, statecounts = multistate_code(
...     {"l1": {"m1": {"c1"}},
...      "l2": {"m1": {"c2"}, "m2": {"c1", "c3"}}})
>>> alignment["l1"][1]
set()
>>> alignment["l2"][1] == {0, 1}
True
>>> statecounts
[2, 2]
lexedata.exporter.phylogenetics.parser()

Construct the CLI argument parser for this script.

lexedata.exporter.phylogenetics.raw_binary_alignment(alignment)
lexedata.exporter.phylogenetics.raw_multistate_alignment(alignment, long_sep: str = ',')
lexedata.exporter.phylogenetics.read_cldf_dataset(dataset: ~lexedata.types.Wordlist[~lexedata.types.Language_ID, ~lexedata.types.Form_ID, ~lexedata.types.Parameter_ID, ~lexedata.types.Cognate_ID, ~lexedata.types.Cognateset_ID], code_column: ~typing.Optional[str] = None, logger: ~logging.Logger = <Logger lexedata (INFO)>) Mapping[Language_ID, Mapping[Parameter_ID, Set[Cognateset_ID]]]

Load a CLDF dataset.

Load the file as json CLDF metadata description file, or as metadata-free dataset contained in a single csv file.

The distinction is made depending on the file extension: .json files are loaded as metadata descriptions, all other files are matched against the CLDF module specifications. Directories are checked for the presence of any CLDF datasets in undefined order of the dataset types.

If use_ids == False, the reader is free to choose language names or language glottocodes for the output if they are unique.

Examples

>>> import tempfile
>>> dirname = Path(tempfile.mkdtemp(prefix="lexedata-test"))
>>> target = dirname / "forms.csv"
>>> _size = open(target, "w", encoding="utf-8").write('''
... ID,Language_ID,Parameter_ID,Form,Cognateset_ID
... '''.strip())
>>> ds = pycldf.Wordlist.from_data(target)

{‘autaa’: defaultdict(<class ‘set’>, {‘Woman’: {‘WOMAN1’}, ‘Person’: {‘PERSON1’}})} TODO: FIXME THIS EXAMPLE IS INCOMPLETE

Parameters:

fname (str or Path) – Path to a CLDF dataset

Return type:

Data

lexedata.exporter.phylogenetics.read_structure_dataset(dataset: ~pycldf.dataset.StructureDataset, logger: ~logging.Logger = <Logger lexedata (INFO)>) MutableMapping[Language_ID, MutableMapping[Parameter_ID, Set]]
lexedata.exporter.phylogenetics.read_wordlist(dataset: ~lexedata.types.Wordlist[~lexedata.types.Language_ID, ~lexedata.types.Form_ID, ~lexedata.types.Parameter_ID, ~lexedata.types.Cognate_ID, ~lexedata.types.Cognateset_ID], code_column: ~typing.Optional[str], logger: ~logging.Logger = <Logger lexedata (INFO)>) MutableMapping[Language_ID, MutableMapping[Parameter_ID, Set]]
lexedata.exporter.phylogenetics.root_meaning_code(dataset: ~typing.Mapping[~lexedata.types.Language_ID, ~typing.Mapping[~lexedata.types.Parameter_ID, ~typing.Set[~lexedata.types.Cognateset_ID]]], core_concepts: ~typing.Set[~lexedata.types.Parameter_ID] = <lexedata.types.WorldSet object>, ascertainment: ~typing.Sequence[~typing.Literal['0', '1', '?']] = ['0']) Tuple[Mapping[Language_ID, List[Literal['0', '1', '?']]], Mapping[Parameter_ID, Mapping[Cognateset_ID, int]]]

Create a root-meaning coding from cognate codes in a dataset

Take the cognate code information from a wordlist, i.e. a mapping of the form {Language ID: {Concept ID: {Cognateset ID}}}, and generate a binary alignment from it that lists for every meaning which roots are used to represent that meaning in each language.

Return the aligment, and the list of slices belonging to each meaning.

The default ascertainment is the a single absence (‘0’): The configuration where a form is absent from all languages is never observed, but always possible, so we add this entry for the purposes of ascertainment correction.

Examples

>>> alignment, concepts = root_meaning_code({"Language": {"Meaning": {"Cognateset 1"}}})
>>> alignment
{'Language': ['0', '1']}
>>> alignment, concepts = root_meaning_code(
...   {"l1": {"m1": {"c1"}},
...    "l2": {"m1": {"c2"}, "m2": {"c1", "c3"}}})
>>> sorted(concepts)
['m1', 'm2']
>>> sorted(concepts["m1"])
['c1', 'c2']
>>> {language: sequence[concepts["m1"]["c1"]] for language, sequence in alignment.items()}
{'l1': '1', 'l2': '0'}
>>> {language: sequence[concepts["m2"]["c3"]] for language, sequence in alignment.items()}
{'l1': '?', 'l2': '1'}
>>> list(zip(*sorted(zip(*alignment.values()))))
[('0', '0', '1', '?', '?'), ('0', '1', '0', '1', '1')]
lexedata.exporter.phylogenetics.root_presence_code(dataset: ~typing.Mapping[~lexedata.types.Language_ID, ~typing.Mapping[~lexedata.types.Parameter_ID, ~typing.Set[~lexedata.types.Cognateset_ID]]], relevant_concepts: ~typing.Mapping[~lexedata.types.Cognateset_ID, ~typing.Iterable[~lexedata.types.Parameter_ID]], ascertainment: ~typing.Sequence[~typing.Literal['0', '1', '?']] = ['0'], logger: ~logging.Logger = <Logger lexedata (INFO)>) Tuple[Mapping[Language_ID, List[Literal['0', '1', '?']]], Mapping[Cognateset_ID, int]]

Create a root-presence/absence coding from cognate codes in a dataset

Take the cognate code information from a wordlist, i.e. a mapping of the form {Language ID: {Concept ID: {Cognateset ID}}}, and generate a binary alignment from it that lists for every root whether it is present in that language or not. Return that, and the association between cognatesets and characters.

>>> alignment, roots = root_presence_code(
...     {"Language": {"Meaning": {"Cognateset 1"}}},
...     relevant_concepts={"Cognateset 1": ["Meaning"]})
>>> alignment
{'Language': ['0', '1']}
>>> roots
{'Cognateset 1': 1}

The first entry in each sequence is always ‘0’: The configuration where a form is absent from all languages is never observed, but always possible, so we add this entry for the purposes of ascertainment correction.

If a root is attested at all, in any concept, it is considered present. Because the word list is never a complete description of the language’s lexicon, the function employs a heuristic to generate ‘absent’ states.

If a root is unattested, and at least half of the relevant concepts associated with this root are attested, but each expressed by another root, the root is assumed to be absent in the target language. (If there is exactly one central concept, then that central concept being attested or unknown is a special case of this general rule.) Otherwise the presence/absence of the root is considered unknown.

>>> alignment, roots = root_presence_code(
...     {"l1": {"m1": {"c1"}},
...      "l2": {"m1": {"c2"}, "m2": {"c1", "c3"}}},
...     relevant_concepts={"c1": ["m1"], "c2": ["m1"], "c3": ["m2"]})
>>> sorted(roots)
['c1', 'c2', 'c3']
>>> sorted_roots = sorted(roots.items())
>>> {language: [sequence[k[1]] for k in sorted_roots] for language, sequence in alignment.items()}
{'l1': ['1', '0', '?'], 'l2': ['1', '1', '1']}
>>> list(zip(*sorted(zip(*alignment.values()))))
[('0', '0', '1', '?'), ('0', '1', '1', '1')]

Module contents