lexedata.report package


lexedata.report.coverage module

class lexedata.report.coverage.Missing(value)

Bases: Enum

An enumeration.

lexedata.report.coverage.coverage_report(dataset: Wordlist[Language_ID, Form_ID, Parameter_ID, Cognate_ID, Cognateset_ID], min_percentage: float = 0.0, with_concept: Iterable[Parameter_ID] = {}, missing: Missing = Missing.KNOWN, only_coded: bool = True) List[List[str]]
lexedata.report.coverage.coverage_report_concepts(dataset: Dataset)

lexedata.report.extended_cldf_validate module

Validate a CLDF wordlist.

This script runs some more validators specific to CLDF Wordlist datasets in addition to the validation implemented in the pycldf core. Some of those tests are not yet mandated by the CLDF standard, but are assumptions which some tools (including lexedata) tacitly make, so this validator makes them explicit.

TODO: There may be programmatic ways to fix the issues that this script reports. Those automatic fixes should be made more obvious.

lexedata.report.extended_cldf_validate.check_foreign_keys(dataset: ~pycldf.dataset.Dataset, logger: ~logging.Logger = <Logger lexedata (INFO)>)
lexedata.report.extended_cldf_validate.check_id_format(dataset: ~pycldf.dataset.Dataset, logger: ~logging.Logger = <Logger lexedata (INFO)>)
lexedata.report.extended_cldf_validate.check_na_form_has_no_alternative(dataset: ~lexedata.types.Wordlist[~lexedata.types.Language_ID, ~lexedata.types.Form_ID, ~lexedata.types.Parameter_ID, ~lexedata.types.Cognate_ID, ~lexedata.types.Cognateset_ID], logger: ~logging.Logger = <Logger lexedata (INFO)>)
lexedata.report.extended_cldf_validate.check_no_separator_in_ids(dataset: ~pycldf.dataset.Dataset, logger: <Logger lexedata (INFO)> = <Logger lexedata (INFO)>) bool
lexedata.report.extended_cldf_validate.check_segmentslice_separator(dataset, logger=None) bool
lexedata.report.extended_cldf_validate.check_unicode_data(dataset: ~pycldf.dataset.Dataset, unicode_form: str = 'NFC', logger: ~logging.Logger = <Logger lexedata (INFO)>) bool
lexedata.report.extended_cldf_validate.log_or_raise(message, log: ~logging.Logger = <Logger lexedata (INFO)>)

lexedata.report.filter module

Filter some table by some column.

Print the partial table to STDOUT or a file, so it can be used as subset-filter for some other script, and output statistics (how many included, how many excluded, what proportion, maybe sub-statistics for xxxReference columns, i.e. by language or by conceptr) to STDERR.

For example, assume you want to filter your FormTable down to those forms that start with a ‘b’, except for those forms from Fantastean varieties, which all have a name containing ‘Fantastean’. You can do this using two calls to this program like this:

python -m lexedata.report.filter Form ‘^b’ FormTable -c ID -c Language_ID |

python -m lexedata.report.filter -V Language_ID ‘Fantastean’ -c ID

If you are aware of standard Unix tools, this script is a column-aware, but otherwise vastly reduced implementation of grep.

lexedata.report.filter.filter(table: ~typing.Iterable[~lexedata.report.filter.R], column: str, filter: ~re.Pattern, invert: bool = False, logger: ~logging.Logger = <Logger lexedata (INFO)>) Iterator[R]

Return all rows matching a filter

Match the filter regular expression and return all rows in the table where the filter matches the column. (Or all where it does not, if invert==True.)

>>> list(filter([
...   {"C": "A"},
...   {"C": "An"},
...   {"C": "T"},
...   {"C": "E"},
... ], "C", re.compile("A"), invert=True))
[{'C': 'T'}, {'C': 'E'}]

lexedata.report.homophones module

Generate a report of homophones.

List all groups of homophones in the dataset, together with (if available) the minimal spanning tree according to clics, in order to identify polysemies vs. accidental homophones

lexedata.report.homophones.list_homophones(dataset: ~pycldf.dataset.Dataset, out: ~io.TextIOBase, logger: ~logging.Logger = <Logger lexedata (INFO)>) None

lexedata.report.judgements module

Check that the judgements make sense.

lexedata.report.judgements.check_cognate_table(dataset: ~pycldf.dataset.Wordlist, logger=<Logger lexedata (INFO)>, strict_concatenative=False) bool

Check that the CognateTable makes sense.

The cognate table MUST have an indication of forms, in a #formReference column, and cognate sets, in a #cognatesetReference column. It SHOULD have segment slices (#segmentSlice) and alignments (#alignment).

  • The segment slice must be a valid (1-based, inclusive) slice into the segments of the form

  • The alignment must match the segment slice applied to the segments of the form

  • The length of the alignment must match the lengths of other alignments of that cognate set

  • NA forms (Including “” for “source reports form as unknown” must not be in cognatesets)

If checking for strictly concatenative morphology, also check that the segment slice is a contiguous, non-overlapping section of the form.

Having no cognates is a valid choice for a dataset, so this function returns True if no CognateTable was found.

lexedata.report.judgements.log_or_raise(message, logger=<Logger lexedata (INFO)>)

lexedata.report.nonconcatenative_morphemes module

lexedata.report.nonconcatenative_morphemes.cluster_overlaps(overlapping_cognatesets: ~typing.Iterable[~typing.Tuple[~lexedata.types.Cognateset_ID, ~lexedata.types.Cognateset_ID]], out=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>) None
lexedata.report.nonconcatenative_morphemes.network_of_overlaps(which_segment_belongs_to_which_cognateset: Mapping[Form_ID, List[Set[Cognateset_ID]]], forms_cache: Optional[Mapping[Form_ID, Form]] = None) Set[Tuple[Cognateset_ID, Cognateset_ID]]
lexedata.report.nonconcatenative_morphemes.segment_to_cognateset(dataset: ~lexedata.types.Wordlist[~lexedata.types.Language_ID, ~lexedata.types.Form_ID, ~lexedata.types.Parameter_ID, ~lexedata.types.Cognate_ID, ~lexedata.types.Cognateset_ID], cognatesets: ~typing.Container[~lexedata.types.Cognateset_ID], logger: ~logging.Logger = <Logger lexedata (INFO)>) Mapping[Form_ID, List[Set[Cognateset_ID]]]

lexedata.report.segment_inventories module

Report the segment inventory of each language.

Report the phonemes (or whatever is represented by the #segments column) for each language, each with frequencies and whether the segments are valid CLTS.

lexedata.report.segment_inventories.comment_on_sound(sound: str) str

Return a comment on the sound, if necessary.

>>> comment_on_sound("a")
>>> comment_on_sound("_")
>>> comment_on_sound("(")
'Invalid BIPA'
lexedata.report.segment_inventories.count_segments(dataset: Wordlist[Language_ID, Form_ID, Parameter_ID, Cognate_ID, Cognateset_ID], languages: Container[Language_ID])

Module contents