lexedata.report package¶

Submodules¶

lexedata.report.coverage module¶

class lexedata.report.coverage.Missing(value)¶

Bases: Enum

An enumeration.

COUNT_NORMALLY = 1¶

IGNORE = 0¶

KNOWN = 2¶

lexedata.report.coverage.coverage_report(dataset: Wordlist[Language_ID, Form_ID, Parameter_ID, Cognate_ID, Cognateset_ID], min_percentage: float = 0.0, with_concept: Iterable[Parameter_ID] = {}, missing: Missing = Missing.KNOWN, only_coded: bool = True) → List[List[str]]¶

lexedata.report.coverage.coverage_report_concepts(dataset: Dataset)¶

lexedata.report.extended_cldf_validate module¶

Validate a CLDF wordlist.

This script runs some more validators specific to CLDF Wordlist datasets in addition to the validation implemented in the pycldf core. Some of those tests are not yet mandated by the CLDF standard, but are assumptions which some tools (including lexedata) tacitly make, so this validator makes them explicit.

TODO: There may be programmatic ways to fix the issues that this script reports. Those automatic fixes should be made more obvious.

lexedata.report.extended_cldf_validate.check_foreign_keys(dataset: ~pycldf.dataset.Dataset, logger: ~logging.Logger = <Logger lexedata (INFO)>)¶

lexedata.report.extended_cldf_validate.check_id_format(dataset: ~pycldf.dataset.Dataset, logger: ~logging.Logger = <Logger lexedata (INFO)>)¶

lexedata.report.extended_cldf_validate.check_na_form_has_no_alternative(dataset: ~lexedata.types.Wordlist[~lexedata.types.Language_ID, ~lexedata.types.Form_ID, ~lexedata.types.Parameter_ID, ~lexedata.types.Cognate_ID, ~lexedata.types.Cognateset_ID], logger: ~logging.Logger = <Logger lexedata (INFO)>)¶

lexedata.report.extended_cldf_validate.check_no_separator_in_ids(dataset: ~pycldf.dataset.Dataset, logger: <Logger lexedata (INFO)> = <Logger lexedata (INFO)>) → bool¶

lexedata.report.extended_cldf_validate.check_segmentslice_separator(dataset, logger=None) → bool¶

lexedata.report.extended_cldf_validate.check_unicode_data(dataset: ~pycldf.dataset.Dataset, unicode_form: str = 'NFC', logger: ~logging.Logger = <Logger lexedata (INFO)>) → bool¶

lexedata.report.extended_cldf_validate.log_or_raise(message, log: ~logging.Logger = <Logger lexedata (INFO)>)¶

lexedata.report.filter module¶

Filter some table by some column.

Print the partial table to STDOUT or a file, so it can be used as subset-filter for some other script, and output statistics (how many included, how many excluded, what proportion, maybe sub-statistics for xxxReference columns, i.e. by language or by conceptr) to STDERR.

For example, assume you want to filter your FormTable down to those forms that start with a ‘b’, except for those forms from Fantastean varieties, which all have a name containing ‘Fantastean’. You can do this using two calls to this program like this:

python -m lexedata.report.filter Form ‘^b’ FormTable -c ID -c Language_ID |: python -m lexedata.report.filter -V Language_ID ‘Fantastean’ -c ID

If you are aware of standard Unix tools, this script is a column-aware, but otherwise vastly reduced implementation of grep.

lexedata.report.filter.filter(table: ~typing.Iterable[~lexedata.report.filter.R], column: str, filter: ~re.Pattern, invert: bool = False, logger: ~logging.Logger = <Logger lexedata (INFO)>) → Iterator[R]¶

Return all rows matching a filter

Match the filter regular expression and return all rows in the table where the filter matches the column. (Or all where it does not, if invert==True.)

>>> list(filter([
...   {"C": "A"},
...   {"C": "An"},
...   {"C": "T"},
...   {"C": "E"},
... ], "C", re.compile("A"), invert=True))
[{'C': 'T'}, {'C': 'E'}]

lexedata.report.filter.parser()¶

lexedata.report.homophones module¶

Generate a report of homophones.

List all groups of homophones in the dataset, together with (if available) the minimal spanning tree according to clics, in order to identify polysemies vs. accidental homophones

lexedata.report.homophones.list_homophones(dataset: ~pycldf.dataset.Dataset, out: ~io.TextIOBase, logger: ~logging.Logger = <Logger lexedata (INFO)>) → None¶

lexedata.report.judgements module¶

Check that the judgements make sense.

lexedata.report.judgements.check_cognate_table(dataset: ~pycldf.dataset.Wordlist, logger=<Logger lexedata (INFO)>, strict_concatenative=False) → bool¶

Check that the CognateTable makes sense.

The cognate table MUST have an indication of forms, in a #formReference column, and cognate sets, in a #cognatesetReference column. It SHOULD have segment slices (#segmentSlice) and alignments (#alignment).

The segment slice must be a valid (1-based, inclusive) slice into the segments of the form

The alignment must match the segment slice applied to the segments of the form

The length of the alignment must match the lengths of other alignments of that cognate set

NA forms (Including “” for “source reports form as unknown” must not be in cognatesets)

If checking for strictly concatenative morphology, also check that the segment slice is a contiguous, non-overlapping section of the form.

Having no cognates is a valid choice for a dataset, so this function returns True if no CognateTable was found.

lexedata.report.judgements.log_or_raise(message, logger=<Logger lexedata (INFO)>)¶

lexedata.report.nonconcatenative_morphemes module¶

lexedata.report.nonconcatenative_morphemes.cluster_overlaps(overlapping_cognatesets: ~typing.Iterable[~typing.Tuple[~lexedata.types.Cognateset_ID, ~lexedata.types.Cognateset_ID]], out=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>) → None¶

lexedata.report.nonconcatenative_morphemes.network_of_overlaps(which_segment_belongs_to_which_cognateset: Mapping[Form_ID, List[Set[Cognateset_ID]]], forms_cache: Optional[Mapping[Form_ID, Form]] = None) → Set[Tuple[Cognateset_ID, Cognateset_ID]]¶

lexedata.report.nonconcatenative_morphemes.segment_to_cognateset(dataset: ~lexedata.types.Wordlist[~lexedata.types.Language_ID, ~lexedata.types.Form_ID, ~lexedata.types.Parameter_ID, ~lexedata.types.Cognate_ID, ~lexedata.types.Cognateset_ID], cognatesets: ~typing.Container[~lexedata.types.Cognateset_ID], logger: ~logging.Logger = <Logger lexedata (INFO)>) → Mapping[Form_ID, List[Set[Cognateset_ID]]]¶

lexedata.report.segment_inventories module¶

Report the segment inventory of each language.

Report the phonemes (or whatever is represented by the #segments column) for each language, each with frequencies and whether the segments are valid CLTS.

lexedata.report.segment_inventories.comment_on_sound(sound: str) → str¶

Return a comment on the sound, if necessary.

>>> comment_on_sound("a")
''
>>> comment_on_sound("_")
'Marker'
>>> comment_on_sound("(")
'Invalid BIPA'

lexedata.report.segment_inventories.count_segments(dataset: Wordlist[Language_ID, Form_ID, Parameter_ID, Cognate_ID, Cognateset_ID], languages: Container[Language_ID])¶

lexedata.report package¶

Submodules¶

lexedata.report.coverage module¶

lexedata.report.extended_cldf_validate module¶

lexedata.report.filter module¶

lexedata.report.homophones module¶

lexedata.report.judgements module¶

lexedata.report.nonconcatenative_morphemes module¶

lexedata.report.segment_inventories module¶

Module contents¶

lexedata

Navigation

Related Topics