lexedata.util package¶

Submodules¶

lexedata.util.add_metadata module¶

Starting with a forms.csv, add metadata for all columns we know about.

lexedata.util.add_metadata.add_metadata(fname: ~pathlib.Path, logger: ~logging.Logger = <Logger lexedata (INFO)>)¶

lexedata.util.excel module¶

Various helper functions for Excel file parsing

class lexedata.util.excel.CellParser(dataset: ~pycldf.dataset.Dataset, element_semantics: ~typing.Iterable[~typing.Tuple[str, str, str, bool]] = [('<', '>', 'form', True), ('(', ')', 'comment', False), ('{', '}', 'source', False)], separation_pattern: str = '([;,])', variant_separator: ~typing.Optional[~typing.List[str]] = ['~', '%'], add_default_source: ~typing.Optional[str] = '1', logger: ~logging.Logger = <Logger lexedata (INFO)>)¶

Bases: NaiveCellParser

c: Dict[str, str]¶

create_cldf_form(properties: Dict[str, Any]) → Optional[str]¶: Return first transcription out of properties as a candidate for cldf_form. Order of transcriptions corresponds to order of cell_parser_semantics as provided in the metadata.

parse_form(form_string: str, language_id: str, cell_identifier: str = '', logger: ~logging.Logger = <Logger lexedata (INFO)>) → Optional[Form]¶

Create a dictionary of columns from a form description.

Extract each value (transcriptions, comments, sources etc.) from a string describing a single form.

postprocess_form(properties: Dict[str, Any], language_id: str, source_delimiters: Tuple[str, str] = ('{', '}')) → None¶

Modify the form in-place

Fix some properties of the form. This is the place to add default sources, cut of delimiters, split unmarked variants, etc.

separate(values: str, context: str = '', logger: ~logging.Logger = <Logger lexedata (INFO)>) → Iterable[str]¶

Separate different form descriptions in one string.

Separate forms separated by comma or semicolon, unless the comma or semicolon occurs within a set of matching component delimiters (eg. brackets)

If the brackets don’t match, the whole remainder string is passed on, so that the form parser can try to recover as much as possible or throw an exception.

source_from_source_string(source_string: str, language_id: ~typing.Optional[str], delimiters: ~typing.Tuple[str, str] = ('{', '}'), logger: ~logging.Logger = <Logger lexedata (INFO)>) → str¶: Parse a string referencing a language-specific source

property transcriptions¶

class lexedata.util.excel.CellParserHyperlink(dataset: Dataset, extractor: Pattern)¶

Bases: NaiveCellParser

c: Dict[str, str]¶

parse(cell: ~openpyxl.cell.cell.Cell, language_id: str, cell_identifier: str = '', logger: ~logging.Logger = <Logger lexedata (INFO)>) → Iterable[Judgement]¶: Return form properties for every form in the cell

class lexedata.util.excel.MawetiCellParser(dataset: Dataset, element_semantics: Iterable[Tuple[str, str, str, bool]], separation_pattern: str, variant_separator: list, add_default_source: Optional[str])¶

Bases: CellParser

c: Dict[str, str]¶

postprocess_form(properties: Dict[str, Any], language_id: str) → None¶: Post processing specific to the Maweti dataset

class lexedata.util.excel.MawetiCognateCellParser(dataset: Dataset, element_semantics: Iterable[Tuple[str, str, str, bool]], separation_pattern: str, variant_separator: list, add_default_source: Optional[str])¶

Bases: MawetiCellParser

c: Dict[str, str]¶

parse_form(values, language, cell_identifier: str = '')¶

Create a dictionary of columns from a form description.

Extract each value (transcriptions, comments, sources etc.) from a string describing a single form.

class lexedata.util.excel.NaiveCellParser(dataset: Dataset)¶

Bases: object

c: Dict[str, str]¶

cc(short, dataset, long=None)¶: Cache the name of a column, or complain if it doesn’t exist

parse(cell: ~openpyxl.cell.cell.Cell, language_id: str, cell_identifier: str = '', logger: ~logging.Logger = <Logger lexedata (INFO)>) → Iterable[Form]¶: Return form properties for every form in the cell

parse_form(form_string: str, language_id: str, cell_identifier: str = '') → Optional[Form]¶

separate(values: str, context: str = '') → Iterable[str]¶

Separate different form descriptions in one string.

Separate forms separated by comma.

lexedata.util.excel.alignment_from_braces(text, start=0)¶

Convert a brace-delimited morpheme description into slices and alignments.

The “-” character is used as the alignment gap character, so it does not count towards the segment slices.

If opening or closing brackets are missing, the slice goes until the end of the form.

>>> alignment_from_braces("t{e x t")
([(2, 4)], ['e', 'x', 't'])
>>> alignment_from_braces("t e x}t")
([(1, 3)], ['t', 'e', 'x'])
>>> alignment_from_braces("t e x t")
([(1, 4)], ['t', 'e', 'x', 't'])

lexedata.util.excel.check_brackets(string, bracket_pairs)¶

Check whether all brackets match.

This function can check the matching of simple bracket pairs, like this:

>>> b = {"(": ")", "[": "]", "{": "}"}
>>> check_brackets("([])", b)
True
>>> check_brackets("([]])", b)
False
>>> check_brackets("([[])", b)
False
>>> check_brackets("This (but [not] this)", b)
True

But it can also deal with multi-character matches

>>> b = {"(": ")", "begin": "end"}
>>> check_brackets("begin (__ (!) xxx) end", b)
True
>>> check_brackets("begin (__ (!) end) xxx", b)
False

This includes multi-character matches where some pair is a subset of another pair. Here the order of the pairs in the dictionary is important – longer pairs must be defined first.

>>> b = {":::": ":::", ":": ":"}
>>> check_brackets("::: :::", b)
True
>>> check_brackets("::::::", b)
True
>>> check_brackets("::::", b)
False
>>> check_brackets(":: ::", b)
True

In combination, these features allow for natural escape sequences:

>>> b = {"!(": "", "!)": "", "(": ")", "[": "]"}
>>> check_brackets("(text)", b)
True
>>> check_brackets("(text", b)
False
>>> check_brackets("text)", b)
False
>>> check_brackets("(te[xt)]", b)
False
>>> check_brackets("!(text", b)
True
>>> check_brackets("text!)", b)
True
>>> check_brackets("!(te[xt!)]", b)
True

lexedata.util.excel.clean_cell_value(cell: ~openpyxl.cell.cell.Cell, logger=<Logger lexedata (INFO)>)¶: Return the value of an Excel cell in a useful format and normalized.

lexedata.util.excel.components_in_brackets(form_string, bracket_pairs)¶

Find all elements delimited by complete pairs of matching brackets.

>>> b = {"!/": "", "(": ")", "[": "]", "{": "}", "/": "/"}
>>> components_in_brackets("/aha/ (exclam. !/ int., also /ah/)",b)
['', '/aha/', ' ', '(exclam. !/ int., also /ah/)', '']

Recovery from mismatched delimiters early in the string is difficult. The following example is still waiting for the first ‘/’ to be closed by the end of the string.

>>> components_in_brackets("/aha (exclam. !/ int., also /ah/)",b)
['', '/aha (exclam. !/ int., also /ah/)']

lexedata.util.excel.get_cell_comment(cell: Cell) → str¶

Get the comment of a cell.

Get the normalized comment of a cell: Guaranteed to be a string (empty if no comment), with lines joined by spaces instead and all ‘lexedata’ author annotations stripped. We also remove equal signs from the beginning of comments so that they don’t get interpreted as formulas by excel.

>>> from openpyxl.comments import Comment
>>> wb = op.Workbook()
>>> ws = wb.active
>>> ws["A1"].comment = Comment('''This comment
... contains a linebreak and a signature.
...   -lexedata.exporter''',
... 'lexedata')
>>> get_cell_comment(ws["A1"])
'This comment contains a linebreak and a signature.'
>>> get_cell_comment(ws["A2"])
''

lexedata.util.excel.normalize_header(row: Iterable[Cell]) → Iterable[str]¶

lexedata.util.fs module¶

lexedata.util.fs.copy_dataset(original: Path, target: Path) → Dataset¶

Return a copy of the dataset at original.

Copy the dataset (metadata and relative table URLs) from original to target, and return the new dataset at target.

lexedata.util.fs.get_dataset(fname: Path) → Dataset¶

Load a CLDF dataset.

Load the file as json CLDF metadata description file, or as metadata-free dataset contained in a single csv file.

The distinction is made depending on the file extension: .json files are loaded as metadata descriptions, all other files are matched against the CLDF module specifications. Directories are checked for the presence of any CLDF datasets in undefined order of the dataset types.

Parameters:: fname (str or Path) – Path to a CLDF dataset
Return type:: Dataset

lexedata.util.fs.new_wordlist(path: Optional[Path] = None, **data) → Wordlist[str, str, str, str, str]¶

Create a new CLDF wordlist.

By default, the wordlist is created in a new temporary directory, but you can specify a path to create it in.

To immediately fill some tables, provide keyword arguments. The necessary components will be created in default shape, so

>>> ds = new_wordlist()

will only have a FormTable

>>> [table.url.string for table in ds.tables]
['forms.csv']

but it is possible to generate a dataset with more tables from scratch

>>> ds = new_wordlist(
...     FormTable=[],
...     LanguageTable=[],
...     ParameterTable=[],
...     CognatesetTable=[],
...     CognateTable=[])
>>> [table.url.string for table in ds.tables]
['forms.csv', 'languages.csv', 'parameters.csv', 'cognatesets.csv', 'cognates.csv']
>>> sorted(f.name for f in ds.directory.iterdir())
['Wordlist-metadata.json', 'cognates.csv', 'cognatesets.csv', 'forms.csv', 'languages.csv', 'parameters.csv']

lexedata.util.simplify_ids module¶

lexedata.util.simplify_ids.clean_mapping(rows: ~typing.Mapping[str, ~typing.Mapping[str, str]], additional_normalize: ~typing.Callable[[str], str] = <method 'lower' of 'str' objects>) → Mapping[str, str]¶

Create unique normalized IDs.

>>> clean_mapping({"A": {}, "B": {}})
{'A': 'a', 'B': 'b'}

>>> clean_mapping({"A": {}, "a": {}})
{'A': 'a', 'a': 'a_x2'}

>>> clean_mapping({"A": {}, "a": {}}, str.upper)
{'A': 'A', 'a': 'A_x2'}

lexedata.util.simplify_ids.simplify_table_ids_and_references(ds: ~lexedata.types.Wordlist[~lexedata.types.Language_ID, ~lexedata.types.Form_ID, ~lexedata.types.Parameter_ID, ~lexedata.types.Cognate_ID, ~lexedata.types.Cognateset_ID], table: ~csvw.metadata.Table, transparent: bool = True, logger: ~logging.Logger = <Logger lexedata (INFO)>, additional_normalize: ~typing.Callable[[str], str] = <method 'lower' of 'str' objects>) → bool¶: Simplify the IDs of the given table.

lexedata.util.simplify_ids.update_ids(ds: ~pycldf.dataset.Dataset, table: ~csvw.metadata.Table, mapping: ~typing.Mapping[str, str], logger: ~logging.Logger = <Logger lexedata (INFO)>)¶: Update all IDs of the table in the database, also in foreign keys, according to mapping.

lexedata.util.simplify_ids.update_integer_ids(ds: ~pycldf.dataset.Dataset, table: ~csvw.metadata.Table, logger: ~logging.Logger = <Logger lexedata (INFO)>)¶: Update all IDs of the table in the database, also in foreign keys.

Module contents¶

class lexedata.util.KeyKeyDict¶: Bases: Mapping[str, str]