lexedata.importer package

Submodules

lexedata.importer.cognates module

Load #cognate and #cognatesets from excel file into CLDF

class lexedata.importer.cognates.CognateEditParser(output_dataset: ~pycldf.dataset.Dataset, row_type=<class 'lexedata.types.CogSet'>, top: int = 2, cellparser: ~lexedata.util.excel.NaiveCellParser = <class 'lexedata.util.excel.CellParser'>, row_header=['set', 'Name', None], check_for_match: ~typing.List[str] = ['Form'], check_for_row_match: ~typing.List[str] = ['Name'], check_for_language_match: ~typing.List[str] = ['Name'])

Bases: ExcelCognateParser

language_from_column(column: List[Cell]) Language
properties_from_row(row: List[Cell]) Optional[RowObject]
lexedata.importer.cognates.header_from_cognate_excel(ws: ~openpyxl.worksheet.worksheet.Worksheet, dataset: ~pycldf.dataset.Dataset, logger: ~logging.Logger = <Logger lexedata (INFO)>)
lexedata.importer.cognates.import_cognates_from_excel(ws: ~openpyxl.worksheet.worksheet.Worksheet, dataset: ~pycldf.dataset.Dataset, extractor: ~re.Pattern = re.compile('/(?P<ID>[^/]*)/?$'), logger: ~logging.Logger = <Logger lexedata (INFO)>) None

lexedata.importer.edictor module

lexedata.importer.edictor.edictor_to_cldf(dataset: Wordlist[Language_ID, Form_ID, Parameter_ID, Cognate_ID, Cognateset_ID], new_cogsets: Mapping[Cognateset_ID, List[Tuple[Form_ID, range, Sequence[str]]]], affected_forms: Set[Form_ID], source: List[str] = [])
lexedata.importer.edictor.extract_partial_judgements(segments: ~typing.Sequence[str], cognatesets: ~typing.Sequence[int], global_alignment: ~typing.Sequence[str], logger: ~logging.Logger = <Logger lexedata (INFO)>) Iterator[Tuple[range, int, Sequence[str]]]

Extract the different partial cognate judgements.

Segments has no morpheme boundary markers, they are inferred from global_alignment. The number of cognatesets and marked segments in global_alignment must match.

>>> next(extract_partial_judgements("t e s t".split(), [3], "t e s t".split()))
(range(0, 4), 3, ['t', 'e', 's', 't'])
>>> partial = extract_partial_judgements("t e s t".split(), [0, 1, 2], "( t ) + e - s + - t -".split())
>>> next(partial)
(range(1, 3), 1, ['e', '-', 's'])
>>> next(partial)
(range(3, 4), 2, ['-', 't', '-'])
lexedata.importer.edictor.load_forms_from_tsv(dataset: ~lexedata.types.Wordlist[~lexedata.types.Language_ID, ~lexedata.types.Form_ID, ~lexedata.types.Parameter_ID, ~lexedata.types.Cognate_ID, ~lexedata.types.Cognateset_ID], input_file: ~pathlib.Path, logger: ~logging.Logger = <Logger lexedata (INFO)>) Mapping[int, Sequence[Tuple[Form_ID, range, Sequence[str]]]]

Side effects

This function overwrites dataset’s FormTable

lexedata.importer.edictor.match_cognatesets(new_cognatesets: Mapping[int, Sequence[Tuple[Form_ID, range, Sequence[str]]]], reference_cognatesets: Mapping[Cognateset_ID, Sequence[Tuple[Form_ID, range, Sequence[str]]]]) Mapping[int, Optional[Cognateset_ID]]

Match two different cognateset assignments with each other.

Map the new_cognatesets to the reference_cognatesets by trying to maximize the overlap between each new cognateset and the reference cognateset it is mapped to.

So, if two cognatesets got merged, they get mapped to the bigger one:

>>> match_cognatesets(
...   {0: ["a", "b", "c"]},
...   {">": ["a", "b"], "<": ["c"]}
... )
{0: '>'}

If a cognateset got split, it gets mapped to the bigger part and the other part becomes a new, unmapped set:

>>> match_cognatesets(
...   {0: ["a", "b"], 1: ["c"]},
...   {"Σ": ["a", "b", "c"]}
... )
{0: 'Σ', 1: None}

If a single form (or a relatively small number) gets moved between cognatesets, the mapping is maintained:

>>> match_cognatesets(
...   {0: ["a", "b"], 1: ["c", "d", "e"]},
...   {0: ["a", "b", "c"], 1: ["d", "e"]}
... ) == {1: 1, 0: 0}
True

(As you see, the function is a bit more general than the type signature implies.)

lexedata.importer.excel_interleaved module

Import data in the “interleaved” format from an Excel spreadsheet.

Here, every even row contains cells with forms (cells may contain multiple forms), while every odd row contains the associated cognate codes (a one-to-one relationship between forms and codes is expected). Forms and cognate codes are separated by commas (“,”) and semi-colons (“;”). Any other information existing in the cell will be parsed as part of the form or the cognate code.

lexedata.importer.excel_interleaved.import_interleaved(ws: ~openpyxl.worksheet.worksheet.Worksheet, logger: ~logging.Logger = <Logger lexedata (INFO)>, ids: ~typing.Optional[~typing.Set[~lexedata.types.Cognateset_ID]] = None) Iterable[Tuple[Form_ID, Language_ID, Parameter_ID, Optional[str], Optional[str], Cognateset_ID]]

lexedata.importer.excel_long_format module

class lexedata.importer.excel_long_format.ImportLanguageReport(is_new_language: bool = False, new: int = 0, existing: int = 0, skipped: int = 0, concepts: int = 0)

Bases: object

concepts: int
existing: int
is_new_language: bool
new: int
skipped: int
lexedata.importer.excel_long_format.add_single_languages(dataset: Dataset, sheets: Iterable[Worksheet], match_form: Optional[List[str]], concept_name: Optional[str], language_name: Optional[str], ignore_missing: bool, ignore_superfluous: bool, status_update: Optional[str], logger: Logger, missing_concepts: Set[str] = {}) Mapping[str, ImportLanguageReport]
lexedata.importer.excel_long_format.get_headers_from_excel(sheet: Worksheet) Iterable[str]
lexedata.importer.excel_long_format.import_data_from_sheet(sheet, sheet_header, implicit: Mapping[Literal['languageReference', 'id', 'value', 'Status_Column'], str] = {}, concept_column: Tuple[str, str] = ('Concept_ID', 'Concept_ID'), skip_if_questionmark: Container[str] = {}) Iterable[Form]
lexedata.importer.excel_long_format.parser()
lexedata.importer.excel_long_format.read_single_excel_sheet(dataset: ~pycldf.dataset.Dataset, sheet: ~openpyxl.worksheet.worksheet.Worksheet, logger: ~logging.Logger = <Logger lexedata (INFO)>, match_form: ~typing.Optional[~typing.List[str]] = None, entries_to_concepts: ~typing.Mapping[str, str] = <lexedata.types.KeyKeyDict object>, concept_column: ~typing.Optional[str] = None, language_name_column: ~typing.Optional[str] = None, ignore_missing: bool = False, ignore_superfluous: bool = False, status_update: ~typing.Optional[str] = None, missing_concepts: ~typing.Set[str] = {}) Mapping[str, ImportLanguageReport]

lexedata.importer.excel_matrix module

class lexedata.importer.excel_matrix.DB(output_dataset: Wordlist)

Bases: object

An in-memory cache of a dataset.

The cache_dataset method is only called in the load_dataset method, but also used for finer control by the cognates importer. This means that if you would load the CognateParser directly, it doesn’t work as the cache is empty and the code will throw errors (e.g. when trying to look for candidates we get a key error). If you use the CognateParser elsewhere, make sure to cache the dataset explicitly, eg. by using DB.from_dataset!

associate(form_id: str, row: RowObject, comment: Optional[str] = None) bool
cache: Dict[str, Dict[Hashable, Dict[str, Any]]]
cache_dataset(logger: ~logging.Logger = <Logger lexedata (INFO)>)
commit()
drop_from_cache(table: str)
empty_cache()
find_db_candidates(object: Ob, properties_for_match: Iterable[str], edit_dist_threshold: Optional[int] = None) Iterable[str]
classmethod from_dataset(dataset, logger: ~logging.Logger = <Logger lexedata (INFO)>)

Create a (filled) cache from a dataset.

insert_into_db(object: Object) None
make_id_unique(object: Object) str
retrieve(table_type: str)
write_dataset_from_cache(tables: Optional[Iterable[str]] = None)
class lexedata.importer.excel_matrix.Dialect(logger: ~logging.Logger = <Logger lexedata (INFO)>, **kwargs)

Bases: object

class lexedata.importer.excel_matrix.ExcelCognateParser(output_dataset: ~pycldf.dataset.Dataset, row_type=<class 'lexedata.types.CogSet'>, top: int = 2, cellparser: ~lexedata.util.excel.NaiveCellParser = <class 'lexedata.util.excel.CellParser'>, row_header=['set', 'Name', None], check_for_match: ~typing.List[str] = ['Form'], check_for_row_match: ~typing.List[str] = ['Name'], check_for_language_match: ~typing.List[str] = ['Name'])

Bases: ExcelParser[CogSet]

associate(form_id: str, row: RowObject, comment: Optional[str] = None) bool
handle_form(params, row_object: CogSet, cell_with_forms, this_lan, status_update: Optional[str])
on_form_not_found(form: ~typing.Dict[str, ~typing.Any], cell_identifier: ~typing.Optional[str] = None, language_id: ~typing.Optional[str] = None, logger: ~logging.Logger = <Logger lexedata (INFO)>) bool

Should I add a missing object? No, but inform the user.

Send a warning (ObjectNotFoundWarning) reporting the missing object and cell.

Returns:

False

Return type:

The object should not be added.

on_language_not_found(language: Dict[str, Any], cell_identifier: Optional[str] = None) bool

Should I add a missing object? No, the object missing is an error.

Raise an exception (ObjectNotFoundWarning) reporting the missing object and cell.

Raises:

ObjectNotFoundWarning

on_row_not_found(row_object: CogSet, cell_identifier: Optional[str] = None) bool

Create row object

properties_from_row(row: List[Cell]) Optional[CogSet]
class lexedata.importer.excel_matrix.ExcelParser(output_dataset: ~pycldf.dataset.Dataset, row_type: ~typing.Type[~lexedata.types.R], top: int = 2, cellparser: ~lexedata.util.excel.NaiveCellParser = <class 'lexedata.util.excel.CellParser'>, row_header: ~typing.List[str] = ['set', 'Name', None], check_for_match: ~typing.List[str] = ['ID'], check_for_row_match: ~typing.List[str] = ['Name'], check_for_language_match: ~typing.List[str] = ['Name'], fuzzy=0)

Bases: Generic[R]

handle_form(params, row_object: R, cell_with_forms, this_lan: str, status_update: Optional[str])
language_from_column(column: List[Cell]) Language
on_form_not_found(form: Dict[str, Any], cell_identifier: Optional[str] = None, language_id: Optional[str] = None) bool

Create form

on_language_not_found(language: Dict[str, Any], cell_identifier: Optional[str] = None) bool

Create language

on_row_not_found(row_object: R, cell_identifier: Optional[str] = None) bool

Create row object

parse_all_languages(sheet: Worksheet) Dict[str, str]

Parse all language descriptions in the focal sheet.

Returns:

languages

Return type:

A dictionary mapping columns (“B”, “C”, “D”, …) to language IDs

parse_cells(sheet: Worksheet, status_update: Optional[str] = None) None
properties_from_row(row: List[Cell]) Optional[R]
lexedata.importer.excel_matrix.cells_are_empty(cells: Iterable[Cell]) bool
lexedata.importer.excel_matrix.excel_parser_from_dialect(output_dataset: Wordlist, dialect: Dialect, cognate: bool) Type[ExcelParser]
lexedata.importer.excel_matrix.load_dataset(metadata: ~pathlib.Path, lexicon: ~typing.Optional[str], cognate_lexicon: ~typing.Optional[str] = None, status_update: ~typing.Optional[str] = None, logger: ~logging.Logger = <Logger lexedata (INFO)>)

Module contents