lexedata.importer package¶

Submodules¶

lexedata.importer.cognates module¶

Load #cognate and #cognatesets from excel file into CLDF

class lexedata.importer.cognates.CognateEditParser(output_dataset: pycldf.dataset.Dataset, row_type=<class 'lexedata.types.CogSet'>, top: int = 2, cellparser: lexedata.util.excel.NaiveCellParser = <class 'lexedata.util.excel.CellParser'>, row_header=['set', 'Name', None], check_for_match: typing.List[str] = ['Form'], check_for_row_match: typing.List[str] = ['Name'], check_for_language_match: typing.List[str] = ['Name'])¶

Bases: lexedata.importer.excel_matrix.ExcelCognateParser

language_from_column(column: List[openpyxl.cell.cell.Cell]) → lexedata.types.Language¶

properties_from_row(row: List[openpyxl.cell.cell.Cell]) → Optional[lexedata.types.RowObject]¶

lexedata.importer.cognates.header_from_cognate_excel(ws: openpyxl.worksheet.worksheet.Worksheet, dataset: pycldf.dataset.Dataset, logger: logging.Logger = <Logger lexedata (INFO)>)¶

lexedata.importer.cognates.import_cognates_from_excel(ws: openpyxl.worksheet.worksheet.Worksheet, dataset: pycldf.dataset.Dataset, extractor: re.Pattern = re.compile('/(?P<ID>[^/]*)/?$'), logger: logging.Logger = <Logger lexedata (INFO)>) → None¶

lexedata.importer.edictor module¶

lexedata.importer.edictor.edictor_to_cldf(dataset: lexedata.types.Wordlist[lexedata.types.Language_ID, lexedata.types.Form_ID, lexedata.types.Parameter_ID, lexedata.types.Cognate_ID, lexedata.types.Cognateset_ID], new_cogsets: Mapping[lexedata.types.Cognateset_ID, List[Tuple[lexedata.types.Form_ID, range, Sequence[str]]]], affected_forms: Set[lexedata.types.Form_ID], source: List[str] = [])¶

lexedata.importer.edictor.extract_partial_judgements(segments: typing.Sequence[str], cognatesets: typing.Sequence[int], global_alignment: typing.Sequence[str], logger: logging.Logger = <Logger lexedata (INFO)>) → Iterator[Tuple[range, int, Sequence[str]]]¶

Extract the different partial cognate judgements.

Segments has no morpheme boundary markers, they are inferred from global_alignment. The number of cognatesets and marked segments in global_alignment must match.

>>> next(extract_partial_judgements("t e s t".split(), [3], "t e s t".split()))
(range(0, 4), 3, ['t', 'e', 's', 't'])

>>> partial = extract_partial_judgements("t e s t".split(), [0, 1, 2], "( t ) + e - s + - t -".split())
>>> next(partial)
(range(1, 3), 1, ['e', '-', 's'])
>>> next(partial)
(range(3, 4), 2, ['-', 't', '-'])

lexedata.importer.edictor.load_forms_from_tsv(dataset: lexedata.types.Wordlist[lexedata.types.Language_ID, lexedata.types.Form_ID, lexedata.types.Parameter_ID, lexedata.types.Cognate_ID, lexedata.types.Cognateset_ID], input_file: pathlib.Path, logger: logging.Logger = <Logger lexedata (INFO)>) → Mapping[int, Sequence[Tuple[lexedata.types.Form_ID, range, Sequence[str]]]]¶: This function overwrites dataset’s FormTable

lexedata.importer.edictor.match_cognatesets(new_cognatesets: Mapping[int, Sequence[Tuple[lexedata.types.Form_ID, range, Sequence[str]]]], reference_cognatesets: Mapping[lexedata.types.Cognateset_ID, Sequence[Tuple[lexedata.types.Form_ID, range, Sequence[str]]]]) → Mapping[int, Optional[lexedata.types.Cognateset_ID]]¶

Match two different cognateset assignments with each other.

Map the new_cognatesets to the reference_cognatesets by trying to maximize the overlap between each new cognateset and the reference cognateset it is mapped to.

So, if two cognatesets got merged, they get mapped to the bigger one:

>>> match_cognatesets(
...   {0: ["a", "b", "c"]},
...   {">": ["a", "b"], "<": ["c"]}
... )
{0: '>'}

If a cognateset got split, it gets mapped to the bigger part and the other part becomes a new, unmapped set:

>>> match_cognatesets(
...   {0: ["a", "b"], 1: ["c"]},
...   {"Σ": ["a", "b", "c"]}
... )
{0: 'Σ', 1: None}

If a single form (or a relatively small number) gets moved between cognatesets, the mapping is maintained:

>>> match_cognatesets(
...   {0: ["a", "b"], 1: ["c", "d", "e"]},
...   {0: ["a", "b", "c"], 1: ["d", "e"]}
... ) == {1: 1, 0: 0}
True

(As you see, the function is a bit more general than the type signature implies.)

lexedata.importer.excel_interleaved module¶

Import data in the “interleaved” format from an Excel spreadsheet.

Here, every even row contains cells with forms (cells may contain multiple forms), while every odd row contains the associated cognate codes (a one-to-one relationship between forms and codes is expected). Forms and cognate codes are separated by commas (“,”) and semi-colons (“;”). Any other information existing in the cell will be parsed as part of the form or the cognate code.

lexedata.importer.excel_interleaved.import_interleaved(ws: openpyxl.worksheet.worksheet.Worksheet, logger: logging.Logger = <Logger lexedata (INFO)>, ids: typing.Optional[typing.Set[lexedata.types.Cognateset_ID]] = None) → Iterable[Tuple[lexedata.types.Form_ID, lexedata.types.Language_ID, lexedata.types.Parameter_ID, Optional[str], Optional[str], lexedata.types.Cognateset_ID]]¶

lexedata.importer.excel_long_format module¶

class lexedata.importer.excel_long_format.ImportLanguageReport(is_new_language: bool = False, new: int = 0, existing: int = 0, skipped: int = 0, concepts: int = 0)¶

Bases: object

concepts: int¶

existing: int¶

is_new_language: bool¶

new: int¶

skipped: int¶

lexedata.importer.excel_long_format.add_single_languages(metadata: pathlib.Path, sheets: Iterable[openpyxl.worksheet.worksheet.Worksheet], match_form: Optional[List[str]], concept_name: Optional[str], language_name: Optional[str], ignore_missing: bool, ignore_superfluous: bool, status_update: Optional[str], logger: logging.Logger, missing_concepts: Set[str] = {}) → Mapping[str, lexedata.importer.excel_long_format.ImportLanguageReport]¶

lexedata.importer.excel_long_format.get_headers_from_excel(sheet: openpyxl.worksheet.worksheet.Worksheet) → Iterable[str]¶

lexedata.importer.excel_long_format.import_data_from_sheet(sheet, sheet_header, language_id: str, implicit: Mapping[Literal['languageReference', 'id', 'value'], str] = {}, concept_column: Tuple[str, str] = ('Concept_ID', 'Concept_ID')) → Iterable[lexedata.types.Form]¶

lexedata.importer.excel_long_format.parser()¶

lexedata.importer.excel_long_format.read_single_excel_sheet(dataset: pycldf.dataset.Dataset, sheet: openpyxl.worksheet.worksheet.Worksheet, logger: logging.Logger = <Logger lexedata (INFO)>, match_form: typing.Optional[typing.List[str]] = None, entries_to_concepts: typing.Mapping[str, str] = <lexedata.types.KeyKeyDict object>, concept_column: typing.Optional[str] = None, language_name_column: typing.Optional[str] = None, ignore_missing: bool = False, ignore_superfluous: bool = False, status_update: typing.Optional[str] = None, missing_concepts: typing.Set[str] = {}) → Mapping[str, lexedata.importer.excel_long_format.ImportLanguageReport]¶

lexedata.importer.excel_matrix module¶

class lexedata.importer.excel_matrix.DB(output_dataset: pycldf.dataset.Wordlist)¶

Bases: object

An in-memobry cache of a dataset.

The cache_dataset method is only called in the load_dataset method, but also used for finer control by the cognates importer. This means that if you would load the CognateParser directly, it doesn’t work as the cache is empty and the code will throw errors (e.g. when trying to look for candidates we get a key error). If you use the CognateParser elsewhere, make sure to cache the dataset explicitly, eg. by using DB.from_dataset!

add_source(source_id)¶

associate(form_id: str, row: lexedata.types.RowObject, comment: Optional[str] = None) → bool¶

cache: Dict[str, Dict[Hashable, Dict[str, Any]]]¶

cache_dataset(logger: logging.Logger = <Logger lexedata (INFO)>)¶

commit()¶

drop_from_cache(table: str)¶

empty_cache()¶

find_db_candidates(object: lexedata.importer.excel_matrix.Ob, properties_for_match: Iterable[str], edit_dist_threshold: Optional[int] = None) → Iterable[str]¶

classmethod from_dataset(dataset, logger: logging.Logger = <Logger lexedata (INFO)>)¶: Create a (filled) cache from a dataset.

insert_into_db(object: lexedata.types.Object) → None¶

make_id_unique(object: lexedata.types.Object) → str¶

retrieve(table_type: str)¶

source_ids: Set[str]¶

write_dataset_from_cache(tables: Optional[Iterable[str]] = None)¶

class lexedata.importer.excel_matrix.ExcelCognateParser(output_dataset: pycldf.dataset.Dataset, row_type=<class 'lexedata.types.CogSet'>, top: int = 2, cellparser: lexedata.util.excel.NaiveCellParser = <class 'lexedata.util.excel.CellParser'>, row_header=['set', 'Name', None], check_for_match: typing.List[str] = ['Form'], check_for_row_match: typing.List[str] = ['Name'], check_for_language_match: typing.List[str] = ['Name'])¶

Bases: lexedata.importer.excel_matrix.ExcelParser[lexedata.types.CogSet]

associate(form_id: str, row: lexedata.types.RowObject, comment: Optional[str] = None) → bool¶

handle_form(params, row_object: lexedata.types.CogSet, cell_with_forms, this_lan, status_update: Optional[str])¶

on_form_not_found(form: typing.Dict[str, typing.Any], cell_identifier: typing.Optional[str] = None, language_id: typing.Optional[str] = None, logger: logging.Logger = <Logger lexedata (INFO)>) → bool¶

Should I add a missing object? No, but inform the user.

Send a warning (ObjectNotFoundWarning) reporting the missing object and cell.

Returns: False
Return type: The object should not be added.

on_language_not_found(language: Dict[str, Any], cell_identifier: Optional[str] = None) → bool¶

Should I add a missing object? No, the object missing is an error.

Raise an exception (ObjectNotFoundWarning) reporting the missing object and cell.

Raises: ObjectNotFoundWarning –

on_row_not_found(row_object: lexedata.types.CogSet, cell_identifier: Optional[str] = None) → bool¶: Create row object

properties_from_row(row: List[openpyxl.cell.cell.Cell]) → Optional[lexedata.types.CogSet]¶

class lexedata.importer.excel_matrix.ExcelParser(output_dataset: pycldf.dataset.Dataset, row_type: typing.Type[lexedata.types.R], top: int = 2, cellparser: lexedata.util.excel.NaiveCellParser = <class 'lexedata.util.excel.CellParser'>, row_header: typing.List[str] = ['set', 'Name', None], check_for_match: typing.List[str] = ['ID'], check_for_row_match: typing.List[str] = ['Name'], check_for_language_match: typing.List[str] = ['Name'], fuzzy=0)¶

Bases: Generic[lexedata.types.R]

handle_form(params, row_object: lexedata.types.R, cell_with_forms, this_lan: str, status_update: Optional[str])¶

language_from_column(column: List[openpyxl.cell.cell.Cell]) → lexedata.types.Language¶

on_form_not_found(form: Dict[str, Any], cell_identifier: Optional[str] = None, language_id: Optional[str] = None) → bool¶: Create form

on_language_not_found(language: Dict[str, Any], cell_identifier: Optional[str] = None) → bool¶: Create language

on_row_not_found(row_object: lexedata.types.R, cell_identifier: Optional[str] = None) → bool¶: Create row object

parse_all_languages(sheet: openpyxl.worksheet.worksheet.Worksheet) → Dict[str, str]¶

Parse all language descriptions in the focal sheet.

Returns: languages
Return type: A dictionary mapping columns (“B”, “C”, “D”, …) to language IDs

parse_cells(sheet: openpyxl.worksheet.worksheet.Worksheet, status_update: Optional[str] = None) → None¶

properties_from_row(row: List[openpyxl.cell.cell.Cell]) → Optional[lexedata.types.R]¶

lexedata.importer.excel_matrix.cells_are_empty(cells: Iterable[openpyxl.cell.cell.Cell]) → bool¶

lexedata.importer.excel_matrix.excel_parser_from_dialect(output_dataset: pycldf.dataset.Wordlist, dialect: NamedTuple, cognate: bool) → Type[lexedata.importer.excel_matrix.ExcelParser]¶

lexedata.importer.excel_matrix.load_dataset(metadata: pathlib.Path, lexicon: typing.Optional[str], cognate_lexicon: typing.Optional[str] = None, status_update: typing.Optional[str] = None, logger: logging.Logger = <Logger lexedata (INFO)>)¶

lexedata.importer package¶

Submodules¶

lexedata.importer.cognates module¶

lexedata.importer.edictor module¶

lexedata.importer.excel_interleaved module¶

lexedata.importer.excel_long_format module¶

lexedata.importer.excel_matrix module¶

Module contents¶

lexedata

Navigation

Related Topics