lexedata.importer package¶
Submodules¶
lexedata.importer.cognates module¶
Load #cognate and #cognatesets from excel file into CLDF
- class lexedata.importer.cognates.CognateEditParser(output_dataset: pycldf.dataset.Dataset, row_type=<class 'lexedata.types.CogSet'>, top: int = 2, cellparser: lexedata.util.excel.NaiveCellParser = <class 'lexedata.util.excel.CellParser'>, row_header=['set', 'Name', None], check_for_match: typing.List[str] = ['Form'], check_for_row_match: typing.List[str] = ['Name'], check_for_language_match: typing.List[str] = ['Name'])¶
Bases:
lexedata.importer.excel_matrix.ExcelCognateParser
- language_from_column(column: List[openpyxl.cell.cell.Cell]) lexedata.types.Language ¶
- properties_from_row(row: List[openpyxl.cell.cell.Cell]) Optional[lexedata.types.RowObject] ¶
- lexedata.importer.cognates.header_from_cognate_excel(ws: openpyxl.worksheet.worksheet.Worksheet, dataset: pycldf.dataset.Dataset, logger: logging.Logger = <Logger lexedata (INFO)>)¶
- lexedata.importer.cognates.import_cognates_from_excel(ws: openpyxl.worksheet.worksheet.Worksheet, dataset: pycldf.dataset.Dataset, extractor: re.Pattern = re.compile('/(?P<ID>[^/]*)/?$'), logger: logging.Logger = <Logger lexedata (INFO)>) None ¶
lexedata.importer.edictor module¶
- lexedata.importer.edictor.edictor_to_cldf(dataset: lexedata.types.Wordlist[lexedata.types.Language_ID, lexedata.types.Form_ID, lexedata.types.Parameter_ID, lexedata.types.Cognate_ID, lexedata.types.Cognateset_ID], new_cogsets: Mapping[lexedata.types.Cognateset_ID, List[Tuple[lexedata.types.Form_ID, range, Sequence[str]]]], affected_forms: Set[lexedata.types.Form_ID], source: List[str] = [])¶
- lexedata.importer.edictor.extract_partial_judgements(segments: typing.Sequence[str], cognatesets: typing.Sequence[int], global_alignment: typing.Sequence[str], logger: logging.Logger = <Logger lexedata (INFO)>) Iterator[Tuple[range, int, Sequence[str]]] ¶
Extract the different partial cognate judgements.
Segments has no morpheme boundary markers, they are inferred from global_alignment. The number of cognatesets and marked segments in global_alignment must match.
>>> next(extract_partial_judgements("t e s t".split(), [3], "t e s t".split())) (range(0, 4), 3, ['t', 'e', 's', 't'])
>>> partial = extract_partial_judgements("t e s t".split(), [0, 1, 2], "( t ) + e - s + - t -".split()) >>> next(partial) (range(1, 3), 1, ['e', '-', 's']) >>> next(partial) (range(3, 4), 2, ['-', 't', '-'])
- lexedata.importer.edictor.load_forms_from_tsv(dataset: lexedata.types.Wordlist[lexedata.types.Language_ID, lexedata.types.Form_ID, lexedata.types.Parameter_ID, lexedata.types.Cognate_ID, lexedata.types.Cognateset_ID], input_file: pathlib.Path, logger: logging.Logger = <Logger lexedata (INFO)>) Mapping[int, Sequence[Tuple[lexedata.types.Form_ID, range, Sequence[str]]]] ¶
This function overwrites dataset’s FormTable
- lexedata.importer.edictor.match_cognatesets(new_cognatesets: Mapping[int, Sequence[Tuple[lexedata.types.Form_ID, range, Sequence[str]]]], reference_cognatesets: Mapping[lexedata.types.Cognateset_ID, Sequence[Tuple[lexedata.types.Form_ID, range, Sequence[str]]]]) Mapping[int, Optional[lexedata.types.Cognateset_ID]] ¶
Match two different cognateset assignments with each other.
Map the new_cognatesets to the reference_cognatesets by trying to maximize the overlap between each new cognateset and the reference cognateset it is mapped to.
So, if two cognatesets got merged, they get mapped to the bigger one:
>>> match_cognatesets( ... {0: ["a", "b", "c"]}, ... {">": ["a", "b"], "<": ["c"]} ... ) {0: '>'}
If a cognateset got split, it gets mapped to the bigger part and the other part becomes a new, unmapped set:
>>> match_cognatesets( ... {0: ["a", "b"], 1: ["c"]}, ... {"Σ": ["a", "b", "c"]} ... ) {0: 'Σ', 1: None}
If a single form (or a relatively small number) gets moved between cognatesets, the mapping is maintained:
>>> match_cognatesets( ... {0: ["a", "b"], 1: ["c", "d", "e"]}, ... {0: ["a", "b", "c"], 1: ["d", "e"]} ... ) == {1: 1, 0: 0} True
(As you see, the function is a bit more general than the type signature implies.)
lexedata.importer.excel_interleaved module¶
Import data in the “interleaved” format from an Excel spreadsheet.
Here, every even row contains cells with forms (cells may contain multiple forms), while every odd row contains the associated cognate codes (a one-to-one relationship between forms and codes is expected). Forms and cognate codes are separated by commas (“,”) and semi-colons (“;”). Any other information existing in the cell will be parsed as part of the form or the cognate code.
- lexedata.importer.excel_interleaved.import_interleaved(ws: openpyxl.worksheet.worksheet.Worksheet, logger: logging.Logger = <Logger lexedata (INFO)>, ids: typing.Optional[typing.Set[lexedata.types.Cognateset_ID]] = None) Iterable[Tuple[lexedata.types.Form_ID, lexedata.types.Language_ID, lexedata.types.Parameter_ID, Optional[str], Optional[str], lexedata.types.Cognateset_ID]] ¶
lexedata.importer.excel_long_format module¶
- class lexedata.importer.excel_long_format.ImportLanguageReport(is_new_language: bool = False, new: int = 0, existing: int = 0, skipped: int = 0, concepts: int = 0)¶
Bases:
object
- concepts: int¶
- existing: int¶
- is_new_language: bool¶
- new: int¶
- skipped: int¶
- lexedata.importer.excel_long_format.add_single_languages(metadata: pathlib.Path, sheets: Iterable[openpyxl.worksheet.worksheet.Worksheet], match_form: Optional[List[str]], concept_name: Optional[str], language_name: Optional[str], ignore_missing: bool, ignore_superfluous: bool, status_update: Optional[str], logger: logging.Logger, missing_concepts: Set[str] = {}) Mapping[str, lexedata.importer.excel_long_format.ImportLanguageReport] ¶
- lexedata.importer.excel_long_format.get_headers_from_excel(sheet: openpyxl.worksheet.worksheet.Worksheet) Iterable[str] ¶
- lexedata.importer.excel_long_format.import_data_from_sheet(sheet, sheet_header, language_id: str, implicit: Mapping[Literal['languageReference', 'id', 'value'], str] = {}, concept_column: Tuple[str, str] = ('Concept_ID', 'Concept_ID')) Iterable[lexedata.types.Form] ¶
- lexedata.importer.excel_long_format.parser()¶
- lexedata.importer.excel_long_format.read_single_excel_sheet(dataset: pycldf.dataset.Dataset, sheet: openpyxl.worksheet.worksheet.Worksheet, logger: logging.Logger = <Logger lexedata (INFO)>, match_form: typing.Optional[typing.List[str]] = None, entries_to_concepts: typing.Mapping[str, str] = <lexedata.types.KeyKeyDict object>, concept_column: typing.Optional[str] = None, language_name_column: typing.Optional[str] = None, ignore_missing: bool = False, ignore_superfluous: bool = False, status_update: typing.Optional[str] = None, missing_concepts: typing.Set[str] = {}) Mapping[str, lexedata.importer.excel_long_format.ImportLanguageReport] ¶
lexedata.importer.excel_matrix module¶
- class lexedata.importer.excel_matrix.DB(output_dataset: pycldf.dataset.Wordlist)¶
Bases:
object
An in-memobry cache of a dataset.
The cache_dataset method is only called in the load_dataset method, but also used for finer control by the cognates importer. This means that if you would load the CognateParser directly, it doesn’t work as the cache is empty and the code will throw errors (e.g. when trying to look for candidates we get a key error). If you use the CognateParser elsewhere, make sure to cache the dataset explicitly, eg. by using DB.from_dataset!
- add_source(source_id)¶
- associate(form_id: str, row: lexedata.types.RowObject, comment: Optional[str] = None) bool ¶
- cache: Dict[str, Dict[Hashable, Dict[str, Any]]]¶
- cache_dataset(logger: logging.Logger = <Logger lexedata (INFO)>)¶
- commit()¶
- drop_from_cache(table: str)¶
- empty_cache()¶
- find_db_candidates(object: lexedata.importer.excel_matrix.Ob, properties_for_match: Iterable[str], edit_dist_threshold: Optional[int] = None) Iterable[str] ¶
- classmethod from_dataset(dataset, logger: logging.Logger = <Logger lexedata (INFO)>)¶
Create a (filled) cache from a dataset.
- insert_into_db(object: lexedata.types.Object) None ¶
- make_id_unique(object: lexedata.types.Object) str ¶
- retrieve(table_type: str)¶
- source_ids: Set[str]¶
- write_dataset_from_cache(tables: Optional[Iterable[str]] = None)¶
- class lexedata.importer.excel_matrix.ExcelCognateParser(output_dataset: pycldf.dataset.Dataset, row_type=<class 'lexedata.types.CogSet'>, top: int = 2, cellparser: lexedata.util.excel.NaiveCellParser = <class 'lexedata.util.excel.CellParser'>, row_header=['set', 'Name', None], check_for_match: typing.List[str] = ['Form'], check_for_row_match: typing.List[str] = ['Name'], check_for_language_match: typing.List[str] = ['Name'])¶
Bases:
lexedata.importer.excel_matrix.ExcelParser
[lexedata.types.CogSet
]- associate(form_id: str, row: lexedata.types.RowObject, comment: Optional[str] = None) bool ¶
- handle_form(params, row_object: lexedata.types.CogSet, cell_with_forms, this_lan, status_update: Optional[str])¶
- on_form_not_found(form: typing.Dict[str, typing.Any], cell_identifier: typing.Optional[str] = None, language_id: typing.Optional[str] = None, logger: logging.Logger = <Logger lexedata (INFO)>) bool ¶
Should I add a missing object? No, but inform the user.
Send a warning (ObjectNotFoundWarning) reporting the missing object and cell.
- Returns
False
- Return type
The object should not be added.
- on_language_not_found(language: Dict[str, Any], cell_identifier: Optional[str] = None) bool ¶
Should I add a missing object? No, the object missing is an error.
Raise an exception (ObjectNotFoundWarning) reporting the missing object and cell.
- Raises
- on_row_not_found(row_object: lexedata.types.CogSet, cell_identifier: Optional[str] = None) bool ¶
Create row object
- properties_from_row(row: List[openpyxl.cell.cell.Cell]) Optional[lexedata.types.CogSet] ¶
- class lexedata.importer.excel_matrix.ExcelParser(output_dataset: pycldf.dataset.Dataset, row_type: typing.Type[lexedata.types.R], top: int = 2, cellparser: lexedata.util.excel.NaiveCellParser = <class 'lexedata.util.excel.CellParser'>, row_header: typing.List[str] = ['set', 'Name', None], check_for_match: typing.List[str] = ['ID'], check_for_row_match: typing.List[str] = ['Name'], check_for_language_match: typing.List[str] = ['Name'], fuzzy=0)¶
Bases:
Generic
[lexedata.types.R
]- handle_form(params, row_object: lexedata.types.R, cell_with_forms, this_lan: str, status_update: Optional[str])¶
- language_from_column(column: List[openpyxl.cell.cell.Cell]) lexedata.types.Language ¶
- on_form_not_found(form: Dict[str, Any], cell_identifier: Optional[str] = None, language_id: Optional[str] = None) bool ¶
Create form
- on_language_not_found(language: Dict[str, Any], cell_identifier: Optional[str] = None) bool ¶
Create language
- on_row_not_found(row_object: lexedata.types.R, cell_identifier: Optional[str] = None) bool ¶
Create row object
- parse_all_languages(sheet: openpyxl.worksheet.worksheet.Worksheet) Dict[str, str] ¶
Parse all language descriptions in the focal sheet.
- Returns
languages
- Return type
A dictionary mapping columns (“B”, “C”, “D”, …) to language IDs
- parse_cells(sheet: openpyxl.worksheet.worksheet.Worksheet, status_update: Optional[str] = None) None ¶
- properties_from_row(row: List[openpyxl.cell.cell.Cell]) Optional[lexedata.types.R] ¶
- lexedata.importer.excel_matrix.cells_are_empty(cells: Iterable[openpyxl.cell.cell.Cell]) bool ¶
- lexedata.importer.excel_matrix.excel_parser_from_dialect(output_dataset: pycldf.dataset.Wordlist, dialect: NamedTuple, cognate: bool) Type[lexedata.importer.excel_matrix.ExcelParser] ¶
- lexedata.importer.excel_matrix.load_dataset(metadata: pathlib.Path, lexicon: typing.Optional[str], cognate_lexicon: typing.Optional[str] = None, status_update: typing.Optional[str] = None, logger: logging.Logger = <Logger lexedata (INFO)>)¶