Python script to convert the Ukrainian morphological dictionary from the LanguageTool project to the OpenCorpora format. The script runs well under PyPy and also collects some stats/insights/anomalies in the input dictionary. Use at your own risk.
It solves these tasks:
- Parses the LanguageTool raw dictionary format
- Performs some basic sanity checks (and collects some stats about input dict)
- Converts LanguageTool tags to OpenCorpora tags
- Groups together wordforms and tries to determine a lemma for the group
- Exports the tagset, the tagset restrictions and all lemmas to the OpenCorpora format
Grouping wordforms under a particular lemma is cumbersome for various reasons. Mostly because of homonymy and the internal format of the LanguageTool dict. In a nutshell:
- An entry in the LanugageTool dictionary looks like this
wordform tag1:tag2:tag3 lemma
, where lemma is just a string. - You cannot tell, to which lemma exactly this entry refers because of homonymy.
- So, you can only apply a bunch of heuristics: the lemma should have the same POS as the wordform, the lemma should have particular tags. For example, for nouns all lemmas should have the :v_naz tag.
- Another problem with heuristics is that a lot of verb lemmas look the same for the :perf and :imperf tags. But those are two different lemmas and they have their own wordforms!
pip install -r requirements.txt
- mapping.csv with the general information about the tagset used in the Ukrainian morphological dictionary. Exported from here.
- An excerpt (first 1000 words) from the Ukrainian morphological dictionary.
- Cream nodes are for the tags found only in OpenCorpora
- Blue nodes are for the tags from LanguageTool only
- Green nodes are for the tags that can be found in both
- The LT tag name is above
- The OpenCorpora tag name is below
- Blue links are for OpenCorpora
- Orange links are for LT
python bin/lt_convert.py 1000.txt out.xml --debug