Tokenising, lemmatising, tagging and dependency parsing annotation of Frisian text using UD Pipe Frysk
- Status: In Progress
- Type: Generic
- Work Package: WP3
- Research Coordinators: Hans Van de Velde, Gosse Bouma, Wilbert Heeringa
- Coordinators for CLARIAH: Hans Van de Velde
- Participating Institutes: Fryske Akademy, Department of Information Science - University of Groningen
- End-users: Researchers
- Developers: Hans Van de Velde (Fryske Akademy, project manager/financing), Wilbert Heeringa (Fryske Akademy, building training corpus/implementation UDPipe Frysk), Gosse Bouma (University of Groningen, building training corpus/advice), Martha Hofman (Fryske Akademy, building training corpus), Eduard Drenth (building web service), Hindrik Sijens (Fryske Akademy, advice)
- Interest Groups: Ann, TP
- Task IDs: FRYPOS
We develop a tool for tokenising, lemmatising, tagging and dependency parsing annotation of Frisian text. We will follow the Universal Dependency guidelines (UD version 2) that are found at: https://universaldependencies.org/guidelines.html .
Researchers often want to enrich texts with linguistic features such as Part-of-Speech tags, lemmas, dependency relations, etc.. These often provide useful features for text mining or as preprocessing towards further ends.
West Lauwers Frisian, a language having about 354,000 native speakers in the province of Friesland in the Northwest of the Netherlands. While a wealth of tools has been developed for large (inter)national languages like English, Spanish and French, this is not obviously the case for smaller regional minority languages. Until recently a tool for tokenising, lemmatising, tagging and dependency parsing annotation of Frisian was not available. Currently a tool is available that can be used for tokenising, lemmatising and tagging Frisian texts, but morphological and syntactical annotations cannot be generated yet. Besides, the tool is trained on the basis of a corpus that consists of only 52,833 tokens, 46,930 words and 3,374 sentences.
We will extend the current treebank to 100.000 words. To this end, we will use texts that were used for the 'Oersetter', the online Frisian-Dutch translation program. We will also include material from the Online Nederlânsk-Frysk Wurdboek (ONFW). We aim to build a balanced corpus that should at least include the following genres: magazines, newspapers, novels, Wikipedia, scientific texts.
The newly added texts will be tokenized and PoS-tagged with the tool as it is available at this moment. The PoS-tags will be manually corrected. Next, the Frisian texts will be translated literally to Dutch. On the basis of the Dutch translations the morphological and syntactical annotations are generated by using the Alpino parser which is using the Lassy Small corpus. The results are projected back onto the Frisian text and are manually checked and corrected.
A webapp (https://frisian.eu/postaggerapp/) and a webservice (https://frisian.eu/postaggerservice/).
Microsoft Excel, R, RStudio. We use UDPipe as available via the R package udpipe. This package has been developed by Jan Wijffels (BNOSAC). For a list of all R packages that are used see: https://frisian.eu/postaggerapp/#about.
We follow the guidelines mentioned in the section 'Data split' at: https://universaldependencies.org/release_checklist.html. For a corpus of 100.000 words this means that we take 10K as test data, 10K as dev data and the rest as training data for tuning the tool and evaluting it.
-
Agić, Ž., Johannsen, A., Plank, B., Alonso, H. M., Schluter, N., & Søgaard, A. (2016). Multilingual projection for parsing truly low-resource languages. Transactions of the Association for Computational Linguistics, 4, 301-312.
-
De Does, J., Vandeghinste, V., Exploiting UD treebanks for the extraction of word combination statistics. https://github.com/CLARIAH/usecases/blob/master/cases/treebank-combi.md .
-
Hupkes, D., & Bod, R. (2016, May). POS-tagging of Historical Dutch. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) (pp. 77-82).
-
Sang, E. T. K. (2016, May). Improving part-of-speech tagging of historical text by first translating to modern text. In International Workshop on Computational History and Data-Driven Humanities (pp. 54-64). Springer, Cham.
-
Straka, M., Hajic, J., & Straková, J. (2016, May). UDPipe: trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, pos tagging and parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) (pp. 4290-4297).
-
Straka, M., & Straková, J. (2017, August). Tokenizing, pos tagging, lemmatizing and parsing ud 2.0 with UDPipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (pp. 88-99).