Skip to content

sarves/thamizhi-pos

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ThamizhiPOSt - a neural based POS tagger for Tamil

ThamizhiPOSt is a deep learning based POS tagger which is developed using Stanza framework, and trained using 11K POS tagged sentences along with fasttext model of Facebook. ThamizhiPOSt uses the Universal Dependency POS tagset for the annotation.

ThamizhiPOSt shows an accuracy of 95.20 (as of today 02.09.2020) for the TTB (https://github.com/UniversalDependencies/UD_Tamil-TTB/blob/master/ta_ttb-ud-test.conllu). This is the current state of the art for the Tamil POS taggers which are implemented/reported as of today.

We trained this POS tagger using the AMRITA POS tagged data. Before we do this, we did a harmonisation of BIS, AMRITA and UPOS tagsets, which are the primary POS tagsets available as of today. The harmonisation Universal Dependency POS (UPOS) , BIS , and AMRITA can be be found in this sheet.

However, we found that the Amrita POS tagged data are more clean, therefore, we used it to train the POS tagger. We used Stanza, a neural based framework developed by Stanford University - a sccuessor of their CoreNLP framework, to train the POS tagger.

The trained models can be found here in a compressed format. This file is in tgz format, you can extract it using tar.

How to use ThamizhiPOSt

Setting up ThamizhiUDp:

You need to have Python 3.0. In addition, install the following tools and libraries (These commands are for Debian based distribution, you can find the similar ones for other Linux distributions & Windows over the web):

pip3 install [stanza](https://stanfordnlp.github.io/stanza/installation_usage.html)
[Download this compressed file](http://nlp-tools.uom.lk/thamizhi-pos/thamizhi-pos.zip) , and uncompressed it. You should be able to see a scipts: thamizhi-post.py, and a folder models

Run the following command:

python3 thamizhi-post.py "input-file"

where "input-file" is the text file you want to POS tag. (there should not be any empty lines in the file) . This will generate a file called pos-tagged.txt.

Note: To use this version of tagger, it is compulsory to include a symbol (can be a period/exclamation mark / question mark) at the end of each line/sentence. Otherwise, the very last token will be considered as a punctuation.

An output will look like the following for the data "தமிழ் எங்கள் உயிருக்கு நேர் ."

1	தமிழ்	PROPN
2	எங்கள்	PRON
3	உயிருக்கு	NOUN
4	நேர்	NOUN
5	.	PUNCT

Tagged data

The following datasets are tagged using ThamizhiPOSt, available for research :

Cite

Please cite this if you use Thamizhi-POS tool / models / tagged data:

@misc{sarveswaran2020thamizhiudp, title={ThamizhiUDp: A Dependency Parser for Tamil}, author={Sarveswaran, Kengatharaiyer and Dias, Gihan}, year={2020}, eprint={2012.13436}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Acknowledgment

This research was supported by the Accelerating Higher Education Expansion and Development (AHEAD) Operation of the Ministry of Higher Education, Sri Lanka funded by the World Bank.