ThamizhiPOSt - a neural based POS tagger for Tamil

Website - http://nlp-tools.uom.lk/thamizhi-pos/

ThamizhiPOSt is a deep learning based POS tagger which is developed using Stanza framework, and trained using 11K POS tagged sentences along with fasttext model of Facebook. ThamizhiPOSt uses the Universal Dependency POS tagset for the annotation.

ThamizhiPOSt shows an accuracy of 95.20 (as of today 02.09.2020) for the TTB (https://github.com/UniversalDependencies/UD_Tamil-TTB/blob/master/ta_ttb-ud-test.conllu). This is the current state of the art for the Tamil POS taggers which are implemented/reported as of today.

We trained this POS tagger using the AMRITA POS tagged data. Before we do this, we did a harmonisation of BIS, AMRITA and UPOS tagsets, which are the primary POS tagsets available as of today. The harmonisation Universal Dependency POS (UPOS) , BIS , and AMRITA can be be found in this sheet.

However, we found that the Amrita POS tagged data are more clean, therefore, we used it to train the POS tagger. We used Stanza, a neural based framework developed by Stanford University - a sccuessor of their CoreNLP framework, to train the POS tagger.

The trained models can be found here in a compressed format. This file is in tgz format, you can extract it using tar.

How to use ThamizhiPOSt

Setting up ThamizhiUDp:

You need to have Python 3.0. In addition, install the following tools and libraries (These commands are for Debian based distribution, you can find the similar ones for other Linux distributions & Windows over the web):

pip3 install [stanza](https://stanfordnlp.github.io/stanza/installation_usage.html)
[Download this compressed file](http://nlp-tools.uom.lk/thamizhi-pos/thamizhi-pos.zip) , and uncompressed it. You should be able to see a scipts: thamizhi-post.py, and a folder models

Run the following command:

python3 thamizhi-post.py "input-file"

where "input-file" is the text file you want to POS tag. (there should not be any empty lines in the file) . This will generate a file called pos-tagged.txt.

Note: To use this version of tagger, it is compulsory to include a symbol (can be a period/exclamation mark / question mark) at the end of each line/sentence. Otherwise, the very last token will be considered as a punctuation.

An output will look like the following for the data "தமிழ் எங்கள் உயிருக்கு நேர் ."

1	தமிழ்	PROPN
2	எங்கள்	PRON
3	உயிருக்கு	NOUN
4	நேர்	NOUN
5	.	PUNCT

Tagged data

The following datasets are tagged using ThamizhiPOSt, available for research :

Official data (consists of Annual reports, Audit reports, Letters - anonymised) - 8,932 tokens/1,100 sentences
Sri Lankan Tamil news data - 124,203 tokens / 10,000 sentences

Cite

Please cite this if you use Thamizhi-POS tool / models / tagged data:

@misc{sarveswaran2020thamizhiudp, title={ThamizhiUDp: A Dependency Parser for Tamil}, author={Sarveswaran, Kengatharaiyer and Dias, Gihan}, year={2020}, eprint={2012.13436}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Acknowledgment

This research was supported by the Accelerating Higher Education Expansion and Development (AHEAD) Operation of the Ministry of Higher Education, Sri Lanka funded by the World Bank.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
tagged-data		tagged-data
LICENSE		LICENSE
README.md		README.md
print-upos.py		print-upos.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ThamizhiPOSt - a neural based POS tagger for Tamil

Website - http://nlp-tools.uom.lk/thamizhi-pos/

How to use ThamizhiPOSt

Setting up ThamizhiUDp:

Tagged data

Cite

Acknowledgment

About

Releases 1

Packages

Languages

License

sarves/thamizhi-pos

Folders and files

Latest commit

History

Repository files navigation

ThamizhiPOSt - a neural based POS tagger for Tamil

Website - http://nlp-tools.uom.lk/thamizhi-pos/

How to use ThamizhiPOSt

Setting up ThamizhiUDp:

Tagged data

Cite

Acknowledgment

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages