POS tagging #13

TviNet · 2019-05-22T08:39:05Z

https://universaldependencies.org/ has labelled data for parts of speech, dependencies and information about morphology for Hindi, Sanskrit, Marathi, Tamil and Telugu.
I plan on using a LM-LSTM-CRF architecture for sequence tagging. However the language models in iNLTK use sentencepiece tokens. Could anyone guide me through using the existing lm for word tokens or do I need to retrain the word embeddings for word tokens?

goru001 · 2019-05-23T03:58:18Z

@TviNet Thanks for reaching out!
I glanced over LM-LSTM-CRF repo, and saw that they're considering every space separated word as a token. I think you can do that for Indic languages as well. But in this case you might not be able to use transfer learning (use pretrained LMs ) (I might be wrong here, need to dig deep into repo, but a quick glance at it makes me think this way).

The way I was thinking of tackling POS is to use transfer learning by doing some pre-processing over the dataset, which would be - breakdown every word into its token (using what we have in iNLTK) and their corresponding tags into -> <sometag1, sometag2, sometag3> depending upon the number of tokens it gets broken down into. I think this will yield better model/results. But we should experiment.

Let me know what your thoughts are. Thanks!

TviNet · 2019-05-23T19:00:19Z

I tried averaging subtokens and then an LSTM+CRF which gave decent results for Hindi ( 13k train sentences, 96.3% accuracy) but not for Tamil (400 train sentences, 87% accuracy). Other languages similarly have very few training samples.

goru001 · 2019-05-25T06:24:46Z

Yes, that's why I think using transfer learning is important here, especially for low resource languages.

sarves · 2021-01-11T14:17:59Z

Hi,

In case if you are interested in a BiLSTM based Tamil POS tagger (this developed using Stanza framework): https://github.com/sarves/thamizhi-pos
You can find relevant models and tagged data.

Sarves

goru001 added the enhancement New feature or request label Apr 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POS tagging #13

POS tagging #13

TviNet commented May 22, 2019

goru001 commented May 23, 2019

TviNet commented May 23, 2019

goru001 commented May 25, 2019

sarves commented Jan 11, 2021

POS tagging #13

POS tagging #13

Comments

TviNet commented May 22, 2019

goru001 commented May 23, 2019

TviNet commented May 23, 2019

goru001 commented May 25, 2019

sarves commented Jan 11, 2021