codemixed-language-identification

This repository contains a classification model that has been trained to detect and classify Indian code-mixed languages and Indian languages written in Latin script. Currently, it supports Hinglish(Hindi + English), Tanglish(Tamil + English) and Manglish(Malayalam + English).

Dataset creation

For creating the dataset, we used the following sources:
- We used the datasets from Dravidian Code-mixed sentiment analysis shared task(link) for gathering Malayalam and Tamil data
- We used the HinglishNorm dataset for getting the Hindi data.
- For English data, we collected ramdom Wikipedia sentences from English wikipedia.
From each of the languages, we selected 5691 random instances to create a dataset of total size 22764
The total dataset was divided into training and validation set by making 80:20 split. The training set contains 18211 samples and the validation set contains 4553 samples.
The labels-language mapping in the dataset are as follows:
- 1: 'en' or English
- 2: 'hi-en' or Hinglish
- 3: 'ta-en' or Tanglish
- 4: 'ml-en' or Manglish

Classification model

For building the classification model, we have used the pre-trained ai4bharat/indic-bert and finetuned on this dataset for classification task. This model has achieved best results in different tasks involving Indian languages. Apart from that, unlike xlm-roberta or other multilingual models, indic-bert focuses hugely on Indian languages.
For training the models, we have used the fastai library to maintain coherence with the inltk toolkit. This model has been inspired from this Medium article which talks about how to incorporate the transformers library with fastai.
In this model, we have used gradual unfreezing of layers along with slanted triangular learning rates.
The model have 95.3 p.c. accuracy on the validation set
The fastai-transformers-train.ipynb contains the training code. The inference learner is stored in the models folder after training.
After training, inference.ipynb can be used for getting inference predictions

Notes

This PR integrates the trained classifier with the inltk library.
You can download the pre-trained classifier model from here.

Future Work

Error analysis
Support for more languages
Collection of more data.
Finetuning better pre-trained models if possible

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
.gitignore		.gitignore
README.md		README.md
data-creation.ipynb		data-creation.ipynb
fastai-transformers-train.ipynb		fastai-transformers-train.ipynb
inference.ipynb		inference.ipynb
vocab.pkl		vocab.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

codemixed-language-identification

Dataset creation

Classification model

Notes

Future Work

About

Releases

Packages

Languages

tathagata-raha/codemixed-language-identification

Folders and files

Latest commit

History

Repository files navigation

codemixed-language-identification

Dataset creation

Classification model

Notes

Future Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages