Skip to content

mojtabasajjadi/NaiveBayesLanguageIdentifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

NaiveBayesLanguageIdentifier

A method for identifying the language of a given text

There are various methods to identify the language of a document. These methods include statistical analysis of character frequencies, dictionary-based approaches, and machine learning models trained on language data. A more advanced approach involves using machine learning models that are trained on large amounts of text data from multiple languages. One popular method is to use supervised learning algorithms like Support Vector Machines (SVM), Naive Bayes, or deep learning models such as recurrent neural networks (RNNs) or convolutional neural networks (CNNs). For this, I aim to build a simple language identification system using the Naive Bayes classifier with character n-grams as features. Naive Bayes is a simple and efficient probabilistic algorithm that works well for text classification tasks.

In this implementation:


I have created a NaiveBayesLanguageIdentifier class that performs two main functions: train and predict. During the training phase, I first split each document into character n-grams. I then calculate the frequency of each n-gram for every language category and store the counts in dictionaries. To handle those n-grams that have not been seen before, I use Laplace smoothing (add-one smoothing). For prediction, I calculate the log-likelihood of the text belonging to each language category. I then select the language category with the highest posterior probability. Finally, I showcase the usage of my classifier by training it on a small dataset and predicting the language of a test document.

Simple improvement by pre-processing:


To enhance the accuracy of our language identification system, we can use noise-filtering techniques to filter out irrelevant or noisy information from the text. This allows our classifier to focus on the most informative features for language identification. One commonly used noise filtering technique is stop-word removal, where we filter out common words that do not convey much semantic meaning. We define a set of stop words for each language and filter them out during likelihood calculation in the Naive Bayes classifier.
Character normalization can also enhance the performance of our language identification system by reducing the impact of variations in character encoding, accents, diacritics, and case sensitivity. In my implementation, I have added a method called "_normalize_text" that uses the "unicodedata.normalize". This ensures consistent character representations across different languages, reduces the impact of variations in character encoding, and improves the robustness of our language identification system. Although it is possible to do much more advanced pre-processing for each language, such as Joyo Kanji for the Japanese language, which is beyond the scope of this rep.

About

A method for identifying the language of a given text

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages