GitHub - 30lm32/ml-spam-sms-classification: Naive Bayesian, SVM, Random Forest Classifier, and Deeplearing (LSTM) on top of Keras and wod2vec TF-IDF were used respectively in SMS classification

Which one does it catch whole* SPAM SMS?

Problem	Data	Methods	Libs	Link
`NLP`	Text	`Naive Bayesian`, `SVM`, `Random Forest Classifier`, `Deep Learning - LSTM`, `Word2Vec`	`Sklearn`, `Keras`, `Gensim`, `Pandas`, `Seaborn`	https://github.com/erdiolmezogullari/ml-spam-sms-classification

If you want to see the further ML projects, you may visit my main repo: https://github.com/erdiolmezogullari/ml-projects

In this project, We applied supervised learning (classification) algorithms and deep learning (LSTM).

We used a public SMS Spam dataset, which is not a purely clean dataset. The data consists of two different columns (features), such as context, and class. The column context is referring to SMS. The column class may take a value that can be either spam or ham corresponding to related SMS context.

Before applying any supervised learning methods, we applied a bunch of data cleansing operations to get rid of messy and dirty data since it has some broken and messy context.

After obtaining the cleaned dataset, we created tokens and lemmas of SMS corpus separately by using Spacy, and then, we generated bag-of-word and TF-IDF of SMS corpus, respectively. In addition to these data transformations, we also performed SVD, SVC, PCA to reduce dimension of dataset.

To manage data transformation in the training and testing phase effectively and avoid data leakage, we used Sklearn's Pipeline class. So, we added each data transformation step (e.g. bag-of-word, TF-IDF, SVC) and classifier (e.g. Naive Bayesian, SVM, Random Forest Classifier) into an instance of class Pipeline.

After applying those supervised learning methods, we also performed deep learning. The deep learning architecture we used is based on LSTM. To perform LSTM approaching in Keras (Tensorflow), we needed to create an embedding matrix of our corpus. So, we used Gensim's Word2Vec approach to obtain embedding matrix, rather than TF-IDF.

At the end of each processing by using a different classifier, we plotted confusion matrix to compare which one the best classifier for filtering SPAM SMS.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
deeplearning.png		deeplearning.png
ml_steps.png		ml_steps.png
model.ipynb		model.ipynb
pipeline.png		pipeline.png
report.pdf		report.pdf
report.pptx		report.pptx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Which one does it catch whole* SPAM SMS?

About

Releases

Packages

Languages

30lm32/ml-spam-sms-classification

Folders and files

Latest commit

History

Repository files navigation

Which one does it catch whole* SPAM SMS?

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages