We provide 4 datasets for Thai text classification in different styles, objectives, and number of labels. We also created some preliminary benchmarks using fastText, linear models (linearSVC and logistic regression), and thai2fit's implementation of ULMFit.
prachathai-67k
, truevoice-intent
, and all code in this repository are released under Apache License 2.0 by pyThaiNLP. wisesight-sentiment
is released to public domain, using Creative Commons Zero v1.0 Universal license, by Wisesight. wongnai-corpus
is released under GNU Lesser General Public License v3.0 by Wongnai.
Datasets | Style | Objective | Labels | Size |
---|---|---|---|---|
prachathai-67k: body_text | Formal (online newspapers), News | Topic | 12 | 67k |
truevoice-intent: destination | Informal (call center transcription), Customer service | Intent | 7 | 16k |
wisesight-sentiment | Informal (social media), Conversation/opinion | Sentiment | 4 | 28k |
wongnai-corpus | Informal (review site), Restuarant review | Sentiment | 5 | 40k |
prachathai-67k: body_text
We benchmark prachathai-67k by using body_text
as text features and construct a 12-label multi-label classification. The performance is measured by macro-averaged accuracy and F1 score. Codes can be run to confirm performance at this notebook. We also provide performance metrics by class in the notebook.
model | macro-accuracy | macro-F1 |
---|---|---|
fastText | 0.9302 | 0.5529 |
LinearSVC | 0.513277 | 0.552801 |
ULMFit | 0.948737 | 0.744875 |
USE | 0.856091 | 0.696172 |
truevoice-intent: destination
We benchmark truevoice-intent by using destination
as target and construct a 7-class multi-class classification. The performance is measured by micro-averaged and macro-averaged accuracy and F1 score. Codes can be run to confirm performance at this notebook. We also provide performance metrics by class in the notebook.
model | macro-accuracy | micro-accuracy | macro-F1 | micro-F1 |
---|---|---|---|---|
LinearSVC | 0.957806 | 0.95747712 | 0.869411 | 0.85116993 |
ULMFit | 0.955066 | 0.84273111 | 0.852149 | 0.84273111 |
BERT | 0.8921 | 0.85 | 0.87 | 0.85 |
USE | 0.943559 | 0.94355855 | 0.787686 | 0.802455 |
Performance of wisesight-sentiment is based on the test set of WISESIGHT Sentiment Analysis. Codes can be run to confirm performance at this notebook.
Disclaimer Note that the labels are obtained manually and are prone to errors so if you are planning to apply the models in the benchmark for real-world applications, be sure to benchmark it with your own dataset.
Model | Public Accuracy | Private Accuracy |
---|---|---|
Logistic Regression | 0.72781 | 0.7499 |
FastText | 0.63144 | 0.6131 |
ULMFit | 0.71259 | 0.74194 |
ULMFit Semi-supervised | 0.73119 | 0.75859 |
ULMFit Semi-supervised Repeated One Time | 0.73372 | 0.75968 |
USE | 0.63987* |
- Done after competition with a test set that was cleaned from 3946 rows to 2674 rows
Performance of wongnai-corpus is based on the test set of Wongnai Challenge: Review Rating Prediction. Codes can be run to confirm performance at this notebook.
Model | Public Micro-F1 | Private Micro-F1 |
---|---|---|
ULMFit Knight | 0.61109 | 0.62580 |
ULMFit | 0.59313 | 0.60322 |
fastText | 0.5145 | 0.5109 |
LinearSVC | 0.5022 | 0.4976 |
Kaggle Score | 0.59139 | 0.58139 |
BERT | 0.56612 | 0.57057 |
USE | 0.42688 | 0.41031 |
@software{cstorm125_2020_3852912,
author = {cstorm125 and
lukkiddd},
title = {PyThaiNLP/classification-benchmarks: v0.1-alpha},
month = may,
year = 2020,
publisher = {Zenodo},
version = {v0.1-alpha},
doi = {10.5281/zenodo.3852912},
url = {https://doi.org/10.5281/zenodo.3852912}
}
-
Ekapol Chuangsuwanich for pioneering wongnai-corpus, wisesight-sentiment, and truevoice-intent for his NLP classes at Chulalongkorn University.
-
@lukkiddd for data exploration and linear model codes.