All credit goes to fxsjy/jieba.
Find more on: https://github.com/fxsjy/jieba
-
Equipped with Emoticon detection
Emoticons will not be segmented as sequences of meaningless punctuations. -
Data are trained with Sinica Corpus
Results are more accurate when dealing with Traditional Chinese (F1-score = 0.91). -
Using Brill Tagger
Training data are trained with Sinica Treebank, which raises the accuracy of POS tagging.
- Python2.7+
- Python3.3+
pip install -U jseg
from jseg import Jieba
j = Jieba()
j.add_guaranteed_wordlist(lst)
Here's a sample text:
sample = '期末要爆炸啦! ◢▆▅▄▃崩╰(〒皿〒)╯潰▃▄▅▇◣'
Segmentation with POS (part-of-speech)
j.seg(sample, pos=True)