Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic Chinese support #93

Merged
merged 1 commit into from
Jul 22, 2017
Merged

Add basic Chinese support #93

merged 1 commit into from
Jul 22, 2017

Conversation

astropeak
Copy link
Contributor

Hi,

I added basic support for Chinese. This modification is straightforward since the code already supports Japanese.

The word tokenizer is provided by jieba, a widely used one for Chinese word segmentation.

I found the stopwords list at
https://raw.githubusercontent.com/6/stopwords-json/master/dist/zh.json

I only tried it a little, but the results seem quite good.

@miso-belica miso-belica merged commit 479ac3b into miso-belica:dev Jul 22, 2017
@miso-belica
Copy link
Owner

Thanks a lot :)

@ArtificialNotImbecile
Copy link

ArtificialNotImbecile commented May 11, 2018

This only works on --url option, but when I use --file_path option, it returns nothing.

@miso-belica
Copy link
Owner

miso-belica commented May 11, 2018

@ArtificialNotImbecile Hi, that's weird. Can you please report and issue with URL and commands which is working and attached file and command which is not?

@ArtificialNotImbecile
Copy link

@miso-belica I can not reproduce my bug, so it most likely my environment has some problem. I just update my Ubuntu 16.04 today and it works perfectly fine for both Chinese and English in both --url and --file_path :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants