Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Chinese sentences are detected as Japanese #84

Open
kewang opened this issue Apr 7, 2020 · 4 comments · May be fixed by #121
Open

Some Chinese sentences are detected as Japanese #84

kewang opened this issue Apr 7, 2020 · 4 comments · May be fixed by #121

Comments

@kewang
Copy link

kewang commented Apr 7, 2020

sentence 1

特別推薦的必訪店家「ヤマシロヤ」,雖然不在阿美橫町上,但就位於JR上野站廣小路口對面

jpn 1
google translate result is Chinese correctly

sentence 2

特別推薦的必訪店家,雖然不在阿美橫町上,但就位於JR上野站廣小路口對面

cmn 1
google translate result is Chinese correctly

Sentence 1 almost are Chinese characters and contains 5 Katakana characters. But its result is jpn incorrectly.

Sentence 2 are Chinese characters fully, and its result is cmn correctly.

Maybe the result is related to #77

@kewang kewang changed the title Some Chinese sentence are detected as Japanese Some Chinese sentences are detected as Japanese Apr 7, 2020
@wooorm
Copy link
Owner

wooorm commented Apr 7, 2020

Thanks. I don’t read, write, or speak Japanese or Chine so I can’t really help. PRs like with GH-77 are welcome!

@kewang
Copy link
Author

kewang commented Apr 12, 2020

Hi @wooorm, @the-worldly-monkey

From https://www.unicode.org/faq/han_cjk.html#4 (How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?)

A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean.

According to url, I will add some extra rules to getTopScript(value, scripts) when detect CJK sentence.

@niftylettuce
Copy link

@kewang PR would be great on this!!

@lorumic
Copy link
Contributor

lorumic commented Nov 13, 2024

Hello, I have just found this issue again after many years, and since I was involved in the original change that caused it (#77), I decided to try to find a way to avoid these false jpn positives. Please find my attempt here: #121 - any feedback is always welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants