-
-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use languages' alphabets to make detection more accurate #83
Comments
That’s a good idea, it’s similar to how Google works! We could do something with a special character list that enhances scores of certain scripts? I remember there is a turkish i variant that isn’t used anywhere else as well, forgot what it was tho |
The dotless i (ı) is used not only in Turkish. Other languages whose alphabets are based on the Turkish alphabet have it too. E.g. Azerbaijani and Crimean Tatar. |
Scripts like Latin, Cyrillic, etc.? You meant languages, not scripts then, right? |
It's not only a matter of which characters the alphabet has, it's also about which ones it doesn't. In |
@wooorm Do you happen to know a programmatic way to get the alphabet (the set of used characters) for a given language? |
I think it’s vague what even an alphabet is, but I did found this list on wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Language_recognition_chart. Interesting stuff! Franc supports the most languages possible, as it uses the biggest training set (UDHR). It’s designed to not discriminate against languages with few speakers, and I can how adding a feature such as this would (because there is no data about alphabets for lots of languages). There are projects that focus on less language and do things like what you’re proposing. Have you looked at https://github.com/CLD2Owners/cld2? |
I thought I saw something on the Unicode site where for each character there was information by which languages it is used, but now I can't find it.
Right. Some characters sometimes aren't considered separate letters of the alphabet (e.g. umlauts in German), etc. That's why I wrote "alphabet (the set of used characters)". |
I don’t think there’s an automated way to do it. I think it could be possible to either do it character-based, e.g., like so:
Or based on n-grams/regexes:
But this is an error-prone and “soft” approach, compared to the current “hard” data-model An alternative idea is to look at the
And Macedonian:
These are mostly ordered already based from frequent -> infrequent |
Found it! http://cldr.unicode.org/translation/-core-data/exemplars Letter frequency is an important thing too, but on the other hand letters that are unique to some language are often infrequent in it. E.g. |
@thorn0 Is this something you’d be interested to work on? |
It's unlikely I'll have time for this any time time soon. |
This comment has been minimized.
This comment has been minimized.
@wooorm Yes, ı and İ are specific to Turkish. |
Что это за язык?
is a Russian sentence, which is detected as Bulgarian (bul 1, rus 0.938953488372093, mkd 0.9353197674418605). However, neither Bulgarian nor Macedonian have the letters э and ы in their alphabets.Same with
Чекаю цієї хвилини.
, which is Ukrainian, but is detected as Northern Uzbek with probability 1 whereas Ukrainian gets only 0.33999999999999997. However, the letters є and ї are used only in Ukrainian whereas the Uzbek Cyrillic alphabet doesn't include as many as five letters from this sentence, namely: ю, ц, і, є and ї.I know that Franc is supposed to be not good with short input strings, but taking alphabets into account seems to be a promising way to improve the accuracy.
The text was updated successfully, but these errors were encountered: