Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use languages' alphabets to make detection more accurate #83

Open
thorn0 opened this issue Feb 17, 2020 · 15 comments
Open

Use languages' alphabets to make detection more accurate #83

thorn0 opened this issue Feb 17, 2020 · 15 comments

Comments

@thorn0
Copy link

thorn0 commented Feb 17, 2020

Что это за язык? is a Russian sentence, which is detected as Bulgarian (bul 1, rus 0.938953488372093, mkd 0.9353197674418605). However, neither Bulgarian nor Macedonian have the letters э and ы in their alphabets.

Same with Чекаю цієї хвилини., which is Ukrainian, but is detected as Northern Uzbek with probability 1 whereas Ukrainian gets only 0.33999999999999997. However, the letters є and ї are used only in Ukrainian whereas the Uzbek Cyrillic alphabet doesn't include as many as five letters from this sentence, namely: ю, ц, і, є and ї.

I know that Franc is supposed to be not good with short input strings, but taking alphabets into account seems to be a promising way to improve the accuracy.

@wooorm
Copy link
Owner

wooorm commented Feb 17, 2020

That’s a good idea, it’s similar to how Google works!
However, I don‘t think it should be so “black and white”, as “the letter ы is not available in bulgarian or macedonian” should still be matched as English.

We could do something with a special character list that enhances scores of certain scripts?

I remember there is a turkish i variant that isn’t used anywhere else as well, forgot what it was tho

@thorn0
Copy link
Author

thorn0 commented Feb 17, 2020

The dotless i (ı) is used not only in Turkish. Other languages whose alphabets are based on the Turkish alphabet have it too. E.g. Azerbaijani and Crimean Tatar.

@thorn0
Copy link
Author

thorn0 commented Feb 17, 2020

We could do something with a special character list that enhances scores of certain scripts?

Scripts like Latin, Cyrillic, etc.? You meant languages, not scripts then, right?

@thorn0
Copy link
Author

thorn0 commented Feb 17, 2020

It's not only a matter of which characters the alphabet has, it's also about which ones it doesn't. In Чекаю цієї хвилини., there are 5 letters that aren't in the Uzbek alphabet. It's 31% of all the letters in the string. In no way should Uzbek get the highest ranking in such a situation.

@thorn0
Copy link
Author

thorn0 commented Feb 18, 2020

@wooorm Do you happen to know a programmatic way to get the alphabet (the set of used characters) for a given language?

@wooorm
Copy link
Owner

wooorm commented Feb 18, 2020

I think it’s vague what even an alphabet is, but I did found this list on wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Language_recognition_chart. Interesting stuff!

Franc supports the most languages possible, as it uses the biggest training set (UDHR). It’s designed to not discriminate against languages with few speakers, and I can how adding a feature such as this would (because there is no data about alphabets for lots of languages).

There are projects that focus on less language and do things like what you’re proposing. Have you looked at https://github.com/CLD2Owners/cld2?

@thorn0
Copy link
Author

thorn0 commented Feb 18, 2020

I thought I saw something on the Unicode site where for each character there was information by which languages it is used, but now I can't find it.

I think it’s vague what even an alphabet is

Right. Some characters sometimes aren't considered separate letters of the alphabet (e.g. umlauts in German), etc. That's why I wrote "alphabet (the set of used characters)".

@wooorm
Copy link
Owner

wooorm commented Feb 18, 2020

I don’t think there’s an automated way to do it.


I think it could be possible to either do it character-based, e.g., like so:

  "э": [
     "bul": -3,
     "mkd": -3,
     "rus": 3,
     "bel": 3,
     // ...or so
  ]

Or based on n-grams/regexes:

  "tje$": [["nld", 2]]
  "^z": [["nld", 1]]

But this is an error-prone and “soft” approach, compared to the current “hard” data-model


An alternative idea is to look at the TRY field in hunspell dictionaries.
E.g., the Russian dictionary defines:

TRY оаитенрсвйлпкьыяудмзшбчгщюжцёхфэъАВСМКГПТЕИЛФНДОЭРЗЮЯБХЖШЦУЧЬЫЪЩЙЁ

And Macedonian:

TRY аеоинвтрслпкудмзбчгјшцњжфхќџѓљѕѐѝАЕОИНВТРСЛПКУДМЗБЧГЈШЦЊЖФХЌЏЃЉЅЀЍ-’!.

These are mostly ordered already based from frequent -> infrequent

@thorn0
Copy link
Author

thorn0 commented Feb 18, 2020

Found it! http://cldr.unicode.org/translation/-core-data/exemplars

Letter frequency is an important thing too, but on the other hand letters that are unique to some language are often infrequent in it. E.g. ѕ (Cyrillic) in Macedonian and є in Ukrainian.

@wooorm
Copy link
Owner

wooorm commented Feb 19, 2020

Nice, we can crawl them from cldr: bg, ru, mk

@wooorm
Copy link
Owner

wooorm commented Mar 27, 2020

@thorn0 Is this something you’d be interested to work on?

@thorn0
Copy link
Author

thorn0 commented Mar 27, 2020

It's unlikely I'll have time for this any time time soon.

@niftylettuce
Copy link

@thorn0 @wooorm I would put a $50 bug bounty on this payable by PayPal if anyone had the time!

@Rakiiv

This comment has been minimized.

@muratcorlu
Copy link

I remember there is a turkish i variant that isn’t used anywhere else as well, forgot what it was tho

@wooorm Yes, ı and İ are specific to Turkish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants