-
Notifications
You must be signed in to change notification settings - Fork 557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing "'" sign #8
Comments
I can check. Having no idea was the languages per se is difficult to get it right using English logic. Let me take a look |
Thank you! Let me know if I can help in any way! |
Almost a year too late. With #13 I am generating latest dataset. I have followed some suggestions and added language detection. There is no logic to split words any longer. I haven't gotten to all languages yet however Bulgarian has been processed if you would like to have a look and validate. I am generating the rest on a low spec machine so it make take a few days. |
This should be fixed.. all Cyrillic need workday for 2018 have been generated without any word splitting. |
In Ukrainian (and Russian, Bulgarian) where is plenty of words with "'" sign in it. I believe it is a completely different character then latin "'". It's not like in English where you can drop this "'" and words will still have a sense ("he's" will become "he"). It's more like Ukrainian word "Computer" is "комп'ютер" and "комп" does not mean anything on its own. There are hundreds of words like that.
https://en.wikipedia.org/wiki/Ukrainian_alphabet#Letter_names_and_pronunciation
Can anyone change that and rerun these words calculations for Ukrainian?
The text was updated successfully, but these errors were encountered: