Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing "'" sign #8

Closed
alassak opened this issue Mar 3, 2018 · 4 comments
Closed

Missing "'" sign #8

alassak opened this issue Mar 3, 2018 · 4 comments

Comments

@alassak
Copy link

alassak commented Mar 3, 2018

In Ukrainian (and Russian, Bulgarian) where is plenty of words with "'" sign in it. I believe it is a completely different character then latin "'". It's not like in English where you can drop this "'" and words will still have a sense ("he's" will become "he"). It's more like Ukrainian word "Computer" is "комп'ютер" and "комп" does not mean anything on its own. There are hundreds of words like that.

https://en.wikipedia.org/wiki/Ukrainian_alphabet#Letter_names_and_pronunciation

Can anyone change that and rerun these words calculations for Ukrainian?

@hermitdave
Copy link
Owner

I can check. Having no idea was the languages per se is difficult to get it right using English logic. Let me take a look

@alassak
Copy link
Author

alassak commented Mar 18, 2018

Thank you! Let me know if I can help in any way!

@hermitdave
Copy link
Owner

Almost a year too late. With #13 I am generating latest dataset. I have followed some suggestions and added language detection. There is no logic to split words any longer. I haven't gotten to all languages yet however Bulgarian has been processed if you would like to have a look and validate.

I am generating the rest on a low spec machine so it make take a few days.

@hermitdave
Copy link
Owner

This should be fixed.. all Cyrillic need workday for 2018 have been generated without any word splitting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants