Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update readme for integerized LSTM models #88

Merged
merged 2 commits into from
Mar 22, 2018
Merged

update readme for integerized LSTM models #88

merged 2 commits into from
Mar 22, 2018

Conversation

Shreeshrii
Copy link
Contributor

Will upload the traineddata files next.

@zdenop zdenop merged commit 1438f22 into tesseract-ocr:master Mar 22, 2018
@amitdo
Copy link

amitdo commented Mar 22, 2018

GitHub upload failed for files > 25mb

Which files?

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Mar 22, 2018 via email

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Mar 22, 2018 via email

@stweil
Copy link
Contributor

stweil commented Mar 22, 2018

@Shreeshrii, how did you handle deu_frak.traineddata and the other files which have no best traineddata?

@Shreeshrii
Copy link
Contributor Author

@stweil I did not have a check whether file exists in best. Just the version string would have gotten updated.

@Shreeshrii
Copy link
Contributor Author

@stweil

Is it just the three user contributed files , dan_frak, deu_frak and slk_frak?
Is there a way to go back to an older commit only for these files (functionality is still the same, just the version string is changed) or should I reupload an older version?

@stweil
Copy link
Contributor

stweil commented Mar 23, 2018

No, I don't think that this is necessary. We can keep them as they are.

@amitdo
Copy link

amitdo commented Mar 23, 2018

kur has no lstm. Does it have Latin or Arabic letters?
kur_ara is from best.

@Shreeshrii
Copy link
Contributor Author

I looked at langdata repo just now. It has both kur and kur_ara. Looks like there was a change in langcode but the files were not moved.

I had taken the list of RTL languages from language_specific.sh, it did not have kur, but has kur_ara.

langdata/kur has training text, wordlists etc, in Arabic script. While langdata/kur_ara only has a list of desired and forbidden characters. Hence the kur_ara traineddata file in tessdata_best is not correct. Probably same will apply to tessdata_fast - I haven't checked.

I will file an issue in langdata mentioning this. Hopefully all this will be fixed when Ray/Jeff update langdata for 4.0.0.

@Shreeshrii
Copy link
Contributor Author

kur has no lstm. Does it have Latin or Arabic letters?
kur_ara is from best.

@amitdo Thanks for bringing notice to this.

https://en.wikipedia.org/wiki/Kurdish_languages says

In use:

Hawar alphabet (Latin script; used mostly in Turkey and Syria)

Sorani alphabet(Perso-Arabic script; used mostly in Iraq and Iran)

Not used:

Cyrillic alphabet (former Soviet Union)

So probably both kur and kur_ara can be there with appropriate langdata.

@amitdo
Copy link

amitdo commented Mar 23, 2018

An older issue about kur: #45

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants