-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Language request: Kurdish-Kurmanji #124
Comments
Please try the current kur_ara traineddata from tessdata_fast and
tessdata_best, it seems to be incorrectly labelled and might be for Kurdish
written in Latin script.
Provide the feedback after testing here, so that required changes can be
made.
…On Wed 11 Apr, 2018, 11:23 PM Brandon Istenes, ***@***.***> wrote:
It's the most popular dialect of Kurdish, written in a Latin alphabet with
a few diacritics.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#124>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o0Yhlvi9gBZkJvhwX-ByPPDaPJSTks5tnkL8gaJpZM4TQeaH>
.
|
The data in both kur and kur_ara is in Arabic script, neither is in a Latin script. |
I am not referring to the langdata repo which has not been updated for
4.0.0.
There are traineddata files for kur_ara in tessdata_best and tessdata_fast.
You can use combine_tessdata to unpack the file and look at the unicharset
in it.
Use dawg2wordlist to convert the words dawg to the wordlist.
Of course, you could try to recognise any image with those traineddata
files and check output.
…On Wed 11 Apr, 2018, 11:42 PM Brandon Istenes, ***@***.***> wrote:
The data in both kur and kur_ara is in Arabic script, neither is in a
Latin script.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#124 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o_TbxSvm63_vCN0BYWRRUOb7FV0wks5tnkdwgaJpZM4TQeaH>
.
|
… On Wed 11 Apr, 2018, 11:46 PM ShreeDevi Kumar, ***@***.***> wrote:
I am not referring to the langdata repo which has not been updated for
4.0.0.
There are traineddata files for kur_ara in tessdata_best and tessdata_fast.
You can use combine_tessdata to unpack the file and look at the unicharset
in it.
Use dawg2wordlist to convert the words dawg to the wordlist.
Of course, you could try to recognise any image with those traineddata
files and check output.
On Wed 11 Apr, 2018, 11:42 PM Brandon Istenes, ***@***.***>
wrote:
> The data in both kur and kur_ara is in Arabic script, neither is in a
> Latin script.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#124 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AE2_o_TbxSvm63_vCN0BYWRRUOb7FV0wks5tnkdwgaJpZM4TQeaH>
> .
>
|
Oh interesting! I will give that a try, thank you. |
Yes, it's definitely trained on Kurmanji. It's doing a great job OCRing Kurmanji text. This is wonderful, thank you! |
Thank you for your feedback. We need to change the traineddata name so that it correctly reflects the language. What would you suggest? kur Any other official language code for kurmanji. |
Sure! My recommendations: Don’t use kur for anything. Depends on if you’re more interested in the script or the dialect. There are a few Kurdish languages using each of the scripts, the Latin-based one and the Persian-based one. Historically Cyrillic and Armenian scripts have been used also. I’m guessing the language is itself important though. The ISO 639-3 code for Kurmanji (aka Northern Kurdish) is kmr. It wouldn’t be at all unreasonable to prefix it as kur_kmr, and refer to it as Kurdish Kurmanji. Or just list kmr as Kurdish Kurmanji. Similarly, I’d recommend using ckb or kur_ckb for the one with the Arabic-based script, Sorani. |
Thanks. @jbreiden @theraysmith what is your recommendation? At a minimum kur_ara should be changed as it doesn't have Arabic in it. |
tesseract-ocr/tessdata_fast#14 (comment) OK. I think we should follow the suggestion by @amitdo, since it is in line with the way tesseract names other languages. |
Shree, please send PRs. |
In case it’s not clear, this choice impedes the development of Tesseract support for Zaza, Gorani/Horami, and Southern Kurdish, each of which will require quite different dictionaries. |
@brandones Yes, that will require different dictionaries. As of now I am not sure exactly which language dictionary is being used. I am attaching a zip file with two traineddatas for Kurdish in Arabic script - Sorani. If you are familiar with the script, please review and provide feedback as to accuracy and also whether the word frequency list is appropriate. (or refer to someone who can provide that feedback). Thanks! |
https://en.wikipedia.org/wiki/Kurdish_languages
ckb - Sorani (Arabic/Persian script). |
I want to improve and work on Kurdish Kurmanji and Sorani for both Latin an Arabic scripts. |
Kurdish-Kurmanji is available at https://github.com/tesseract-ocr/tessdata_best/blob/master/kmr.traineddata https://github.com/tesseract-ocr/langdata_lstm/tree/master/kmr You can test it to identify the improvements needed. |
Sorani in Arabic script is not available in tessdata_best and tessdata_fast. https://github.com/tesseract-ocr/langdata_lstm/tree/master/kur You will need to identify unicode fonts and training text for the same before running any training. |
thank you so much but for the sorani in Arabic script where can I train it? I mean do I need to do boxing some images? |
wonderful |
how can I do it in windows |
"kur" no longer exists, might be named "kur_ara" (the old "kur_ara" is now "kmr", which is actually Latin) now, but "kur" is not present in tessdata_fast nor in tessdata_best. [1] [2] "tgl" (Tagalo) is now named "fil" (Filipino) [3] [1] tesseract-ocr/langdata#124 [2] tesseract-ocr/tessdata_best#23 [3] tesseract-ocr/langdata#84 "kur" no longer exists, might be named "kur_ara" now, but it is not present in tessdata_fast nor in tessdata_best. "kmr" is the Latin version (Kurmanji) "tgl" (Tagalo) is now named "fil" (Filipino)
"kur" no longer exists, might be named "kur_ara" (the old "kur_ara" is now "kmr", which is actually Latin) now, but "kur" is not present in tessdata_fast nor in tessdata_best. [1] [2] "tgl" (Tagalo) is now named "fil" (Filipino) [3] [1] tesseract-ocr/langdata#124 [2] tesseract-ocr/tessdata_best#23 [3] tesseract-ocr/langdata#84
"kur" no longer exists, might be named "kur_ara" (the old "kur_ara" is now "kmr", which is actually Latin) now, but "kur" is not present in tessdata_fast nor in tessdata_best. [1] [2] "tgl" (Tagalo) is now named "fil" (Filipino) [3] [1] tesseract-ocr/langdata#124 [2] tesseract-ocr/tessdata_best#23 [3] tesseract-ocr/langdata#84
It's the most popular dialect of Kurdish, written in a Latin alphabet with a few diacritics.
The text was updated successfully, but these errors were encountered: