Language request: Kurdish-Kurmanji #124

brandones · 2018-04-11T17:52:59Z

It's the most popular dialect of Kurdish, written in a Latin alphabet with a few diacritics.

Shreeshrii · 2018-04-11T18:09:54Z

Please try the current kur_ara traineddata from tessdata_fast and tessdata_best, it seems to be incorrectly labelled and might be for Kurdish written in Latin script. Provide the feedback after testing here, so that required changes can be made.

…

On Wed 11 Apr, 2018, 11:23 PM Brandon Istenes, ***@***.***> wrote: It's the most popular dialect of Kurdish, written in a Latin alphabet with a few diacritics. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#124>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o0Yhlvi9gBZkJvhwX-ByPPDaPJSTks5tnkL8gaJpZM4TQeaH> .

brandones · 2018-04-11T18:11:57Z

The data in both kur and kur_ara is in Arabic script, neither is in a Latin script.

Shreeshrii · 2018-04-11T18:16:28Z

I am not referring to the langdata repo which has not been updated for 4.0.0. There are traineddata files for kur_ara in tessdata_best and tessdata_fast. You can use combine_tessdata to unpack the file and look at the unicharset in it. Use dawg2wordlist to convert the words dawg to the wordlist. Of course, you could try to recognise any image with those traineddata files and check output.

…

On Wed 11 Apr, 2018, 11:42 PM Brandon Istenes, ***@***.***> wrote: The data in both kur and kur_ara is in Arabic script, neither is in a Latin script. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#124 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o_TbxSvm63_vCN0BYWRRUOb7FV0wks5tnkdwgaJpZM4TQeaH> .

Shreeshrii · 2018-04-11T18:35:39Z

See tesseract-ocr/tessdata_best#23

…

On Wed 11 Apr, 2018, 11:46 PM ShreeDevi Kumar, ***@***.***> wrote: I am not referring to the langdata repo which has not been updated for 4.0.0. There are traineddata files for kur_ara in tessdata_best and tessdata_fast. You can use combine_tessdata to unpack the file and look at the unicharset in it. Use dawg2wordlist to convert the words dawg to the wordlist. Of course, you could try to recognise any image with those traineddata files and check output. On Wed 11 Apr, 2018, 11:42 PM Brandon Istenes, ***@***.***> wrote: > The data in both kur and kur_ara is in Arabic script, neither is in a > Latin script. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#124 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AE2_o_TbxSvm63_vCN0BYWRRUOb7FV0wks5tnkdwgaJpZM4TQeaH> > . >

brandones · 2018-04-11T18:52:08Z

Oh interesting! I will give that a try, thank you.

brandones · 2018-04-12T08:34:09Z

Yes, it's definitely trained on Kurmanji. It's doing a great job OCRing Kurmanji text. This is wonderful, thank you!

Shreeshrii · 2018-04-12T08:42:18Z

Thank you for your feedback.

We need to change the traineddata name so that it correctly reflects the language.

What would you suggest?

kur
kur_lat

Any other official language code for kurmanji.

brandones · 2018-04-12T11:19:59Z

Sure! My recommendations:

Don’t use kur for anything.

Depends on if you’re more interested in the script or the dialect. There are a few Kurdish languages using each of the scripts, the Latin-based one and the Persian-based one. Historically Cyrillic and Armenian scripts have been used also. I’m guessing the language is itself important though.

The ISO 639-3 code for Kurmanji (aka Northern Kurdish) is kmr.

It wouldn’t be at all unreasonable to prefix it as kur_kmr, and refer to it as Kurdish Kurmanji. Or just list kmr as Kurdish Kurmanji.

Similarly, I’d recommend using ckb or kur_ckb for the one with the Arabic-based script, Sorani.

Shreeshrii · 2018-04-12T12:13:04Z

Thanks.

@jbreiden @theraysmith what is your recommendation?

At a minimum kur_ara should be changed as it doesn't have Arabic in it.

Shreeshrii · 2018-04-21T08:20:15Z

tesseract-ocr/tessdata_fast#14 (comment)

OK. I think we should follow the suggestion by @amitdo, since it is in line with the way tesseract names other languages.

amitdo · 2018-04-21T08:25:03Z

Shree, please send PRs.

brandones · 2018-04-21T08:46:06Z

In case it’s not clear, this choice impedes the development of Tesseract support for Zaza, Gorani/Horami, and Southern Kurdish, each of which will require quite different dictionaries.

Shreeshrii · 2018-04-25T04:28:56Z

@brandones Yes, that will require different dictionaries. As of now I am not sure exactly which language dictionary is being used.

I am attaching a zip file with two traineddatas for Kurdish in Arabic script - Sorani.

If you are familiar with the script, please review and provide feedback as to accuracy and also whether the word frequency list is appropriate. (or refer to someone who can provide that feedback). Thanks!

kur_ara.traineddata.zip

Shreeshrii · 2018-04-25T16:11:32Z

https://en.wikipedia.org/wiki/Kurdish_languages

ISO 639-3 kur – inclusive code
Individual codes:
ckb – Central Kurdish
kmr – Northern Kurdish
sdh – Southern Kurdish

ckb - Sorani (Arabic/Persian script).
kmr - Kurmanji (Latin script)

Shreeshrii · 2019-02-16T12:49:26Z

@zdenop This issue can be closed as Kurdish-Kurmanji is available at

https://github.com/tesseract-ocr/tessdata_best/blob/master/kmr.traineddata
https://github.com/tesseract-ocr/tessdata_fast/blob/master/kmr.traineddata

https://github.com/tesseract-ocr/langdata_lstm/tree/master/kmr

keyochali · 2019-09-11T09:08:03Z

I want to improve and work on Kurdish Kurmanji and Sorani for both Latin an Arabic scripts.
how can I do that?
what is needed to be done?
how can I submit it?

Shreeshrii · 2019-09-11T10:52:39Z

Kurdish-Kurmanji is available at

https://github.com/tesseract-ocr/tessdata_best/blob/master/kmr.traineddata
https://github.com/tesseract-ocr/tessdata_fast/blob/master/kmr.traineddata

https://github.com/tesseract-ocr/langdata_lstm/tree/master/kmr

You can test it to identify the improvements needed.

Shreeshrii · 2019-09-11T10:55:33Z

Sorani in Arabic script is not available in tessdata_best and tessdata_fast.

https://github.com/tesseract-ocr/langdata_lstm/tree/master/kur
has minimal language data available for it.

You will need to identify unicode fonts and training text for the same before running any training.

keyochali · 2019-09-11T11:12:27Z

thank you so much
I just tested the kmr.traineddata

but for the sorani in Arabic script where can I train it?
what data is needed except for the text?

I mean do I need to do boxing some images?
can you show me an example of training in tessercat?
is there any documentation for training it?

Shreeshrii · 2019-09-11T11:20:48Z

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

https://github.com/Shreeshrii/tess4training

https://github.com/tesseract-ocr/tesstrain

keyochali · 2019-09-11T11:32:20Z

wonderful
thanks a lot

keyochali · 2019-09-12T08:45:57Z

how can I do it in windows
when tesseract is installed it has the training tools with it
how can I use them
is there any python script to do it?
what is a textilne(a word? or the line?)

"kur" no longer exists, might be named "kur_ara" (the old "kur_ara" is now "kmr", which is actually Latin) now, but "kur" is not present in tessdata_fast nor in tessdata_best. [1] [2] "tgl" (Tagalo) is now named "fil" (Filipino) [3] [1] tesseract-ocr/langdata#124 [2] tesseract-ocr/tessdata_best#23 [3] tesseract-ocr/langdata#84 "kur" no longer exists, might be named "kur_ara" now, but it is not present in tessdata_fast nor in tessdata_best. "kmr" is the Latin version (Kurmanji) "tgl" (Tagalo) is now named "fil" (Filipino)

"kur" no longer exists, might be named "kur_ara" (the old "kur_ara" is now "kmr", which is actually Latin) now, but "kur" is not present in tessdata_fast nor in tessdata_best. [1] [2] "tgl" (Tagalo) is now named "fil" (Filipino) [3] [1] tesseract-ocr/langdata#124 [2] tesseract-ocr/tessdata_best#23 [3] tesseract-ocr/langdata#84

This was referenced Apr 25, 2018

correct name kur_ara to kmr - Kurmanji (Latin script) tesseract-ocr/tessdata_best#26

Merged

correct name kur_ara to kmr - Kurmanji (Latin script) tesseract-ocr/tessdata_fast#16

Merged

zdenop closed this as completed Feb 16, 2019

MerlijnWajer mentioned this issue Dec 1, 2020

Remove references to "kur" and "tgl", add "fil" to man page tesseract-ocr/tesseract#3165

Merged

cyanfish mentioned this issue Sep 6, 2023

Kurdish Language cyanfish/naps2#181

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language request: Kurdish-Kurmanji #124

Language request: Kurdish-Kurmanji #124

brandones commented Apr 11, 2018

Shreeshrii commented Apr 11, 2018 via email

brandones commented Apr 11, 2018

Shreeshrii commented Apr 11, 2018 via email

Shreeshrii commented Apr 11, 2018 via email

brandones commented Apr 11, 2018

brandones commented Apr 12, 2018

Shreeshrii commented Apr 12, 2018

brandones commented Apr 12, 2018 •

edited

Loading

Shreeshrii commented Apr 12, 2018

Shreeshrii commented Apr 21, 2018

amitdo commented Apr 21, 2018

brandones commented Apr 21, 2018

Shreeshrii commented Apr 25, 2018

Shreeshrii commented Apr 25, 2018

Shreeshrii commented Feb 16, 2019 •

edited

Loading

keyochali commented Sep 11, 2019

Shreeshrii commented Sep 11, 2019

Shreeshrii commented Sep 11, 2019

keyochali commented Sep 11, 2019

Shreeshrii commented Sep 11, 2019 •

edited

Loading

keyochali commented Sep 11, 2019

keyochali commented Sep 12, 2019

Language request: Kurdish-Kurmanji #124

Language request: Kurdish-Kurmanji #124

Comments

brandones commented Apr 11, 2018

Shreeshrii commented Apr 11, 2018 via email

brandones commented Apr 11, 2018

Shreeshrii commented Apr 11, 2018 via email

Shreeshrii commented Apr 11, 2018 via email

brandones commented Apr 11, 2018

brandones commented Apr 12, 2018

Shreeshrii commented Apr 12, 2018

brandones commented Apr 12, 2018 • edited Loading

Shreeshrii commented Apr 12, 2018

Shreeshrii commented Apr 21, 2018

amitdo commented Apr 21, 2018

brandones commented Apr 21, 2018

Shreeshrii commented Apr 25, 2018

Shreeshrii commented Apr 25, 2018

Shreeshrii commented Feb 16, 2019 • edited Loading

keyochali commented Sep 11, 2019

Shreeshrii commented Sep 11, 2019

Shreeshrii commented Sep 11, 2019

keyochali commented Sep 11, 2019

Shreeshrii commented Sep 11, 2019 • edited Loading

keyochali commented Sep 11, 2019

keyochali commented Sep 12, 2019

brandones commented Apr 12, 2018 •

edited

Loading

Shreeshrii commented Feb 16, 2019 •

edited

Loading

Shreeshrii commented Sep 11, 2019 •

edited

Loading