-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
“no best words!!” on mixed language (fra+ara) items #235
Comments
Seems like yet another OpenCL bug report... Try this: |
I had the same question but the behaviour is identical either with that environmental variable or even using Tesseract which wasn't built with OpenCL at all:
|
I suggest changing the title so it will contain the word "Arabic" or "ara". There were several reports in the past about problems when using Arabic+other lang.
In general, Arabic uses a special engine called 'Cube', most other languages use another engine. |
I updated the title. It wasn't clear that these were related since Arabic works fine on its own. The commit which you referenced is shown as being included in the version (3.04.01) I'm using. |
I just ran into the exact same problem. Arabic alone is processed successfully, but when I try to get Arabic and English read at the same time, tesseract crashes. I'm using Windows version 3.05.00dev. Another question (I'm totally new to tesseract): When I use arabic language recognition and I read a text with arabic letters, but latin numbers, the latin numbers are not recognized (that's why I wanted to add English as recognition language). In the file "ara.cube.lm" I found the line
Does this mean,latin numbers should be recognized when I only use arabic as recognition language? |
Here's the stack trace for the crash
but I think the problem is actually in either The latter is easier to do and I was too lazy to dig further into the recognizer, so I generated a patch for that which I'll post. |
Hi All, Does Tesseract support script identification. I have bilingual pages and two different model for different scripts. I want to use a script identifier on each word and call my models accordingly for recognition. |
@anupamaray Please use the mailing list for questions (and don't hijack issues about unrelated topics). You'll get better answers if you include more details about the scripts, languages, etc. |
Hi @anupamaray ! Please read this: Try asking your question in the users mailing-list |
Is this issue still exist in 4.00 (code in master)? Probably not, since cube was removed. @Shreeshrii |
This is still an issue: it crashes with
The crash is obviously unrelated to OpenCL, as it crashed here without using OpenCL. |
--oem 0 and --oem 2 - both use the tesseract mode, so the problem is in that code.
|
here's is the recognition of original sepia image - |
It was also sufficient to specify |
I just unpacked the ara.traineddata - it does not have the tesseract model
files in it.
combine_tessdata -u ara.traineddata ara.
Extracting tessdata components from ara.traineddata
Wrote ara.config
Wrote ara.unicharset
Wrote ara.punc-dawg
Wrote ara.word-dawg
Wrote ara.number-dawg
Wrote ara.freq-dawg
Wrote ara.lstm
Wrote ara.lstm-punc-dawg
Wrote ara.lstm-word-dawg
Wrote ara.lstm-number-dawg
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Sat, Dec 31, 2016 at 8:57 PM, Stefan Weil ***@***.***> wrote:
It was also sufficient to specify -l ara in my test.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#235 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o0ZzhLREr78qnJF_KgztsNsJEPDCks5rNnRsgaJpZM4HgCcw>
.
|
Last time we talked Ray said he was leaning toward deletion of the non-LSTM
recognizer.
|
It never had tesseract model files in it... |
The original issue here should be tagged and tested against the 3.05 branch since it is related to cube. the ara.config file in ara.traineddata uses oem 1 (originally for cube and now for LSTM). The current issue being seen with 4.0alpha, ara not working for --oem 0 and --oem 2 is to be expected since there is no Tesseract model for the Arabic language. So, instead of segfault, the message displayed should be something like the following ... "Tesseract requested but not present, LSTM engine used instead". Later, if non-LSTM recognizer is removed this will not apply. |
Yes Shree, you are right. |
Problem still there in 3.05.01 Arabic can be used ONLY in --oem 1 mode (cube in 3.05). Combined language mode tries to apply same --oem to both languages. So, if using Arabic as one of the languages, need to use --oem 1. However, what would be the result if the second language does have --oem 1 option. |
More mixed language issues reported in forum |
recently filed issue - |
What's the output with current master code?
for each option above try:
Also try best/fast 'Arabic' (alone, no '+'). |
@amitdo Here are the results - console output, without any pre-processing for image. I also have the OCRed output texts, if you want.
|
Shree, thank you very much for your testing! |
Solution for 4.0.0: When using 'ara', only use traineddata files from best or fast repos. 3.0x versions are not supported by the Tesseract team anymore. |
Hi @MariamHijazi, https://github.com/tesseract-ocr/tesseract/blob/master/CONTRIBUTING.md
Title of this issue: “no best words!!” on mixed language (fra+ara) items If you are using tesseract command line program and both of your traineddata files are from best or fast, you probably don't get this error message, so it's not the same issue. |
what do we do? |
I've noticed a couple of mixed language items which cause Tessearct v3.04.01 (Leptonica 1.72) to crash:
Here's an example image:
Interestingly, this appears to depend on the order of the languages – using
-l ara
or-l fra
alone avoids the crash but specifying both in either order will cause it to crash.The text was updated successfully, but these errors were encountered: