Skip to content
This repository has been archived by the owner on Mar 17, 2022. It is now read-only.

Special requirements for Hindi and Arabic OCR #239

Closed
HarshitD opened this issue Mar 27, 2018 · 14 comments
Closed

Special requirements for Hindi and Arabic OCR #239

HarshitD opened this issue Mar 27, 2018 · 14 comments

Comments

@HarshitD
Copy link

Summary:
I am new to tesseract and Android Studio. I am trying to build android app for OCR using tess two. I was able to make it with the help of internet and it runs for many languages except Hindi. For Hindi, the app just crashes after opening it.

Expected result:
Hindi language should also work along with all other languages.

Actual result:
The app crashes when I put hin.traineddata file and change the language to Hindi.

Tess-two version:
tess-two:5.4.1

Android version:
7.1.2

Phone/device model:
Xiaomi Redmi 4

Phone/device architecture (armeabi, armeabi-v7a, x86, mips, arm64-v8a, x86_64, mips64):

Link to training data used:
https://github.com/tesseract-ocr/tessdata/tree/3.04.00

Link to image used as input:
test24_hin

@rmtheis
Copy link
Owner

rmtheis commented Mar 28, 2018

Hmm, can you try again using tess-two version 8.0.0? Hindi is working OK for me in both Tesseract and Cube modes on version 8.0.0.

@HarshitD
Copy link
Author

Thanks for the reply. I tried with version 8.0.0 but still same issue.
In the build.gradle file of app, I changed the version as
compile 'com.rmtheis:tess-two:8.0.0'
I am directly using this code : https://github.com/imperialsoup/SimpleTesseractExample
Is there some modification to be done in this code to make it work for Hindi?
I have hin.traineddata file along with all .cube files under app>assets>tessdata folder.
Could you describe how to make it work for Hindi language?
Thanks in advance!!

@rmtheis
Copy link
Owner

rmtheis commented Mar 28, 2018

What's the error message that's printed to the device log when your app crashes?

@HarshitD
Copy link
Author

Here is the error summary displayed on my android mobile. Screenshot 1 Screenshot 2
According to what I have found is, I think the problem is - for Hindi, I have to use .cube files as well because Tesseract 3 requires .cube files and tess-two works on Tesseract 3. And I am not able to figure out how to use these .cube files. Simply putting .cube files in the folder with hin.traineddata file doesn't work.

Thanks for your help.

@rmtheis
Copy link
Owner

rmtheis commented Mar 30, 2018

I can't reproduce the error that you're seeing. Make sure you're using the correct training data file, from the 3.04.00 tag of the tessdata project.

I get the following result for your input image when using the default settings (OEM_TESSERACT_ONLY and PageSegMode.PSM_SINGLE_BLOCK):

राहुल ने तंज कसते हुए कहा कि कि स'घ का उद्देश्य महिलाओं कं! असशक्त करना है. आरएसएस मैं महिलाओं की कोई
जगह नहीं है. यथा कांई जानता हैं कि कोई महिला २55 से संबंधित हो और नेतृत्व कर रही हो माल अगर साप महात्मा
गांधी की तस्वीर देखेंगे तो उनके दाई और बाई और महिलाओं कं! पाएंगे, मार आप मोहन भागवत की तस्वीर देखेंगे तो या
तो दो अकेले होंगे या फिर पुरुषों से घिरे होंगे

राहुल गांधी ने कहा कि अगा हम अंदर की रस्ता में आते है तो हम जीएत्तटी की संरचना में बदलाव लाएंगै और इसे काफी
सरल बनाएंगे. उन्होंने कहा कि कांम्रेरर में सबसे अह्म रूप से इस बात का संतुलन रखा गया है कि महिला और पुरुषों की
संख्या मैं ज्यादा अंतर नहीं अम मैं मेघालय में पाती की महिलाओं की आमंहिरत करना चाहूगा कि दो पार्टी मैं शामिल हाँ
त्ताब्सि हमारे षाटींमें अधिक से अधिक महिलाएं चुनी जा सकें और उन्हें नौका मिलरस्के.

@rmtheis rmtheis closed this as completed Mar 30, 2018
@HarshitD
Copy link
Author

Thanks for your reply. However, I still could not resolve the error. I have tried with training data file from here. This page also says that "For Arabic and Hindi you need both the traineddata file and the cube data files."
I have searched on internet, many people faced similar problem to mine that the app crashes for Hindi and Arab, but nowhere I found an answer. The closest I found said to include cube data files in the same folder as training data file, but that also doesn't help. Could you please tell me how did you make it run for Hindi?

Thanks a lot for your help.

@rmtheis
Copy link
Owner

rmtheis commented Mar 30, 2018

Yes, you need to install hin.* from https://github.com/tesseract-ocr/tessdata/tree/3.04.00

Thanks for reporting this issue. I've created a task (#240) for myself to improve the training data checking for Arabic and Hindi so developers get a clear error message rather than a crash when using the wrong training data files.

@HarshitD
Copy link
Author

Thanks for your reply. I installed all hin.* files from the link provided by you but the app still crashes.
Could you tell how you made it work for Hindi or share the relevant code?

Thanks for your help.

@HarshitD
Copy link
Author

HarshitD commented Apr 6, 2018

The problem is solved. Thanks for your help.
The problem was in TessBaseAPI.init() As I am new to it, I couldn't understand it earlier. After implementing OEM_TESSERACT_ONLY, it worked,

Thanks a lot for your help.

@rmtheis
Copy link
Owner

rmtheis commented Apr 6, 2018

Glad you were able to solve the problem!

@rmtheis
Copy link
Owner

rmtheis commented Apr 19, 2018

Thanks for looking into this issue. After taking a second look at this, I want to make a note here for reference.

Special requirements for Hindi and Arabic OCR

Arabic and Hindi OCR requires the installation of all Cube data files when using OEM_DEFAULT.

Hindi OCR also works using OEM_TESSERACT_ONLY when the hin.traineddata file is installed, and Hindi also works using OEM_CUBE_ONLY or OEM_TESSERACT_CUBE_COMBINED when the Cube data files are additionally installed.

@rmtheis rmtheis reopened this Apr 19, 2018
@rmtheis rmtheis changed the title Hindi language OCR not working Special requirements for Hindi and Arabic OCR Apr 19, 2018
@rmtheis rmtheis closed this as completed Jul 2, 2018
@singhmeenu
Copy link

I am trying to build android app for OCR Hindi using tess two. It runs for many languages except Hindi. For Hindi, the app just crashes when try to scan any hindi language. I tried all OEM_TESSERACT_ONLY, OEM_TESSERACT_CUBE_COMBINED, OEM_CUBE_ONLY and PSM_SINGLE_BLOCK but app not working. Please give any solution .

Crash:
java.lang.IllegalArgumentException: Cube data files not found. See #239
at com.googlecode.tesseract.android.TessBaseAPI.init(TessBaseAPI.java:347)
at com.googlecode.tesseract.android.TessBaseAPI.init(TessBaseAPI.java:303)
at com.ashomok.tesseractsample.MainActivity.extractText(MainActivity.java:352)

@DorisGM
Copy link

DorisGM commented Apr 9, 2019

I include ara.cube.* and user OEM_TESSERACT_ONLY , app still crash

@DorisGM
Copy link

DorisGM commented Apr 9, 2019

I include ara.cube.* and user OEM_TESSERACT_ONLY , app still crash

I also use OEM_CUBE_ONLY

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants