-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
What were you trying to do?
Trying to use "best" English Tesseract data and OCRmyPDF completely fails. Issue unrelated to input; this is about dependency on Tesseract scripts.
This is due to the use of "hocr" and "txt" script parameters into tesseract.
I downloaded https://github.com/tesseract-ocr/tessdata_best eng.traineddata and set TESSDATA_PREFIX to point at that directory (as most tesseract installs only include fast)... but that directory did not contain:
configs/hocr
configs/txt
Providing those avoids this problem completely :-)
One of the really great things about OCRmyPDF is that will print out suggestions when things are missing (like Ghostscript).
Ideas:
- add logic to check for these errors and emit useful information
- add a note to https://ocrmypdf.readthedocs.io/en/latest/errors.html
- document the Tesseract scripts requirement in the installation docs
It took me a while to debug this, hence opening up the issue for discussion. What made this tricky (for me, as a person new to OCRmyPDF) was the missing errors from Tesseract.
Semi-related:
Thanks for such a useful tool!
Where are you installing/running from?
Wndows package manager (chocolatey, etc.)
OCRmyPDF version
16.10.4
What operating system are you working on?
Windows
Operating system details and version
Windows 11 - Microsoft Windows [Version 10.0.22631.5624]
Simple sanity checks
- Operating system is currently supported by its vendor (not end of life)
- Python version is compatible with OCRmyPDF
- This issue is not about a specific input file
Relevant log output
Scanning contents ---------------------------------------- 100% 88/88 0:00:00
Start processing 16 pages concurrently ocr.py:96
2 [tesseract] read_params_file: Can't open hocr tesseract.py:257
2 [tesseract] read_params_file: Can't open txt tesseract.py:257
7 [tesseract] read_params_file: Can't open hocr tesseract.py:257
7 [tesseract] read_params_file: Can't open txt tesseract.py:257
3 [tesseract] read_params_file: Can't open hocr tesseract.py:257
3 [tesseract] read_params_file: Can't open txt tesseract.py:257
14 [tesseract] read_params_file: Can't open hocr tesseract.py:257
14 [tesseract] read_params_file: Can't open txt tesseract.py:257
1 [tesseract] read_params_file: Can't open hocr tesseract.py:257
1 [tesseract] read_params_file: Can't open txt tesseract.py:257
5 [tesseract] read_params_file: Can't open hocr tesseract.py:257
with an eventual stack trace and failure.