Skip to content

[Bug]: (possibly a feature request) error: read_params_file: Can't open hocr AND read_params_file: Can't open txt #1567

@clach04

Description

@clach04

What were you trying to do?

Trying to use "best" English Tesseract data and OCRmyPDF completely fails. Issue unrelated to input; this is about dependency on Tesseract scripts.

This is due to the use of "hocr" and "txt" script parameters into tesseract.

I downloaded https://github.com/tesseract-ocr/tessdata_best eng.traineddata and set TESSDATA_PREFIX to point at that directory (as most tesseract installs only include fast)... but that directory did not contain:

  • configs/hocr
  • configs/txt

Providing those avoids this problem completely :-)

One of the really great things about OCRmyPDF is that will print out suggestions when things are missing (like Ghostscript).

Ideas:

  1. add logic to check for these errors and emit useful information
  2. add a note to https://ocrmypdf.readthedocs.io/en/latest/errors.html
  3. document the Tesseract scripts requirement in the installation docs

It took me a while to debug this, hence opening up the issue for discussion. What made this tricky (for me, as a person new to OCRmyPDF) was the missing errors from Tesseract.

Semi-related:

Thanks for such a useful tool!

Where are you installing/running from?

Wndows package manager (chocolatey, etc.)

OCRmyPDF version

16.10.4

What operating system are you working on?

Windows

Operating system details and version

Windows 11 - Microsoft Windows [Version 10.0.22631.5624]

Simple sanity checks

  • Operating system is currently supported by its vendor (not end of life)
  • Python version is compatible with OCRmyPDF
  • This issue is not about a specific input file

Relevant log output

Scanning contents     ---------------------------------------- 100% 88/88 0:00:00
Start processing 16 pages concurrently                                                                                                                                            ocr.py:96
    2 [tesseract] read_params_file: Can't open hocr                                                                                                                        tesseract.py:257
    2 [tesseract] read_params_file: Can't open txt                                                                                                                         tesseract.py:257
    7 [tesseract] read_params_file: Can't open hocr                                                                                                                        tesseract.py:257
    7 [tesseract] read_params_file: Can't open txt                                                                                                                         tesseract.py:257
    3 [tesseract] read_params_file: Can't open hocr                                                                                                                        tesseract.py:257
    3 [tesseract] read_params_file: Can't open txt                                                                                                                         tesseract.py:257
   14 [tesseract] read_params_file: Can't open hocr                                                                                                                        tesseract.py:257
   14 [tesseract] read_params_file: Can't open txt                                                                                                                         tesseract.py:257
    1 [tesseract] read_params_file: Can't open hocr                                                                                                                        tesseract.py:257
    1 [tesseract] read_params_file: Can't open txt                                                                                                                         tesseract.py:257
    5 [tesseract] read_params_file: Can't open hocr                                                                                                                        tesseract.py:257


with an eventual stack trace and failure.

Metadata

Metadata

Assignees

Labels

triageIssue needs triage

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions