Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add early validation with clear error messages for unsupported image formats #973

Open
alexisabadger opened this issue Nov 22, 2024 · 1 comment

Comments

@alexisabadger
Copy link

alexisabadger commented Nov 22, 2024

Is your feature request related to a problem? Please describe.
When users attempt to use unsupported image formats (like PDF) with Tesseract.js, they receive a cryptic error message:
Error in pixReadStream: Pdf reading is not supported
Error: Error attempting to read image.
This error message doesn't clearly tell users what formats are supported, leading to confusion and unnecessary debugging time.

Describe the solution you'd like
Add early validation of image formats before processing begins, with a clear error message that lists all supported formats. The error message should look like:
Error: Unsupported image format: pdf. Tesseract.js supports: png, jpg, bmp, pbm, webp, gif

This would:

  1. Fail fast before unnecessary processing
  2. Clearly indicate what went wrong
  3. Show users which formats they can use instead

Describe alternatives you've considered

  1. Add format documentation to README (but users might not see it)
  2. Update the existing error message in pixReadStream (but that happens too late in the process)
  3. Add format validation in the example scripts (but that wouldn't help library users)

Additional context
I would like to submit for a review a pull request ready that implements this feature by adding format validation in createWorker.js, using the existing FORMATS constant from tests/constants.js.

@Balearica
Copy link
Member

Although there is a section in the FAQ that discusses PDF support, I agree that the subject comes up enough to warrant mentioning it in the readme. I will edit at some point in the next week. I also agree that it could be useful to have an error message when users attempt to recognize a file that is clearly a .pdf that lists supported formats and/or links to the FAQ.

Regarding input validation in general, I would only want to throw new errors in cases where we are extremely confident that the input would be rejected by Tesseract. For example, the case where the input is a file with a .pdf extension would qualify. Especially given the number of supported image formats and data types, throwing more input validation errors is inherently high-risk, as any bug or unforeseen edge-case that results in valid inputs being incorrectly rejected would break somebody's application. For example, it looks like the checks implemented in your PR cause several automated tests to fail due to incorrectly rejecting valid inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants