-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
autodetect pdf type #343
Comments
It can simply be done by using if-else conditions. result = extract_data(filename,templates=templates) |
Hello, I'm trying to apply what you are saying, but I'm getting the following error: "NameError: name 'tesseract' is not defined" It also happens when I fill the input_module with "pdftotext" and the other ones. Invoice2data is working good for me with normal PDFs, but in this case I'm trying to process a scanned pdf, that's why I need to specify tesseract as input_module. Hope you can help me. |
Could be, but how to handle corner cases? The invoice line part is the same. (Or another company who issues invoices with their company info as flat image, and the rest of the invoice as text.) |
Previously there was a function in invoice2data which was checking the PDF output. It was something like. If the output is less then 80 characters, then fallback on Tesseract to OCR the PDF. Maybe this is not needed to be solved in invoice2data. Maybe we need to update documentation how to use it. |
Is there already a solution to check/detect if the pdf is searchable (pdftotext) or it is an image (ocr, tesseract) and use appropriate method for text extraction.
The text was updated successfully, but these errors were encountered: