autodetect pdf type #343

gregoribic · 2021-05-10T10:12:31Z

Is there already a solution to check/detect if the pdf is searchable (pdftotext) or it is an image (ocr, tesseract) and use appropriate method for text extraction.

nayyhah · 2022-01-23T12:25:31Z

It can simply be done by using if-else conditions.
You can put a condition and check if pdf can be extracted by using pdftotext, and if result is False then the else condition will try it again with OCR(tesseract)

result = extract_data(filename,templates=templates)
if not result:
result = extract_data(filename, templates=templates, input_module=tesseract)

manuel-barreiro · 2022-04-08T18:12:17Z

Hello, I'm trying to apply what you are saying, but I'm getting the following error: "NameError: name 'tesseract' is not defined"

It also happens when I fill the input_module with "pdftotext" and the other ones. Invoice2data is working good for me with normal PDFs, but in this case I'm trying to process a scanned pdf, that's why I need to specify tesseract as input_module.

Hope you can help me.

bosd · 2022-08-26T07:16:27Z

Could be, but how to handle corner cases?
I've got a couple of invoices. Where they put the company info in the image header of the invoice.

The invoice line part is the same.
Branch A --> Shows header image with Branch A business Info
Branch B --> Shows header image with Branch B business Info

(Or another company who issues invoices with their company info as flat image, and the rest of the invoice as text.)

bosd · 2022-08-26T07:19:55Z

Previously there was a function in invoice2data which was checking the PDF output. It was something like. If the output is less then 80 characters, then fallback on Tesseract to OCR the PDF.
It was removed because of stability issues??

Maybe this is not needed to be solved in invoice2data.
As you pdfminer support hOCR now.
pdfminer/pdfminer.six#651

Maybe we need to update documentation how to use it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

autodetect pdf type #343

autodetect pdf type #343

gregoribic commented May 10, 2021

nayyhah commented Jan 23, 2022

manuel-barreiro commented Apr 8, 2022

bosd commented Aug 26, 2022

bosd commented Aug 26, 2022

autodetect pdf type #343

autodetect pdf type #343

Comments

gregoribic commented May 10, 2021

nayyhah commented Jan 23, 2022

manuel-barreiro commented Apr 8, 2022

bosd commented Aug 26, 2022

bosd commented Aug 26, 2022