-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCV to HOCR or PAGE conversion not working #121
Comments
To convert PDF to Google Cloud Vision JSON,, you need to use Google Cloud Vision which is a commercial cloud software we neither support nor endorse. Once you have that JSON data by using their services, you can convert it to hOCR. |
You could also convert to PAGE via hOCR and try https://github.com/PRImA-Research-Lab/prima-page-to-pdf |
Hi @kba , thank you for the answer. I think I may have not explained it correctly, or you misunderstood me: When I'm trying to convert it to PAGE instead I get this result:
So I'm looking to understand why are the gcv converters in this module not working for me, despite the fact that I have a perfectly viable gcv JSON. I can send you the JSON generated from gcv and you can try for yourself to convert it, if it helps. Thanks! |
extracted_pdf.pdfoutput-1-to-1.txt This is the JSON from gcv that I'm using (I changed the suffix into .txt to upload it here), it's a JSON of the sample document that google uses in the tutorial. |
Then it's best to ask @dinosauria123 (not sure whether they're subscribed to issues here but they should see the mention). The code is at https://github.com/dinosauria123/gcv2hocr |
Hi, |
Ok @dinosauria123 ! Thanks |
Is this issue still live? I'm getting a similar error ( |
@sarepal I'm still having issues converting GCV to HOCR and, I could be wrong, I think the conversion to PAGE goes via HOCR. Are you using a result from |
Hi all,
I am new to using this software so please bear with me if this has been asked before or I'm not using the tool correctly.
I have the JSON output of google vision OCR of a PDF (emphasis on PDF and not an image).
I would like to create a searchable version of that PDF using the OCR results. I have tried using gcv2hocr but it doesn't seem to work on PDFs, or it has some other error, because the HOCR output I'm getting from it is basically just the metadata. I tried using ocr-fileformat on the same file, but once again I get only the metadata as a result. Trying to convert it to PAGE fails as well, with the result being some java lines indicating exceptions have occurred. Does ocr-fileformat supports GCV JSON generated from PDF?
The file I'm trying to run it on is the sample file from google:
gs://cloud-samples-data/vision/pdf_tiff/census2010.pdf
And the JSON is generated following this tutorial:
https://cloud.google.com/vision/docs/pdf#vision_text_detection_pdf_gcs-python
If you could assist me or point me in the direction of how to solve it I would be very grateful, as I'm in an urgent need to solve this issue.
Thanks in advance!
The text was updated successfully, but these errors were encountered: