Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed Extraction - cmap font missing #33

Closed
s4zuk3 opened this issue Nov 20, 2024 · 2 comments · Fixed by #42
Closed

Failed Extraction - cmap font missing #33

s4zuk3 opened this issue Nov 20, 2024 · 2 comments · Fixed by #42
Labels
bug Something isn't working

Comments

@s4zuk3
Copy link
Contributor

s4zuk3 commented Nov 20, 2024

Hello! While trying to extract content from a PDF, I got the following error with very little information:

  • ParseError("Parse error occurred: Unable to extract PDF content").

After modifying the code, I was able to extract the full error, which is as follows:

Stack trace: org.apache.tika.exception.TikaException: Unable to extract PDF content
\tat org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:130)
\tat org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:219)
\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
\tat org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
\tat ai.yobix.TikaNativeMain.parseToStringWithConfig(TikaNativeMain.java:199)
\tat ai.yobix.TikaNativeMain.parseFileToString(TikaNativeMain.java:103)
**Caused by: java.io.IOException: Error: Could not find referenced cmap stream Identity-V**
\tat org.apache.fontbox.cmap.CMapParser.getExternalCMap(CMapParser.java:508)
\tat org.apache.fontbox.cmap.CMapParser.parsePredefined(CMapParser.java:99)
\tat org.apache.pdfbox.pdmodel.font.CMapManager.getPredefinedCMap(CMapManager.java:54)
\tat org.apache.pdfbox.pdmodel.font.PDType0Font.readEncoding(PDType0Font.java:287)
\tat org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:204)
\tat org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:97)
\tat org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:171)
\tat org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66)
\tat org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:959)
\tat org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:532)
\tat org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:507)
\tat org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:151)
\tat org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:155)
\tat org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:363)
\tat org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:137)
\tat org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1369)
\tat org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
\tat org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108)
\t... 6 more
")

From what I can see, several font files from fontbox.cmap are missing and need to be included in the Tika native image. I only see a few of them in the configuration. Is it possible to include all of them in the configuration?

Thanks!

@KapiWow
Copy link
Collaborator

KapiWow commented Nov 24, 2024

Hello @s4zuk3!
Yes, we can add the missing entities to the config. Could you provide a PDF for testing? Perhaps a short example of when it doesn't work would also help.

@nmammeri nmammeri added the bug Something isn't working label Nov 25, 2024
@s4zuk3
Copy link
Contributor Author

s4zuk3 commented Nov 25, 2024

cmap_issue.pdf

Thanks!

@nmammeri nmammeri linked a pull request Dec 20, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants