-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
better pdf parsing #754
better pdf parsing #754
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Reviewed everything up to 57a8c21 in 1.0 minute and 6.655169999999998 seconds
More details
- Looked at
409
lines of code in10
files - Skipped
2
files when reviewing. - Skipped posting
0
drafted comments based on config settings.
Workflow ID: wflow_3Jv7F8oL8WjV6Qg4
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
|
||
print(f"Overriding parser for {parser_override['document_type']} with {parser_override['parser']}") | ||
|
||
for doc_type, parser_infos in self.AVAILABLE_PARSERS.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation of ParsingPipe
does not correctly handle multiple parsers for a single document type. It always selects the first parser in the list, which might not be the intended behavior if specific parsers are better suited for certain PDFs. Consider implementing a mechanism to select the appropriate parser based on the document's characteristics or user configuration.
Summary:
Introduced new PDF parsers and updated the parsing pipeline to support multiple and overrideable parsers.
Key points:
PDFParserUnstructured
,PDFParserLocal
,PDFParserVLM
, andPDFParserMarker
inr2r/parsers/media/pdf_parser.py
.__init__.py
files inr2r
andr2r/parsers
to include new PDF parsers.r2r/main/assembly/factory.py
to passoverride_parsers
tocreate_parsing_pipe
.ParsingPipe
inr2r/pipes/ingestion/parsing_pipe.py
to support multiple parsers for a document type and handle parser overrides.ParserComponentLayoutExtraction
andParserComponentOCR
inr2r/parsers/common/layout_extraction.py
for layout extraction and OCR.IngestionService
inr2r/main/services/ingestion_service.py
to remove unnecessary blank line.Generated with ❤️ by ellipsis.dev