Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better pdf parsing #754

Merged
merged 10 commits into from
Jul 24, 2024
Merged

better pdf parsing #754

merged 10 commits into from
Jul 24, 2024

Conversation

shreyaspimpalgaonkar
Copy link
Member

@shreyaspimpalgaonkar shreyaspimpalgaonkar commented Jul 24, 2024

🚀 This description was created by Ellipsis for commit 57a8c21

Summary:

Introduced new PDF parsers and updated the parsing pipeline to support multiple and overrideable parsers.

Key points:

  • Added PDFParserUnstructured, PDFParserLocal, PDFParserVLM, and PDFParserMarker in r2r/parsers/media/pdf_parser.py.
  • Updated __init__.py files in r2r and r2r/parsers to include new PDF parsers.
  • Modified r2r/main/assembly/factory.py to pass override_parsers to create_parsing_pipe.
  • Updated ParsingPipe in r2r/pipes/ingestion/parsing_pipe.py to support multiple parsers for a document type and handle parser overrides.
  • Added ParserComponentLayoutExtraction and ParserComponentOCR in r2r/parsers/common/layout_extraction.py for layout extraction and OCR.
  • Updated IngestionService in r2r/main/services/ingestion_service.py to remove unnecessary blank line.

Generated with ❤️ by ellipsis.dev

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Reviewed everything up to 57a8c21 in 1.0 minute and 6.655169999999998 seconds

More details
  • Looked at 409 lines of code in 10 files
  • Skipped 2 files when reviewing.
  • Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_3Jv7F8oL8WjV6Qg4


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.


print(f"Overriding parser for {parser_override['document_type']} with {parser_override['parser']}")

for doc_type, parser_infos in self.AVAILABLE_PARSERS.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of ParsingPipe does not correctly handle multiple parsers for a single document type. It always selects the first parser in the list, which might not be the intended behavior if specific parsers are better suited for certain PDFs. Consider implementing a mechanism to select the appropriate parser based on the document's characteristics or user configuration.

@shreyaspimpalgaonkar shreyaspimpalgaonkar merged commit c2008d1 into dev Jul 24, 2024
@shreyaspimpalgaonkar shreyaspimpalgaonkar deleted the shreyas/parsing branch August 2, 2024 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants