better pdf parsing #754

shreyaspimpalgaonkar · 2024-07-24T18:58:21Z

🚀	This description was created by Ellipsis for commit `57a8c21`

Summary:

Introduced new PDF parsers and updated the parsing pipeline to support multiple and overrideable parsers.

Key points:

Added PDFParserUnstructured, PDFParserLocal, PDFParserVLM, and PDFParserMarker in r2r/parsers/media/pdf_parser.py.
Updated __init__.py files in r2r and r2r/parsers to include new PDF parsers.
Modified r2r/main/assembly/factory.py to pass override_parsers to create_parsing_pipe.
Updated ParsingPipe in r2r/pipes/ingestion/parsing_pipe.py to support multiple parsers for a document type and handle parser overrides.
Added ParserComponentLayoutExtraction and ParserComponentOCR in r2r/parsers/common/layout_extraction.py for layout extraction and OCR.
Updated IngestionService in r2r/main/services/ingestion_service.py to remove unnecessary blank line.

Generated with ❤️ by ellipsis.dev

ellipsis-dev

❌ Changes requested. Reviewed everything up to 57a8c21 in 1.0 minute and 6.655169999999998 seconds

More details

Looked at 409 lines of code in 10 files
Skipped 2 files when reviewing.
Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_3Jv7F8oL8WjV6Qg4

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ellipsis-dev · 2024-07-24T19:02:35Z

r2r/pipes/ingestion/parsing_pipe.py

+
+            print(f"Overriding parser for {parser_override['document_type']} with {parser_override['parser']}")
+
+        for doc_type, parser_infos in self.AVAILABLE_PARSERS.items():


The implementation of ParsingPipe does not correctly handle multiple parsers for a single document type. It always selects the first parser in the list, which might not be the intended behavior if specific parsers are better suited for certain PDFs. Consider implementing a mechanism to select the appropriate parser based on the document's characteristics or user configuration.

shreyaspimpalgaonkar added 9 commits July 11, 2024 16:11

Add parser component

ab03aeb

checkin

8320290

update example

b758b1c

up

c293552

link parsers

eb2a9ee

working parser

5572a28

rm unnecessary files

ff87178

remove more files

882d3b4

remove more files

57a8c21

ellipsis-dev bot reviewed Jul 24, 2024

View reviewed changes

emrgnt-cmplxty approved these changes Jul 24, 2024

View reviewed changes

emrgnt-cmplxty deleted the branch dev July 24, 2024 22:07

emrgnt-cmplxty closed this Jul 24, 2024

shreyaspimpalgaonkar reopened this Jul 24, 2024

Merge branch 'dev' into shreyas/parsing

298068b

shreyaspimpalgaonkar merged commit c2008d1 into dev Jul 24, 2024

shreyaspimpalgaonkar deleted the shreyas/parsing branch August 2, 2024 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

better pdf parsing #754

better pdf parsing #754

shreyaspimpalgaonkar commented Jul 24, 2024 •

edited by ellipsis-dev bot

Loading

ellipsis-dev bot left a comment

ellipsis-dev bot Jul 24, 2024


		print(f"Overriding parser for {parser_override['document_type']} with {parser_override['parser']}")

		for doc_type, parser_infos in self.AVAILABLE_PARSERS.items():

better pdf parsing #754

better pdf parsing #754

Conversation

shreyaspimpalgaonkar commented Jul 24, 2024 • edited by ellipsis-dev bot Loading

Summary:

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot Jul 24, 2024

Choose a reason for hiding this comment

shreyaspimpalgaonkar commented Jul 24, 2024 •

edited by ellipsis-dev bot

Loading