-
Notifications
You must be signed in to change notification settings - Fork 773
Multimodal PDF support #1047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multimodal PDF support #1047
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements MVP multimodal support for PaperQA, allowing the system to parse and utilize images and tables from PDFs alongside text content. The implementation introduces an opt-out setting for multimodal parsing and adds comprehensive support across both PyPDF and PyMuPDF parsers.
- Adds multimodal parsing capability to PDF readers supporting images, tables, and full-page screenshots
- Introduces
multimodal
configuration setting (defaults to True) to control image/table parsing - Creates comprehensive test coverage for table querying and multimodal functionality
Reviewed Changes
Copilot reviewed 11 out of 16 changed files in this pull request and generated 5 comments.
Show a summary per file
File | Description |
---|---|
src/paperqa/settings.py |
Adds multimodal boolean field to ParsingSettings for controlling image/table parsing |
src/paperqa/docs.py |
Integrates multimodal setting into document reading pipeline via parse_images parameter |
packages/paper-qa-pypdf/src/paperqa_pypdf/reader.py |
Implements multimodal support with pypdfium2 for full-page screenshots and media parsing |
packages/paper-qa-pymupdf/src/paperqa_pymupdf/reader.py |
Extends PyMuPDF parser to support drawings, tables, and full-page screenshots with clustering |
tests/test_paperqa.py |
Updates existing tests for multimodal behavior and adds new table querying test |
tests/test_agents.py |
Updates file count expectations to include new influence.pdf test file |
Various test files | Comprehensive test coverage for multimodal parsing functionality across both PDF parsers |
0bdcedc
to
5bd9b72
Compare
media: list[ParsedMedia] = [] | ||
if parse_media: | ||
if full_page: # Capture the entire page as one image | ||
pix = page.get_pixmap(dpi=image_dpi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we add some error handling here if a bad image is hit and returning an ImpossibleParsingError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already protect page = file.load_page(...)
with ImpossibleParsingError
, I think once a Page
is loaded in and constructed, we should be good.
I haven't seen a get_pixmap
crash so far, I'd like to hold off on this for the scope of PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm -- some minor comments
5bd9b72
to
addd2f1
Compare
addd2f1
to
4b0d848
Compare
This PR completes MVP multimodal support for PaperQA: