Skip to content

Conversation

jamesbraza
Copy link
Collaborator

This PR completes MVP multimodal support for PaperQA:

  1. Moves PDF readers to support images and tables
    • With several routes such as full page screenshot vs image screenshot
  2. Creates an opt-out setting for multimodal parsings
  3. Adds test PDF checking we can parse table data

@jamesbraza jamesbraza self-assigned this Aug 6, 2025
@jamesbraza jamesbraza added the enhancement New feature or request label Aug 6, 2025
@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Aug 6, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements MVP multimodal support for PaperQA, allowing the system to parse and utilize images and tables from PDFs alongside text content. The implementation introduces an opt-out setting for multimodal parsing and adds comprehensive support across both PyPDF and PyMuPDF parsers.

  • Adds multimodal parsing capability to PDF readers supporting images, tables, and full-page screenshots
  • Introduces multimodal configuration setting (defaults to True) to control image/table parsing
  • Creates comprehensive test coverage for table querying and multimodal functionality

Reviewed Changes

Copilot reviewed 11 out of 16 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/paperqa/settings.py Adds multimodal boolean field to ParsingSettings for controlling image/table parsing
src/paperqa/docs.py Integrates multimodal setting into document reading pipeline via parse_images parameter
packages/paper-qa-pypdf/src/paperqa_pypdf/reader.py Implements multimodal support with pypdfium2 for full-page screenshots and media parsing
packages/paper-qa-pymupdf/src/paperqa_pymupdf/reader.py Extends PyMuPDF parser to support drawings, tables, and full-page screenshots with clustering
tests/test_paperqa.py Updates existing tests for multimodal behavior and adds new table querying test
tests/test_agents.py Updates file count expectations to include new influence.pdf test file
Various test files Comprehensive test coverage for multimodal parsing functionality across both PDF parsers

@jamesbraza jamesbraza force-pushed the multimodal-pdfs branch 2 times, most recently from 0bdcedc to 5bd9b72 Compare August 6, 2025 20:31
media: list[ParsedMedia] = []
if parse_media:
if full_page: # Capture the entire page as one image
pix = page.get_pixmap(dpi=image_dpi)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add some error handling here if a bad image is hit and returning an ImpossibleParsingError?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already protect page = file.load_page(...) with ImpossibleParsingError, I think once a Page is loaded in and constructed, we should be good.

I haven't seen a get_pixmap crash so far, I'd like to hold off on this for the scope of PR

Copy link
Collaborator

@mskarlin mskarlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm -- some minor comments

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Aug 6, 2025
@jamesbraza jamesbraza merged commit f71d023 into main Aug 7, 2025
7 checks passed
@jamesbraza jamesbraza deleted the multimodal-pdfs branch August 7, 2025 05:12
jamesbraza added a commit that referenced this pull request Aug 25, 2025
jamesbraza added a commit that referenced this pull request Aug 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request lgtm This PR has been approved by a maintainer size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants