Skip to content

Conversation

@jamesbraza
Copy link
Collaborator

We retrieve relevant Texts using embeddings of the text. Since our embeddings aren't multimodal, the contents of a media do not get reflected in the embedding.

This leads to a failure mode where a given media is unassociated with colocated text. For example:

  • The media was on a separate page from body text, due to LaTeX PDF compilation
  • The media's caption is minimalist and doesn't well-connect to the media's contents

This PR fixes that by:

  1. Adds enrichment settings (e.g. prompt template + desired length)
  2. Integrates async enrichment into read_doc after PDF parsing and before chunking
  3. Incorporates enriched descriptions into Text.embeddings
  4. Creates an integration test to confirm this feature's usefulness.

@jamesbraza jamesbraza self-assigned this Oct 16, 2025
@Copilot Copilot AI review requested due to automatic review settings October 16, 2025 21:57
@jamesbraza jamesbraza added the enhancement New feature or request label Oct 16, 2025
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Oct 16, 2025
@dosubot
Copy link

dosubot bot commented Oct 16, 2025

Related Documentation

1 document(s) may need updating based on files changed in this PR

How did I do? Any feedback?  Join Discord

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enables media enrichment in embeddings by generating LLM-based descriptions of images and tables in documents. The enrichment allows better retrieval of text chunks containing media by incorporating visual content into embeddings without modifying the actual quoted text.

Key changes:

  • Added MultimodalOptions enum with ON_WITH_ENRICHMENT setting to control media enrichment
  • Integrated async media enrichment into the document reading pipeline before chunking
  • Modified embedding generation to optionally include enriched descriptions via Text.get_embeddable_text()

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/test_paperqa.py Added integration test validating enrichment improves FigQA-style question answering
tests/cassettes/test_image_enrichment.yaml VCR cassette recording API interactions for the enrichment test
src/paperqa/types.py Added get_embeddable_text() method to Text class for optional enrichment inclusion
src/paperqa/settings.py Added enrichment configuration fields and make_media_enricher() factory method
src/paperqa/readers.py Integrated enrichment call in read_doc() before chunking
src/paperqa/prompts.py Added media_enrichment_prompt_template for LLM-based media descriptions
src/paperqa/docs.py Updated embedding calls to use get_embeddable_text() with enrichment flag
src/paperqa/core.py Modified to only include table text (not all media text) in summaries
README.md Documented new enrichment-related settings

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@whitead
Copy link
Collaborator

whitead commented Oct 16, 2025

How will this change caching behavior of parsed docs? Like will the parsing config be a hash later on? I'm trying to grok how many parameters now affect parsing.

Also - does it matter that our parsing will now be stochastic? Any downstream effects of this?

@jamesbraza jamesbraza force-pushed the image-descriptions branch 4 times, most recently from 15a8eb7 to 94088cf Compare October 17, 2025 22:03
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Oct 17, 2025
@jamesbraza
Copy link
Collaborator Author

jamesbraza commented Oct 17, 2025

Thanks for the questions. To go over it holistically:

  1. First we parse: text into ParsedText.content and 0+ media into ParsedMedia.data
    • Enrichment didn't happen yet, so to be clear this PR doesn't impact parsing
  2. Next we chunk, which starts with enrichment
    • Enrichment is write-once, so once added to a media, we won't overwrite it
    • Caption is placed into the ParsedMedia.info metadata dict -- so it's not "first class", we know it's synthetic
    • Enrichment hypers are reflected in chunk metadata
    • Otherwise chunking (which happens on first class text-only) is unchanged by this PR
  3. Then in Docs.aadd_texts, the enriched text is included in just the embeddings
    • This influences retrieval of multimodal content for Context creation -- the point of this PR
    • Note it doesn't shift any actual text content, thus features such as 'snippets' can't be impacted

How will this change caching behavior of parsed docs? Like will the parsing config be a hash later on?

This PR doesn't change our parsing's caching (thanks to #1132). Here's what the metadata looks like:

  1. ParsedMetadata.name (with or without enrichment, it's the same): pdf|pipeline=StandardPdfPipeline|multimodal|images_scale=4.166666666666667
  2. ChunkMetadata.name: (without enrichment) paper-qa=5.29.2.dev60+gf88b581e1.d20251017|algorithm=overlap-pdf|size=5000|overlap=250
  3. ChunkMetadata.name: (with enrichment) paper-qa=5.29.2.dev60+gf88b581e1.d20251017|algorithm=overlap-pdf|size=5000|overlap=250|enriched=17|radius=1
    • With enrichment, a different hash gets used

Also - does it matter that our parsing will now be stochastic? Any downstream effects of this?

It's not that stochastic because enrichment is write-once -- once we have it, we keep it around. When storing a ParsedMedia in a DB, one should be storing the info attribute, so that's unchanged.

Let me know if you have any follow up questions/comments

Edit: I added some of this information to the README so it's documented beyond this PR

@jamesbraza jamesbraza requested review from whitead and removed request for whitead October 17, 2025 22:04
@jamesbraza jamesbraza requested a review from Copilot October 18, 2025 03:46
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Collaborator

@whitead whitead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Would be nice to keep all our defaults to one LLM provider (right now, this is OpenAI). Can we do the same for default enrichment, rather than sonnet-4.5?

texts,
await embedding_model.embed_documents(texts=[t.text for t in texts]),
await embedding_model.embed_documents(
texts=await asyncio.gather(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to semaphore this? You can say no and keep the code simple. Just wondering about LLM rate limits here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good question. Our enrichment process takes place within read_doc, and this code in aadd_texts takes place after that. So at the moment this asyncio.gather is a placeholder.

I had made t.get_embeddable_text async so that if ever desired we could support:

  • Just-in-time enrichment
  • Fetching enrichment from somewhere external storage

I went ahead and documented this more extensively in the get_embeddable_text docstring

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request lgtm This PR has been approved by a maintainer size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants