Supporting media enrichment in embeddings #1143

jamesbraza · 2025-10-16T21:57:30Z

We retrieve relevant Texts using embeddings of the text. Since our embeddings aren't multimodal, the contents of a media do not get reflected in the embedding.

This leads to a failure mode where a given media is unassociated with colocated text. For example:

The media was on a separate page from body text, due to LaTeX PDF compilation
The media's caption is minimalist and doesn't well-connect to the media's contents

This PR fixes that by:

Adds enrichment settings (e.g. prompt template + desired length)
Integrates async enrichment into read_doc after PDF parsing and before chunking
Incorporates enriched descriptions into Text.embeddings
Creates an integration test to confirm this feature's usefulness.

dosubot · 2025-10-16T21:58:14Z

Related Documentation

1 document(s) may need updating based on files changed in this PR

Multimodal Support in PaperQA

^{How did I do? Any feedback?}

Copilot

Pull Request Overview

This PR enables media enrichment in embeddings by generating LLM-based descriptions of images and tables in documents. The enrichment allows better retrieval of text chunks containing media by incorporating visual content into embeddings without modifying the actual quoted text.

Key changes:

Added MultimodalOptions enum with ON_WITH_ENRICHMENT setting to control media enrichment
Integrated async media enrichment into the document reading pipeline before chunking
Modified embedding generation to optionally include enriched descriptions via Text.get_embeddable_text()

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`tests/test_paperqa.py`	Added integration test validating enrichment improves FigQA-style question answering
`tests/cassettes/test_image_enrichment.yaml`	VCR cassette recording API interactions for the enrichment test
`src/paperqa/types.py`	Added `get_embeddable_text()` method to `Text` class for optional enrichment inclusion
`src/paperqa/settings.py`	Added enrichment configuration fields and `make_media_enricher()` factory method
`src/paperqa/readers.py`	Integrated enrichment call in `read_doc()` before chunking
`src/paperqa/prompts.py`	Added `media_enrichment_prompt_template` for LLM-based media descriptions
`src/paperqa/docs.py`	Updated embedding calls to use `get_embeddable_text()` with enrichment flag
`src/paperqa/core.py`	Modified to only include table text (not all media text) in summaries
`README.md`	Documented new enrichment-related settings

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/paperqa/settings.py

src/paperqa/readers.py

tests/test_paperqa.py

whitead · 2025-10-16T23:01:41Z

How will this change caching behavior of parsed docs? Like will the parsing config be a hash later on? I'm trying to grok how many parameters now affect parsing.

Also - does it matter that our parsing will now be stochastic? Any downstream effects of this?

jamesbraza · 2025-10-17T22:04:43Z

Thanks for the questions. To go over it holistically:

First we parse: text into ParsedText.content and 0+ media into ParsedMedia.data
- Enrichment didn't happen yet, so to be clear this PR doesn't impact parsing
Next we chunk, which starts with enrichment
- Enrichment is write-once, so once added to a media, we won't overwrite it
- Caption is placed into the ParsedMedia.info metadata dict -- so it's not "first class", we know it's synthetic
- Enrichment hypers are reflected in chunk metadata
- Otherwise chunking (which happens on first class text-only) is unchanged by this PR
Then in Docs.aadd_texts, the enriched text is included in just the embeddings
- This influences retrieval of multimodal content for Context creation -- the point of this PR
- Note it doesn't shift any actual text content, thus features such as 'snippets' can't be impacted

How will this change caching behavior of parsed docs? Like will the parsing config be a hash later on?

This PR doesn't change our parsing's caching (thanks to #1132). Here's what the metadata looks like:

ParsedMetadata.name (with or without enrichment, it's the same): pdf|pipeline=StandardPdfPipeline|multimodal|images_scale=4.166666666666667
ChunkMetadata.name: (without enrichment) paper-qa=5.29.2.dev60+gf88b581e1.d20251017|algorithm=overlap-pdf|size=5000|overlap=250
ChunkMetadata.name: (with enrichment) paper-qa=5.29.2.dev60+gf88b581e1.d20251017|algorithm=overlap-pdf|size=5000|overlap=250|enriched=17|radius=1
- With enrichment, a different hash gets used

Also - does it matter that our parsing will now be stochastic? Any downstream effects of this?

It's not that stochastic because enrichment is write-once -- once we have it, we keep it around. When storing a ParsedMedia in a DB, one should be storing the info attribute, so that's unchanged.

Let me know if you have any follow up questions/comments

Edit: I added some of this information to the README so it's documented beyond this PR

Copilot

Pull Request Overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/paperqa/settings.py

README.md

whitead

This looks good. Would be nice to keep all our defaults to one LLM provider (right now, this is OpenAI). Can we do the same for default enrichment, rather than sonnet-4.5?

whitead · 2025-10-20T13:17:08Z

src/paperqa/docs.py

                texts,
-                await embedding_model.embed_documents(texts=[t.text for t in texts]),
+                await embedding_model.embed_documents(
+                    texts=await asyncio.gather(


Do we need to semaphore this? You can say no and keep the code simple. Just wondering about LLM rate limits here.

Yeah good question. Our enrichment process takes place within read_doc, and this code in aadd_texts takes place after that. So at the moment this asyncio.gather is a placeholder.

I had made t.get_embeddable_text async so that if ever desired we could support:

Just-in-time enrichment

Fetching enrichment from somewhere external storage

I went ahead and documented this more extensively in the get_embeddable_text docstring

…rompt_template

Integrated media enrichment into settings and Docs.aadd

jamesbraza requested review from maykcaldas, mskarlin, nadolskit, sidnarayanan and whitead October 16, 2025 21:57

jamesbraza self-assigned this Oct 16, 2025

Copilot AI review requested due to automatic review settings October 16, 2025 21:57

jamesbraza added the enhancement New feature or request label Oct 16, 2025

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Oct 16, 2025

Copilot AI reviewed Oct 16, 2025

View reviewed changes

jamesbraza force-pushed the image-descriptions branch from ea498c9 to f4f1d08 Compare October 16, 2025 22:19

jamesbraza force-pushed the image-descriptions branch 4 times, most recently from 15a8eb7 to 94088cf Compare October 17, 2025 22:03

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Oct 17, 2025

jamesbraza requested review from whitead and removed request for whitead October 17, 2025 22:04

jamesbraza force-pushed the image-descriptions branch from 94088cf to 2994684 Compare October 18, 2025 03:46

jamesbraza requested a review from Copilot October 18, 2025 03:46

Copilot AI reviewed Oct 18, 2025

View reviewed changes

src/paperqa/settings.py Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

jamesbraza force-pushed the image-descriptions branch from 2994684 to 8e9775d Compare October 18, 2025 03:51

whitead approved these changes Oct 20, 2025

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 20, 2025

jamesbraza force-pushed the image-descriptions branch from 8e9775d to 2e6f22e Compare October 20, 2025 23:01

Expanded ParsedMedia's field descriptions to talk about image enrichment

80b0aae

jamesbraza added 9 commits October 21, 2025 17:30

Recording page_num in ParsedMedia.info, with tests

146f7c1

Fixed _map_fxn_summary not filtering for tables in text_with_tables_p…

054885d

…rompt_template

Added multimodal_enricher to read_doc

3610f82

Integrated media enrichment into settings and Docs.aadd, with tests

b39b98a

Integrated media enrichment into settings and Docs.aadd

Removed description_length template variable to have less hypers

c6b4468

Added media enrichment to embedding process, with a test

d059bb1

Created README docs on the enrichment

b5e433a

Added robustness to rejected images, with tests

420727c

Refreshing cassettes as needed

302c393

jamesbraza force-pushed the image-descriptions branch from 2e6f22e to 302c393 Compare October 22, 2025 00:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Supporting media enrichment in embeddings #1143

Supporting media enrichment in embeddings #1143

Uh oh!

jamesbraza commented Oct 16, 2025

Uh oh!

dosubot bot commented Oct 16, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

whitead commented Oct 16, 2025

Uh oh!

jamesbraza commented Oct 17, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

whitead left a comment

Uh oh!

whitead Oct 20, 2025

Uh oh!

jamesbraza Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Supporting media enrichment in embeddings #1143

Are you sure you want to change the base?

Supporting media enrichment in embeddings #1143

Uh oh!

Conversation

jamesbraza commented Oct 16, 2025

Uh oh!

dosubot bot commented Oct 16, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

whitead commented Oct 16, 2025

Uh oh!

jamesbraza commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

whitead left a comment

Choose a reason for hiding this comment

Uh oh!

whitead Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

jamesbraza Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jamesbraza commented Oct 17, 2025 •

edited

Loading