-
Notifications
You must be signed in to change notification settings - Fork 779
Supporting media enrichment in embeddings #1143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Related Documentation 1 document(s) may need updating based on files changed in this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enables media enrichment in embeddings by generating LLM-based descriptions of images and tables in documents. The enrichment allows better retrieval of text chunks containing media by incorporating visual content into embeddings without modifying the actual quoted text.
Key changes:
- Added
MultimodalOptionsenum withON_WITH_ENRICHMENTsetting to control media enrichment - Integrated async media enrichment into the document reading pipeline before chunking
- Modified embedding generation to optionally include enriched descriptions via
Text.get_embeddable_text()
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_paperqa.py |
Added integration test validating enrichment improves FigQA-style question answering |
tests/cassettes/test_image_enrichment.yaml |
VCR cassette recording API interactions for the enrichment test |
src/paperqa/types.py |
Added get_embeddable_text() method to Text class for optional enrichment inclusion |
src/paperqa/settings.py |
Added enrichment configuration fields and make_media_enricher() factory method |
src/paperqa/readers.py |
Integrated enrichment call in read_doc() before chunking |
src/paperqa/prompts.py |
Added media_enrichment_prompt_template for LLM-based media descriptions |
src/paperqa/docs.py |
Updated embedding calls to use get_embeddable_text() with enrichment flag |
src/paperqa/core.py |
Modified to only include table text (not all media text) in summaries |
README.md |
Documented new enrichment-related settings |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
ea498c9 to
f4f1d08
Compare
|
How will this change caching behavior of parsed docs? Like will the parsing config be a hash later on? I'm trying to grok how many parameters now affect parsing. Also - does it matter that our parsing will now be stochastic? Any downstream effects of this? |
15a8eb7 to
94088cf
Compare
|
Thanks for the questions. To go over it holistically:
This PR doesn't change our parsing's caching (thanks to #1132). Here's what the metadata looks like:
It's not that stochastic because enrichment is write-once -- once we have it, we keep it around. When storing a Let me know if you have any follow up questions/comments Edit: I added some of this information to the |
94088cf to
2994684
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
2994684 to
8e9775d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. Would be nice to keep all our defaults to one LLM provider (right now, this is OpenAI). Can we do the same for default enrichment, rather than sonnet-4.5?
| texts, | ||
| await embedding_model.embed_documents(texts=[t.text for t in texts]), | ||
| await embedding_model.embed_documents( | ||
| texts=await asyncio.gather( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to semaphore this? You can say no and keep the code simple. Just wondering about LLM rate limits here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good question. Our enrichment process takes place within read_doc, and this code in aadd_texts takes place after that. So at the moment this asyncio.gather is a placeholder.
I had made t.get_embeddable_text async so that if ever desired we could support:
- Just-in-time enrichment
- Fetching enrichment from somewhere external storage
I went ahead and documented this more extensively in the get_embeddable_text docstring
8e9775d to
2e6f22e
Compare
Integrated media enrichment into settings and Docs.aadd
2e6f22e to
302c393
Compare
We retrieve relevant
Texts using embeddings of the text. Since our embeddings aren't multimodal, the contents of a media do not get reflected in the embedding.This leads to a failure mode where a given media is unassociated with colocated text. For example:
This PR fixes that by:
asyncenrichment intoread_docafter PDF parsing and before chunkingText.embeddings