@@ -747,6 +747,24 @@ Depending on the source document, the same image can appear multiple times
747747Thus, clients should consider media databases
748748to have a many-to-many relationship with chunks.
749749
750+ Since PaperQA's evidence gathering process centers on text-based retrieval,
751+ it's possible relevant image(s) or table(s) aren't retrieved
752+ because their associated text content is irrelevant.
753+ For a concrete example, imagine the figure in a paper has a terse caption
754+ and is placed one page after relevant main-text discussion.
755+ To solve this problem, PaperQA supports media enrichment at document read-time.
756+ Basically after reading in the PDF,
757+ the ` parsing.enrichment_llm ` is given the ` parsing.enrichment_prompt `
758+ and co-located text to generate a synthetic caption for every image/table.
759+ The synthetic captions are used to shift the embeddings of each text chunk,
760+ but are kept separate from the actual source text.
761+ This way evidence gathering can fetch relevant images/tables
762+ without risk of polluting contextual summaries with LLM-generated captions.
763+
764+ If you want multimodal PDF reading, but do not want enrichment
765+ (since adds one LLM prompt/media at read-time),
766+ enrichment can be disabled by setting ` parsing.multimodal ` to ` ON_WITHOUT_ENRICHMENT ` .
767+
750768When creating contextual summaries on a given chunk (a ` Text ` ),
751769the summary LLM is passed both the chunk's text and the chunk's associated media,
752770but the output contextual summary itself remains text-only.
0 commit comments