doc-loader: retain Azure Doc Intelligence API metadata in Document parser #28382

jmohren · 2024-11-27T10:24:35Z

Description:
This PR modifies the doc_intelligence.py parser in the community package to include all metadata returned by the Azure Doc Intelligence API in the Document object. Previously, only the parsed content (markdown) was retained, while other important metadata such as bounding boxes (bboxes) for images and tables was discarded. These image bboxes are crucial for supporting use cases like multi-modal RAG workflows when using Azure Doc Intelligence.

The change ensures that all information returned by the Azure Doc Intelligence API is preserved by setting the metadata attribute of the Document object to the entire result returned by the API, rather than an empty dictionary. This extends the parser's utility for complex use cases without breaking existing functionality.

Issue:
This change does not address a specific issue number, but it resolves a critical limitation in supporting multimodal workflows when using the LangChain wrapper for the Azure API.

Dependencies:
No additional dependencies are required for this change.

vercel · 2024-11-27T10:24:39Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Nov 27, 2024 4:27pm

ccurme · 2024-11-27T15:39:02Z

libs/community/langchain_community/document_loaders/parsers/doc_intelligence.py

@@ -71,7 +71,7 @@ def _generate_docs_page(self, result: Any) -> Iterator[Document]:
            yield d

    def _generate_docs_single(self, result: Any) -> Iterator[Document]:
-        yield Document(page_content=result.content, metadata={})
+        yield Document(page_content=result.content, metadata=result)


is result always a dict? I see it typed as Any.

No you are right, it is type AnalyzeResult which has an as_dict method. I will adjust accordingly, thanks for pointing out!

The azure result function is implemented generically using a type placeholder (type variable) (PollingReturnType_co), which allows flexibility for different use cases. This is why the docstring states Any. However, during initialization, the specific polling method enforces the return type to be AnalyzeResult in this context, as it deserializes the server response into this strongly-typed object.

@ccurme I kindly ask if there is an update in the review process :)

keep information returned by doc intelligence api

c0a379c

dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Nov 27, 2024

dosubot bot added community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Nov 27, 2024

vercel bot deployed to Preview November 27, 2024 10:35 View deployment

ccurme reviewed Nov 27, 2024

View reviewed changes

transform AnalyzeResult to dict

121530d

vercel bot deployed to Preview November 27, 2024 16:27 View deployment

jmohren requested a review from ccurme November 29, 2024 08:33

efriis assigned ccurme Dec 9, 2024

ccurme approved these changes Dec 10, 2024

View reviewed changes

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Dec 10, 2024

ccurme merged commit c1d348e into langchain-ai:master Dec 10, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc-loader: retain Azure Doc Intelligence API metadata in Document parser #28382

doc-loader: retain Azure Doc Intelligence API metadata in Document parser #28382

jmohren commented Nov 27, 2024 •

edited

Loading

vercel bot commented Nov 27, 2024 •

edited

Loading

ccurme Nov 27, 2024

jmohren Nov 27, 2024 •

edited

Loading

jmohren Nov 27, 2024

jmohren Dec 4, 2024

doc-loader: retain Azure Doc Intelligence API metadata in Document parser #28382

doc-loader: retain Azure Doc Intelligence API metadata in Document parser #28382

Conversation

jmohren commented Nov 27, 2024 • edited Loading

vercel bot commented Nov 27, 2024 • edited Loading

ccurme Nov 27, 2024

Choose a reason for hiding this comment

jmohren Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

jmohren Nov 27, 2024

Choose a reason for hiding this comment

jmohren Dec 4, 2024

Choose a reason for hiding this comment

jmohren commented Nov 27, 2024 •

edited

Loading

vercel bot commented Nov 27, 2024 •

edited

Loading

jmohren Nov 27, 2024 •

edited

Loading