Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc-loader: retain Azure Doc Intelligence API metadata in Document parser #28382

Merged
merged 2 commits into from
Dec 10, 2024

Conversation

jmohren
Copy link
Contributor

@jmohren jmohren commented Nov 27, 2024

Description:
This PR modifies the doc_intelligence.py parser in the community package to include all metadata returned by the Azure Doc Intelligence API in the Document object. Previously, only the parsed content (markdown) was retained, while other important metadata such as bounding boxes (bboxes) for images and tables was discarded. These image bboxes are crucial for supporting use cases like multi-modal RAG workflows when using Azure Doc Intelligence.

The change ensures that all information returned by the Azure Doc Intelligence API is preserved by setting the metadata attribute of the Document object to the entire result returned by the API, rather than an empty dictionary. This extends the parser's utility for complex use cases without breaking existing functionality.

Issue:
This change does not address a specific issue number, but it resolves a critical limitation in supporting multimodal workflows when using the LangChain wrapper for the Azure API.

Dependencies:
No additional dependencies are required for this change.

@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Nov 27, 2024
Copy link

vercel bot commented Nov 27, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 27, 2024 4:27pm

@dosubot dosubot bot added community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Nov 27, 2024
@@ -71,7 +71,7 @@ def _generate_docs_page(self, result: Any) -> Iterator[Document]:
yield d

def _generate_docs_single(self, result: Any) -> Iterator[Document]:
yield Document(page_content=result.content, metadata={})
yield Document(page_content=result.content, metadata=result)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is result always a dict? I see it typed as Any.

Copy link
Contributor Author

@jmohren jmohren Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No you are right, it is type AnalyzeResult which has an as_dict method. I will adjust accordingly, thanks for pointing out!
image

The azure result function is implemented generically using a type placeholder (type variable) (PollingReturnType_co), which allows flexibility for different use cases. This is why the docstring states Any. However, during initialization, the specific polling method enforces the return type to be AnalyzeResult in this context, as it deserializes the server response into this strongly-typed object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccurme I kindly ask if there is an update in the review process :)

@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Dec 10, 2024
@ccurme ccurme merged commit c1d348e into langchain-ai:master Dec 10, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) lgtm PR looks good. Use to confirm that a PR is ready for merging. size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants