-
Notifications
You must be signed in to change notification settings - Fork 15.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc-loader: retain Azure Doc Intelligence API metadata in Document parser #28382
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
@@ -71,7 +71,7 @@ def _generate_docs_page(self, result: Any) -> Iterator[Document]: | |||
yield d | |||
|
|||
def _generate_docs_single(self, result: Any) -> Iterator[Document]: | |||
yield Document(page_content=result.content, metadata={}) | |||
yield Document(page_content=result.content, metadata=result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is result
always a dict? I see it typed as Any
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No you are right, it is type AnalyzeResult which has an as_dict method. I will adjust accordingly, thanks for pointing out!
The azure result function is implemented generically using a type placeholder (type variable) (PollingReturnType_co), which allows flexibility for different use cases. This is why the docstring states Any. However, during initialization, the specific polling method enforces the return type to be AnalyzeResult in this context, as it deserializes the server response into this strongly-typed object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ccurme I kindly ask if there is an update in the review process :)
Description:
This PR modifies the doc_intelligence.py parser in the community package to include all metadata returned by the Azure Doc Intelligence API in the Document object. Previously, only the parsed content (markdown) was retained, while other important metadata such as bounding boxes (bboxes) for images and tables was discarded. These image bboxes are crucial for supporting use cases like multi-modal RAG workflows when using Azure Doc Intelligence.
The change ensures that all information returned by the Azure Doc Intelligence API is preserved by setting the metadata attribute of the Document object to the entire result returned by the API, rather than an empty dictionary. This extends the parser's utility for complex use cases without breaking existing functionality.
Issue:
This change does not address a specific issue number, but it resolves a critical limitation in supporting multimodal workflows when using the LangChain wrapper for the Azure API.
Dependencies:
No additional dependencies are required for this change.