-
Notifications
You must be signed in to change notification settings - Fork 205
feat: Add azure-doc-intelligence integration #2717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
julian-risch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple smaller things that I'd like to see changed but overall looks quite good to me already.
...lligence/src/haystack_integrations/components/converters/azure_doc_intelligence/converter.py
Outdated
Show resolved
Hide resolved
| @@ -0,0 +1,173 @@ | |||
| # azure-doc-intelligence-haystack | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we keep this README minimal, similar to the README of other integrations for example OpenSearch. The advantage is that if there are changes to the integration, we don't need to update many different readme files. The content you suggest here is better suited for the haystack-integrations repo.
| "pytest-rerunfailures", | ||
| "mypy", | ||
| "pip", | ||
| "pandas", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
took me a moment to understand that pandas is here because it's the extra dependence installed by azure-doc-intelligence-haystack[csv]. A short comment here would help to document that.
| from haystack.components.converters.utils import get_bytestream_from_source, normalize_metadata | ||
| from haystack.dataclasses import ByteStream | ||
| from haystack.utils import Secret, deserialize_secrets_inplace | ||
| from pandas import DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas is optional, right? We probably want to move this to _extract_csv_tables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we only support markdown as I suggest here we can drop this dependency
...lligence/src/haystack_integrations/components/converters/azure_doc_intelligence/converter.py
Outdated
Show resolved
Hide resolved
...lligence/src/haystack_integrations/components/converters/azure_doc_intelligence/converter.py
Show resolved
Hide resolved
| output_format: Literal["text", "markdown"] = "markdown", | ||
| table_format: Literal["csv", "markdown"] = "markdown", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For simplicity sake, I'd actually advocate for only supporting the markdown formatting currently since it's an in-built new feature of the new SDK and is the easiest for us to integrate with.
The text and csv formats are specific to custom preprocessing functions we have that I'd say only perform okay. I think they could be worth having, but I'd push for adding them into a future PR. Especially since I made a better working version of them here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok leaving only markdown
|
Some other files to be updated:
|
...lligence/src/haystack_integrations/components/converters/azure_doc_intelligence/converter.py
Show resolved
Hide resolved
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
…components/converters/azure_doc_intelligence/converter.py Co-authored-by: Julian Risch <julian.risch@deepset.ai>
…components/converters/azure_doc_intelligence/converter.py Co-authored-by: Julian Risch <julian.risch@deepset.ai>
…components/converters/azure_doc_intelligence/converter.py Co-authored-by: Julian Risch <julian.risch@deepset.ai>
.github/labeler.yml
Outdated
| integration:azure-doc-intelligence: | ||
| - changed-files: | ||
| - any-glob-to-any-file: "integrations/azure_doc_intelligence/**/*" | ||
| - any-glob-to-any-file: ".github/workflows/azure_doc_intelligence.yml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know @anakin87 likes to keep this file alphabetized :D So could we move this below azure-ai-search?
| self, | ||
| endpoint: str, | ||
| *, | ||
| api_key: Secret = Secret.from_env_var("AZURE_AI_API_KEY"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be DI instead of AI? Or is the API key name the same between the doc intelligence and AI services?
| Azure model to use for analysis. Options: | ||
| - "prebuilt-read": Fast OCR for text extraction (default) | ||
| - "prebuilt-layout": Enhanced layout analysis with better table/structure detection | ||
| - "prebuilt-document": General document analysis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I recall, I found prebuilt-document to perform well and much better than prebuilt-read so I wonder if we should make this default even though it's more expensive. WDYT?
| self.client = DocumentIntelligenceClient( | ||
| endpoint=endpoint, credential=AzureKeyCredential(api_key.resolve_value() or "") | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move this connection into the warm_up function? This way it's easier to perform validation on a pipeline using this component (e.g. all connections work, etc.) before needing to supply the api key or make an external API request.
|
@vblagoje looking good! Just a few remaining comments |
|
Should be gtg now @sjrl |
julian-risch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one last change request: let's make sure the component auto-runs warm up when necessary instead of throwing an error. Similar to the work that happened as part of this issue #2592
Should be gtg now @julian-risch |
julian-risch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Thanks for addressing the numerous change requests! 👍
|
very nice seeing this implemented 🙂 |
Why
Adds a new integration for Azure Document Intelligence (azure-doc-intelligence-haystack), providing a Haystack component that converts documents (PDF, images, Office files) to Haystack Documents using Azure's Document Intelligence service.
azure-ai-documentintelligencelibrary haystack#8404Azure deprecated azure-ai-formrecognizer in favor of azure-ai-documentintelligence (v1.0.0, GA Dec 2024). This new integration provides a clean slate with:
What
Added AzureDocumentIntelligenceConverter component:
Usage
Testing
Notes for reviewer