Skip to content

Conversation

@vblagoje
Copy link
Member

Why

Adds a new integration for Azure Document Intelligence (azure-doc-intelligence-haystack), providing a Haystack component that converts documents (PDF, images, Office files) to Haystack Documents using Azure's Document Intelligence service.

Azure deprecated azure-ai-formrecognizer in favor of azure-ai-documentintelligence (v1.0.0, GA Dec 2024). This new integration provides a clean slate with:

  • Markdown output format (GitHub Flavored Markdown): Better suited for RAG/LLM applications - tables inline with context, preserved document structure (headings, lists), no manual assembly required
  • Modern API: Uses the 2024-11-30 API version with improved table and structure detection
  • Simplified API: Removed deprecated parameters and streamlined the interface

What

Added AzureDocumentIntelligenceConverter component:

  • Uses azure-ai-documentintelligence>=1.0.0 package
  • Markdown output mode (default): Single document with inline tables and preserved structure
  • Text output mode (backward compatibility): Separate CSV table documents or markdown tables
  • Multiple model support: prebuilt-read (fast OCR), prebuilt-layout (enhanced structure), prebuilt-document (general), or custom models

Usage

  import os
  from haystack_integrations.components.converters.azure_doc_intelligence import (
      AzureDocumentIntelligenceConverter,
  )
  from haystack.utils import Secret

  # Markdown mode (recommended for RAG)
  converter = AzureDocumentIntelligenceConverter(
      endpoint=os.environ["AZURE_DI_ENDPOINT"],
      api_key=Secret.from_env_var("AZURE_AI_API_KEY"),
      output_format="markdown"
  )
  results = converter.run(sources=["invoice.pdf"])
  # Returns single document with markdown, tables inline

  # Text mode with CSV tables (backward compatibility)
  converter = AzureDocumentIntelligenceConverter(
      endpoint=os.environ["AZURE_DI_ENDPOINT"],
      api_key=Secret.from_env_var("AZURE_AI_API_KEY"),
      output_format="text",
      table_format="csv"
  )
  # Returns separate CSV table documents + text document

Testing

  • 3 unit tests (init, to_dict, from_dict)
  • 4 integration tests with real Azure API (markdown output, text+CSV tables, metadata handling, multiple files)

Notes for reviewer

  • Package follows the standard haystack-core-integrations structure
  • Includes optional [csv] extra for tabulate dependency
  • CI workflow added: .github/workflows/azure_doc_intelligence.yml
  • Integration added to root README.md inventory table

@github-actions github-actions bot added topic:CI type:documentation Improvements or additions to documentation labels Jan 12, 2026
@vblagoje vblagoje marked this pull request as ready for review January 12, 2026 15:03
@vblagoje vblagoje requested a review from a team as a code owner January 12, 2026 15:03
@vblagoje vblagoje requested review from julian-risch and sjrl and removed request for a team January 12, 2026 15:03
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple smaller things that I'd like to see changed but overall looks quite good to me already.

@@ -0,0 +1,173 @@
# azure-doc-intelligence-haystack
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we keep this README minimal, similar to the README of other integrations for example OpenSearch. The advantage is that if there are changes to the integration, we don't need to update many different readme files. The content you suggest here is better suited for the haystack-integrations repo.

"pytest-rerunfailures",
"mypy",
"pip",
"pandas",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took me a moment to understand that pandas is here because it's the extra dependence installed by azure-doc-intelligence-haystack[csv]. A short comment here would help to document that.

from haystack.components.converters.utils import get_bytestream_from_source, normalize_metadata
from haystack.dataclasses import ByteStream
from haystack.utils import Secret, deserialize_secrets_inplace
from pandas import DataFrame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas is optional, right? We probably want to move this to _extract_csv_tables?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we only support markdown as I suggest here we can drop this dependency

Comment on lines 91 to 92
output_format: Literal["text", "markdown"] = "markdown",
table_format: Literal["csv", "markdown"] = "markdown",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simplicity sake, I'd actually advocate for only supporting the markdown formatting currently since it's an in-built new feature of the new SDK and is the easiest for us to integrate with.

The text and csv formats are specific to custom preprocessing functions we have that I'd say only perform okay. I think they could be worth having, but I'd push for adding them into a future PR. Especially since I made a better working version of them here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok leaving only markdown

@sjrl
Copy link
Contributor

sjrl commented Jan 14, 2026

Some other files to be updated:

vblagoje and others added 8 commits January 15, 2026 10:59
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
…components/converters/azure_doc_intelligence/converter.py

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
…components/converters/azure_doc_intelligence/converter.py

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
…components/converters/azure_doc_intelligence/converter.py

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
@vblagoje vblagoje requested review from julian-risch and sjrl January 15, 2026 15:29
Comment on lines 27 to 30
integration:azure-doc-intelligence:
- changed-files:
- any-glob-to-any-file: "integrations/azure_doc_intelligence/**/*"
- any-glob-to-any-file: ".github/workflows/azure_doc_intelligence.yml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know @anakin87 likes to keep this file alphabetized :D So could we move this below azure-ai-search?

self,
endpoint: str,
*,
api_key: Secret = Secret.from_env_var("AZURE_AI_API_KEY"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be DI instead of AI? Or is the API key name the same between the doc intelligence and AI services?

Azure model to use for analysis. Options:
- "prebuilt-read": Fast OCR for text extraction (default)
- "prebuilt-layout": Enhanced layout analysis with better table/structure detection
- "prebuilt-document": General document analysis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall, I found prebuilt-document to perform well and much better than prebuilt-read so I wonder if we should make this default even though it's more expensive. WDYT?

Comment on lines 90 to 92
self.client = DocumentIntelligenceClient(
endpoint=endpoint, credential=AzureKeyCredential(api_key.resolve_value() or "")
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this connection into the warm_up function? This way it's easier to perform validation on a pipeline using this component (e.g. all connections work, etc.) before needing to supply the api key or make an external API request.

@sjrl
Copy link
Contributor

sjrl commented Jan 16, 2026

@vblagoje looking good! Just a few remaining comments

@vblagoje
Copy link
Member Author

@vblagoje looking good! Just a few remaining comments

Fair points @sjrl - will finish this up today!

@vblagoje vblagoje requested a review from sjrl January 19, 2026 09:53
@vblagoje
Copy link
Member Author

Should be gtg now @sjrl

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one last change request: let's make sure the component auto-runs warm up when necessary instead of throwing an error. Similar to the work that happened as part of this issue #2592

@vblagoje
Copy link
Member Author

Only one last change request: let's make sure the component auto-runs warm up when necessary instead of throwing an error. Similar to the work that happened as part of this issue

Should be gtg now @julian-risch

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for addressing the numerous change requests! 👍

@vblagoje vblagoje merged commit 14f9ae0 into main Jan 19, 2026
9 checks passed
@vblagoje vblagoje deleted the azure_di branch January 19, 2026 11:47
@vblagoje
Copy link
Member Author

Released as https://pypi.org/project/azure-doc-intelligence-haystack/

@nils-hde
Copy link

very nice seeing this implemented 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:CI type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add a Azure OCR Converter that uses the azure-ai-documentintelligence library

5 participants