feat: Add azure-doc-intelligence integration #2717

vblagoje · 2026-01-12T14:56:45Z

Why

Adds a new integration for Azure Document Intelligence (azure-doc-intelligence-haystack), providing a Haystack component that converts documents (PDF, images, Office files) to Haystack Documents using Azure's Document Intelligence service.

fixes feat: Add a Azure OCR Converter that uses the azure-ai-documentintelligence library haystack#8404

Azure deprecated azure-ai-formrecognizer in favor of azure-ai-documentintelligence (v1.0.0, GA Dec 2024). This new integration provides a clean slate with:

Markdown output format (GitHub Flavored Markdown): Better suited for RAG/LLM applications - tables inline with context, preserved document structure (headings, lists), no manual assembly required
Modern API: Uses the 2024-11-30 API version with improved table and structure detection
Simplified API: Removed deprecated parameters and streamlined the interface

What

Added AzureDocumentIntelligenceConverter component:

Uses azure-ai-documentintelligence>=1.0.0 package
Markdown output mode (default): Single document with inline tables and preserved structure
Text output mode (backward compatibility): Separate CSV table documents or markdown tables
Multiple model support: prebuilt-read (fast OCR), prebuilt-layout (enhanced structure), prebuilt-document (general), or custom models

Usage

  import os
  from haystack_integrations.components.converters.azure_doc_intelligence import (
      AzureDocumentIntelligenceConverter,
  )
  from haystack.utils import Secret

  # Markdown mode (recommended for RAG)
  converter = AzureDocumentIntelligenceConverter(
      endpoint=os.environ["AZURE_DI_ENDPOINT"],
      api_key=Secret.from_env_var("AZURE_AI_API_KEY"),
      output_format="markdown"
  )
  results = converter.run(sources=["invoice.pdf"])
  # Returns single document with markdown, tables inline

  # Text mode with CSV tables (backward compatibility)
  converter = AzureDocumentIntelligenceConverter(
      endpoint=os.environ["AZURE_DI_ENDPOINT"],
      api_key=Secret.from_env_var("AZURE_AI_API_KEY"),
      output_format="text",
      table_format="csv"
  )
  # Returns separate CSV table documents + text document

Testing

3 unit tests (init, to_dict, from_dict)
4 integration tests with real Azure API (markdown output, text+CSV tables, metadata handling, multiple files)

Notes for reviewer

Package follows the standard haystack-core-integrations structure
Includes optional [csv] extra for tabulate dependency
CI workflow added: .github/workflows/azure_doc_intelligence.yml
Integration added to root README.md inventory table

julian-risch

Couple smaller things that I'd like to see changed but overall looks quite good to me already.

...lligence/src/haystack_integrations/components/converters/azure_doc_intelligence/converter.py

integrations/azure_doc_intelligence/pyproject.toml

julian-risch · 2026-01-13T16:45:26Z

integrations/azure_doc_intelligence/README.md

@@ -0,0 +1,173 @@
+# azure-doc-intelligence-haystack


I suggest we keep this README minimal, similar to the README of other integrations for example OpenSearch. The advantage is that if there are changes to the integration, we don't need to update many different readme files. The content you suggest here is better suited for the haystack-integrations repo.

julian-risch · 2026-01-13T16:49:41Z

integrations/azure_doc_intelligence/pyproject.toml

+    "pytest-rerunfailures",
+    "mypy",
+    "pip",
+    "pandas",


took me a moment to understand that pandas is here because it's the extra dependence installed by azure-doc-intelligence-haystack[csv]. A short comment here would help to document that.

julian-risch · 2026-01-13T16:50:53Z

...lligence/src/haystack_integrations/components/converters/azure_doc_intelligence/converter.py

+from haystack.components.converters.utils import get_bytestream_from_source, normalize_metadata
+from haystack.dataclasses import ByteStream
+from haystack.utils import Secret, deserialize_secrets_inplace
+from pandas import DataFrame


pandas is optional, right? We probably want to move this to _extract_csv_tables?

If we only support markdown as I suggest here we can drop this dependency

...lligence/src/haystack_integrations/components/converters/azure_doc_intelligence/converter.py

sjrl · 2026-01-14T07:52:00Z

...lligence/src/haystack_integrations/components/converters/azure_doc_intelligence/converter.py

+        output_format: Literal["text", "markdown"] = "markdown",
+        table_format: Literal["csv", "markdown"] = "markdown",


For simplicity sake, I'd actually advocate for only supporting the markdown formatting currently since it's an in-built new feature of the new SDK and is the easiest for us to integrate with.

The text and csv formats are specific to custom preprocessing functions we have that I'd say only perform okay. I think they could be worth having, but I'd push for adding them into a future PR. Especially since I made a better working version of them here

Ok leaving only markdown

sjrl · 2026-01-14T08:22:09Z

Some other files to be updated:

https://github.com/deepset-ai/haystack-core-integrations/blob/main/.github/labeler.yml to add a new label for the new integration
Missing a license file in the azure_doc_intelligence folder
The py.typed is at the incorrect level. It should be in the converters folder not in the azure_doc_intelligence folder
For completeness and consistency we should add the license header even to the empty __init__.py files

...lligence/src/haystack_integrations/components/converters/azure_doc_intelligence/converter.py

Co-authored-by: Julian Risch <julian.risch@deepset.ai>

…components/converters/azure_doc_intelligence/converter.py Co-authored-by: Julian Risch <julian.risch@deepset.ai>

sjrl · 2026-01-16T08:30:55Z

.github/labeler.yml

+integration:azure-doc-intelligence:
+  - changed-files:
+      - any-glob-to-any-file: "integrations/azure_doc_intelligence/**/*"
+      - any-glob-to-any-file: ".github/workflows/azure_doc_intelligence.yml"


I know @anakin87 likes to keep this file alphabetized :D So could we move this below azure-ai-search?

sjrl · 2026-01-16T08:34:01Z

...lligence/src/haystack_integrations/components/converters/azure_doc_intelligence/converter.py

+        self,
+        endpoint: str,
+        *,
+        api_key: Secret = Secret.from_env_var("AZURE_AI_API_KEY"),


Should this be DI instead of AI? Or is the API key name the same between the doc intelligence and AI services?

sjrl · 2026-01-16T08:34:59Z

...lligence/src/haystack_integrations/components/converters/azure_doc_intelligence/converter.py

+            Azure model to use for analysis. Options:
+            - "prebuilt-read": Fast OCR for text extraction (default)
+            - "prebuilt-layout": Enhanced layout analysis with better table/structure detection
+            - "prebuilt-document": General document analysis


If I recall, I found prebuilt-document to perform well and much better than prebuilt-read so I wonder if we should make this default even though it's more expensive. WDYT?

sjrl · 2026-01-16T08:36:08Z

...lligence/src/haystack_integrations/components/converters/azure_doc_intelligence/converter.py

+        self.client = DocumentIntelligenceClient(
+            endpoint=endpoint, credential=AzureKeyCredential(api_key.resolve_value() or "")
+        )


Could we move this connection into the warm_up function? This way it's easier to perform validation on a pipeline using this component (e.g. all connections work, etc.) before needing to supply the api key or make an external API request.

sjrl · 2026-01-16T08:40:37Z

@vblagoje looking good! Just a few remaining comments

vblagoje · 2026-01-16T10:23:35Z

@vblagoje looking good! Just a few remaining comments

Fair points @sjrl - will finish this up today!

vblagoje · 2026-01-19T09:54:20Z

Should be gtg now @sjrl

julian-risch

Only one last change request: let's make sure the component auto-runs warm up when necessary instead of throwing an error. Similar to the work that happened as part of this issue #2592

vblagoje · 2026-01-19T11:18:27Z

Only one last change request: let's make sure the component auto-runs warm up when necessary instead of throwing an error. Similar to the work that happened as part of this issue

Should be gtg now @julian-risch

julian-risch

Looks good to me! Thanks for addressing the numerous change requests! 👍

vblagoje · 2026-01-19T11:52:49Z

Released as https://pypi.org/project/azure-doc-intelligence-haystack/

nils-hde · 2026-01-20T09:02:53Z

very nice seeing this implemented 🙂

vblagoje added 3 commits January 12, 2026 15:37

Initial commit of azure-doc-intelligence

557561e

Pin dep to haystack >= 2.22.0

3d1360c

Add workflow file

05602e2

github-actions bot added topic:CI type:documentation Improvements or additions to documentation labels Jan 12, 2026

Add pydoc

4af4580

vblagoje marked this pull request as ready for review January 12, 2026 15:03

vblagoje requested a review from a team as a code owner January 12, 2026 15:03

vblagoje requested review from julian-risch and sjrl and removed request for a team January 12, 2026 15:03

vblagoje mentioned this pull request Jan 12, 2026

feat: Add AzureDocumentIntelligenceConverter using the azure-ai-documentintelligence deepset-ai/haystack#10322

Closed

julian-risch requested changes Jan 13, 2026

View reviewed changes

sjrl reviewed Jan 14, 2026

View reviewed changes

...lligence/src/haystack_integrations/components/converters/azure_doc_intelligence/converter.py Show resolved Hide resolved

vblagoje and others added 8 commits January 15, 2026 10:59

Update integrations/azure_doc_intelligence/pyproject.toml

244936c

Co-authored-by: Julian Risch <julian.risch@deepset.ai>

Update integrations/azure_doc_intelligence/src/haystack_integrations/…

4095c9d

…components/converters/azure_doc_intelligence/converter.py Co-authored-by: Julian Risch <julian.risch@deepset.ai>

Update integrations/azure_doc_intelligence/src/haystack_integrations/…

9aeafe0

…components/converters/azure_doc_intelligence/converter.py Co-authored-by: Julian Risch <julian.risch@deepset.ai>

Update integrations/azure_doc_intelligence/src/haystack_integrations/…

ad366a8

…components/converters/azure_doc_intelligence/converter.py Co-authored-by: Julian Risch <julian.risch@deepset.ai>

Linting

3f91338

PR feedback

56257dd

More PR feedback

0994654

Add jpg and docx integration test

cb23e7f

vblagoje requested review from julian-risch and sjrl January 15, 2026 15:29

sjrl reviewed Jan 16, 2026

View reviewed changes

PR feedback sjrl

05f49e2

vblagoje requested a review from sjrl January 19, 2026 09:53

julian-risch requested changes Jan 19, 2026

View reviewed changes

vblagoje added 2 commits January 19, 2026 12:06

Add automatic warm_up call

5860abc

Linting

7f9c089

julian-risch approved these changes Jan 19, 2026

View reviewed changes

vblagoje merged commit 14f9ae0 into main Jan 19, 2026
9 checks passed

vblagoje deleted the azure_di branch January 19, 2026 11:47

This was referenced Jan 19, 2026

Follow up after release of AzureDocumentIntelligenceConverter #2764

Open

Add Azure Document Intelligence Integrations Page deepset-ai/haystack-integrations#389

Open

julian-risch mentioned this pull request Jan 26, 2026

Add Azure Document Intelligence integration deepset-ai/haystack-integrations#392

Open

		output_format: Literal["text", "markdown"] = "markdown",
		table_format: Literal["csv", "markdown"] = "markdown",

feat: Add azure-doc-intelligence integration #2717

feat: Add azure-doc-intelligence integration #2717

Uh oh!

Conversation

vblagoje commented Jan 12, 2026

Why

What

Usage

Testing

Notes for reviewer

Uh oh!

julian-risch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sjrl commented Jan 14, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sjrl commented Jan 16, 2026

Uh oh!

vblagoje commented Jan 16, 2026

Uh oh!

vblagoje commented Jan 19, 2026

Uh oh!

julian-risch left a comment

Choose a reason for hiding this comment

Uh oh!

vblagoje commented Jan 19, 2026

Uh oh!

julian-risch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vblagoje commented Jan 19, 2026

Uh oh!

nils-hde commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants