Skip to content

Conversation

@vblagoje
Copy link
Member

@vblagoje vblagoje commented Jan 8, 2026

Why

Azure deprecated azure-ai-formrecognizer in favor of azure-ai-documentintelligence (v1.0.0, GA Dec 2024). New package supports markdown output format (GitHub Flavored Markdown) which is better suited for RAG/LLM applications - tables inline with context, preserved document structure (headings, lists), no manual assembly required.

What

Added AzureDocumentIntelligenceConverter component:

  • Uses azure-ai-documentintelligence>=1.0.0 package (2024-11-30 API)
  • Markdown output mode (default): single document with inline tables, preserved structure
  • Text output mode (backward compat): separate CSV table documents or markdown tables
  • Simplified API: removed page_layout, threshold_y, preceding_context_len, following_context_len, merge_multiple_column_headers
  • Added output_format (markdown/text), table_format (csv/markdown)

Deprecated AzureOCRDocumentConverter (removal in Haystack 2.25)

How can it be used

  from haystack.components.converters import AzureDocumentIntelligenceConverter
  from haystack.utils import Secret

  # Markdown mode (recommended for RAG)
  converter = AzureDocumentIntelligenceConverter(
      endpoint=os.environ["AZURE_DI_ENDPOINT"],
      api_key=Secret.from_env_var("AZURE_AI_API_KEY"),
      output_format="markdown"
  )
  results = converter.run(sources=["invoice.pdf"])
  # Returns single document with markdown, tables inline

  # Text mode (backward compat)
  converter = AzureDocumentIntelligenceConverter(
      endpoint=os.environ["AZURE_DI_ENDPOINT"],
      api_key=Secret.from_env_var("AZURE_AI_API_KEY"),
      output_format="text",
      table_format="csv"
  )
  # Returns separate CSV table documents + text document

How did you test it

  • 3 unit tests (init, to_dict, from_dict)
  • 4 integration tests with real Azure API (markdown output, text+CSV tables, metadata handling, multiple files)

Notes for the reviewer

Migration path from old converter:

  • page_layout="natural" → output_format="markdown"
  • Remove context/layout params (Azure API handles this now)
  • Tables inline in markdown mode vs separate CSV docs

@vercel
Copy link

vercel bot commented Jan 8, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Review Updated (UTC)
haystack-docs Ignored Ignored Preview Jan 8, 2026 1:15pm

@anakin87
Copy link
Member

anakin87 commented Jan 9, 2026

@vblagoje @sjrl could this be a good opportunity to move this component to core-integrations?
Just an idea, curious to hear your opinions.

@sjrl
Copy link
Contributor

sjrl commented Jan 9, 2026

Yeah I think this would be a good time to move this to core integrations

See this comment #8404 (comment) from the issue thread

@vblagoje
Copy link
Member Author

@vblagoje
Copy link
Member Author

Superseded by deepset-ai/haystack-core-integrations#2717 Closing

@vblagoje vblagoje closed this Jan 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add a Azure OCR Converter that uses the azure-ai-documentintelligence library

4 participants