Skip to content

Conversation

@luisquintanilla
Copy link
Contributor

@luisquintanilla luisquintanilla commented Oct 9, 2025

This pull request introduces a new conceptual article on data ingestion for AI applications and updates the documentation table of contents to include it. The new article explains the fundamentals of data ingestion, its importance for AI and RAG scenarios, and details the architecture and building blocks provided by the Microsoft.Extensions.DataIngestion library.

Documentation Additions and Updates:

  • Added a comprehensive conceptual article, data-ingestion.md, covering the definition, importance, and technical foundations of data ingestion, with a focus on .NET AI workflows and the Microsoft.Extensions.DataIngestion library.
  • Updated the documentation table of contents (toc.yml) to include the new "Data ingestion" article under the AI conceptual section.

Internal previews

📄 File 🔗 Preview link
docs/ai/conceptual/data-ingestion.md Data Ingestion
docs/ai/toc.yml docs/ai/toc

@luisquintanilla luisquintanilla requested review from a team and gewarren as code owners October 9, 2025 20:19
Copilot AI review requested due to automatic review settings October 9, 2025 20:19
@dotnetrepoman dotnetrepoman bot added this to the October 2025 milestone Oct 9, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request introduces comprehensive documentation for data ingestion in AI applications, focusing on the Microsoft.Extensions.DataIngestion library and its integration with .NET AI workflows.

Key changes:

  • Added a new conceptual article explaining data ingestion fundamentals, architecture, and building blocks
  • Updated the table of contents to include the new data ingestion documentation

Reviewed Changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.

File Description
docs/ai/toc.yml Added "Data ingestion" entry to the conceptual documentation navigation
docs/ai/conceptual/data-ingestion.md New comprehensive article covering data ingestion concepts, architecture, and the Microsoft.Extensions.DataIngestion library components

@adamsitnik adamsitnik requested a review from Copilot November 21, 2025 18:01
Copilot finished reviewing on behalf of adamsitnik November 21, 2025 18:04
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 19 comments.

@adamsitnik adamsitnik force-pushed the data-ingestion-extensions branch from 5f3bcfa to 6dbac1c Compare November 21, 2025 18:30
@adamsitnik adamsitnik force-pushed the data-ingestion-extensions branch from 6dbac1c to 93a7d71 Compare November 21, 2025 18:45
@adamsitnik adamsitnik marked this pull request as ready for review November 21, 2025 18:57
@adamsitnik
Copy link
Member

@gewarren the PR is ready for review, PTAL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple spelling mistakes in this image - Ingegrate and Microosft.Extensions.DataIngestion.

Comment on lines +11 to +13
# Data Ingestion

## What is data ingestion?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Data Ingestion
## What is data ingestion?
# Data ingestion


## What is data ingestion?

Data ingestion is the process of collecting, reading, and preparing data from different sources such as files, databases, APIs, or cloud services so it can be used in downstream applications. In practice, this follows the familiar Extract-Transform-Load (ETL) workflow:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Data ingestion is the process of collecting, reading, and preparing data from different sources such as files, databases, APIs, or cloud services so it can be used in downstream applications. In practice, this follows the familiar Extract-Transform-Load (ETL) workflow:
Data ingestion is the process of collecting, reading, and preparing data from different sources such as files, databases, APIs, or cloud services so it can be used in downstream applications. In practice, this process follows the Extract-Transform-Load (ETL) workflow:


This is where data ingestion becomes critical. You need to extract text from different file formats, break large documents into smaller chunks that fit within AI model limits, enrich the content with metadata, generate embeddings for semantic search, and store everything in a way that enables fast retrieval. Each step requires careful consideration of how to preserve the original meaning and context.

## What is Microsoft.Extensions.DataIngestion?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## What is Microsoft.Extensions.DataIngestion?
## The Microsoft.Extensions.DataIngestion library


## What is Microsoft.Extensions.DataIngestion?

[Microsoft.Extensions.DataIngestion](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion) provides foundational .NET components for data ingestion. It enables developers to read, process, and prepare documents for AI and machine learning workflows, especially Retrieval-Augmented Generation (RAG) scenarios.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[Microsoft.Extensions.DataIngestion](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion) provides foundational .NET components for data ingestion. It enables developers to read, process, and prepare documents for AI and machine learning workflows, especially Retrieval-Augmented Generation (RAG) scenarios.
The [📦 Microsoft.Extensions.DataIngestion package](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion) provides foundational .NET components for data ingestion. It enables developers to read, process, and prepare documents for AI and machine learning workflows, especially Retrieval-Augmented Generation (RAG) scenarios.


### Document Writer and Storage

The `IngestionChunkWriter<T>` stores processed chunks into a data store for later retrieval. Using Microsoft.Extensions.AI and [Microsoft.Extensions.VectorData.Abstractions](https://www.nuget.org/packages/Microsoft.Extensions.VectorData.Abstractions), the library provides the `VectorStoreWriter<T>` class that supports storing chunks in any vector store supported by Microsoft.Extensions.VectorData.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `IngestionChunkWriter<T>` stores processed chunks into a data store for later retrieval. Using Microsoft.Extensions.AI and [Microsoft.Extensions.VectorData.Abstractions](https://www.nuget.org/packages/Microsoft.Extensions.VectorData.Abstractions), the library provides the `VectorStoreWriter<T>` class that supports storing chunks in any vector store supported by Microsoft.Extensions.VectorData.
<xref:Microsoft.Extensions.DataIngestion.IngestionChunkWriter`1> stores processed chunks into a data store for later retrieval. Using Microsoft.Extensions.AI and [Microsoft.Extensions.VectorData.Abstractions](https://www.nuget.org/packages/Microsoft.Extensions.VectorData.Abstractions), the library provides the <xref:Microsoft.Extensions.DataIngestion.VectorStoreWriter`1> class that supports storing chunks in any vector store supported by Microsoft.Extensions.VectorData.


The `IngestionChunkWriter<T>` stores processed chunks into a data store for later retrieval. Using Microsoft.Extensions.AI and [Microsoft.Extensions.VectorData.Abstractions](https://www.nuget.org/packages/Microsoft.Extensions.VectorData.Abstractions), the library provides the `VectorStoreWriter<T>` class that supports storing chunks in any vector store supported by Microsoft.Extensions.VectorData.

This includes popular options like [Qdrant](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.Qdrant), [SQL Server](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.SqlServer), [CosmosDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.CosmosNoSQL), [MongoDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.MongoDB), [ElasticSearch](https://www.nuget.org/packages/Elastic.SemanticKernel.Connectors.Elasticsearch), and many more. The writer can also automatically generate embeddings for your chunks using Microsoft.Extensions.AI, making them ready for semantic search and retrieval scenarios.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This includes popular options like [Qdrant](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.Qdrant), [SQL Server](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.SqlServer), [CosmosDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.CosmosNoSQL), [MongoDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.MongoDB), [ElasticSearch](https://www.nuget.org/packages/Elastic.SemanticKernel.Connectors.Elasticsearch), and many more. The writer can also automatically generate embeddings for your chunks using Microsoft.Extensions.AI, making them ready for semantic search and retrieval scenarios.
Vectore stores include popular options like [Qdrant](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.Qdrant), [SQL Server](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.SqlServer), [CosmosDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.CosmosNoSQL), [MongoDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.MongoDB), [ElasticSearch](https://www.nuget.org/packages/Elastic.SemanticKernel.Connectors.Elasticsearch), and many more. The writer can also automatically generate embeddings for your chunks using Microsoft.Extensions.AI, readying them for semantic search and retrieval scenarios.

using VectorStoreWriter<string> writer = new(vectorStore, dimensionCount: 1536);
```

### Document Processing Pipeline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Document Processing Pipeline
### Document processing pipeline


### Document Processing Pipeline

The `IngestionPipeline<T>` API allows you to chain together the various data ingestion components into a complete workflow. You can combine:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `IngestionPipeline<T>` API allows you to chain together the various data ingestion components into a complete workflow. You can combine:
The <xref:Microsoft.Extensions.DataIngestion.IngestionPipeline`1> API allows you to chain together the various data ingestion components into a complete workflow. You can combine:

}
```

A single document ingestion failure should not fail the whole pipeline, that is why the `IngestionPipeline.ProcessAsync` implements partial success by returning `IAsyncEnumerable<IngestionResult>`. The caller is responsible for handling any failures (for example, by re-trying failed documents or stopping on first error).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A single document ingestion failure should not fail the whole pipeline, that is why the `IngestionPipeline.ProcessAsync` implements partial success by returning `IAsyncEnumerable<IngestionResult>`. The caller is responsible for handling any failures (for example, by re-trying failed documents or stopping on first error).
A single document ingestion failure shouldn't fail the whole pipeline. That's why <xref:Microsoft.Extensions.DataIngestion.IngestionPipeline`1.ProcessAsync*?displayProperty=nameWithType> implements partial success by returning `IAsyncEnumerable<IngestionResult>`. The caller is responsible for handling any failures (for example, by retrying failed documents or stopping on first error).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants