Data Ingestion #49051

luisquintanilla · 2025-10-09T20:19:43Z

This pull request introduces a new conceptual article on data ingestion for AI applications and updates the documentation table of contents to include it. The new article explains the fundamentals of data ingestion, its importance for AI and RAG scenarios, and details the architecture and building blocks provided by the Microsoft.Extensions.DataIngestion library.

Documentation Additions and Updates:

Added a comprehensive conceptual article, data-ingestion.md, covering the definition, importance, and technical foundations of data ingestion, with a focus on .NET AI workflows and the Microsoft.Extensions.DataIngestion library.
Updated the documentation table of contents (toc.yml) to include the new "Data ingestion" article under the AI conceptual section.

Internal previews

📄 File	🔗 Preview link
docs/ai/conceptual/data-ingestion.md	Data Ingestion
docs/ai/toc.yml	docs/ai/toc

Copilot

Pull Request Overview

This pull request introduces comprehensive documentation for data ingestion in AI applications, focusing on the Microsoft.Extensions.DataIngestion library and its integration with .NET AI workflows.

Key changes:

Added a new conceptual article explaining data ingestion fundamentals, architecture, and building blocks
Updated the table of contents to include the new data ingestion documentation

Reviewed Changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.

File	Description
docs/ai/toc.yml	Added "Data ingestion" entry to the conceptual documentation navigation
docs/ai/conceptual/data-ingestion.md	New comprehensive article covering data ingestion concepts, architecture, and the Microsoft.Extensions.DataIngestion library components

docs/ai/conceptual/data-ingestion.md

Copilot

Pull request overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 19 comments.

docs/ai/conceptual/data-ingestion.md

adamsitnik · 2025-11-21T18:58:20Z

@gewarren the PR is ready for review, PTAL

gewarren · 2025-11-24T17:32:13Z

docs/ai/media/data-ingestion/DataIngestion.png

Couple spelling mistakes in this image - Ingegrate and Microosft.Extensions.DataIngestion.

gewarren · 2025-11-24T17:33:01Z

docs/ai/conceptual/data-ingestion.md

+# Data Ingestion
+
+## What is data ingestion?


Suggested change

# Data Ingestion

## What is data ingestion?

# Data ingestion

gewarren · 2025-11-24T17:33:50Z

docs/ai/conceptual/data-ingestion.md

+
+## What is data ingestion?
+
+Data ingestion is the process of collecting, reading, and preparing data from different sources such as files, databases, APIs, or cloud services so it can be used in downstream applications. In practice, this follows the familiar Extract-Transform-Load (ETL) workflow:


Suggested change

Data ingestion is the process of collecting, reading, and preparing data from different sources such as files, databases, APIs, or cloud services so it can be used in downstream applications. In practice, this follows the familiar Extract-Transform-Load (ETL) workflow:

Data ingestion is the process of collecting, reading, and preparing data from different sources such as files, databases, APIs, or cloud services so it can be used in downstream applications. In practice, this process follows the Extract-Transform-Load (ETL) workflow:

gewarren · 2025-11-24T17:35:46Z

docs/ai/conceptual/data-ingestion.md

+
+This is where data ingestion becomes critical. You need to extract text from different file formats, break large documents into smaller chunks that fit within AI model limits, enrich the content with metadata, generate embeddings for semantic search, and store everything in a way that enables fast retrieval. Each step requires careful consideration of how to preserve the original meaning and context.
+
+## What is Microsoft.Extensions.DataIngestion?


Suggested change

## What is Microsoft.Extensions.DataIngestion?

## The Microsoft.Extensions.DataIngestion library

gewarren · 2025-11-24T17:36:19Z

docs/ai/conceptual/data-ingestion.md

+
+## What is Microsoft.Extensions.DataIngestion?
+
+[Microsoft.Extensions.DataIngestion](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion) provides foundational .NET components for data ingestion. It enables developers to read, process, and prepare documents for AI and machine learning workflows, especially Retrieval-Augmented Generation (RAG) scenarios.


Suggested change

[Microsoft.Extensions.DataIngestion](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion) provides foundational .NET components for data ingestion. It enables developers to read, process, and prepare documents for AI and machine learning workflows, especially Retrieval-Augmented Generation (RAG) scenarios.

The [📦 Microsoft.Extensions.DataIngestion package](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion) provides foundational .NET components for data ingestion. It enables developers to read, process, and prepare documents for AI and machine learning workflows, especially Retrieval-Augmented Generation (RAG) scenarios.

gewarren · 2025-11-24T17:51:16Z

docs/ai/conceptual/data-ingestion.md

+
+### Document Writer and Storage
+
+The `IngestionChunkWriter<T>` stores processed chunks into a data store for later retrieval. Using Microsoft.Extensions.AI and [Microsoft.Extensions.VectorData.Abstractions](https://www.nuget.org/packages/Microsoft.Extensions.VectorData.Abstractions), the library provides the `VectorStoreWriter<T>` class that supports storing chunks in any vector store supported by Microsoft.Extensions.VectorData.


Suggested change

The `IngestionChunkWriter<T>` stores processed chunks into a data store for later retrieval. Using Microsoft.Extensions.AI and [Microsoft.Extensions.VectorData.Abstractions](https://www.nuget.org/packages/Microsoft.Extensions.VectorData.Abstractions), the library provides the `VectorStoreWriter<T>` class that supports storing chunks in any vector store supported by Microsoft.Extensions.VectorData.

<xref:Microsoft.Extensions.DataIngestion.IngestionChunkWriter`1> stores processed chunks into a data store for later retrieval. Using Microsoft.Extensions.AI and [Microsoft.Extensions.VectorData.Abstractions](https://www.nuget.org/packages/Microsoft.Extensions.VectorData.Abstractions), the library provides the <xref:Microsoft.Extensions.DataIngestion.VectorStoreWriter`1> class that supports storing chunks in any vector store supported by Microsoft.Extensions.VectorData.

gewarren · 2025-11-24T17:51:47Z

docs/ai/conceptual/data-ingestion.md

+
+The `IngestionChunkWriter<T>` stores processed chunks into a data store for later retrieval. Using Microsoft.Extensions.AI and [Microsoft.Extensions.VectorData.Abstractions](https://www.nuget.org/packages/Microsoft.Extensions.VectorData.Abstractions), the library provides the `VectorStoreWriter<T>` class that supports storing chunks in any vector store supported by Microsoft.Extensions.VectorData.
+
+This includes popular options like [Qdrant](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.Qdrant), [SQL Server](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.SqlServer), [CosmosDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.CosmosNoSQL), [MongoDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.MongoDB), [ElasticSearch](https://www.nuget.org/packages/Elastic.SemanticKernel.Connectors.Elasticsearch), and many more. The writer can also automatically generate embeddings for your chunks using Microsoft.Extensions.AI, making them ready for semantic search and retrieval scenarios.


Suggested change

This includes popular options like [Qdrant](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.Qdrant), [SQL Server](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.SqlServer), [CosmosDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.CosmosNoSQL), [MongoDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.MongoDB), [ElasticSearch](https://www.nuget.org/packages/Elastic.SemanticKernel.Connectors.Elasticsearch), and many more. The writer can also automatically generate embeddings for your chunks using Microsoft.Extensions.AI, making them ready for semantic search and retrieval scenarios.

Vectore stores include popular options like [Qdrant](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.Qdrant), [SQL Server](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.SqlServer), [CosmosDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.CosmosNoSQL), [MongoDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.MongoDB), [ElasticSearch](https://www.nuget.org/packages/Elastic.SemanticKernel.Connectors.Elasticsearch), and many more. The writer can also automatically generate embeddings for your chunks using Microsoft.Extensions.AI, readying them for semantic search and retrieval scenarios.

gewarren · 2025-11-24T17:52:31Z

docs/ai/conceptual/data-ingestion.md

+using VectorStoreWriter<string> writer = new(vectorStore, dimensionCount: 1536);
+```
+
+### Document Processing Pipeline


Suggested change

### Document Processing Pipeline

### Document processing pipeline

gewarren · 2025-11-24T17:53:54Z

docs/ai/conceptual/data-ingestion.md

+
+### Document Processing Pipeline
+
+The `IngestionPipeline<T>` API allows you to chain together the various data ingestion components into a complete workflow. You can combine:


Suggested change

The `IngestionPipeline<T>` API allows you to chain together the various data ingestion components into a complete workflow. You can combine:

The <xref:Microsoft.Extensions.DataIngestion.IngestionPipeline`1> API allows you to chain together the various data ingestion components into a complete workflow. You can combine:

gewarren · 2025-11-24T17:57:17Z

docs/ai/conceptual/data-ingestion.md

+}
+```
+
+A single document ingestion failure should not fail the whole pipeline, that is why the `IngestionPipeline.ProcessAsync` implements partial success by returning `IAsyncEnumerable<IngestionResult>`. The caller is responsible for handling any failures (for example, by re-trying failed documents or stopping on first error).


Suggested change

A single document ingestion failure should not fail the whole pipeline, that is why the `IngestionPipeline.ProcessAsync` implements partial success by returning `IAsyncEnumerable<IngestionResult>`. The caller is responsible for handling any failures (for example, by re-trying failed documents or stopping on first error).

A single document ingestion failure shouldn't fail the whole pipeline. That's why <xref:Microsoft.Extensions.DataIngestion.IngestionPipeline`1.ProcessAsync*?displayProperty=nameWithType> implements partial success by returning `IAsyncEnumerable<IngestionResult>`. The caller is responsible for handling any failures (for example, by retrying failed documents or stopping on first error).

Initial commit

a0b3741

luisquintanilla requested review from a team and gewarren as code owners October 9, 2025 20:19

Copilot AI review requested due to automatic review settings October 9, 2025 20:19

dotnetrepoman bot added this to the October 2025 milestone Oct 9, 2025

Copilot AI reviewed Oct 9, 2025

View reviewed changes

docs/ai/conceptual/data-ingestion.md Outdated Show resolved Hide resolved

docs/ai/conceptual/data-ingestion.md Outdated Show resolved Hide resolved

docs/ai/conceptual/data-ingestion.md Outdated Show resolved Hide resolved

luisquintanilla marked this pull request as draft October 10, 2025 14:29

gewarren mentioned this pull request Oct 14, 2025

Update data ingestion for web chat app #48867

Open

BillWagner modified the milestones: October 2025, November 2025 Nov 3, 2025

adamsitnik added 2 commits November 20, 2025 20:01

first part of the doc update

6b1759d

add missing samples, update type names and some of the descriptions

ef70e3d

adamsitnik requested a review from Copilot November 21, 2025 18:01

Copilot started reviewing on behalf of adamsitnik November 21, 2025 18:01 View session

Copilot finished reviewing on behalf of adamsitnik November 21, 2025 18:04

Copilot AI reviewed Nov 21, 2025

View reviewed changes

adamsitnik force-pushed the data-ingestion-extensions branch from 5f3bcfa to 6dbac1c Compare November 21, 2025 18:30

address Copilot feedback

93a7d71

adamsitnik force-pushed the data-ingestion-extensions branch from 6dbac1c to 93a7d71 Compare November 21, 2025 18:45

adamsitnik marked this pull request as ready for review November 21, 2025 18:57

gewarren approved these changes Nov 24, 2025

View reviewed changes


		## What is data ingestion?

		Data ingestion is the process of collecting, reading, and preparing data from different sources such as files, databases, APIs, or cloud services so it can be used in downstream applications. In practice, this follows the familiar Extract-Transform-Load (ETL) workflow:


		This is where data ingestion becomes critical. You need to extract text from different file formats, break large documents into smaller chunks that fit within AI model limits, enrich the content with metadata, generate embeddings for semantic search, and store everything in a way that enables fast retrieval. Each step requires careful consideration of how to preserve the original meaning and context.

		## What is Microsoft.Extensions.DataIngestion?

	## What is Microsoft.Extensions.DataIngestion?
	## The Microsoft.Extensions.DataIngestion library


		## What is Microsoft.Extensions.DataIngestion?

		[Microsoft.Extensions.DataIngestion](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion) provides foundational .NET components for data ingestion. It enables developers to read, process, and prepare documents for AI and machine learning workflows, especially Retrieval-Augmented Generation (RAG) scenarios.

	[Microsoft.Extensions.DataIngestion](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion) provides foundational .NET components for data ingestion. It enables developers to read, process, and prepare documents for AI and machine learning workflows, especially Retrieval-Augmented Generation (RAG) scenarios.
	The [📦 Microsoft.Extensions.DataIngestion package](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion) provides foundational .NET components for data ingestion. It enables developers to read, process, and prepare documents for AI and machine learning workflows, especially Retrieval-Augmented Generation (RAG) scenarios.


		### Document Writer and Storage

		The `IngestionChunkWriter<T>` stores processed chunks into a data store for later retrieval. Using Microsoft.Extensions.AI and [Microsoft.Extensions.VectorData.Abstractions](https://www.nuget.org/packages/Microsoft.Extensions.VectorData.Abstractions), the library provides the `VectorStoreWriter<T>` class that supports storing chunks in any vector store supported by Microsoft.Extensions.VectorData.


		The `IngestionChunkWriter<T>` stores processed chunks into a data store for later retrieval. Using Microsoft.Extensions.AI and [Microsoft.Extensions.VectorData.Abstractions](https://www.nuget.org/packages/Microsoft.Extensions.VectorData.Abstractions), the library provides the `VectorStoreWriter<T>` class that supports storing chunks in any vector store supported by Microsoft.Extensions.VectorData.

		This includes popular options like [Qdrant](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.Qdrant), [SQL Server](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.SqlServer), [CosmosDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.CosmosNoSQL), [MongoDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.MongoDB), [ElasticSearch](https://www.nuget.org/packages/Elastic.SemanticKernel.Connectors.Elasticsearch), and many more. The writer can also automatically generate embeddings for your chunks using Microsoft.Extensions.AI, making them ready for semantic search and retrieval scenarios.

	### Document Processing Pipeline
	### Document processing pipeline


		### Document Processing Pipeline

		The `IngestionPipeline<T>` API allows you to chain together the various data ingestion components into a complete workflow. You can combine:

	The `IngestionPipeline<T>` API allows you to chain together the various data ingestion components into a complete workflow. You can combine:
	The <xref:Microsoft.Extensions.DataIngestion.IngestionPipeline`1> API allows you to chain together the various data ingestion components into a complete workflow. You can combine:

	A single document ingestion failure should not fail the whole pipeline, that is why the `IngestionPipeline.ProcessAsync` implements partial success by returning `IAsyncEnumerable<IngestionResult>`. The caller is responsible for handling any failures (for example, by re-trying failed documents or stopping on first error).
	A single document ingestion failure shouldn't fail the whole pipeline. That's why <xref:Microsoft.Extensions.DataIngestion.IngestionPipeline`1.ProcessAsync*?displayProperty=nameWithType> implements partial success by returning `IAsyncEnumerable<IngestionResult>`. The caller is responsible for handling any failures (for example, by retrying failed documents or stopping on first error).

Data Ingestion #49051

Are you sure you want to change the base?

Data Ingestion #49051

Uh oh!

Conversation

luisquintanilla commented Oct 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Internal previews

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adamsitnik commented Nov 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

luisquintanilla commented Oct 9, 2025 •

edited by github-actions bot

Loading