-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Data Ingestion #49051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Data Ingestion #49051
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request introduces comprehensive documentation for data ingestion in AI applications, focusing on the Microsoft.Extensions.DataIngestion library and its integration with .NET AI workflows.
Key changes:
- Added a new conceptual article explaining data ingestion fundamentals, architecture, and building blocks
- Updated the table of contents to include the new data ingestion documentation
Reviewed Changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| docs/ai/toc.yml | Added "Data ingestion" entry to the conceptual documentation navigation |
| docs/ai/conceptual/data-ingestion.md | New comprehensive article covering data ingestion concepts, architecture, and the Microsoft.Extensions.DataIngestion library components |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 3 changed files in this pull request and generated 19 comments.
5f3bcfa to
6dbac1c
Compare
6dbac1c to
93a7d71
Compare
|
@gewarren the PR is ready for review, PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple spelling mistakes in this image - Ingegrate and Microosft.Extensions.DataIngestion.
| # Data Ingestion | ||
|
|
||
| ## What is data ingestion? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Data Ingestion | |
| ## What is data ingestion? | |
| # Data ingestion |
|
|
||
| ## What is data ingestion? | ||
|
|
||
| Data ingestion is the process of collecting, reading, and preparing data from different sources such as files, databases, APIs, or cloud services so it can be used in downstream applications. In practice, this follows the familiar Extract-Transform-Load (ETL) workflow: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Data ingestion is the process of collecting, reading, and preparing data from different sources such as files, databases, APIs, or cloud services so it can be used in downstream applications. In practice, this follows the familiar Extract-Transform-Load (ETL) workflow: | |
| Data ingestion is the process of collecting, reading, and preparing data from different sources such as files, databases, APIs, or cloud services so it can be used in downstream applications. In practice, this process follows the Extract-Transform-Load (ETL) workflow: |
|
|
||
| This is where data ingestion becomes critical. You need to extract text from different file formats, break large documents into smaller chunks that fit within AI model limits, enrich the content with metadata, generate embeddings for semantic search, and store everything in a way that enables fast retrieval. Each step requires careful consideration of how to preserve the original meaning and context. | ||
|
|
||
| ## What is Microsoft.Extensions.DataIngestion? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ## What is Microsoft.Extensions.DataIngestion? | |
| ## The Microsoft.Extensions.DataIngestion library |
|
|
||
| ## What is Microsoft.Extensions.DataIngestion? | ||
|
|
||
| [Microsoft.Extensions.DataIngestion](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion) provides foundational .NET components for data ingestion. It enables developers to read, process, and prepare documents for AI and machine learning workflows, especially Retrieval-Augmented Generation (RAG) scenarios. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| [Microsoft.Extensions.DataIngestion](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion) provides foundational .NET components for data ingestion. It enables developers to read, process, and prepare documents for AI and machine learning workflows, especially Retrieval-Augmented Generation (RAG) scenarios. | |
| The [📦 Microsoft.Extensions.DataIngestion package](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion) provides foundational .NET components for data ingestion. It enables developers to read, process, and prepare documents for AI and machine learning workflows, especially Retrieval-Augmented Generation (RAG) scenarios. |
|
|
||
| ### Document Writer and Storage | ||
|
|
||
| The `IngestionChunkWriter<T>` stores processed chunks into a data store for later retrieval. Using Microsoft.Extensions.AI and [Microsoft.Extensions.VectorData.Abstractions](https://www.nuget.org/packages/Microsoft.Extensions.VectorData.Abstractions), the library provides the `VectorStoreWriter<T>` class that supports storing chunks in any vector store supported by Microsoft.Extensions.VectorData. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The `IngestionChunkWriter<T>` stores processed chunks into a data store for later retrieval. Using Microsoft.Extensions.AI and [Microsoft.Extensions.VectorData.Abstractions](https://www.nuget.org/packages/Microsoft.Extensions.VectorData.Abstractions), the library provides the `VectorStoreWriter<T>` class that supports storing chunks in any vector store supported by Microsoft.Extensions.VectorData. | |
| <xref:Microsoft.Extensions.DataIngestion.IngestionChunkWriter`1> stores processed chunks into a data store for later retrieval. Using Microsoft.Extensions.AI and [Microsoft.Extensions.VectorData.Abstractions](https://www.nuget.org/packages/Microsoft.Extensions.VectorData.Abstractions), the library provides the <xref:Microsoft.Extensions.DataIngestion.VectorStoreWriter`1> class that supports storing chunks in any vector store supported by Microsoft.Extensions.VectorData. |
|
|
||
| The `IngestionChunkWriter<T>` stores processed chunks into a data store for later retrieval. Using Microsoft.Extensions.AI and [Microsoft.Extensions.VectorData.Abstractions](https://www.nuget.org/packages/Microsoft.Extensions.VectorData.Abstractions), the library provides the `VectorStoreWriter<T>` class that supports storing chunks in any vector store supported by Microsoft.Extensions.VectorData. | ||
|
|
||
| This includes popular options like [Qdrant](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.Qdrant), [SQL Server](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.SqlServer), [CosmosDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.CosmosNoSQL), [MongoDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.MongoDB), [ElasticSearch](https://www.nuget.org/packages/Elastic.SemanticKernel.Connectors.Elasticsearch), and many more. The writer can also automatically generate embeddings for your chunks using Microsoft.Extensions.AI, making them ready for semantic search and retrieval scenarios. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This includes popular options like [Qdrant](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.Qdrant), [SQL Server](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.SqlServer), [CosmosDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.CosmosNoSQL), [MongoDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.MongoDB), [ElasticSearch](https://www.nuget.org/packages/Elastic.SemanticKernel.Connectors.Elasticsearch), and many more. The writer can also automatically generate embeddings for your chunks using Microsoft.Extensions.AI, making them ready for semantic search and retrieval scenarios. | |
| Vectore stores include popular options like [Qdrant](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.Qdrant), [SQL Server](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.SqlServer), [CosmosDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.CosmosNoSQL), [MongoDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.MongoDB), [ElasticSearch](https://www.nuget.org/packages/Elastic.SemanticKernel.Connectors.Elasticsearch), and many more. The writer can also automatically generate embeddings for your chunks using Microsoft.Extensions.AI, readying them for semantic search and retrieval scenarios. |
| using VectorStoreWriter<string> writer = new(vectorStore, dimensionCount: 1536); | ||
| ``` | ||
|
|
||
| ### Document Processing Pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ### Document Processing Pipeline | |
| ### Document processing pipeline |
|
|
||
| ### Document Processing Pipeline | ||
|
|
||
| The `IngestionPipeline<T>` API allows you to chain together the various data ingestion components into a complete workflow. You can combine: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The `IngestionPipeline<T>` API allows you to chain together the various data ingestion components into a complete workflow. You can combine: | |
| The <xref:Microsoft.Extensions.DataIngestion.IngestionPipeline`1> API allows you to chain together the various data ingestion components into a complete workflow. You can combine: |
| } | ||
| ``` | ||
|
|
||
| A single document ingestion failure should not fail the whole pipeline, that is why the `IngestionPipeline.ProcessAsync` implements partial success by returning `IAsyncEnumerable<IngestionResult>`. The caller is responsible for handling any failures (for example, by re-trying failed documents or stopping on first error). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| A single document ingestion failure should not fail the whole pipeline, that is why the `IngestionPipeline.ProcessAsync` implements partial success by returning `IAsyncEnumerable<IngestionResult>`. The caller is responsible for handling any failures (for example, by re-trying failed documents or stopping on first error). | |
| A single document ingestion failure shouldn't fail the whole pipeline. That's why <xref:Microsoft.Extensions.DataIngestion.IngestionPipeline`1.ProcessAsync*?displayProperty=nameWithType> implements partial success by returning `IAsyncEnumerable<IngestionResult>`. The caller is responsible for handling any failures (for example, by retrying failed documents or stopping on first error). |
This pull request introduces a new conceptual article on data ingestion for AI applications and updates the documentation table of contents to include it. The new article explains the fundamentals of data ingestion, its importance for AI and RAG scenarios, and details the architecture and building blocks provided by the
Microsoft.Extensions.DataIngestionlibrary.Documentation Additions and Updates:
data-ingestion.md, covering the definition, importance, and technical foundations of data ingestion, with a focus on .NET AI workflows and theMicrosoft.Extensions.DataIngestionlibrary.toc.yml) to include the new "Data ingestion" article under the AI conceptual section.Internal previews