You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/ai/conceptual/data-ingestion.md
+52-22Lines changed: 52 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,6 +39,7 @@ With these building blocks, developers can create robust, flexible, and intellig
39
39
-**Customizable chunking strategies:** Split documents into chunks using token-based, section-based, or semantic-aware approaches, so you can optimize for your retrieval and analysis needs.
40
40
-**Production-ready storage:** Store processed chunks in popular vector databases and document stores, with support for embedding generation, making your pipelines ready for real-world scenarios.
41
41
-**End-to-end pipeline composition:** Chain together readers, processors, chunkers, and writers with the `IngestionPipeline` API, reducing boilerplate and making it easy to build, customize, and extend complete workflows.
42
+
-**Performance and scalability:** Designed for scalable data processing, these components can handle large volumes of data efficiently, making them suitable for enterprise-grade applications.
42
43
43
44
All of these components are open and extensible by design. You can add custom logic, new connectors, and extend the system to support emerging AI scenarios. By standardizing how documents are represented, processed, and stored, .NET developers can build reliable, scalable, and maintainable data pipelines without reinventing the wheel for every project.
44
45
@@ -64,59 +65,77 @@ At the foundation of the library is the `IngestionDocument` type, which provides
64
65
65
66
The `IngestionDocumentReader` abstraction handles loading documents from various sources, whether local files or streams. There are few readers available:
This design means you can work with documents from different sources using the same consistent API, making your code more maintainable and flexible.
71
+
And we are actively working on adding more readers (including **LlamaParse** and **Azure Document Intelligence**).
73
72
74
-
```csharp
75
-
// TODO: Add code snippet
76
-
```
73
+
This design means you can work with documents from different sources using the same consistent API, making your code more maintainable and flexible.
77
74
78
75
### Document Processing
79
76
80
-
Document processors apply transformations at the document level to enhance and prepare content. The library currently supports:
81
-
82
-
-**Image processing** to extract text and descriptions from images within documents
83
-
-**Table processing** to preserve tabular data structure and make it searchable
77
+
Document processors apply transformations at the document level to enhance and prepare content. The library currently provides `ImageAlternativeTextEnricher` class as a built-in processor that uses large language models to generate descriptive alternative text for images within documents.
84
78
85
79
### Chunks and Chunking Strategies
86
80
87
81
Once you have a document loaded, you typically need to break it down into smaller pieces called chunks. Chunks represent subsections of a document that can be efficiently processed, stored, and retrieved by AI systems. This chunking process is essential for retrieval-augmented generation scenarios where you need to find the most relevant pieces of information quickly.
88
82
89
83
The library provides several chunking strategies to fit different use cases:
90
84
91
-
-**Token-based chunking**splits text based on token counts, ensuring chunks fit within model limits
92
-
-**Section-based chunking**splits on headers and natural document boundaries
93
-
-**Semantic-aware chunking**preserves complete thoughts and ideas across chunk boundaries
85
+
-**Header-based chunking**to split on headers.
86
+
-**Section-based chunking**to split on sections (example: pages).
These chunking strategies build on the Microsoft.ML.Tokenizers library to intelligently split text into appropriately sized pieces that work well with large language models. The right chunking strategy depends on your document types and how you plan to retrieve information.
After documents are split into chunks, you can apply processors to enhance and enrich the content. Chunk processors work on individual pieces and can perform:
104
104
105
-
-**Content enrichment** including automatic summaries, sentiment analysis, and keyword extraction
106
-
-**PII removal** for privacy-focused content sanitization using large language models
107
-
-**Classification** for automated content categorization based on predefined categories
105
+
-**Content enrichment** including automatic summaries (`SummaryEnricher`), sentiment analysis (`SentimentEnricher`), and keyword extraction (`KeywordEnricher`).
106
+
-**Classification** for automated content categorization based on predefined categories (`ClassificationEnricher`).
108
107
109
-
These processors use Microsoft.Extensions.AI to leverage large language models for intelligent content transformation, making your chunks more useful for downstream AI applications.
108
+
These processors use [Microsoft.Extensions.AI.Abstractions](https://www.nuget.org/packages/Microsoft.Extensions.AI.Abstractions) to leverage large language models for intelligent content transformation, making your chunks more useful for downstream AI applications.
110
109
111
110
### Document Writer and Storage
112
111
113
-
The `DocumentWriter` stores processed chunks into a data store for later retrieval. Using Microsoft.Extensions.AI and Microsoft.Extensions.VectorData, the library provides abstractions and implementations that support storing chunks in any vector store supported by Microsoft.Extensions.VectorData.
112
+
The `IngestionChunkWriter<T>` stores processed chunks into a data store for later retrieval. Using Microsoft.Extensions.AI and [Microsoft.Extensions.VectorData.Abstractions](https://www.nuget.org/packages/Microsoft.Extensions.VectorData.Abstractions), the library provides the `VectorStoreWriter<T>` class that supports storing chunks in any vector store supported by Microsoft.Extensions.VectorData.
114
113
115
-
This includes popular options like Azure SQL Server, CosmosDB, PostgreSQL, MongoDB, and many others. The writer can also automatically generate embeddings for your chunks using Microsoft.Extensions.AI, making them ready for semantic search and retrieval scenarios.
114
+
This includes popular options like [Qdrant](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.Qdrant), [SQL Server](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.SqlServer), [CosmosDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.CosmosNoSQL), [MongoDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.MongoDB), [ElasticSearch](https://www.nuget.org/packages/Elastic.SemanticKernel.Connectors.Elasticsearch), and many more. The writer can also automatically generate embeddings for your chunks using Microsoft.Extensions.AI, making them ready for semantic search and retrieval scenarios.
A single document ingestion failure should not fail the whole pipeline, that is why the `IngestionPipeline.ProcessAsync` implements partial success by returning `IAsyncEnumerable<IngestionResult>`. The caller is responsible for handling any failures (for example, by re-trying failed documents or stopping on first error).
0 commit comments