Skip to content

Commit ef70e3d

Browse files
committed
add missing samples, update type names and some of the descriptions
1 parent 6b1759d commit ef70e3d

File tree

1 file changed

+52
-22
lines changed

1 file changed

+52
-22
lines changed

docs/ai/conceptual/data-ingestion.md

Lines changed: 52 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ With these building blocks, developers can create robust, flexible, and intellig
3939
- **Customizable chunking strategies:** Split documents into chunks using token-based, section-based, or semantic-aware approaches, so you can optimize for your retrieval and analysis needs.
4040
- **Production-ready storage:** Store processed chunks in popular vector databases and document stores, with support for embedding generation, making your pipelines ready for real-world scenarios.
4141
- **End-to-end pipeline composition:** Chain together readers, processors, chunkers, and writers with the `IngestionPipeline` API, reducing boilerplate and making it easy to build, customize, and extend complete workflows.
42+
- **Performance and scalability:** Designed for scalable data processing, these components can handle large volumes of data efficiently, making them suitable for enterprise-grade applications.
4243

4344
All of these components are open and extensible by design. You can add custom logic, new connectors, and extend the system to support emerging AI scenarios. By standardizing how documents are represented, processed, and stored, .NET developers can build reliable, scalable, and maintainable data pipelines without reinventing the wheel for every project.
4445

@@ -64,59 +65,77 @@ At the foundation of the library is the `IngestionDocument` type, which provides
6465

6566
The `IngestionDocumentReader` abstraction handles loading documents from various sources, whether local files or streams. There are few readers available:
6667

67-
- **Azure Document Intelligence**
68-
- **LlamaParse**
6968
- **[MarkItDown](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion.MarkItDown)**
7069
- **[Markdown](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion.Markdig/)**
7170

72-
This design means you can work with documents from different sources using the same consistent API, making your code more maintainable and flexible.
71+
And we are actively working on adding more readers (including **LlamaParse** and **Azure Document Intelligence**).
7372

74-
```csharp
75-
// TODO: Add code snippet
76-
```
73+
This design means you can work with documents from different sources using the same consistent API, making your code more maintainable and flexible.
7774

7875
### Document Processing
7976

80-
Document processors apply transformations at the document level to enhance and prepare content. The library currently supports:
81-
82-
- **Image processing** to extract text and descriptions from images within documents
83-
- **Table processing** to preserve tabular data structure and make it searchable
77+
Document processors apply transformations at the document level to enhance and prepare content. The library currently provides `ImageAlternativeTextEnricher` class as a built-in processor that uses large language models to generate descriptive alternative text for images within documents.
8478

8579
### Chunks and Chunking Strategies
8680

8781
Once you have a document loaded, you typically need to break it down into smaller pieces called chunks. Chunks represent subsections of a document that can be efficiently processed, stored, and retrieved by AI systems. This chunking process is essential for retrieval-augmented generation scenarios where you need to find the most relevant pieces of information quickly.
8882

8983
The library provides several chunking strategies to fit different use cases:
9084

91-
- **Token-based chunking** splits text based on token counts, ensuring chunks fit within model limits
92-
- **Section-based chunking** splits on headers and natural document boundaries
93-
- **Semantic-aware chunking** preserves complete thoughts and ideas across chunk boundaries
85+
- **Header-based chunking** to split on headers.
86+
- **Section-based chunking** to split on sections (example: pages).
87+
- **Semantic-aware chunking** to preserve complete thoughts.
9488

9589
These chunking strategies build on the Microsoft.ML.Tokenizers library to intelligently split text into appropriately sized pieces that work well with large language models. The right chunking strategy depends on your document types and how you plan to retrieve information.
9690

9791
```csharp
98-
// TODO: Add code snippet
92+
Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4");
93+
IngestionChunkerOptions options = new(tokenizer)
94+
{
95+
MaxTokensPerChunk = 2000,
96+
OverlapTokens = 0
97+
};
98+
IngestionChunker<string> chunker = new HeaderChunker(options);
9999
```
100100

101101
### Chunk Processing and Enrichment
102102

103103
After documents are split into chunks, you can apply processors to enhance and enrich the content. Chunk processors work on individual pieces and can perform:
104104

105-
- **Content enrichment** including automatic summaries, sentiment analysis, and keyword extraction
106-
- **PII removal** for privacy-focused content sanitization using large language models
107-
- **Classification** for automated content categorization based on predefined categories
105+
- **Content enrichment** including automatic summaries (`SummaryEnricher`), sentiment analysis (`SentimentEnricher`), and keyword extraction (`KeywordEnricher`).
106+
- **Classification** for automated content categorization based on predefined categories (`ClassificationEnricher`).
108107

109-
These processors use Microsoft.Extensions.AI to leverage large language models for intelligent content transformation, making your chunks more useful for downstream AI applications.
108+
These processors use [Microsoft.Extensions.AI.Abstractions](https://www.nuget.org/packages/Microsoft.Extensions.AI.Abstractions) to leverage large language models for intelligent content transformation, making your chunks more useful for downstream AI applications.
110109

111110
### Document Writer and Storage
112111

113-
The `DocumentWriter` stores processed chunks into a data store for later retrieval. Using Microsoft.Extensions.AI and Microsoft.Extensions.VectorData, the library provides abstractions and implementations that support storing chunks in any vector store supported by Microsoft.Extensions.VectorData.
112+
The `IngestionChunkWriter<T>` stores processed chunks into a data store for later retrieval. Using Microsoft.Extensions.AI and [Microsoft.Extensions.VectorData.Abstractions](https://www.nuget.org/packages/Microsoft.Extensions.VectorData.Abstractions), the library provides the `VectorStoreWriter<T>` class that supports storing chunks in any vector store supported by Microsoft.Extensions.VectorData.
114113

115-
This includes popular options like Azure SQL Server, CosmosDB, PostgreSQL, MongoDB, and many others. The writer can also automatically generate embeddings for your chunks using Microsoft.Extensions.AI, making them ready for semantic search and retrieval scenarios.
114+
This includes popular options like [Qdrant](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.Qdrant), [SQL Server](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.SqlServer), [CosmosDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.CosmosNoSQL), [MongoDB](https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.MongoDB), [ElasticSearch](https://www.nuget.org/packages/Elastic.SemanticKernel.Connectors.Elasticsearch), and many more. The writer can also automatically generate embeddings for your chunks using Microsoft.Extensions.AI, making them ready for semantic search and retrieval scenarios.
115+
116+
```csharp
117+
OpenAIClient openAIClient = new(
118+
new ApiKeyCredential(Environment.GetEnvironmentVariable("GITHUB_TOKEN")!),
119+
new OpenAIClientOptions { Endpoint = new Uri("https://models.github.ai/inference") });
120+
121+
IEmbeddingGenerator<string, Embedding<float>> embeddingGenerator =
122+
openAIClient.GetEmbeddingClient("text-embedding-3-small").AsIEmbeddingGenerator();
123+
124+
using SqliteVectorStore vectorStore = new(
125+
"Data Source=vectors.db;Pooling=false",
126+
new()
127+
{
128+
EmbeddingGenerator = embeddingGenerator
129+
});
130+
131+
// The writer requires the embedding dimension count to be specified.
132+
// For OpenAI's `text-embedding-3-small`, the dimension count is 1536.
133+
using VectorStoreWriter<string> writer = new(vectorStore, dimensionCount: 1536);
134+
```
116135

117136
### Document Processing Pipeline
118137

119-
The `DocumentPipeline` API allows you to chain together the various data ingestion components into a complete workflow. You can combine:
138+
The `IngestionPipeline<T>` API allows you to chain together the various data ingestion components into a complete workflow. You can combine:
120139

121140
- **Readers** to load documents from various sources
122141
- **Processors** to transform and enrich document content
@@ -126,5 +145,16 @@ The `DocumentPipeline` API allows you to chain together the various data ingesti
126145
This pipeline approach reduces boilerplate code and makes it easy to build, test, and maintain complex data ingestion workflows.
127146

128147
```csharp
129-
//TODO: Add code snippet
148+
using IngestionPipeline<string> pipeline = new(reader, chunker, writer, loggerFactory: loggerFactory)
149+
{
150+
DocumentProcessors = { imageAlternativeTextEnricher },
151+
ChunkProcessors = { summaryEnricher }
152+
};
153+
154+
await foreach (var result in pipeline.ProcessAsync(new DirectoryInfo("."), searchPattern: "*.md"))
155+
{
156+
Console.WriteLine($"Completed processing '{result.DocumentId}'. Succeeded: '{result.Succeeded}'.");
157+
}
130158
```
159+
160+
A single document ingestion failure should not fail the whole pipeline, that is why the `IngestionPipeline.ProcessAsync` implements partial success by returning `IAsyncEnumerable<IngestionResult>`. The caller is responsible for handling any failures (for example, by re-trying failed documents or stopping on first error).

0 commit comments

Comments
 (0)