Skip to content

Commit 6b1759d

Browse files
committed
first part of the doc update
1 parent a0b3741 commit 6b1759d

File tree

1 file changed

+10
-10
lines changed

1 file changed

+10
-10
lines changed

docs/ai/conceptual/data-ingestion.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -29,24 +29,24 @@ This is where data ingestion becomes critical. You need to extract text from dif
2929

3030
## What is Microsoft.Extensions.DataIngestion?
3131

32-
Microsoft.Extensions.DataIngestion provides foundational .NET components for data ingestion. It enables developers to read, process, and prepare documents for AI and machine learning workflows, especially Retrieval-Augmented Generation (RAG) scenarios.
32+
[Microsoft.Extensions.DataIngestion](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion) provides foundational .NET components for data ingestion. It enables developers to read, process, and prepare documents for AI and machine learning workflows, especially Retrieval-Augmented Generation (RAG) scenarios.
3333

3434
With these building blocks, developers can create robust, flexible, and intelligent data ingestion pipelines tailored for their application needs:
3535

3636
- **Unified document representation:** Represent any file type (PDF, Image, Microsoft Word, etc.) in a consistent format that works well with large language models.
3737
- **Flexible data ingestion:** Read documents from both cloud services and local sources using multiple built-in readers, making it easy to bring in data from wherever it lives.
38-
- **Built-in AI enhancements:** Automatically enrich content with summaries, sentiment analysis, keyword extraction, privacy-focused PII removal, and classification, preparing your data for intelligent workflows.
38+
- **Built-in AI enhancements:** Automatically enrich content with summaries, sentiment analysis, keyword extraction and classification, preparing your data for intelligent workflows.
3939
- **Customizable chunking strategies:** Split documents into chunks using token-based, section-based, or semantic-aware approaches, so you can optimize for your retrieval and analysis needs.
4040
- **Production-ready storage:** Store processed chunks in popular vector databases and document stores, with support for embedding generation, making your pipelines ready for real-world scenarios.
41-
- **End-to-end pipeline composition:** Chain together readers, processors, chunkers, and writers with the DocumentPipeline API, reducing boilerplate and making it easy to build, customize, and extend complete workflows.
41+
- **End-to-end pipeline composition:** Chain together readers, processors, chunkers, and writers with the `IngestionPipeline` API, reducing boilerplate and making it easy to build, customize, and extend complete workflows.
4242

4343
All of these components are open and extensible by design. You can add custom logic, new connectors, and extend the system to support emerging AI scenarios. By standardizing how documents are represented, processed, and stored, .NET developers can build reliable, scalable, and maintainable data pipelines without reinventing the wheel for every project.
4444

4545
### Building on stable foundations
4646

4747
![Data Ingestion Architecture Diagram](../media/data-ingestion/DataIngestion.png)
4848

49-
These new data ingestion abstractions are built on top of proven and extensible components in the .NET ecosystem, ensuring reliability, interoperability, and seamless integration with existing AI workflows:
49+
These new data ingestion building blocks are built on top of proven and extensible components in the .NET ecosystem, ensuring reliability, interoperability, and seamless integration with existing AI workflows:
5050

5151
- **Microsoft.ML.Tokenizers:** Tokenizers provide the foundation for chunking documents based on tokens. This enables precise splitting of content, which is essential for preparing data for large language models and optimizing retrieval strategies.
5252
- **Microsoft.Extensions.AI:** This set of libraries powers enrichment transformations using large language models. It enables features like summarization, sentiment analysis, keyword extraction, and embedding generation, making it easy to enhance your data with intelligent insights.
@@ -56,18 +56,18 @@ In addition to familiar patterns and tools, these abstractions build on already
5656

5757
## Data ingestion building blocks
5858

59-
The Microsoft.Extensions.DataIngestion library is built around several key components that work together to create a complete data processing pipeline. Let's explore each component and how they fit together.
59+
The [Microsoft.Extensions.DataIngestion](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion) library is built around several key components that work together to create a complete data processing pipeline. Let's explore each component and how they fit together.
6060

6161
### Documents and Document Readers
6262

63-
At the foundation of the library is the `Document` type, which provides a unified way to represent any file format without losing important information. The `Document` is Markdown-centric because large language models work best with Markdown formatting.
63+
At the foundation of the library is the `IngestionDocument` type, which provides a unified way to represent any file format without losing important information. The `IngestionDocument` is Markdown-centric because large language models work best with Markdown formatting.
6464

65-
The `DocumentReader` abstraction handles loading documents from various sources, whether local files or remote URIs. The library includes several built-in readers:
65+
The `IngestionDocumentReader` abstraction handles loading documents from various sources, whether local files or streams. There are few readers available:
6666

6767
- **Azure Document Intelligence**
6868
- **LlamaParse**
69-
- **MarkItDown**
70-
- **Markdown**
69+
- **[MarkItDown](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion.MarkItDown)**
70+
- **[Markdown](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion.Markdig/)**
7171

7272
This design means you can work with documents from different sources using the same consistent API, making your code more maintainable and flexible.
7373

@@ -127,4 +127,4 @@ This pipeline approach reduces boilerplate code and makes it easy to build, test
127127

128128
```csharp
129129
//TODO: Add code snippet
130-
```
130+
```

0 commit comments

Comments
 (0)