Skip to content

Commit 93a7d71

Browse files
committed
address Copilot feedback
1 parent ef70e3d commit 93a7d71

File tree

1 file changed

+9
-8
lines changed

1 file changed

+9
-8
lines changed

docs/ai/conceptual/data-ingestion.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ author: luisquintanilla
55
ms.author: luquinta
66
ms.date: 11/11/2025
77
ms.topic: concept-article
8+
ai-usage: ai-assisted
89
---
910

1011
# Data Ingestion
@@ -21,7 +22,7 @@ For AI and machine learning scenarios, especially Retrieval-Augmented Generation
2122

2223
## Why data ingestion matters for AI applications
2324

24-
Imagine you're building a RAG-powered chatbot to help employees find information across your company's vast collection of documents. These documents might include PDFs, Word files, PowerPoint presentations, and web pages scattered across different systems.
25+
Imagine you're building a RAG-powered chatbot to help employees find information across your company's vast collection of documents. These documents might include PDFs, Word files, PowerPoint presentations, and web pages scattered across different systems.
2526

2627
Your chatbot needs to understand and search through thousands of documents to provide accurate, contextual answers. But raw documents aren't suitable for AI systems. You need to transform them into a format that preserves meaning while making them searchable and retrievable.
2728

@@ -35,7 +36,7 @@ With these building blocks, developers can create robust, flexible, and intellig
3536

3637
- **Unified document representation:** Represent any file type (PDF, Image, Microsoft Word, etc.) in a consistent format that works well with large language models.
3738
- **Flexible data ingestion:** Read documents from both cloud services and local sources using multiple built-in readers, making it easy to bring in data from wherever it lives.
38-
- **Built-in AI enhancements:** Automatically enrich content with summaries, sentiment analysis, keyword extraction and classification, preparing your data for intelligent workflows.
39+
- **Built-in AI enhancements:** Automatically enrich content with summaries, sentiment analysis, keyword extraction, and classification, preparing your data for intelligent workflows.
3940
- **Customizable chunking strategies:** Split documents into chunks using token-based, section-based, or semantic-aware approaches, so you can optimize for your retrieval and analysis needs.
4041
- **Production-ready storage:** Store processed chunks in popular vector databases and document stores, with support for embedding generation, making your pipelines ready for real-world scenarios.
4142
- **End-to-end pipeline composition:** Chain together readers, processors, chunkers, and writers with the `IngestionPipeline` API, reducing boilerplate and making it easy to build, customize, and extend complete workflows.
@@ -63,12 +64,12 @@ The [Microsoft.Extensions.DataIngestion](https://www.nuget.org/packages/Microsof
6364

6465
At the foundation of the library is the `IngestionDocument` type, which provides a unified way to represent any file format without losing important information. The `IngestionDocument` is Markdown-centric because large language models work best with Markdown formatting.
6566

66-
The `IngestionDocumentReader` abstraction handles loading documents from various sources, whether local files or streams. There are few readers available:
67+
The `IngestionDocumentReader` abstraction handles loading documents from various sources, whether local files or streams. A few readers are available:
6768

6869
- **[MarkItDown](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion.MarkItDown)**
6970
- **[Markdown](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion.Markdig/)**
7071

71-
And we are actively working on adding more readers (including **LlamaParse** and **Azure Document Intelligence**).
72+
We're actively working on adding more readers (including **LlamaParse** and **Azure Document Intelligence**).
7273

7374
This design means you can work with documents from different sources using the same consistent API, making your code more maintainable and flexible.
7475

@@ -137,10 +138,10 @@ using VectorStoreWriter<string> writer = new(vectorStore, dimensionCount: 1536);
137138

138139
The `IngestionPipeline<T>` API allows you to chain together the various data ingestion components into a complete workflow. You can combine:
139140

140-
- **Readers** to load documents from various sources
141-
- **Processors** to transform and enrich document content
142-
- **Chunkers** to break documents into manageable pieces
143-
- **Writers** to store the final results in your chosen data store
141+
- **Readers** to load documents from various sources.
142+
- **Processors** to transform and enrich document content.
143+
- **Chunkers** to break documents into manageable pieces.
144+
- **Writers** to store the final results in your chosen data store.
144145

145146
This pipeline approach reduces boilerplate code and makes it easy to build, test, and maintain complex data ingestion workflows.
146147

0 commit comments

Comments
 (0)