You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/ai/conceptual/data-ingestion.md
+9-8Lines changed: 9 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,6 +5,7 @@ author: luisquintanilla
5
5
ms.author: luquinta
6
6
ms.date: 11/11/2025
7
7
ms.topic: concept-article
8
+
ai-usage: ai-assisted
8
9
---
9
10
10
11
# Data Ingestion
@@ -21,7 +22,7 @@ For AI and machine learning scenarios, especially Retrieval-Augmented Generation
21
22
22
23
## Why data ingestion matters for AI applications
23
24
24
-
Imagine you're building a RAG-powered chatbot to help employees find information across your company's vast collection of documents. These documents might include PDFs, Word files, PowerPoint presentations, and web pages scattered across different systems.
25
+
Imagine you're building a RAG-powered chatbot to help employees find information across your company's vast collection of documents. These documents might include PDFs, Word files, PowerPoint presentations, and web pages scattered across different systems.
25
26
26
27
Your chatbot needs to understand and search through thousands of documents to provide accurate, contextual answers. But raw documents aren't suitable for AI systems. You need to transform them into a format that preserves meaning while making them searchable and retrievable.
27
28
@@ -35,7 +36,7 @@ With these building blocks, developers can create robust, flexible, and intellig
35
36
36
37
-**Unified document representation:** Represent any file type (PDF, Image, Microsoft Word, etc.) in a consistent format that works well with large language models.
37
38
-**Flexible data ingestion:** Read documents from both cloud services and local sources using multiple built-in readers, making it easy to bring in data from wherever it lives.
38
-
-**Built-in AI enhancements:** Automatically enrich content with summaries, sentiment analysis, keyword extraction and classification, preparing your data for intelligent workflows.
39
+
-**Built-in AI enhancements:** Automatically enrich content with summaries, sentiment analysis, keyword extraction, and classification, preparing your data for intelligent workflows.
39
40
-**Customizable chunking strategies:** Split documents into chunks using token-based, section-based, or semantic-aware approaches, so you can optimize for your retrieval and analysis needs.
40
41
-**Production-ready storage:** Store processed chunks in popular vector databases and document stores, with support for embedding generation, making your pipelines ready for real-world scenarios.
41
42
-**End-to-end pipeline composition:** Chain together readers, processors, chunkers, and writers with the `IngestionPipeline` API, reducing boilerplate and making it easy to build, customize, and extend complete workflows.
@@ -63,12 +64,12 @@ The [Microsoft.Extensions.DataIngestion](https://www.nuget.org/packages/Microsof
63
64
64
65
At the foundation of the library is the `IngestionDocument` type, which provides a unified way to represent any file format without losing important information. The `IngestionDocument` is Markdown-centric because large language models work best with Markdown formatting.
65
66
66
-
The `IngestionDocumentReader` abstraction handles loading documents from various sources, whether local files or streams. There are few readers available:
67
+
The `IngestionDocumentReader` abstraction handles loading documents from various sources, whether local files or streams. A few readers are available:
0 commit comments