You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/ai/conceptual/data-ingestion.md
+10-10Lines changed: 10 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,24 +29,24 @@ This is where data ingestion becomes critical. You need to extract text from dif
29
29
30
30
## What is Microsoft.Extensions.DataIngestion?
31
31
32
-
Microsoft.Extensions.DataIngestion provides foundational .NET components for data ingestion. It enables developers to read, process, and prepare documents for AI and machine learning workflows, especially Retrieval-Augmented Generation (RAG) scenarios.
32
+
[Microsoft.Extensions.DataIngestion](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion) provides foundational .NET components for data ingestion. It enables developers to read, process, and prepare documents for AI and machine learning workflows, especially Retrieval-Augmented Generation (RAG) scenarios.
33
33
34
34
With these building blocks, developers can create robust, flexible, and intelligent data ingestion pipelines tailored for their application needs:
35
35
36
36
-**Unified document representation:** Represent any file type (PDF, Image, Microsoft Word, etc.) in a consistent format that works well with large language models.
37
37
-**Flexible data ingestion:** Read documents from both cloud services and local sources using multiple built-in readers, making it easy to bring in data from wherever it lives.
38
-
-**Built-in AI enhancements:** Automatically enrich content with summaries, sentiment analysis, keyword extraction, privacy-focused PII removal, and classification, preparing your data for intelligent workflows.
38
+
-**Built-in AI enhancements:** Automatically enrich content with summaries, sentiment analysis, keyword extraction and classification, preparing your data for intelligent workflows.
39
39
-**Customizable chunking strategies:** Split documents into chunks using token-based, section-based, or semantic-aware approaches, so you can optimize for your retrieval and analysis needs.
40
40
-**Production-ready storage:** Store processed chunks in popular vector databases and document stores, with support for embedding generation, making your pipelines ready for real-world scenarios.
41
-
-**End-to-end pipeline composition:** Chain together readers, processors, chunkers, and writers with the DocumentPipeline API, reducing boilerplate and making it easy to build, customize, and extend complete workflows.
41
+
-**End-to-end pipeline composition:** Chain together readers, processors, chunkers, and writers with the `IngestionPipeline` API, reducing boilerplate and making it easy to build, customize, and extend complete workflows.
42
42
43
43
All of these components are open and extensible by design. You can add custom logic, new connectors, and extend the system to support emerging AI scenarios. By standardizing how documents are represented, processed, and stored, .NET developers can build reliable, scalable, and maintainable data pipelines without reinventing the wheel for every project.
These new data ingestion abstractions are built on top of proven and extensible components in the .NET ecosystem, ensuring reliability, interoperability, and seamless integration with existing AI workflows:
49
+
These new data ingestion building blocks are built on top of proven and extensible components in the .NET ecosystem, ensuring reliability, interoperability, and seamless integration with existing AI workflows:
50
50
51
51
-**Microsoft.ML.Tokenizers:** Tokenizers provide the foundation for chunking documents based on tokens. This enables precise splitting of content, which is essential for preparing data for large language models and optimizing retrieval strategies.
52
52
-**Microsoft.Extensions.AI:** This set of libraries powers enrichment transformations using large language models. It enables features like summarization, sentiment analysis, keyword extraction, and embedding generation, making it easy to enhance your data with intelligent insights.
@@ -56,18 +56,18 @@ In addition to familiar patterns and tools, these abstractions build on already
56
56
57
57
## Data ingestion building blocks
58
58
59
-
The Microsoft.Extensions.DataIngestion library is built around several key components that work together to create a complete data processing pipeline. Let's explore each component and how they fit together.
59
+
The [Microsoft.Extensions.DataIngestion](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion) library is built around several key components that work together to create a complete data processing pipeline. Let's explore each component and how they fit together.
60
60
61
61
### Documents and Document Readers
62
62
63
-
At the foundation of the library is the `Document` type, which provides a unified way to represent any file format without losing important information. The `Document` is Markdown-centric because large language models work best with Markdown formatting.
63
+
At the foundation of the library is the `IngestionDocument` type, which provides a unified way to represent any file format without losing important information. The `IngestionDocument` is Markdown-centric because large language models work best with Markdown formatting.
64
64
65
-
The `DocumentReader` abstraction handles loading documents from various sources, whether local files or remote URIs. The library includes several built-in readers:
65
+
The `IngestionDocumentReader` abstraction handles loading documents from various sources, whether local files or streams. There are few readers available:
0 commit comments