Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,14 @@
<PropertyGroup>
<TargetFrameworks>$(TargetFrameworks);netstandard2.0</TargetFrameworks>
<RootNamespace>Microsoft.Extensions.DataIngestion</RootNamespace>

<!-- we are not ready to publish yet -->
<IsPackable>false</IsPackable>
<Description>Abstractions representing Data Ingestion components for RAG.</Description>
<Workstream>RAG</Workstream>
<Stage>preview</Stage>
<EnablePackageValidation>false</EnablePackageValidation>
<MinCodeCoverage>75</MinCodeCoverage>
<MinMutationScore>75</MinMutationScore>
<!-- Convert abstract class into interface -->
<NoWarn>$(NoWarn);S1694</NoWarn>
<Stage>preview</Stage>
<EnablePackageValidation>false</EnablePackageValidation>
</PropertyGroup>

<ItemGroup>
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Microsoft.Extensions.DataIngestion.Abstractions

.NET developers need to efficiently process, chunk, and retrieve information from diverse document formats while preserving semantic meaning and structural context. The `Microsoft.Extensions.DataIngestion` libraries provide a unified approach for representing document ingestion components.

## The packages

The [Microsoft.Extensions.DataIngestion.Abstractions](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion.Abstractions) package provides the core exchange types, including [`IngestionDocument`](https://learn.microsoft.com/dotnet/api/microsoft.extensions.dataingestion.ingestiondocument), [`IngestionChunker<T>`](https://learn.microsoft.com/dotnet/api/microsoft.extensions.dataingestion.ingestionchunker-1), [`IngestionChunkProcessor<T>`](https://learn.microsoft.com/dotnet/api/microsoft.extensions.dataingestion.ingestionchunkprocessor-1), and [`IngestionChunkWriter<T>`](https://learn.microsoft.com/dotnet/api/microsoft.extensions.dataingestion.ingestionchunkwriter-1). Any .NET library that provides document processing capabilities can implement these abstractions to enable seamless integration with consuming code.

The [Microsoft.Extensions.DataIngestion](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion) package has an implicit dependency on the `Microsoft.Extensions.DataIngestion.Abstractions` package. This package enables you to easily integrate components such as enrichment processors, vector storage writers, and telemetry into your applications using familiar dependency injection and pipeline patterns. For example, it provides processors for sentiment analysis, keyword extraction, and summarization that can be chained together in ingestion pipelines.

## Which package to reference

Libraries that provide implementations of the abstractions typically reference only `Microsoft.Extensions.DataIngestion.Abstractions`.

To also have access to higher-level utilities for working with document ingestion components, reference the `Microsoft.Extensions.DataIngestion` package instead (which itself references `Microsoft.Extensions.DataIngestion.Abstractions`). Most consuming applications and services should reference the `Microsoft.Extensions.DataIngestion` package along with one or more libraries that provide concrete implementations of the abstractions, such as `Microsoft.Extensions.DataIngestion.MarkItDown` or `Microsoft.Extensions.DataIngestion.Markdig`.

## Install the package

From the command-line:

```console
dotnet add package Microsoft.Extensions.DataIngestion.Abstractions --prerelease
```

Or directly in the C# project file:

```xml
<ItemGroup>
<PackageReference Include="Microsoft.Extensions.DataIngestion.Abstractions" Version="[CURRENTVERSION]" />
</ItemGroup>
```

## Documentation

Refer to the [Microsoft.Extensions.DataIngestion libraries documentation](https://learn.microsoft.com/dotnet/dataingestion/microsoft-extensions-dataingestion) for more information and API usage examples.

## Feedback & Contributing

We welcome feedback and contributions in [our GitHub repo](https://github.com/dotnet/extensions).
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,12 @@
<PropertyGroup>
<TargetFrameworks>$(TargetFrameworks);netstandard2.0</TargetFrameworks>
<RootNamespace>Microsoft.Extensions.DataIngestion</RootNamespace>

<!-- we are not ready to publish yet -->
<IsPackable>false</IsPackable>
<Description>Implementation of IngestionDocumentReader abstraction for MarkItDown.</Description>
<Workstream>RAG</Workstream>
<Stage>preview</Stage>
<EnablePackageValidation>false</EnablePackageValidation>
<MinCodeCoverage>75</MinCodeCoverage>
<MinMutationScore>75</MinMutationScore>
</PropertyGroup>

<ItemGroup>
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Microsoft.Extensions.DataIngestion.MarkItDown

Provides an implementation of the `IngestionDocumentReader` class for the [MarkItDown](https://github.com/microsoft/markitdown/) utility.

## Install the package

From the command-line:

```console
dotnet add package Microsoft.Extensions.DataIngestion.MarkItDown --prerelease
```

Or directly in the C# project file:

```xml
<ItemGroup>
<PackageReference Include="Microsoft.Extensions.DataIngestion.MarkItDown" Version="[CURRENTVERSION]" />
</ItemGroup>
```

## Usage Examples

### Creating a MarkItDownReader for Data Ingestion

```csharp
using Microsoft.Extensions.DataIngestion;

IngestionDocumentReader reader =
new MarkItDownReader(new FileInfo(@"pathToMarkItDown.exe"), extractImages: true);

using IngestionPipeline<string> pipeline = new(reader, CreateChunker(), CreateWriter());
```

## Feedback & Contributing

We welcome feedback and contributions in [our GitHub repo](https://github.com/dotnet/extensions).
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,12 @@
<PropertyGroup>
<TargetFrameworks>$(TargetFrameworks);netstandard2.0</TargetFrameworks>
<RootNamespace>Microsoft.Extensions.DataIngestion</RootNamespace>

<!-- we are not ready to publish yet -->
<IsPackable>false</IsPackable>
<Description>Implementation of IngestionDocumentReader abstraction for Markdown.</Description>
<Workstream>RAG</Workstream>
<Stage>preview</Stage>
<EnablePackageValidation>false</EnablePackageValidation>
<MinCodeCoverage>75</MinCodeCoverage>
<MinMutationScore>75</MinMutationScore>
</PropertyGroup>

<ItemGroup>
Expand Down
35 changes: 35 additions & 0 deletions src/Libraries/Microsoft.Extensions.DataIngestion.Markdig/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Microsoft.Extensions.DataIngestion.Markdig

Provides an implementation of the `IngestionDocumentReader` class for the Markdown files using [MarkDig](https://github.com/xoofx/markdig) library.

## Install the package

From the command-line:

```console
dotnet add package Microsoft.Extensions.DataIngestion.Markdig --prerelease
```

Or directly in the C# project file:

```xml
<ItemGroup>
<PackageReference Include="Microsoft.Extensions.DataIngestion.Markdig" Version="[CURRENTVERSION]" />
</ItemGroup>
```

## Usage Examples

### Creating a MarkdownReader for Data Ingestion

```csharp
using Microsoft.Extensions.DataIngestion;

IngestionDocumentReader reader = new MarkdownReader();

using IngestionPipeline<string> pipeline = new(reader, CreateChunker(), CreateWriter());
```

## Feedback & Contributing

We welcome feedback and contributions in [our GitHub repo](https://github.com/dotnet/extensions).
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,14 @@
<PropertyGroup>
<TargetFrameworks>$(TargetFrameworks);netstandard2.0</TargetFrameworks>
<RootNamespace>Microsoft.Extensions.DataIngestion</RootNamespace>

<Description>Data Ingestion utilities for RAG.</Description>
<Workstream>RAG</Workstream>
<UseLoggingGenerator>true</UseLoggingGenerator>
<DisableMicrosoftExtensionsLoggingSourceGenerator>false</DisableMicrosoftExtensionsLoggingSourceGenerator>

<!-- we are not ready to publish yet -->
<IsPackable>false</IsPackable>
<Stage>preview</Stage>
<EnablePackageValidation>false</EnablePackageValidation>
<MinCodeCoverage>75</MinCodeCoverage>
<MinMutationScore>75</MinMutationScore>
</PropertyGroup>

<ItemGroup>
Expand Down
34 changes: 34 additions & 0 deletions src/Libraries/Microsoft.Extensions.DataIngestion/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Microsoft.Extensions.DataIngestion

.NET developers need to efficiently process, chunk, and retrieve information from diverse document formats while preserving semantic meaning and structural context. The `Microsoft.Extensions.DataIngestion` libraries provide a unified approach for representing document ingestion components.

## The packages

The [Microsoft.Extensions.DataIngestion.Abstractions](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion.Abstractions) package provides the core exchange types, including [`IngestionDocument`](https://learn.microsoft.com/dotnet/api/microsoft.extensions.dataingestion.ingestiondocument), [`IngestionChunker<T>`](https://learn.microsoft.com/dotnet/api/microsoft.extensions.dataingestion.ingestionchunker-1), [`IngestionChunkProcessor<T>`](https://learn.microsoft.com/dotnet/api/microsoft.extensions.dataingestion.ingestionchunkprocessor-1), and [`IngestionChunkWriter<T>`](https://learn.microsoft.com/dotnet/api/microsoft.extensions.dataingestion.ingestionchunkwriter-1). Any .NET library that provides document processing capabilities can implement these abstractions to enable seamless integration with consuming code.

The [Microsoft.Extensions.DataIngestion](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion) package has an implicit dependency on the `Microsoft.Extensions.DataIngestion.Abstractions` package. This package enables you to easily integrate components such as enrichment processors, vector storage writers, and telemetry into your applications using familiar dependency injection and pipeline patterns. For example, it provides the [`SentimentEnricher`](https://learn.microsoft.com/dotnet/api/microsoft.extensions.dataingestion.sentimentenricher), [`KeywordEnricher`](https://learn.microsoft.com/dotnet/api/microsoft.extensions.dataingestion.keywordenricher), and [`SummaryEnricher`](https://learn.microsoft.com/dotnet/api/microsoft.extensions.dataingestion.summaryenricher) processors that can be chained together in ingestion pipelines.

## Which package to reference

Libraries that provide implementations of the abstractions typically reference only `Microsoft.Extensions.DataIngestion.Abstractions`.

To also have access to higher-level utilities for working with document ingestion components, reference the `Microsoft.Extensions.DataIngestion` package instead (which itself references `Microsoft.Extensions.DataIngestion.Abstractions`). Most consuming applications and services should reference the `Microsoft.Extensions.DataIngestion` package along with one or more libraries that provide concrete implementations of the abstractions, such as `Microsoft.Extensions.DataIngestion.MarkItDown` or `Microsoft.Extensions.DataIngestion.Markdig`.

## Install the package

From the command-line:

```console
dotnet add package Microsoft.Extensions.DataIngestion --prerelease
```
Or directly in the C# project file:

```xml
<ItemGroup>
<PackageReference Include="Microsoft.Extensions.DataIngestion" Version="[CURRENTVERSION]" />
</ItemGroup>
```

## Feedback & Contributing

We welcome feedback and contributions in [our GitHub repo](https://github.com/dotnet/extensions).
Loading