-
Notifications
You must be signed in to change notification settings - Fork 849
Add DocumentTokenChunker #7093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DocumentTokenChunker #7093
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
adamsitnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is correct, but not optimal. Since the main goal of this chunker is best performance, please improve the implementation based on my feedback.
Thank you for your contribution @KrystofS !
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
6a72a5b to
2a49ac9
Compare
adamsitnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the code looks good, but we can still avoid some allocations. PTAL at my comments, thank you @KrystofS !
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
…mentTokenChunker.cs Co-authored-by: Adam Sitnik <[email protected]>
…mentTokenChunker.cs Co-authored-by: Adam Sitnik <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
| ReadOnlyMemory<char> contentToProcess = elementContent.AsMemory(); | ||
| while (stringBuilderTokenCount + contentToProcessTokenCount >= _maxTokensPerChunk) | ||
| { | ||
| int index = _tokenizer.GetIndexByTokenCount( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't appear to be making any attempt to move the start/end of the chunk to a "good" location, e.g. this could be in the middle of a word?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. I don't think it is an issue because any RAG system should be resilient enough not to be affected by this assuming reasonable overlap size. Similarly this could take just a part of some table cell etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe. But I see other chunking systems going to great lengths to try to find good boundaries for the chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephentoub what other chunking systems are you referring to? Langchains TokenTextSplitter could split any word in similar fashion, so could TokenChunker from chonkie. I'd say it's true in general not for this type of token count based chunker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest adding only a warning to documentation and keeping the current behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephentoub can we move on with this one? I in my testing this method actually performed the best in RAG tasks with the default settings on my test dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave it up to @adamsitnik.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that it's documented, works fine in some cases and our competitors provide similar feature, I am fine merging it.
I in my testing this method actually performed the best in RAG tasks with the default settings on my test dataset.
Just out of curiosity, have you tried the HeaderChunker I've implemented?
adamsitnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KrystofS it's almost ready, PTAL at my last comment. Thanks!
src/Libraries/Microsoft.Extensions.DataIngestion/Microsoft.Extensions.DataIngestion.csproj
Show resolved
Hide resolved
| unsafe | ||
| { | ||
| fixed (char* ptr = &MemoryMarshal.GetReference(contentToProcess.Span)) | ||
| { | ||
| _ = stringBuilder.Append(ptr, index); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect you are using unsafe to avoid string allocation of .NET Standard/Full Framework. I don't believe it's worth the struggle (we aim to not use unsafe at all when possible).
Please follow the pattern of passing span to builder on modern .NET and allocating otherwise:
extensions/src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/ElementsChunker.cs
Lines 229 to 233 in e801349
| #if NET | |
| stringBuilder.Append(chars); | |
| #else | |
| stringBuilder.Append(chars.ToString()); | |
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommended it. This could be a ton of string allocation, entirely unnecessarily. The unsafe use is very small and scoped and easily audited. I don't see a problem with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with @stephentoub on this one. I agree that unsafe use is contained to a very small portion of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a debate about the same thing in some other PR and I removed the unsafe.
We could at least move the unsafe to !NET:
#if NET
stringBuilder.Append(chars);
#else
unsafe goes here
#endif Or introduce an extension method that does take care of that of !NET
But I don't want to block @KrystofS, we can deal with it later.
cc @EgorBo Who is leading the effort of unsafe removal.
adamsitnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you for your contribution @KrystofS !
| unsafe | ||
| { | ||
| fixed (char* ptr = &MemoryMarshal.GetReference(contentToProcess.Span)) | ||
| { | ||
| _ = stringBuilder.Append(ptr, index); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a debate about the same thing in some other PR and I removed the unsafe.
We could at least move the unsafe to !NET:
#if NET
stringBuilder.Append(chars);
#else
unsafe goes here
#endif Or introduce an extension method that does take care of that of !NET
But I don't want to block @KrystofS, we can deal with it later.
cc @EgorBo Who is leading the effort of unsafe removal.
| ReadOnlyMemory<char> contentToProcess = elementContent.AsMemory(); | ||
| while (stringBuilderTokenCount + contentToProcessTokenCount >= _maxTokensPerChunk) | ||
| { | ||
| int index = _tokenizer.GetIndexByTokenCount( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that it's documented, works fine in some cases and our competitors provide similar feature, I am fine merging it.
I in my testing this method actually performed the best in RAG tasks with the default settings on my test dataset.
Just out of curiosity, have you tried the HeaderChunker I've implemented?
|
@adamsitnik I have not tried |
Microsoft Reviewers: Open in CodeFlow