Skip to content

Conversation

@KrystofS
Copy link
Contributor

@KrystofS KrystofS commented Nov 30, 2025

Microsoft Reviewers: Open in CodeFlow

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is correct, but not optimal. Since the main goal of this chunker is best performance, please improve the implementation based on my feedback.

Thank you for your contribution @KrystofS !

@KrystofS KrystofS requested a review from adamsitnik December 6, 2025 12:00
@KrystofS KrystofS force-pushed the feature/DocumentTokenChunker branch from 6a72a5b to 2a49ac9 Compare December 8, 2025 16:46
Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the code looks good, but we can still avoid some allocations. PTAL at my comments, thank you @KrystofS !

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

@KrystofS KrystofS requested a review from stephentoub December 10, 2025 00:15
ReadOnlyMemory<char> contentToProcess = elementContent.AsMemory();
while (stringBuilderTokenCount + contentToProcessTokenCount >= _maxTokensPerChunk)
{
int index = _tokenizer.GetIndexByTokenCount(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't appear to be making any attempt to move the start/end of the chunk to a "good" location, e.g. this could be in the middle of a word?

Copy link
Contributor Author

@KrystofS KrystofS Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I don't think it is an issue because any RAG system should be resilient enough not to be affected by this assuming reasonable overlap size. Similarly this could take just a part of some table cell etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. But I see other chunking systems going to great lengths to try to find good boundaries for the chunks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub what other chunking systems are you referring to? Langchains TokenTextSplitter could split any word in similar fashion, so could TokenChunker from chonkie. I'd say it's true in general not for this type of token count based chunker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding only a warning to documentation and keeping the current behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub can we move on with this one? I in my testing this method actually performed the best in RAG tasks with the default settings on my test dataset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it up to @adamsitnik.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that it's documented, works fine in some cases and our competitors provide similar feature, I am fine merging it.

I in my testing this method actually performed the best in RAG tasks with the default settings on my test dataset.

Just out of curiosity, have you tried the HeaderChunker I've implemented?

@KrystofS KrystofS requested a review from stephentoub December 10, 2025 17:33
Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KrystofS it's almost ready, PTAL at my last comment. Thanks!

Comment on lines +69 to +75
unsafe
{
fixed (char* ptr = &MemoryMarshal.GetReference(contentToProcess.Span))
{
_ = stringBuilder.Append(ptr, index);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you are using unsafe to avoid string allocation of .NET Standard/Full Framework. I don't believe it's worth the struggle (we aim to not use unsafe at all when possible).

Please follow the pattern of passing span to builder on modern .NET and allocating otherwise:

#if NET
stringBuilder.Append(chars);
#else
stringBuilder.Append(chars.ToString());
#endif

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommended it. This could be a ton of string allocation, entirely unnecessarily. The unsafe use is very small and scoped and easily audited. I don't see a problem with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with @stephentoub on this one. I agree that unsafe use is contained to a very small portion of the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a debate about the same thing in some other PR and I removed the unsafe.

We could at least move the unsafe to !NET:

 #if NET 
         stringBuilder.Append(chars); 
 #else 
         unsafe goes here 
 #endif 

Or introduce an extension method that does take care of that of !NET

But I don't want to block @KrystofS, we can deal with it later.

cc @EgorBo Who is leading the effort of unsafe removal.

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for your contribution @KrystofS !

Comment on lines +69 to +75
unsafe
{
fixed (char* ptr = &MemoryMarshal.GetReference(contentToProcess.Span))
{
_ = stringBuilder.Append(ptr, index);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a debate about the same thing in some other PR and I removed the unsafe.

We could at least move the unsafe to !NET:

 #if NET 
         stringBuilder.Append(chars); 
 #else 
         unsafe goes here 
 #endif 

Or introduce an extension method that does take care of that of !NET

But I don't want to block @KrystofS, we can deal with it later.

cc @EgorBo Who is leading the effort of unsafe removal.

ReadOnlyMemory<char> contentToProcess = elementContent.AsMemory();
while (stringBuilderTokenCount + contentToProcessTokenCount >= _maxTokensPerChunk)
{
int index = _tokenizer.GetIndexByTokenCount(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that it's documented, works fine in some cases and our competitors provide similar feature, I am fine merging it.

I in my testing this method actually performed the best in RAG tasks with the default settings on my test dataset.

Just out of curiosity, have you tried the HeaderChunker I've implemented?

@adamsitnik adamsitnik merged commit 20db541 into dotnet:main Dec 12, 2025
6 checks passed
@KrystofS
Copy link
Contributor Author

@adamsitnik I have not tried HeaderChunker, I can extend my testing and let you know about the results.

@github-actions github-actions bot locked and limited conversation to collaborators Jan 12, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants