-
Couldn't load subscription status.
- Fork 5.2k
Add APIs to BlobBuilder for customizing the underlying byte array et al.
#115294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Note regarding the |
1 similar comment
|
Note regarding the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds new APIs to BlobBuilder for customizing the underlying byte array and updates related encoding and buffer-handling logic across metadata and core libraries. Key changes include replacing legacy UTF-8 encoding code with calls to the new System.Text.Unicode.Utf8 APIs, updating BlobBuilder’s API surface (including new constructors and properties), and adding NET-specific intrinsics support across several core modules.
Reviewed Changes
Copilot reviewed 31 out of 36 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| System/Reflection/Internal/Utilities/StreamExtensions.cs | Removed obsolete TryReadAll overload for Span to rely on newer API paths. |
| System/Reflection/Internal/Utilities/BlobUtilities.cs | Rewrote WriteUtf8 to use Utf8.FromUtf16 for encoding, replacing manual UTF-8 encoding logic. |
| System/Reflection/Metadata.cs | Added new BlobBuilder constructors, properties, and APIs including ReadOnlySpan/WriteBytes overloads. |
| System.Private.CoreLib (various files) | Updated intrinsics and preprocessor conditions (#if NET, #if SYSTEM_PRIVATE_CORELIB) for newer vectorized and ASCII helper routines. |
| Microsoft.Bcl.Memory (PACKAGE.md and others) | Updated documentation and type forwarding to include UTF-8 APIs for NET platforms. |
Files not reviewed (5)
- src/libraries/Microsoft.Bcl.Memory/src/Microsoft.Bcl.Memory.csproj: Language not supported
- src/libraries/Microsoft.Bcl.Memory/tests/Microsoft.Bcl.Memory.Tests.csproj: Language not supported
- src/libraries/System.Reflection.Metadata/System.Reflection.Metadata.sln: Language not supported
- src/libraries/System.Reflection.Metadata/src/Resources/Strings.resx: Language not supported
- src/libraries/System.Reflection.Metadata/src/System.Reflection.Metadata.csproj: Language not supported
...braries/System.Reflection.Metadata/src/System/Reflection/Internal/Utilities/BlobUtilities.cs
Show resolved
Hide resolved
src/libraries/System.Reflection.Metadata/ref/System.Reflection.Metadata.cs
Show resolved
Hide resolved
|
Tagging subscribers to this area: @dotnet/area-system-reflection-metadata |
Is this needed to introduce the new APIs? System.Text.Unicode.Utf8 change introduces a new dependency for System.Reflection.Metadata on .NET Framework that will be an extra work to push through the system. It would be better to avoid bundling the two changes together in a single PR. |
The other PR is approved and ready to merge #111292. After the merge and rebase this branch against main, those commits will disappear. |
The change that introduces System.Reflection.Metadata dependency on Microsoft.Bcl.Memory won't disappear. |
0638427 to
b3c957d
Compare
|
Switched back to the old Tests pass locally. This is ready for review. |
What''s the performance regression introduced by this rewrite on .NET Framework? Our primary interest in removing unsafe code is on latest .NET. It is fine to keep unsafe code for .NET Framework if it is required for decent performance. |
|
I updated the function to use unsafe code and wrote a benchmark to compare it with my initial safe edition. We cannot compare it with the existing unsafe implementation since the functions don't have the same signature and semantics. The numbers look promising so I switched to the
Benchmark code// See https://aka.ms/new-console-template for more information
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.InteropServices;
using System.Text;
BenchmarkRunner.Run<Utf8Bench>();
public class Utf8Bench
{
public string TestString = null!;
public byte[] TestBytes = new byte[2048];
[Params(16, 128)]
public int N { get; set; }
[GlobalSetup]
public void Setup()
{
var sb = new StringBuilder();
for (int i = 0; i < N; i++)
{
sb.Append('a');
}
for (int i = 0; i < N; i++)
{
sb.Append('Θ');
}
for (int i = 0; i < N; i++)
{
sb.Append("😂");
}
TestString = sb.ToString();
TestBytes = new byte[2048];
}
[Benchmark(Baseline = true)]
public int Safe()
{
WriteUtf8Safe(TestString.AsSpan(), TestBytes, out int charsRead, out int bytesWritten, true);
return charsRead + bytesWritten;
}
[Benchmark]
public int Unsafe()
{
WriteUtf8Unsafe(TestString.AsSpan(), TestBytes, out int charsRead, out int bytesWritten, true);
return charsRead + bytesWritten;
}
public static void WriteUtf8Safe(ReadOnlySpan<char> source, Span<byte> destination, out int charsRead, out int bytesWritten, bool allowUnpairedSurrogates)
{
// Copy from PR
}
public static unsafe void WriteUtf8Unsafe(ReadOnlySpan<char> source, Span<byte> destination, out int charsRead, out int bytesWritten, bool allowUnpairedSurrogates)
{
// Copy from PR
}
} |
f0e47b1 to
a5921ee
Compare
| /// <summary> | ||
| /// Changes the size of the byte array underpinning the <see cref="BlobBuilder"/>. | ||
| /// Derived types can override this method to control the allocation strategy. | ||
| /// </summary> | ||
| /// <param name="capacity">The array's new size.</param> | ||
| /// <seealso cref="Capacity"/> | ||
| protected virtual void SetCapacity(int capacity) | ||
| { | ||
| Array.Resize(ref _buffer, Math.Max(MinChunkSize, capacity)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how subclasses are supposed to override this. They will reassign the Buffer property, but according to a comment in #100418, reassigning Buffer clears the head chunk.
What are the semantics? Should Capacity's setter have additional logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @jaredpar
a5921ee to
b0d2baa
Compare
We use the old `WriteUtf8` function for downlevel frameworks, rewritten to have a span-based signature and eliminate unsafe code.
We must increment the pointers after we write the bytes.
2304f5e to
4f7ebd8
Compare
|
@steveharter could you take a look? |
Only for downlevel or everywhere? The numbers you quote are sizeable. |
|
Only for downlevel. On modern .NET we use |
Can you help me understand then what you were measuring in #115294 (comment)? Is that something else? |
|
We are trying to minimize our use of unsafe throughout, so I expect we'd want to avoid bringing that in here. @jkotas @AaronRobinsonMSFT can you advise on this PR for whether we should pursue getting it into approvable state for .NET 11? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing jumps out to me as being the reason for the slowdown. @teo-tsirpanis Have you tried running this under a profiler and seeing where the hot spots are?
...braries/System.Reflection.Metadata/src/System/Reflection/Internal/Utilities/BlobUtilities.cs
Show resolved
Hide resolved
| bytesWritten = destinationLength - destination.Length; | ||
| } | ||
| #else | ||
| public static void WriteUtf8(ReadOnlySpan<char> source, Span<byte> destination, out int charsRead, out int bytesWritten, bool allowUnpairedSurrogates) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to change at all? If this is for .NET Framework, I would just leave it as-is. There is little benefit to changing anything due to possibility of regressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The method's old signature was harder to work with; it uses pointers, does not clearly state which buffer is read-only, and requires pre-computing the UTF-8 byte count.
I had to refactor SRM's UTF-8 encoder1 in order to improve code reusability and memory safety in modern frameworks, and take advantage of the System.Text.Unicode.Utf8 APIs. Originally, I tried adding a reference to the Microsoft.Bcl.Memory package — which has a polyfill to Utf8, and this method had a single implementation. Adding the extra dependency to SRM however is not going to be simple, hence the custom code's re-introduction.
Another idea if we want to avoid the extra package dependency, is to vendor the sources of Utf8 to SRM.
Footnotes
|
|
||
| _createBlobBuilderFunc = createBlobBuilderFunc ?? (capacity => new BlobBuilder(capacity)); | ||
| _userStringBuilder = _createBlobBuilderFunc(4 * 1024); | ||
| _guidBuilder = _createBlobBuilderFunc(16); // full metadata has just a single guid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| _guidBuilder = _createBlobBuilderFunc(16); // full metadata has just a single guid | |
| _guidBuilder = _createBlobBuilderFunc(BlobUtilities.SizeOfGuid); // full metadata has just a single guid |
...s/System.Reflection.Metadata/src/System/Reflection/Metadata/Ecma335/MetadataBuilder.Heaps.cs
Show resolved
Hide resolved
The unsafe code is used in .NET Framework polyfills for the most part. There is not much we can do about that without significantly regressing performance of this library on .NET Framework. These APIs are expected to improve Roslyn performance once Roslyn switches over to use them. Like with any performance related change, I would like to see some numbers that show (1) the refactoring required by these APIs is not regressing Roslyn performance and (2) switching Roslyn to use these APIs is improving Roslyn performance. |
Fixes #99244
Fixes #100418
This PR builds on top of @jaredpar's branch to add APIs for customizing the underlying buffer of a
BlobBuilder. The chunking logic ofBlobBuilderwas updated to allocate multiple additional chunks with a user-customizable maximum size each. As part of this, we use APIs fromSystem.Text.Unicode.Utf8to encode UTF-8 strings, which increases performance and safety, and reduces duplicate code.