-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Don't mutate strings in Kestrel #17556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
gfoidl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes for review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK to inline this methods, as in effect it's just the delegate invocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't find a better solution on how to communicate back that non-acii data is available. Except using exceptions, but it's not exceptional and slow / miuse of exceptions.
That's why is marked with a "flag", which is checked above at L167.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm aware of "no further unsafe code in Kestrel", but for code that used hardware intrinsics there is nearly no other way. Therefore I left the signature of this method as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some checks with higher threshold for AVX2 kicking in, but with not so good results. Hence AVX2 kicks in as soon there are enough elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will never be reached, but to indiciate that if Arm / Neon comes up as intrinsics the check may be expanded here. JIT eleminates this branch anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above by AVX2.
|
Did using string.Create cause slowdown that was mitigated by the new usage of hardware intrinsics? |
Yes. Due the overhead of the delegate call.
Not really.
I would not force the inlining of the default implementations, as there is more code working and would result in bigger call-sites. |
src/Servers/Kestrel/Core/src/Internal/Infrastructure/HttpUtilities.cs
Outdated
Show resolved
Hide resolved
This reverts commit 5ae80c2834471baf34d1e5a05a42e3cce1ff02d7. This is a .NET STandard 2.0 project, so no span is available by default. I think it's not worth it to add a reference to System.Memory-package just for this change.
…ing a null check for the compiler generated cached delegate
|
Rebased due conflicts in |
a477a53 to
80a77ff
Compare
| { | ||
| // If Vector not-accelerated or remaining less than vector size | ||
| if (!Vector.IsHardwareAccelerated || input > end - Vector<sbyte>.Count) | ||
| if (Avx2.IsSupported && input <= end - Vector256<sbyte>.Count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pointers are unsigned and end - Vector256<sbyte>.Count can't go negative; so
byte* p = null;
if (p <= p - 15)
{
Console.WriteLine(true);
}Will output true
So its better to switch the subtraction around with a <= test and have input overflow if it goes negative e.g.
- input <= end - Vector256<sbyte>.Count
+ input - Vector256<sbyte>.Count <= endThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this condition won't work.
E.g. this constellation is valid and could be processed by AVX-codepath:
byte* input = 0;
byte* end = 32;
if (input - Vector256<sbyte>.Count <= end)
{
// ...
}Here it will wrap around to 0xffffffffffffffe0 <= 32 which is false.
So it should be input + Vector256<sbyte>.Count <= end which itself can overflow, hence the condition is written with the subtraction.
Your concern is valid, so I could add
public static unsafe bool TryGetAsciiString(byte* input, char* output, int count)
{
+ Debug.Assert(input != null);
+ Debug.Assert(output != null);?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just count >= Vector256<sbyte>.Count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the easy way 😃
In
| } while (input <= end - Vector256<sbyte>.Count); |
And in the lines below -- either for differnt code-path or to process the remainder -- similar conditions are used. To use count here this value needs bookkeeping.
As this is a Try...-method I think it's better to just return false in case input == null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it provably impossible for end < Vector256<sbyte>.Count if we know it's not null? I know we were already doing end - Vector<sbyte>.Count before, but I'm curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory end < Vector256<sbyte>.Count can be true, but not in practice.
Even if count = 0 then end will be the same as input and that pointer won't reside at addresses lower than 32. Hence there is no real problem.
As it's internal I'll add another assert for this.
| } | ||
|
|
||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static bool CheckBytesInAsciiRange(Vector256<sbyte> check, Vector256<sbyte> zero) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why is the zero vector passed in as an argument in these methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that the "zero register" can be reused. Otherwise the JIT won't reuse that register and xors a register again.
Although zeroing (xor) a register has no latency, the cpu-frontend must fetch and decode the instruction, so it still has some cost, that can be avoided with this way.
Note: this works only because this method will be inlined.
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static bool CheckBytesInAsciiRange(Vector256<sbyte> check, Vector256<sbyte> zero) | ||
| { | ||
| if (Avx2.IsSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not actually possible for this to be called without Avx2.IsSupported being true, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
Right now I ask myself why I didn't go with a Debug.Assert to make this obvious.
There must be some other code pattern that influenced me to write it that way, but I can't remember.
Also why this method is public and not private.
I'll update this to the be simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh, I didn't remember my own comment 😉
#17556 (comment)
That's why it's written that way.
Which way to go? With Debug.Assert or leave as is (plus add a comment explaining this).
| } | ||
|
|
||
| var asciiString = new string('\0', span.Length); | ||
| fixed (byte* source = &MemoryMarshal.GetReference(span)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this perform better than fixed (byte* buffer = span) used previously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes (it safes one length == 0 check).
GetPinnableReference needs to check for empty span, this check is already done above, so we are sure that we have a non-empty span, so can skip the check for empty by using MemoryMarshal.GetReference.
|
|
||
| fixed (char* output = asciiString) | ||
| fixed (byte* buffer = span) | ||
| private static readonly SpanAction<char, IntPtr> s_getHeaderName = GetHeaderName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we move the static field to the top of the class and start it with _ for consistency? I know the convention is different in the runtime repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done -- will be updated with the next push.
|
|
||
| fixed (char* output = asciiString) | ||
| fixed (byte* buffer = span) | ||
| private static readonly SpanAction<char, IntPtr> s_getAsciiStringNonNullCharacters = GetAsciiStringNonNullCharacters; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto for s_getHeaderName
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done -- will be updated with the next push.
| output += Vector128<sbyte>.Count; | ||
| } while (input <= end - Vector128<sbyte>.Count); | ||
|
|
||
| if (input == end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only helps if the decodes string happens to have a multiple of 16 characters, right? Seems unlikely to be worth it on average.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a single check avoids the 4 check chain: <= sizeof(long), <= sizeof(int), <= sizeof(char), <= 1; though suppose it adds one more to everything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's right. I tend to keep it, as @benaadams explained, and I think it doesn't harm on input-length that are not multiples of 16 chars long.
| return false; | ||
| } | ||
|
|
||
| if (Bmi2.X64.IsSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GrabYourPitchforks with respect to dotnet/runtime#2251 and your PR dotnet/runtime#31904 should Bmi2 be removed here too?
I don't know what's the current stance on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, BMI2 should be removed if possible. For AMD processors this PR is likely to introduce similar regressions as those we are fixing via dotnet/runtime#2251.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> #18949 to track this.
|
Hello @halter73! Because this pull request has the p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (
|
|
Thanks @gfoidl! |
Description
Addresses #8978
Replaced the mutating of strings with
string.Create. To get cheap invocations of theSpanActions they are cached in static readonly fields, as this is quite faster than by caching via the C#-compiler generated code (it safes the null-check amongst some other boilerplate-code).In
TryGetAsciiStringBenchmarksadded specific code-path for AVX2 and SSE2, as the codegen is much tighter than theVector<T>-method. Further made use ofBmi2.ParallelBitDepositto gain some perf in the non-vectorized path.Also changed the algorithm to exit early on non-ascii values, instead of iterating to the end and then returning true/false.
Benchmarks
Code for benchmarks
BytesToStringBenchmark
Before
After
TryGetAsciiStringBenchmarks
ThisPR_ExitEarlyis the method finally used in this PR.Further benchmarks
GetHeaderName
StringCreateis used in this PR.NewStringis a variant with stackallocing and thennew string(Span<char>), as it avoid the delegate-call. But it's slower here.Default_withImprovedStringUtitlitesis the default version (master-branch), but with the improvedTryGetAsciiStringBenchmarks-method.GetAsciiStringNonNullCharactersBenchmark
StringCreateis used in this PR.Default_withImprovedStringUtitlitesis the default version (master-branch), but with the improvedTryGetAsciiStringBenchmarks-method.GetAsciiOrUTF8StringNonNullCharactersBenchmark
StringCreateis used in this PR.Unfortunately BenchmarkDotNet trims the TestData-string so that the cases aren't clearly seen.
But each "set" of lenght has
üinserted at endüinserted at index 0