-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce allocations in VersionConverter #55564
Reduce allocations in VersionConverter #55564
Conversation
Tagging subscribers to this area: @eiriktsarpalis, @layomia Issue DetailsResolves #55179 I didn't find any info on creating custom benchmarks to measure performance of Few notes on implementation:
|
...ies/System.Text.Json/src/System/Text/Json/Serialization/Converters/Value/VersionConverter.cs
Outdated
Show resolved
Hide resolved
int maxCharCount = JsonReaderHelper.s_utf8Encoding.GetMaxCharCount(source.Length); | ||
char[]? pooledChars = null; | ||
Span<char> charBuffer = maxCharCount * sizeof(char) <= JsonConstants.StackallocThreshold | ||
? stackalloc char[maxCharCount] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the guidance is to just use JsonConstants.StackallocThreshold
for the size as the jitter can optimize when calling the method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case Encoding.GetChars()
can write not complete byte
s source to char
buffer. Indeed, max length of string representation of valid Version
object is 43 chars aka 86 bytes which is always lower than current JsonConstants.StackallocThreshold
value.
But if we want to get string representation only of valid Version
s be successfully converted to chars, shouldn't we then just use max length of string of valid Version
object * sizeof(char) (86 bytes), and so we will stackalloc even less than JsonConstants.StackallocThreshold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you know that the maximum length is x bytes or chars for any specific data type, it's fine to hardcode that value here. It'll result in somewhat more efficient stack usage than relying on the fallback const.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only know that maximum length of string of valid Version
coming from CLR is 43 chars, but incoming data can be of any length or it can have whitespaces as pointed below, and potentially can cause exception when getting chars from encoding in case when char buffer cant hold all data from byte source. In case of netstandard2.0 target such exception wont be thrown since buffer will be allocated accounting not max length of version but max char count coming from byte source.
string source = $"{int.MaxValue}.{int.MaxValue}.{int.MaxValue}.{int.MaxValue}";
var s_utf8Encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true);
byte[] ut8Bytes = s_utf8Encoding.GetBytes(source);
Span<char> chars = stackalloc char[source.Length - 1]; // Buffer too small to contain all incoming data
s_utf8Encoding.GetChars(ut8Bytes, chars);
//The line above fails with unhandled exception
//Unhandled exception. System.ArgumentException:
//The output char buffer is too small to contain the decoded characters,
//encoding 'Unicode (UTF-8)' fallback 'System.Text.DecoderExceptionFallback'. (Parameter 'chars')
|
||
#if BUILDING_INBOX_LIBRARY | ||
|
||
ReadOnlySpan<byte> source = reader.HasValueSequence ? reader.ValueSequence.ToArray() : reader.ValueSpan; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're accessing the raw bytes and performing the conversion to chars yourself, this could introduce a security vulnerability because it could allow invalid JSON through the system. Version.TryParse
accepts a wider range of characters than the JSON spec allows.
@layomia - You'll recognize this as the same feedback I gave for the recent TimeSpan
optimizations. This is now the second PR I've seen that uses this pattern.
Layomi, please schedule some time to audit all the optimizations that were checked in as part of the 6.0 wave and ensure this type of pattern was not actually committed anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Version.TryParse accepts a wider range of characters than the JSON spec allows
Example? Is this about whitespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not just whitespace, but nulls and other weirdness.
string s = "\n10\0.\v20\t\0\0.\r\n30\0"; // lots of invalid JSON
Version v = new Version(s);
Console.WriteLine(v); // prints 10.20.30
Edit: It's very much in the same spirit of the warning comment I put on string.ReplaceLineEndings
. Methods like int.Parse
, Version.Parse
, and others are meant to parse human-readable representations of these values, not protocol-compliant representations of these values. It's up to the caller to ensure that the payload follows the requirements of whatever protocol is being parsed. The earlier version of this code forced the creation of a string
, which meant it would ride atop the JSON stack's existing implicit validation logic. The new version bypasses this, going down to the raw bytes, which means that the caller (this routine) is now also taking responsibility for protocol validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not just whitespace, but nulls and other weirdness
This is what I meant by whitespace. Int32.{Try}Parse by default allows for a subset of what char.IsWhitespace considers valid, and also ignores trailing nulls. Version.TryParse just inherits that.
What, specifically, is the security vulnerability concern with not having validation fail because of such characters? e.g. how would that manifest as an attack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We make a guarantee that using default deserialization parameters and built-in converters, the incoming JSON payload is fully validated before a full object graph can be hydrated. This guarantee exists to provide support for heterogeneous environments. For instance: you have a frontend performing an initial pass over the input data, including basic consistency checks and request authentication; then that payload is forwarded to a backend for real processing. The frontend should not allow non-compliant data to reach the backend, as the backend might not be resilient against processing that data, or it might store the data in a manner that corrupts the database. (There were vulnerabilities in DasBlog, aspnet 4.x, and Exchange due to two parsers in a heterogeneous environment interpreting the same payload in different manners.)
One quasi-by-design hole in the System.Text.Json stack is that string data is not validated until the string itself is materialized. For most parsers and converters this behavior is just fine, as they get the string (which validates!), then parse it to create the final result. But in the case of hyper-optimized converters like the one under discussion here, they're dropping down to the raw byte level to get at the underlying data and avoid string materialization, which can suppress the normal correctness guarantees we make for the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In proposed version i added failing fast if string has any escaping.
If string data with escaping is still valid Version - then this failing should be removed, and replaced with unescaping in case if string has any escaping. This way byte data will basically follow the same path as when you call reader.GetString
except calling Encoding.GetChars
instead of Encoding.GetString
in the end. But I can't reproduce validation on materialization that string does, that you are talking about, with the example invalid json you presented earlier.
string source = "\n10\0.\v20\t\0\0.\r\n30\0"; // lots of invalid JSON
var s_utf8Encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true);
byte[] ut8Bytes = s_utf8Encoding.GetBytes(source);
string convertedString = s_utf8Encoding.GetString(ut8Bytes);
Span<char> chars = stackalloc char[source.Length];
s_utf8Encoding.GetChars(ut8Bytes, chars);
Console.WriteLine(chars.SequenceEqual(convertedString) && chars.SequenceEqual(source)); // Prints true
Version versionFromSource = new Version(source);
Version versionFromConvertedStr = new Version(convertedString);
Version versionFromChars = Version.Parse(chars);
Console.WriteLine(versionFromSource.Equals(versionFromConvertedStr) && versionFromSource.Equals(versionFromChars)
&& versionFromConvertedStr.Equals(versionFromChars)); // Prints true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GrabYourPitchforks, thanks, I get all that, but I'm not clear how that's applicable to this specific case. We're not talking about taking unvalidated json and storing it somewhere that expects validated json; we're talking about ending up with a well-formed Version object that discarded the unnecessary whitespace/nulls. I'm not saying that's not divergent from the json spec, but I'm also not seeing how it results in a vulnerability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, we're taking the approach of disallowing all leading or trailing trivia across all the primitive types that we support across the STJ stack. This involves implementing all the necessary validation prior to calling an underlying lower-level API that does the actual data parsing, e.g. Utf8Parser
. This has been a manual process repeated for each supported data type, and we'll do the same for this PR.
please schedule some time to audit all the optimizations that were checked in as part of the 6.0 wave
@GrabYourPitchforks the TimeSpan
PRs (#54186, #55350) and this PR are the only candidates for .NET 6 where we process data types at this level so I think these code reviews might suffice.
Here are some somewhat related PRs in case something jumps out:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. The implementation needs to support parsing escaped data.
namespace System.Text.Json.Serialization.Converters | ||
{ | ||
internal sealed class VersionConverter : JsonConverter<Version> | ||
{ | ||
public override Version Read(ref Utf8JsonReader reader, Type typeToConvert, JsonSerializerOptions options) | ||
{ | ||
if (reader._stringHasEscaping) | ||
{ | ||
ThrowHelper.ThrowJsonException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to parse escaped data for all data types. You could use #54186 (comment) and commits on that PR (as well as #55350) as a reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this in 84ab81a. Also added checks if first and last chars are digits, and ported this checks to netstandard2.0
|
||
#if BUILDING_INBOX_LIBRARY | ||
|
||
ReadOnlySpan<byte> source = reader.HasValueSequence ? reader.ValueSequence.ToArray() : reader.ValueSpan; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, we're taking the approach of disallowing all leading or trailing trivia across all the primitive types that we support across the STJ stack. This involves implementing all the necessary validation prior to calling an underlying lower-level API that does the actual data parsing, e.g. Utf8Parser
. This has been a manual process repeated for each supported data type, and we'll do the same for this PR.
please schedule some time to audit all the optimizations that were checked in as part of the 6.0 wave
@GrabYourPitchforks the TimeSpan
PRs (#54186, #55350) and this PR are the only candidates for .NET 6 where we process data types at this level so I think these code reviews might suffice.
Here are some somewhat related PRs in case something jumps out:
07b6a9a
to
84ab81a
Compare
Hey @N0D4N, we're going to move this to 7.0.0 if that's ok. I expect to circle back and review this after we are done with pending 6.0.0 work. |
@eiriktsarpalis, It's alright with me. The only issue i see - later we would need to make sure that behavior of refactored converter's |
Hi @N0D4N, |
Hi @eiriktsarpalis. Can you be more specific what PR feedback you are talking about? Since I'm pretty sure that I've dealt with all PR feedback:
Though I'd need to update PR branch with commits from |
...ies/System.Text.Json/src/System/Text/Json/Serialization/Converters/Value/VersionConverter.cs
Show resolved
Hide resolved
throw ThrowHelper.GetFormatException(DataType.Version); | ||
} | ||
|
||
Span<byte> stackSpan = stackalloc byte[isEscaped ? MaximumEscapedVersionLength : MaximumVersionLength]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could maxLength
be used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed same approach as in TimeSpanConverter
Lines 24 to 38 in 30226bf
int maximumLength = isEscaped ? MaximumEscapedTimeSpanFormatLength : MaximumTimeSpanFormatLength; | |
ReadOnlySpan<byte> source = stackalloc byte[0]; | |
if (reader.HasValueSequence) | |
{ | |
ReadOnlySequence<byte> valueSequence = reader.ValueSequence; | |
long sequenceLength = valueSequence.Length; | |
if (!JsonHelpers.IsInRangeInclusive(sequenceLength, MinimumTimeSpanFormatLength, maximumLength)) | |
{ | |
throw ThrowHelper.GetFormatException(DataType.TimeSpan); | |
} | |
Span<byte> stackSpan = stackalloc byte[isEscaped ? MaximumEscapedTimeSpanFormatLength : MaximumTimeSpanFormatLength]; |
As far as i know JIT can optimize stackallocing const amount of bytes, but i'm not sure if it can optimize with
maxLength
, since it can be only one of two const values, but by itself it isn't const.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @GrabYourPitchforks do you reckon the JIT could be more optimal with the second pattern (N) here? maxLength
could only be a const
value.
…ssert successful formatting on write.
…one more test case that should fail
d861306
to
11efcbd
Compare
Rebased on top of |
Added When you commit this breaking change:
Tagging @dotnet/compat for awareness of the breaking change. |
@N0D4N we want to throw when we have leading/trailing white space, just like with |
cc @dotnet/area-system-text-json thoughts on the breaking change highlighted in #55564 (comment)? I support it since it's an edge case, it's early in the release, and there's an easy workaround (custom converter with the previous implementation). |
I favor consistency here, so if we are throwing in the majority of places already then we should throw here as well. Also, from a perf perspective, it is faster to write a parser that doesn't have to check for such whitespace, so I think that should be the default at least. Although technically breaking, I think it would be very rare to have leading\trailing whitespace within the quotes. The work-around is as mentioned is to add a custom converter that allows it. |
@layomia I'd support it as well. |
...ies/System.Text.Json/src/System/Text/Json/Serialization/Converters/Value/VersionConverter.cs
Outdated
Show resolved
Hide resolved
...ies/System.Text.Json/src/System/Text/Json/Serialization/Converters/Value/VersionConverter.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.Json/tests/System.Text.Json.Tests/Serialization/Value.ReadTests.cs
Show resolved
Hide resolved
… it to .NetStandard2.0 target.
Should i open an issue at dotnet/docs that will describe breaking change in this PR? |
@N0D4N feel free to add to dotnet/docs#26292. I've also mailed the |
Thanks for your feedback, everyone! |
Resolves #55179
I didn't find any info on creating custom benchmarks to measure performance of
VersionConverter
for current implementation of .NET runtime and custom build with proposed changes, so I would much appreciate if you can point to docs on how to do it, so I can create and run benchmark and attach its results here.Few notes on implementation:
reader.TokenType
beingstring
, probably we can add this check, the only problem isThrowHelper.GetInvalidOperationException_ExpectedString
returns exception instead of throwing it, so we would havethrow
statement directly inRead
method, but i guess it's fine since this method is overriden and won't be inlined anyway;Write
method, proposed one is more fragile to changes insideVersion
class, since it needs to know what size of buffer it needs to allocate, ans therefore is more dependent on concrete implementation like,Version
only having 4 components and components beingInt32
, so probably additionalDebug.Assert
's can be added to make sure they would fail in case of breaking changes insideVersion
class, but unfortunately I can not think of any.