New ASCII APIs #75012

adamsitnik · 2022-09-02T18:41:18Z

namespace System.Text;

public static class Ascii
{   
    public static bool IsValid(ReadOnlySpan<byte> value);
    public static bool IsValid(ReadOnlySpan<char> value);

    public static bool IsValid(byte value);
    public static bool IsValid(char value);

    public static OperationStatus ToUpper(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten);
    public static OperationStatus ToUpper(ReadOnlySpan<char> source, Span<char> destination, out int charsWritten);
    public static OperationStatus ToUpper(ReadOnlySpan<byte> source, Span<char> destination, out int charsWritten);
    public static OperationStatus ToUpper(ReadOnlySpan<char> source, Span<byte> destination, out int bytesWritten);

    public static OperationStatus ToLower(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten);
    public static OperationStatus ToLower(ReadOnlySpan<char> source, Span<char> destination, out int charsWritten);
    public static OperationStatus ToLower(ReadOnlySpan<byte> source, Span<char> destination, out int charsWritten);
    public static OperationStatus ToLower(ReadOnlySpan<char> source, Span<byte> destination, out int bytesWritten);

    public static OperationStatus ToUpperInPlace(Span<byte> value, out int bytesWritten);
    public static OperationStatus ToUpperInPlace(Span<char> value, out int charsWritten);

    public static OperationStatus ToLowerInPlace(Span<byte> value, out int bytesWritten);
    public static OperationStatus ToLowerInPlace(Span<char> value, out int charsWritten);

    public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int charsWritten);
    public static OperationStatus FromUtf16(ReadOnlySpan<char> source, Span<byte> destination, out int bytesWritten);

    public static Range Trim(ReadOnlySpan<byte> value);
    public static Range Trim(ReadOnlySpan<char> value);
    public static Range TrimStart(ReadOnlySpan<byte> value);
    public static Range TrimStart(ReadOnlySpan<char> value);
    public static Range TrimEnd(ReadOnlySpan<byte> value);
    public static Range TrimEnd(ReadOnlySpan<char> value);
}

davidfowl · 2022-11-21T04:35:53Z

cc @BrennanConroy

# Conflicts: # src/libraries/System.Private.CoreLib/src/System/Globalization/TextInfo.cs # src/libraries/System.Private.Uri/src/System/DomainNameHelper.cs # src/libraries/System.Private.Uri/src/System/UriHelper.cs

adamsitnik · 2022-12-08T15:11:38Z

@stephentoub the PR is ready, is there any chance you could re-review it?

gfoidl · 2022-12-08T17:39:13Z

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.CaseConversion.cs

+        private struct ToUpperConversion { }
+        private struct ToLowerConversion { }


Could there be used static abstract interfaces instead?

src/libraries/System.Private.Uri/src/System/DomainNameHelper.cs

stephentoub · 2022-12-08T15:22:51Z

...oreclr/nativeaot/System.Private.CoreLib/src/System/Runtime/InteropServices/PInvokeMarshal.cs

-                    }
-                }
+                OperationStatus conversionStatus = Ascii.FromUtf16(new ReadOnlySpan<char>(pManaged, length), new Span<byte>(pNative, length), out _);
+                Debug.Assert(conversionStatus == OperationStatus.Done);


Not for this PR, but as a potential follow-up, if it'll be common for pNative to be non-NULL and if data suggests it'll be long enough on average, it might be worth trying a fast-path that just does FromUtf16 without first doing IsValid.

src/libraries/System.Net.HttpListener/src/System/Net/HttpListener.cs

stephentoub · 2022-12-08T19:17:22Z

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIEncoding.cs

@@ -188,11 +189,12 @@ private protected sealed override unsafe int GetByteCountFast(char* pChars, int

            if (!(fallback is EncoderReplacementFallback replacementFallback
                && replacementFallback.MaxCharCount == 1
-                && replacementFallback.DefaultString[0] <= 0x7F))
+                && Ascii.IsValid(replacementFallback.DefaultString[0])))


How are we thinking about char.IsAscii(char) vs Ascii.IsValid(char)?
https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Char.cs,33d30f343eda0003,references
Do we need the new char/byte methods at all? Should we have Array.IsValid(int value) instead of both of the char/byte overloads?

Ascii.IsValid(char) was added mostly for completeness and having everything in once place.

Should we have Array.IsValid(int value)

I assume you meant Ascii.IsValid(int)? On the one hand it would be more flexible (users could pass integers as inputs too), on the other I am not sure about codegen differences and whether everyone knows that they can pass byte/char to int-accepting method.

Ascii.IsValid(char) was added mostly for completeness and having everything in once place.

So which should we be using? Should we change all use of char.IsAscii to be Ascii.IsValid?

Please no :) Personally I find char.IsAscii to be more readable.

Then should we replace all Ascii.IsValid(char) with char.IsAscii?

We "just" added char.IsAscii in .NET 6. Now in .NET 8 we're adding an identical Ascii.IsValid(char). I'm trying to rationalize why we have both. I get the consistency argument, but I don't think it's worth consistency just to have a method no one uses.

Hence my question about whether we should just have the one Ascii.IsValid(int), or maybe both IsValid(int) and IsValid(uint). Would there be performance benefits / cons to that? We do have cases today where we check {u}ints for < 0x80, e.g.

runtime/src/libraries/System.Private.CoreLib/src/System/Globalization/IdnMapping.cs

Lines 529 to 531 in 2072f16

private static bool Basic(uint cp) =>

// Is it in ASCII range?

cp < 0x80;

runtime/src/libraries/System.Private.CoreLib/src/System/Text/UnicodeUtility.cs

Line 120 in 57bfe47

public static bool IsAsciiCodePoint(uint value) => value <= 0x7Fu;

runtime/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Transcoding.cs

Lines 647 to 648 in 57bfe47

uint firstByte = pInputBuffer[0];

if (firstByte <= 0x7Fu)

runtime/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Transcoding.cs

Line 1423 in 57bfe47

if (thisChar <= 0x7Fu)

@adamsitnik, opinions?

stephentoub · 2022-12-08T19:25:15Z

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.CaseConversion.cs

+            where TTo : unmanaged, IBinaryInteger<TTo>
+            where TCasing : struct
+        {
+            if (MemoryMarshal.AsBytes(source).Overlaps(MemoryMarshal.AsBytes(destination)))


Why are the casts to bytes necessary?

TFROM and TTO can be different (char and byte for example) and in theory someone could do sth like this:

runtime/src/libraries/System.Text.Encoding/tests/Ascii/CaseConversionTests.cs

Lines 28 to 29 in c0f38d1

Assert.Throws<InvalidOperationException>(() => Ascii.ToLower(byteBuffer, MemoryMarshal.Cast<byte, char>(byteBuffer), out _));

Assert.Throws<InvalidOperationException>(() => Ascii.ToLower(byteBuffer, MemoryMarshal.Cast<byte, char>(byteBuffer).Slice(1, 3), out _));

Since everything can be casted to bytes I decided to use it and cover all possible scenarios, but to be honest I did it mostly to cover all possible unit testing scenarios rather than expecting someone to actually do it.

stephentoub · 2022-12-08T20:37:06Z

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Trimming.cs

+                    uint elementValue = uint.CreateTruncating(value[start]);
+                    if ((elementValue > 0x20) || ((TrimMask & (1u << ((int)elementValue - 1))) == 0))


You could make this branch-free with:

uint c = (ushort)(uint.CreateTruncating(value) - '\t'); if ((int)((0xF8000100U << (short)c) & (c - 32)) >= 0)

See

runtime/src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs

Lines 4599 to 4628 in 2480f01

// Next, handle sets where the high - low + 1 range is <= 32. In that case, we can emit

// a branchless lookup in a uint that does not rely on loading any objects (e.g. the string-based

// lookup we use later). This nicely handles common sets like [\t\r\n ].

if (analysis.OnlyRanges && (analysis.UpperBoundExclusiveIfOnlyRanges - analysis.LowerBoundInclusiveIfOnlyRanges) <= 32)

{

additionalDeclarations.Add("uint charMinusLowUInt32;");

// Create the 32-bit value with 1s at indices corresponding to every character in the set,

// where the bit is computed to be the char value minus the lower bound starting from

// most significant bit downwards.

bool negatedClass = RegexCharClass.IsNegated(charClass);

uint bitmap = 0;

for (int i = analysis.LowerBoundInclusiveIfOnlyRanges; i < analysis.UpperBoundExclusiveIfOnlyRanges; i++)

{

if (RegexCharClass.CharInClass((char)i, charClass) ^ negatedClass)

{

bitmap |= 1u << (31 - (i - analysis.LowerBoundInclusiveIfOnlyRanges));

}

}

// To determine whether a character is in the set, we subtract the lowest char; this subtraction happens before the result is

// zero-extended to uint, meaning that `charMinusLowUInt32` will always have upper 16 bits equal to 0.

// We then left shift the constant with this offset, and apply a bitmask that has the highest

// bit set (the sign bit) if and only if `chExpr` is in the [low, low + 32) range.

// Then we only need to check whether this final result is less than 0: this will only be

// the case if both `charMinusLowUInt32` was in fact the index of a set bit in the constant, and also

// `chExpr` was in the allowed range (this ensures that false positive bit shifts are ignored).

negate ^= negatedClass;

return $"((int)((0x{bitmap:X}U << (short)(charMinusLowUInt32 = (ushort)({chExpr} - {Literal((char)analysis.LowerBoundInclusiveIfOnlyRanges)}))) & (charMinusLowUInt32 - 32)) {(negate ? ">=" : "<")} 0)";

}

for an explanation.

If possible I would prefer to avoid adding further optimizations now, merge the API, add benchmarks to dotnet/performance and then let others tune it.

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.cs

src/libraries/System.Private.Uri/src/System/DomainNameHelper.cs

...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs

Co-authored-by: Miha Zupan <[email protected]>

jeffhandley

I reviewed this mostly through the lens of how the new methods are used--each usage is a nice improvement in maintainability. I also appreciate the copious comments through these implementations and tests.

dakersnar · 2023-01-05T17:55:37Z

More regressions: dotnet/perf-autofiling-issues#11194
dotnet/perf-autofiling-issues#11132

lewing · 2023-01-10T00:44:09Z

mono interpreter regression dotnet/perf-autofiling-issues#11147
there is likely a wasm regression as well but it won't show up until the benchmarks are fixed there

EgorBo · 2023-01-31T00:53:48Z

It seems this PR still has regressions opened, e.g. these 5 (we don't have a lot of coverage for ToLower so only 5) regressed if you open "full history" tab: dotnet/perf-autofiling-issues#10226

My guess that it's because the SIMD part to change case was a bit more efficient in #78262, e.g.:

static Vector128<sbyte> ToLower(Vector128<sbyte> src)
{
    var lowInd = Vector128.Create((sbyte)63) + src;
    var combInd = Vector128.LessThan(Vector128.Create((sbyte)-103), lowInd);
    return Vector128.AndNot(Vector128.Create((sbyte)0x20), combInd) + src;
}

adamsitnik · 2023-01-31T09:07:53Z

@EgorBo we have an issue that is tracking it: #80245

My guess that it's because the SIMD part to change case was a bit more efficient in

The new methods support 4 different combinations of input and output: byte->byte, byte->char, char->char and char->byte. The generic code is most likely not as optimal as it was for char->char implementation we had:

runtime/src/libraries/System.Private.CoreLib/src/System/Text/Ascii.CaseConversion.cs

Lines 250 to 260 in 1c442fc

    
           TFrom SourceSignedMinValue = TFrom.CreateTruncating(1 << (8 * sizeof(TFrom) - 1)); 
        
           Vector128<TFrom> subtractionVector = Vector128.Create(conversionIsToUpper ? (SourceSignedMinValue + TFrom.CreateTruncating('a')) : (SourceSignedMinValue + TFrom.CreateTruncating('A'))); 
        
           Vector128<TFrom> comparisionVector = Vector128.Create(SourceSignedMinValue + TFrom.CreateTruncating(26 /* A..Z or a..z */)); 
        
           Vector128<TFrom> caseConversionVector = Vector128.Create(TFrom.CreateTruncating(0x20)); // works both directions 
        
           Vector128<TFrom> matches = SignedLessThan((srcVector - subtractionVector), comparisionVector); 
        
           srcVector ^= (matches & caseConversionVector); 
        
           // Now write to the destination. 
        
           ChangeWidthAndWriteTo(srcVector, pDest, 0);

there might be some overhead for pinning the input twice (Solve recently introduce ASCII regression #80709)

GrabYourPitchforks and others added 29 commits July 20, 2022 16:22

Initial ASCII methods

af2b950

Add transcoding APIs

bd2b5f1

Implement Trim

db37d32

Split ASCII utilities into separate files

4a832cc

Add ref asm

e054a01

Fun with case conversion!

36cbfa6

Fun with case conversion!

bba61f5

Updates!

89331d6

Fix incorrect comparison

8333ec8

Fix incorrect precondition checks

841fe3c

Update main vectorized loop

dce2cae

Perf improvements & fix arithmetic error

685b330

tests for Ascii.GetIndexOfFirstNonAsciiByte

4914c65

tests for Ascii.GetIndexOfFirstNonAsciiChar

7204d2d

add tests for Ascii.IsAscii

e3709b7

add tests for Ascii.FromUtf16

fc6db59

add tests for Ascii.ToUtf16

e316789

add tests for Ascii.Trim* and fix bug they have discovered

a5f61b9

ToUpper & ToLower tests

6516ae2

implement the missing pieces for case conversions + fix the tests

6e1ca32

Merge remote-tracking branch 'upstream/main' into asciiAPIs

5c90223

implement TryToLowerInPlace/TryToUpperInPlace

4339af5

implement Ascii.StartsWith* and EndsWith* methods

cc3be10

implement Ascii.Equals* methods

ad4d90b

use self-describing names at a cost of using pragma disable ;)

adc2f53

throw ArgumentException with meaningful error message

aad125a

rename files

f8f98ed

Implement IndexOf and LastIndexOf using narrowing and widening

2b2bcd1

Implement IndexOfIgnoreCase and LastIndexOfIgnoreCase

2d9105e

adamsitnik added the area-System.Buffers label Sep 2, 2022

adamsitnik mentioned this pull request Nov 29, 2022

Create APIs to deal with processing ASCII text (as bytes) #28230

Closed

adamsitnik added 4 commits December 7, 2022 16:26

adjust code after recent API Review

0d69abd

add missing XML docs

bb0a272

Merge remote-tracking branch 'upstream/main' into asciiAPIs

b1f1f07

# Conflicts: # src/libraries/System.Private.CoreLib/src/System/Globalization/TextInfo.cs # src/libraries/System.Private.Uri/src/System/DomainNameHelper.cs # src/libraries/System.Private.Uri/src/System/UriHelper.cs

cleanup

c0f38d1

build-analysis bot mentioned this pull request Dec 8, 2022

SyndicationFeed_Write_RSS_Atom test failing with "IOException : The file '/tmp/' already exists." #78454

Closed

gfoidl reviewed Dec 8, 2022

View reviewed changes

MihaZupan reviewed Dec 8, 2022

View reviewed changes

src/libraries/System.Private.Uri/src/System/DomainNameHelper.cs Outdated Show resolved Hide resolved

stephentoub reviewed Dec 8, 2022

View reviewed changes

adamsitnik and others added 3 commits December 9, 2022 12:23

address code review feedback

da94353

Merge remote-tracking branch 'upstream/main' into asciiAPIs

b975fe1

Update src/libraries/System.Private.Uri/src/System/DomainNameHelper.cs

4483baf

Co-authored-by: Miha Zupan <[email protected]>

adamsitnik requested a review from stephentoub December 12, 2022 09:23

jeffhandley approved these changes Dec 20, 2022

View reviewed changes

adamsitnik merged commit 66e64e5 into dotnet:main Dec 21, 2022

This was referenced Jan 5, 2023

Regressions in System.Tests.Perf_String Case Conversion #80245

Closed

Regressions in System.Text.Perf_Utf8Encoding and System.Text.Tests.Perf_Encoding #80247

Closed

lewing mentioned this pull request Jan 10, 2023

[Perf] Linux/x64: 178 Regressions on 12/21/2022 11:29:15 AM dotnet/perf-autofiling-issues#11147

Closed

This was referenced Jan 10, 2023

Wasm AOT runs not autofiled dotnet/performance#2824

Closed

[Perf] Linux/x64: 77 Regressions on 12/21/2022 11:29:15 AM dotnet/perf-autofiling-issues#11104

Closed

adamsitnik mentioned this pull request Jan 16, 2023

Solve recently introduce ASCII regression #80709

Merged

EgorBo mentioned this pull request Jan 24, 2023

[Perf] Windows/x64: 1 Regression on 1/18/2023 1:01:18 AM dotnet/perf-autofiling-issues#12006

Closed

ghost locked as resolved and limited conversation to collaborators Mar 2, 2023

jeffhandley added the blog-candidate Completed PRs that are candidate topics for blog post coverage label Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New ASCII APIs #75012

New ASCII APIs #75012

adamsitnik commented Sep 2, 2022 •

edited

Loading

davidfowl commented Nov 21, 2022

adamsitnik commented Dec 8, 2022

gfoidl Dec 8, 2022

stephentoub Dec 8, 2022

stephentoub Dec 8, 2022

adamsitnik Dec 9, 2022

stephentoub Jan 4, 2023

MihaZupan Jan 4, 2023

stephentoub Jan 4, 2023 •

edited

Loading

stephentoub Jan 16, 2023

stephentoub Dec 8, 2022

adamsitnik Dec 9, 2022

stephentoub Dec 8, 2022

adamsitnik Dec 9, 2022

jeffhandley left a comment

dakersnar commented Jan 5, 2023 •

edited

Loading

lewing commented Jan 10, 2023

EgorBo commented Jan 31, 2023 •

edited

Loading

adamsitnik commented Jan 31, 2023

		private struct ToUpperConversion { }
		private struct ToLowerConversion { }

	private static bool Basic(uint cp) =>
	// Is it in ASCII range?
	cp < 0x80;

	Assert.Throws<InvalidOperationException>(() => Ascii.ToLower(byteBuffer, MemoryMarshal.Cast<byte, char>(byteBuffer), out _));
	Assert.Throws<InvalidOperationException>(() => Ascii.ToLower(byteBuffer, MemoryMarshal.Cast<byte, char>(byteBuffer).Slice(1, 3), out _));

		uint elementValue = uint.CreateTruncating(value[start]);
		if ((elementValue > 0x20) \|\| ((TrimMask & (1u << ((int)elementValue - 1))) == 0))

	// Next, handle sets where the high - low + 1 range is <= 32. In that case, we can emit
	// a branchless lookup in a uint that does not rely on loading any objects (e.g. the string-based
	// lookup we use later). This nicely handles common sets like [\t\r\n ].
	if (analysis.OnlyRanges && (analysis.UpperBoundExclusiveIfOnlyRanges - analysis.LowerBoundInclusiveIfOnlyRanges) <= 32)
	{
	additionalDeclarations.Add("uint charMinusLowUInt32;");

	// Create the 32-bit value with 1s at indices corresponding to every character in the set,
	// where the bit is computed to be the char value minus the lower bound starting from
	// most significant bit downwards.
	bool negatedClass = RegexCharClass.IsNegated(charClass);
	uint bitmap = 0;
	for (int i = analysis.LowerBoundInclusiveIfOnlyRanges; i < analysis.UpperBoundExclusiveIfOnlyRanges; i++)
	{
	if (RegexCharClass.CharInClass((char)i, charClass) ^ negatedClass)
	{
	bitmap \|= 1u << (31 - (i - analysis.LowerBoundInclusiveIfOnlyRanges));
	}
	}

	// To determine whether a character is in the set, we subtract the lowest char; this subtraction happens before the result is
	// zero-extended to uint, meaning that `charMinusLowUInt32` will always have upper 16 bits equal to 0.
	// We then left shift the constant with this offset, and apply a bitmask that has the highest
	// bit set (the sign bit) if and only if `chExpr` is in the [low, low + 32) range.
	// Then we only need to check whether this final result is less than 0: this will only be
	// the case if both `charMinusLowUInt32` was in fact the index of a set bit in the constant, and also
	// `chExpr` was in the allowed range (this ensures that false positive bit shifts are ignored).
	negate ^= negatedClass;
	return $"((int)((0x{bitmap:X}U << (short)(charMinusLowUInt32 = (ushort)({chExpr} - {Literal((char)analysis.LowerBoundInclusiveIfOnlyRanges)}))) & (charMinusLowUInt32 - 32)) {(negate ? ">=" : "<")} 0)";
	}

New ASCII APIs #75012

New ASCII APIs #75012

Conversation

adamsitnik commented Sep 2, 2022 • edited Loading

davidfowl commented Nov 21, 2022

adamsitnik commented Dec 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub Jan 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeffhandley left a comment

Choose a reason for hiding this comment

dakersnar commented Jan 5, 2023 • edited Loading

lewing commented Jan 10, 2023

EgorBo commented Jan 31, 2023 • edited Loading

adamsitnik commented Jan 31, 2023

adamsitnik commented Sep 2, 2022 •

edited

Loading

stephentoub Jan 4, 2023 •

edited

Loading

dakersnar commented Jan 5, 2023 •

edited

Loading

EgorBo commented Jan 31, 2023 •

edited

Loading