-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Avoid extra boundary checks when preceeded/succeeded char set is known #118105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If we statically know by construction that what comes before or after a \b is guaranteed to be a word char, then we can avoid half the run-time checks. This also tweaks the source-generated implementation of IsBoundaryWordChar in order to avoid an extra branch on every check. It's currently delegating to IsWordChar and then if that returns false, checking whether it's one of the other two joiner characters that are considered as part of the boundary set. Instead, this duplicates the IsWordChar implementation (which is just a couple of lines once the helpers are separated out into their own members), such that for ASCII, the additional check isn't necessary. The implementation used by the interpreter and RegexCompiler already do this.
|
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes word boundary (\b) checking in regular expressions by avoiding redundant runtime checks when the preceding or succeeding character is statically known to be a word character. The optimization reduces boundary check overhead by introducing specialized methods that only validate one side of the boundary when the other side is guaranteed.
Key changes:
- Adds specialized boundary checking methods that skip half the validation when one side is known
- Introduces static analysis to determine when characters are guaranteed to be word characters
- Refactors the source generator to use these optimized boundary checks and improve code structure
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| RegexRunner.cs | Adds IsPreWordCharBoundary and IsPostWordCharBoundary methods for optimized boundary checking |
| RegexNode.cs | Implements IsKnownPrecededByWordChar and IsKnownSuccededByWordChar methods for static analysis |
| RegexPrefixAnalyzer.cs | Makes FindFirstOrLastCharClass method public to support the new static analysis |
| RegexCompiler.cs | Updates boundary emission to use optimized methods when applicable |
| RegexGenerator.Emitter.cs | Refactors helper generation with improved structure and adds support for new boundary methods |
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Outdated
Show resolved
Hide resolved
...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs
Outdated
Show resolved
Hide resolved
|
@MihuBot regexdiff |
|
3865 out of 18857 patterns have generated source code changes. Examples of GeneratedRegex source diffs"{\\s*(?<P>\\D\\w*)\\s*\\:\\s*var\\(\\s*(?<B> ..." (9881 uses)[GeneratedRegex("{\\s*(?<P>\\D\\w*)\\s*\\:\\s*var\\(\\s*(?<B>\\D\\w*)\\s*\\)\\s*(;\\s*(?<P>\\D\\w*)\\s*\\:\\s*var\\(\\s*(?<B>\\D\\w*)\\s*\\)\\s*\\s*)*}")] [MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool IsWordChar(char ch)
{
- // Mask of Unicode categories that combine to form [\w]
- const int WordCategoriesMask =
- 1 << (int)UnicodeCategory.UppercaseLetter |
- 1 << (int)UnicodeCategory.LowercaseLetter |
- 1 << (int)UnicodeCategory.TitlecaseLetter |
- 1 << (int)UnicodeCategory.ModifierLetter |
- 1 << (int)UnicodeCategory.OtherLetter |
- 1 << (int)UnicodeCategory.NonSpacingMark |
- 1 << (int)UnicodeCategory.DecimalDigitNumber |
- 1 << (int)UnicodeCategory.ConnectorPunctuation;
-
- // Bitmap for whether each character 0 through 127 is in [\w]
- ReadOnlySpan<byte> ascii = new byte[]
- {
- 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
- 0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
- };
-
// If the char is ASCII, look it up in the bitmap. Otherwise, query its Unicode category.
+ ReadOnlySpan<byte> ascii = WordCharBitmap;
int chDiv8 = ch >> 3;
return (uint)chDiv8 < (uint)ascii.Length ?
(ascii[chDiv8] & (1 << (ch & 0x7))) != 0 :
StackPush(ref stack, ref pos, arg0, arg1, arg2);
}
}
+
+ /// <summary>Provides a mask of Unicode categories that combine to form [\w].</summary>
+ private const int WordCategoriesMask =
+ 1 << (int)UnicodeCategory.UppercaseLetter |
+ 1 << (int)UnicodeCategory.LowercaseLetter |
+ 1 << (int)UnicodeCategory.TitlecaseLetter |
+ 1 << (int)UnicodeCategory.ModifierLetter |
+ 1 << (int)UnicodeCategory.OtherLetter |
+ 1 << (int)UnicodeCategory.NonSpacingMark |
+ 1 << (int)UnicodeCategory.DecimalDigitNumber |
+ 1 << (int)UnicodeCategory.ConnectorPunctuation;
+
+ /// <summary>Gets a bitmap for whether each character 0 through 127 is in [\w]</summary>
+ private static ReadOnlySpan<byte> WordCharBitmap => new byte[]
+ {
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
+ 0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
+ };
+
}
}"[A-z-[dDfFiIoOqQuUwWzZ]]\\d[A-z-[dDfFiIoOqQu ..." (5703 uses)[GeneratedRegex("[A-z-[dDfFiIoOqQuUwWzZ]]\\d[A-z-[dDfFiIoOqQuU]] *\\d[A-z-[dDfFiIoOqQuU]]\\d\\b", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.CultureInvariant)] }
// Match if at a word boundary.
- if (!Utilities.IsBoundary(inputSpan, pos + 6))
+ if (!Utilities.IsPostWordCharBoundary(inputSpan, pos + 6))
{
return false; // The input didn't match.
}
/// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
- /// <summary>Determines whether the specified index is a boundary.</summary>
+ /// <summary>Determines whether the specified index is a boundary word character.</summary>
+ /// <remarks>This is the same as \w plus U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER.</remarks>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
- internal static bool IsBoundary(ReadOnlySpan<char> inputSpan, int index)
+ internal static bool IsBoundaryWordChar(char ch)
{
- int indexMinus1 = index - 1;
- return ((uint)indexMinus1 < (uint)inputSpan.Length && IsBoundaryWordChar(inputSpan[indexMinus1])) !=
- ((uint)index < (uint)inputSpan.Length && IsBoundaryWordChar(inputSpan[index]));
-
- static bool IsBoundaryWordChar(char ch) => IsWordChar(ch) || (ch == '\u200C' | ch == '\u200D');
+ ReadOnlySpan<byte> ascii = WordCharBitmap;
+ int chDiv8 = ch >> 3;
+ return (uint)chDiv8 < (uint)ascii.Length ?
+ (ascii[chDiv8] & (1 << (ch & 0x7))) != 0 :
+ ((WordCategoriesMask & (1 << (int)CharUnicodeInfo.GetUnicodeCategory(ch))) != 0) || (ch is '' or '');
}
- /// <summary>Determines whether the character is part of the [\w] set.</summary>
+ /// <summary>Determines whether the specified index is a boundary.</summary>
+ /// <remarks>This variant is only employed when the previous character has already been validated as a word character.</remarks>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
- internal static bool IsWordChar(char ch)
- {
- // Mask of Unicode categories that combine to form [\w]
- const int WordCategoriesMask =
- 1 << (int)UnicodeCategory.UppercaseLetter |
- 1 << (int)UnicodeCategory.LowercaseLetter |
- 1 << (int)UnicodeCategory.TitlecaseLetter |
- 1 << (int)UnicodeCategory.ModifierLetter |
- 1 << (int)UnicodeCategory.OtherLetter |
- 1 << (int)UnicodeCategory.NonSpacingMark |
- 1 << (int)UnicodeCategory.DecimalDigitNumber |
- 1 << (int)UnicodeCategory.ConnectorPunctuation;
+ internal static bool IsPostWordCharBoundary(ReadOnlySpan<char> inputSpan, int index) =>
+ ((uint)index >= (uint)inputSpan.Length || !IsBoundaryWordChar(inputSpan[index]));
- // Bitmap for whether each character 0 through 127 is in [\w]
- ReadOnlySpan<byte> ascii = new byte[]
+ /// <summary>Provides a mask of Unicode categories that combine to form [\w].</summary>
+ private const int WordCategoriesMask =
+ 1 << (int)UnicodeCategory.UppercaseLetter |
+ 1 << (int)UnicodeCategory.LowercaseLetter |
+ 1 << (int)UnicodeCategory.TitlecaseLetter |
+ 1 << (int)UnicodeCategory.ModifierLetter |
+ 1 << (int)UnicodeCategory.OtherLetter |
+ 1 << (int)UnicodeCategory.NonSpacingMark |
+ 1 << (int)UnicodeCategory.DecimalDigitNumber |
+ 1 << (int)UnicodeCategory.ConnectorPunctuation;
+
+ /// <summary>Gets a bitmap for whether each character 0 through 127 is in [\w]</summary>
+ private static ReadOnlySpan<byte> WordCharBitmap => new byte[]
{
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
};
- // If the char is ASCII, look it up in the bitmap. Otherwise, query its Unicode category.
- int chDiv8 = ch >> 3;
- return (uint)chDiv8 < (uint)ascii.Length ?
- (ascii[chDiv8] & (1 << (ch & 0x7))) != 0 :
- (WordCategoriesMask & (1 << (int)CharUnicodeInfo.GetUnicodeCategory(ch))) != 0;
- }
/// <summary>Supports searching for characters in or not in "ABCEGHJKLMNPRSTVXY[\\]^_`abceghjklmnprstvxyK".</summary>
internal static readonly SearchValues<char> s_nonAscii_0DD9414ACADF36B5FCB9FD5EDD16B6170F356585861BFF97C0F99F5B6EB09472 = SearchValues.Create("ABCEGHJKLMNPRSTVXY[\\]^_`abceghjklmnprstvxyK");"^\\w+([_.-]\\w+)*$" (5006 uses)[GeneratedRegex("^\\w+([_.-]\\w+)*$", RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture)] [MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool IsWordChar(char ch)
{
- // Mask of Unicode categories that combine to form [\w]
- const int WordCategoriesMask =
- 1 << (int)UnicodeCategory.UppercaseLetter |
- 1 << (int)UnicodeCategory.LowercaseLetter |
- 1 << (int)UnicodeCategory.TitlecaseLetter |
- 1 << (int)UnicodeCategory.ModifierLetter |
- 1 << (int)UnicodeCategory.OtherLetter |
- 1 << (int)UnicodeCategory.NonSpacingMark |
- 1 << (int)UnicodeCategory.DecimalDigitNumber |
- 1 << (int)UnicodeCategory.ConnectorPunctuation;
-
- // Bitmap for whether each character 0 through 127 is in [\w]
- ReadOnlySpan<byte> ascii = new byte[]
- {
- 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
- 0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
- };
-
// If the char is ASCII, look it up in the bitmap. Otherwise, query its Unicode category.
+ ReadOnlySpan<byte> ascii = WordCharBitmap;
int chDiv8 = ch >> 3;
return (uint)chDiv8 < (uint)ascii.Length ?
(ascii[chDiv8] & (1 << (ch & 0x7))) != 0 :
StackPush(ref stack, ref pos, arg0, arg1);
}
}
+
+ /// <summary>Provides a mask of Unicode categories that combine to form [\w].</summary>
+ private const int WordCategoriesMask =
+ 1 << (int)UnicodeCategory.UppercaseLetter |
+ 1 << (int)UnicodeCategory.LowercaseLetter |
+ 1 << (int)UnicodeCategory.TitlecaseLetter |
+ 1 << (int)UnicodeCategory.ModifierLetter |
+ 1 << (int)UnicodeCategory.OtherLetter |
+ 1 << (int)UnicodeCategory.NonSpacingMark |
+ 1 << (int)UnicodeCategory.DecimalDigitNumber |
+ 1 << (int)UnicodeCategory.ConnectorPunctuation;
+
+ /// <summary>Gets a bitmap for whether each character 0 through 127 is in [\w]</summary>
+ private static ReadOnlySpan<byte> WordCharBitmap => new byte[]
+ {
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
+ 0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
+ };
+
}
}"^(\\w*)=(.*?)" (3778 uses)[GeneratedRegex("^(\\w*)=(.*?)")] [MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool IsWordChar(char ch)
{
- // Mask of Unicode categories that combine to form [\w]
- const int WordCategoriesMask =
- 1 << (int)UnicodeCategory.UppercaseLetter |
- 1 << (int)UnicodeCategory.LowercaseLetter |
- 1 << (int)UnicodeCategory.TitlecaseLetter |
- 1 << (int)UnicodeCategory.ModifierLetter |
- 1 << (int)UnicodeCategory.OtherLetter |
- 1 << (int)UnicodeCategory.NonSpacingMark |
- 1 << (int)UnicodeCategory.DecimalDigitNumber |
- 1 << (int)UnicodeCategory.ConnectorPunctuation;
-
- // Bitmap for whether each character 0 through 127 is in [\w]
- ReadOnlySpan<byte> ascii = new byte[]
- {
- 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
- 0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
- };
-
// If the char is ASCII, look it up in the bitmap. Otherwise, query its Unicode category.
+ ReadOnlySpan<byte> ascii = WordCharBitmap;
int chDiv8 = ch >> 3;
return (uint)chDiv8 < (uint)ascii.Length ?
(ascii[chDiv8] & (1 << (ch & 0x7))) != 0 :
(WordCategoriesMask & (1 << (int)CharUnicodeInfo.GetUnicodeCategory(ch))) != 0;
}
+
+ /// <summary>Provides a mask of Unicode categories that combine to form [\w].</summary>
+ private const int WordCategoriesMask =
+ 1 << (int)UnicodeCategory.UppercaseLetter |
+ 1 << (int)UnicodeCategory.LowercaseLetter |
+ 1 << (int)UnicodeCategory.TitlecaseLetter |
+ 1 << (int)UnicodeCategory.ModifierLetter |
+ 1 << (int)UnicodeCategory.OtherLetter |
+ 1 << (int)UnicodeCategory.NonSpacingMark |
+ 1 << (int)UnicodeCategory.DecimalDigitNumber |
+ 1 << (int)UnicodeCategory.ConnectorPunctuation;
+
+ /// <summary>Gets a bitmap for whether each character 0 through 127 is in [\w]</summary>
+ private static ReadOnlySpan<byte> WordCharBitmap => new byte[]
+ {
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
+ 0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
+ };
+
}
}"^(\\w+\\.)+\\w+$" (2468 uses)[GeneratedRegex("^(\\w+\\.)+\\w+$")] [MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool IsWordChar(char ch)
{
- // Mask of Unicode categories that combine to form [\w]
- const int WordCategoriesMask =
- 1 << (int)UnicodeCategory.UppercaseLetter |
- 1 << (int)UnicodeCategory.LowercaseLetter |
- 1 << (int)UnicodeCategory.TitlecaseLetter |
- 1 << (int)UnicodeCategory.ModifierLetter |
- 1 << (int)UnicodeCategory.OtherLetter |
- 1 << (int)UnicodeCategory.NonSpacingMark |
- 1 << (int)UnicodeCategory.DecimalDigitNumber |
- 1 << (int)UnicodeCategory.ConnectorPunctuation;
-
- // Bitmap for whether each character 0 through 127 is in [\w]
- ReadOnlySpan<byte> ascii = new byte[]
- {
- 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
- 0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
- };
-
// If the char is ASCII, look it up in the bitmap. Otherwise, query its Unicode category.
+ ReadOnlySpan<byte> ascii = WordCharBitmap;
int chDiv8 = ch >> 3;
return (uint)chDiv8 < (uint)ascii.Length ?
(ascii[chDiv8] & (1 << (ch & 0x7))) != 0 :
StackPush(ref stack, ref pos, arg0, arg1);
}
}
+
+ /// <summary>Provides a mask of Unicode categories that combine to form [\w].</summary>
+ private const int WordCategoriesMask =
+ 1 << (int)UnicodeCategory.UppercaseLetter |
+ 1 << (int)UnicodeCategory.LowercaseLetter |
+ 1 << (int)UnicodeCategory.TitlecaseLetter |
+ 1 << (int)UnicodeCategory.ModifierLetter |
+ 1 << (int)UnicodeCategory.OtherLetter |
+ 1 << (int)UnicodeCategory.NonSpacingMark |
+ 1 << (int)UnicodeCategory.DecimalDigitNumber |
+ 1 << (int)UnicodeCategory.ConnectorPunctuation;
+
+ /// <summary>Gets a bitmap for whether each character 0 through 127 is in [\w]</summary>
+ private static ReadOnlySpan<byte> WordCharBitmap => new byte[]
+ {
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
+ 0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
+ };
+
}
}"{(?<env>env:)??\\w+(\\s+(\\?\\?)??\\s+\\w+)??}" (2282 uses)[GeneratedRegex("{(?<env>env:)??\\w+(\\s+(\\?\\?)??\\s+\\w+)??}")] [MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool IsWordChar(char ch)
{
- // Mask of Unicode categories that combine to form [\w]
- const int WordCategoriesMask =
- 1 << (int)UnicodeCategory.UppercaseLetter |
- 1 << (int)UnicodeCategory.LowercaseLetter |
- 1 << (int)UnicodeCategory.TitlecaseLetter |
- 1 << (int)UnicodeCategory.ModifierLetter |
- 1 << (int)UnicodeCategory.OtherLetter |
- 1 << (int)UnicodeCategory.NonSpacingMark |
- 1 << (int)UnicodeCategory.DecimalDigitNumber |
- 1 << (int)UnicodeCategory.ConnectorPunctuation;
-
- // Bitmap for whether each character 0 through 127 is in [\w]
- ReadOnlySpan<byte> ascii = new byte[]
- {
- 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
- 0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
- };
-
// If the char is ASCII, look it up in the bitmap. Otherwise, query its Unicode category.
+ ReadOnlySpan<byte> ascii = WordCharBitmap;
int chDiv8 = ch >> 3;
return (uint)chDiv8 < (uint)ascii.Length ?
(ascii[chDiv8] & (1 << (ch & 0x7))) != 0 :
StackPush(ref stack, ref pos, arg0, arg1, arg2);
}
}
+
+ /// <summary>Provides a mask of Unicode categories that combine to form [\w].</summary>
+ private const int WordCategoriesMask =
+ 1 << (int)UnicodeCategory.UppercaseLetter |
+ 1 << (int)UnicodeCategory.LowercaseLetter |
+ 1 << (int)UnicodeCategory.TitlecaseLetter |
+ 1 << (int)UnicodeCategory.ModifierLetter |
+ 1 << (int)UnicodeCategory.OtherLetter |
+ 1 << (int)UnicodeCategory.NonSpacingMark |
+ 1 << (int)UnicodeCategory.DecimalDigitNumber |
+ 1 << (int)UnicodeCategory.ConnectorPunctuation;
+
+ /// <summary>Gets a bitmap for whether each character 0 through 127 is in [\w]</summary>
+ private static ReadOnlySpan<byte> WordCharBitmap => new byte[]
+ {
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
+ 0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
+ };
+
}
}", Version=\\d+.\\d+.\\d+.\\d+, Culture=\\w+, ..." (2239 uses)[GeneratedRegex(", Version=\\d+.\\d+.\\d+.\\d+, Culture=\\w+, PublicKeyToken=\\w+")] [MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool IsWordChar(char ch)
{
- // Mask of Unicode categories that combine to form [\w]
- const int WordCategoriesMask =
- 1 << (int)UnicodeCategory.UppercaseLetter |
- 1 << (int)UnicodeCategory.LowercaseLetter |
- 1 << (int)UnicodeCategory.TitlecaseLetter |
- 1 << (int)UnicodeCategory.ModifierLetter |
- 1 << (int)UnicodeCategory.OtherLetter |
- 1 << (int)UnicodeCategory.NonSpacingMark |
- 1 << (int)UnicodeCategory.DecimalDigitNumber |
- 1 << (int)UnicodeCategory.ConnectorPunctuation;
-
- // Bitmap for whether each character 0 through 127 is in [\w]
- ReadOnlySpan<byte> ascii = new byte[]
- {
- 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
- 0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
- };
-
// If the char is ASCII, look it up in the bitmap. Otherwise, query its Unicode category.
+ ReadOnlySpan<byte> ascii = WordCharBitmap;
int chDiv8 = ch >> 3;
return (uint)chDiv8 < (uint)ascii.Length ?
(ascii[chDiv8] & (1 << (ch & 0x7))) != 0 :
(WordCategoriesMask & (1 << (int)CharUnicodeInfo.GetUnicodeCategory(ch))) != 0;
}
+ /// <summary>Provides a mask of Unicode categories that combine to form [\w].</summary>
+ private const int WordCategoriesMask =
+ 1 << (int)UnicodeCategory.UppercaseLetter |
+ 1 << (int)UnicodeCategory.LowercaseLetter |
+ 1 << (int)UnicodeCategory.TitlecaseLetter |
+ 1 << (int)UnicodeCategory.ModifierLetter |
+ 1 << (int)UnicodeCategory.OtherLetter |
+ 1 << (int)UnicodeCategory.NonSpacingMark |
+ 1 << (int)UnicodeCategory.DecimalDigitNumber |
+ 1 << (int)UnicodeCategory.ConnectorPunctuation;
+
+ /// <summary>Gets a bitmap for whether each character 0 through 127 is in [\w]</summary>
+ private static ReadOnlySpan<byte> WordCharBitmap => new byte[]
+ {
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
+ 0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
+ };
+
+
/// <summary>Supports searching for the string ", Version=".</summary>
internal static readonly SearchValues<string> s_indexOfString_F484FBA9DDF61CC32D17E4ED223128BF4D7C62347668A9B369CE2C1E6BBB3513 = SearchValues.Create([", Version="], StringComparison.Ordinal);
}"^-+ *BEGIN (?<keyName>\\w+( \\w+)*) PRIVATE ..." (1964 uses)[GeneratedRegex("^-+ *BEGIN (?<keyName>\\w+( \\w+)*) PRIVATE KEY *-+\\r?\\n(Proc-Type: 4,ENCRYPTED\\r?\\nDEK-Info: (?<cipherName>[A-Z0-9-]+),(?<salt>[A-F0-9]+)\\r?\\n\\r?\\n)?(?<data>([a-zA-Z0-9/+=]{1,80}\\r?\\n)+)-+ *END \\k<keyName> PRIVATE KEY *-+", RegexOptions.Multiline)] [MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool IsWordChar(char ch)
{
- // Mask of Unicode categories that combine to form [\w]
- const int WordCategoriesMask =
- 1 << (int)UnicodeCategory.UppercaseLetter |
- 1 << (int)UnicodeCategory.LowercaseLetter |
- 1 << (int)UnicodeCategory.TitlecaseLetter |
- 1 << (int)UnicodeCategory.ModifierLetter |
- 1 << (int)UnicodeCategory.OtherLetter |
- 1 << (int)UnicodeCategory.NonSpacingMark |
- 1 << (int)UnicodeCategory.DecimalDigitNumber |
- 1 << (int)UnicodeCategory.ConnectorPunctuation;
-
- // Bitmap for whether each character 0 through 127 is in [\w]
- ReadOnlySpan<byte> ascii = new byte[]
- {
- 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
- 0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
- };
-
// If the char is ASCII, look it up in the bitmap. Otherwise, query its Unicode category.
+ ReadOnlySpan<byte> ascii = WordCharBitmap;
int chDiv8 = ch >> 3;
return (uint)chDiv8 < (uint)ascii.Length ?
(ascii[chDiv8] & (1 << (ch & 0x7))) != 0 :
}
}
+ /// <summary>Provides a mask of Unicode categories that combine to form [\w].</summary>
+ private const int WordCategoriesMask =
+ 1 << (int)UnicodeCategory.UppercaseLetter |
+ 1 << (int)UnicodeCategory.LowercaseLetter |
+ 1 << (int)UnicodeCategory.TitlecaseLetter |
+ 1 << (int)UnicodeCategory.ModifierLetter |
+ 1 << (int)UnicodeCategory.OtherLetter |
+ 1 << (int)UnicodeCategory.NonSpacingMark |
+ 1 << (int)UnicodeCategory.DecimalDigitNumber |
+ 1 << (int)UnicodeCategory.ConnectorPunctuation;
+
+ /// <summary>Gets a bitmap for whether each character 0 through 127 is in [\w]</summary>
+ private static ReadOnlySpan<byte> WordCharBitmap => new byte[]
+ {
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
+ 0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
+ };
+
+
/// <summary>Supports searching for characters in or not in "0123456789ABCDEF".</summary>
internal static readonly SearchValues<char> s_asciiHexDigitsUpper = SearchValues.Create("0123456789ABCDEF");"&(?!#?\\w+;)" (1880 uses)[GeneratedRegex("&(?!#?\\w+;)")] [MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool IsWordChar(char ch)
{
- // Mask of Unicode categories that combine to form [\w]
- const int WordCategoriesMask =
- 1 << (int)UnicodeCategory.UppercaseLetter |
- 1 << (int)UnicodeCategory.LowercaseLetter |
- 1 << (int)UnicodeCategory.TitlecaseLetter |
- 1 << (int)UnicodeCategory.ModifierLetter |
- 1 << (int)UnicodeCategory.OtherLetter |
- 1 << (int)UnicodeCategory.NonSpacingMark |
- 1 << (int)UnicodeCategory.DecimalDigitNumber |
- 1 << (int)UnicodeCategory.ConnectorPunctuation;
-
- // Bitmap for whether each character 0 through 127 is in [\w]
- ReadOnlySpan<byte> ascii = new byte[]
- {
- 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
- 0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
- };
-
// If the char is ASCII, look it up in the bitmap. Otherwise, query its Unicode category.
+ ReadOnlySpan<byte> ascii = WordCharBitmap;
int chDiv8 = ch >> 3;
return (uint)chDiv8 < (uint)ascii.Length ?
(ascii[chDiv8] & (1 << (ch & 0x7))) != 0 :
(WordCategoriesMask & (1 << (int)CharUnicodeInfo.GetUnicodeCategory(ch))) != 0;
}
+
+ /// <summary>Provides a mask of Unicode categories that combine to form [\w].</summary>
+ private const int WordCategoriesMask =
+ 1 << (int)UnicodeCategory.UppercaseLetter |
+ 1 << (int)UnicodeCategory.LowercaseLetter |
+ 1 << (int)UnicodeCategory.TitlecaseLetter |
+ 1 << (int)UnicodeCategory.ModifierLetter |
+ 1 << (int)UnicodeCategory.OtherLetter |
+ 1 << (int)UnicodeCategory.NonSpacingMark |
+ 1 << (int)UnicodeCategory.DecimalDigitNumber |
+ 1 << (int)UnicodeCategory.ConnectorPunctuation;
+
+ /// <summary>Gets a bitmap for whether each character 0 through 127 is in [\w]</summary>
+ private static ReadOnlySpan<byte> WordCharBitmap => new byte[]
+ {
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
+ 0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
+ };
+
}
}"\\A\\s*(?<name>\\w+)(\\s*\\((?<arguments>.*) ..." (1751 uses)[GeneratedRegex("\\A\\s*(?<name>\\w+)(\\s*\\((?<arguments>.*)\\))?\\s*\\Z", RegexOptions.Singleline)] [MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool IsWordChar(char ch)
{
- // Mask of Unicode categories that combine to form [\w]
- const int WordCategoriesMask =
- 1 << (int)UnicodeCategory.UppercaseLetter |
- 1 << (int)UnicodeCategory.LowercaseLetter |
- 1 << (int)UnicodeCategory.TitlecaseLetter |
- 1 << (int)UnicodeCategory.ModifierLetter |
- 1 << (int)UnicodeCategory.OtherLetter |
- 1 << (int)UnicodeCategory.NonSpacingMark |
- 1 << (int)UnicodeCategory.DecimalDigitNumber |
- 1 << (int)UnicodeCategory.ConnectorPunctuation;
-
- // Bitmap for whether each character 0 through 127 is in [\w]
- ReadOnlySpan<byte> ascii = new byte[]
- {
- 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
- 0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
- };
-
// If the char is ASCII, look it up in the bitmap. Otherwise, query its Unicode category.
+ ReadOnlySpan<byte> ascii = WordCharBitmap;
int chDiv8 = ch >> 3;
return (uint)chDiv8 < (uint)ascii.Length ?
(ascii[chDiv8] & (1 << (ch & 0x7))) != 0 :
StackPush(ref stack, ref pos, arg0, arg1, arg2);
}
}
+
+ /// <summary>Provides a mask of Unicode categories that combine to form [\w].</summary>
+ private const int WordCategoriesMask =
+ 1 << (int)UnicodeCategory.UppercaseLetter |
+ 1 << (int)UnicodeCategory.LowercaseLetter |
+ 1 << (int)UnicodeCategory.TitlecaseLetter |
+ 1 << (int)UnicodeCategory.ModifierLetter |
+ 1 << (int)UnicodeCategory.OtherLetter |
+ 1 << (int)UnicodeCategory.NonSpacingMark |
+ 1 << (int)UnicodeCategory.DecimalDigitNumber |
+ 1 << (int)UnicodeCategory.ConnectorPunctuation;
+
+ /// <summary>Gets a bitmap for whether each character 0 through 127 is in [\w]</summary>
+ private static ReadOnlySpan<byte> WordCharBitmap => new byte[]
+ {
+ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
+ 0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
+ };
+
}
}For more diff examples, see https://gist.github.com/MihuBot/3dc0c347ab5ededb4c479718266d90f0 JIT assembly changesFor a list of JIT diff regressions, see Regressions.md Sample source code for further analysisconst string JsonPath = "RegexResults-1303.json";
if (!File.Exists(JsonPath))
{
await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/E2rQ5ESA");
using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}
using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");
record KnownPattern(string Pattern, RegexOptions Options, int Count);
sealed class RegexEntry
{
public required KnownPattern Regex { get; set; }
public required string MainSource { get; set; }
public required string PrSource { get; set; }
public string? FullDiff { get; set; }
public string? ShortDiff { get; set; }
public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
} |
If we statically know by construction that what comes before or after a \b is guaranteed to be a word char, then we can avoid half the run-time checks.
This also tweaks the source-generated implementation of IsBoundaryWordChar in order to avoid an extra branch on every check. It's currently delegating to IsWordChar and then if that returns false, checking whether it's one of the other two joiner characters that are considered as part of the boundary set. Instead, this duplicates the IsWordChar implementation (which is just a couple of lines once the helpers are separated out into their own members), such that for ASCII, the additional check isn't necessary. The implementation used by the interpreter and RegexCompiler already do this.
Before:
After: