Skip to content

Conversation

Copilot
Copy link
Contributor

@Copilot Copilot AI commented Oct 20, 2025

  • Analyze the issue and understand the optimization needed
  • Locate relevant code files (RegexGenerator.Emitter.cs, RegexFindOptimizations.cs)
  • Understand the current implementation
  • Implement the optimization in EmitTryFindNextPossibleStartingPosition
  • Add tests to verify the optimization
  • Build and test the changes
  • Run CodeQL security checks
  • Manually verify the optimization works
  • Address feedback: Move computation to emitter to avoid interpreter overhead
  • Add optimization to RegexCompiler.cs for consistency
  • Simplify condition checks and remove unnecessary IL labels
  • Remove unnecessary braces and improve comment accuracy
  • Add comprehensive test variations for edge cases

Issue Summary

Optimize regex patterns with both leading Beginning anchor (^) and trailing End anchor (\z) with fixed length. Instead of checking if input is at least N characters, check if input is exactly N characters AND position is 0.

Implementation Details (Updated)

Modified three files:

  1. RegexFindOptimizations.cs: Reverted changes - no longer computes TrailingAnchor for leading Beginning anchor
  2. RegexGenerator.Emitter.cs: Compute trailing anchor and max length lazily when emitting LeadingAnchor_LeftToRight_Beginning case
  3. RegexCompiler.cs: Added same optimization to compiled regexes
  4. Test files: Added comprehensive test cases covering various edge cases

The optimization only applies when:

  • LeadingAnchor == RegexNodeKind.Beginning (^ or \A)
  • TrailingAnchor == RegexNodeKind.End (\z, not \Z or $)
  • Pattern has fixed length (MinRequiredLength == MaxPossibleLength)

Changes from Previous Versions

  • Moved the trailing anchor and max length computation from RegexFindOptimizations.cs to the emitters (RegexGenerator.Emitter.cs and RegexCompiler.cs)
  • This avoids adding overhead to the interpreter, as the computation now only happens in compiled and source-generated code
  • Added the optimization to RegexCompiler.cs to keep it in sync with the source generator
  • Simplified condition checks to directly compare ComputeMaxLength() with MinRequiredLength
  • Removed unnecessary success label in RegexCompiler IL emission
  • Removed unnecessary braces around the optimization checks
  • Improved comment accuracy and wording
  • Inlined constant values in IL emission
  • Added comprehensive test variations covering edge cases (no starting anchor, Multiline mode, \Z/$ anchors, content before/after anchors)

Test Results

✅ All unit tests pass (1,005 tests)
✅ All functional tests pass (30,391 tests - includes 56 new tests for edge cases)
✅ No security vulnerabilities detected

Security Summary

No security vulnerabilities were introduced or discovered.

Original prompt

This section details on the original issue you should resolve

<issue_title>Earlier fail fast for fixed-length regex patterns with anchors</issue_title>
<issue_description>

Description

When having a regex pattern which contains both leading+trailing anchors and the pattern is of fixed length, it seems to me that we might be missing an opportunity to fail fast earlier.

Take e.g. the regex pattern ^1234$.
Currently the emitted TryFindNextPossibleStartingPosition checks that the input has at least 4 characters to continue.

if (pos <= inputSpan.Length - 4 && pos == 0)

When invoking IsMatch("12345") on the generated regex, the flow will be:

  • pass TryFindNextPossibleStartingPosition as the input is longer than 4 characters,
  • enter TryMatchAtCurrentPosition and call .StartsWith("1234"),
  • and first then fail on 5 < slice.Length because the input was too long.

If we tightened the check in TryFindNextPossibleStartingPosition to "exactly 4 characters", we wouldn't have to enter TryMatchAtCurrentPosition.

If I understand the code correctly, we can detect this situation in EmitTryFindNextPossibleStartingPosition when:

  • rm.Tree.FindOptimizations.LeadingAnchor is RegexNodeKind.Beginning and
  • rm.Tree.FindOptimizations.FindMode is FindNextStartingPositionMode.TrailingAnchor_FixedLength_LeftToRight_End or FindNextStartingPositionMode.TrailingAnchor_FixedLength_LeftToRight_EndZ

Am I missing something?</issue_description>

<agent_instructions>If the pattern starts with a begin anchor, ends with a /z (End) anchor, and has a fixed length (its min length equals its max length), then TryFindNextPossibleStartingPosition can be optimized to return true if and only if the current position is 0 and the length of the input matches the guaranteed match length.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@stephentoub Thanks. There are certainly more special-cases we could optimize for; it's primarily a question of how common they are and whether it's worth the extra code to maintain. Looking through https://github.com/dotnet/runtime-assets/blob/main/src/System.Text.RegularExpressions.TestData/Regex_RealWorldPatterns.json, there are several hundred occurrences of this pattern, so I'd be ok with a PR to try to optimize it further.

That said, keep in mind that $ does not mean "exactly at the end". It actually means "at the end or just before an ending \n", so a length check of exactly 4 would be wrong.

For this specific construction, I think the right answer would be to not emit TryFindNextPossibleStartingPosition at all. You can see in the source generator code there's already special-cases when Scan is emitted to control whether or not we emit TryFindNextPossibleStartingPosition or TryMatchAtCurrentPosition. In this case, I think the right answer would simply be to just emit TryMatchAtCurrentPosition and have Scan just call it.</comment_new>
<comment_new>@stephentoub
Looking at this again, I take back what I said about not emitting TryFindNextPossibleStartingPosition; it has the length check, which we still want, as it quickly rules out any candidates that aren't of the right length. The only difference I think we might want would be for that length check to be an exact length check rather than a minimum, i.e. instead of:

if (pos <= inputSpan.Length - 4 && pos == 0)

it'd be:

if (pos == 0 && inputSpan.Length == 4)

But this would only be viable with a /z anchor, not /Z or $, which as previously noted allow for an optional \n at the end, so the usefulness of this optimization is significantly lessened.</comment_new>

Fixes #118489


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Implement fail-fast optimization for patterns like ^abc\z that have both a leading Beginning anchor and trailing End anchor with fixed length. Instead of checking if input is at least N characters, check if input is exactly N characters AND position is 0.

Co-authored-by: stephentoub <[email protected]>
@Copilot Copilot AI changed the title [WIP] Improve fail fast behavior for fixed-length regex patterns Optimize regex patterns with both beginning and end anchors for early fail-fast Oct 20, 2025
Copilot finished work on behalf of stephentoub October 20, 2025 19:54
@Copilot Copilot AI requested a review from stephentoub October 20, 2025 19:54
- Remove trailing anchor computation from RegexFindOptimizations.cs
- Compute trailing anchor lazily in RegexGenerator.Emitter.cs only when needed
- Add same optimization to RegexCompiler.cs for consistency
This avoids adding overhead to the interpreter while keeping benefits for compiled/generated code

Co-authored-by: stephentoub <[email protected]>
@Copilot Copilot AI requested a review from stephentoub October 20, 2025 21:36
Copilot finished work on behalf of stephentoub October 20, 2025 21:36
@stephentoub stephentoub marked this pull request as ready for review October 22, 2025 19:28
@Copilot Copilot AI review requested due to automatic review settings October 22, 2025 19:28
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes regex patterns with both beginning (^ or \A) and end (\z) anchors where the pattern has a fixed length. The optimization enables earlier fail-fast by checking if the input length exactly matches the pattern length (instead of just checking minimum length), avoiding unnecessary entry into TryMatchAtCurrentPosition when the length doesn't match.

  • Moved trailing anchor computation from RegexFindOptimizations.cs to the emitters (source generator and compiler) to avoid interpreter overhead
  • Added exact length check for patterns with both beginning and end anchors with fixed length
  • Added comprehensive test cases covering exact match, too long, too short, and offset scenarios

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
Regex.Match.Tests.cs Added test cases verifying the optimization works for various scenarios with fixed-length anchored patterns
RegexCompiler.cs Added optimization logic to compiled regexes for exact length checking when both anchors are present
RegexGenerator.Emitter.cs Added optimization logic to source-generated regexes for exact length checking when both anchors are present

- Simplify trailing anchor check to directly compare ComputeMaxLength() with MinRequiredLength
- Remove unnecessary success label in RegexCompiler and use Bne(returnFalse) directly
- Move minRequiredLength declaration after the condition check

Co-authored-by: stephentoub <[email protected]>
@Copilot Copilot AI requested a review from stephentoub October 22, 2025 19:57
Copilot finished work on behalf of stephentoub October 22, 2025 19:57
- Remove unnecessary braces around if statements (no longer needed after removing extra local variables)
- Update comment in RegexCompiler.cs to use 'minRequiredLength' instead of 'exactLength'
- Update comment in RegexGenerator.Emitter.cs to mention both '^' and '\A' anchors (^/\A)

Co-authored-by: stephentoub <[email protected]>
Copilot finished work on behalf of stephentoub October 22, 2025 20:44
@Copilot Copilot AI requested a review from stephentoub October 22, 2025 20:44
…variations

- Update comment in RegexGenerator.Emitter.cs to use 'leads with' wording
- Inline MinRequiredLength directly in Ldc call in RegexCompiler.cs
- Add test variations for edge cases: without starting anchor, with Multiline, with \Z/$ anchors, with content before/after anchors

Co-authored-by: stephentoub <[email protected]>
@Copilot Copilot AI requested a review from stephentoub October 23, 2025 03:55
Copilot finished work on behalf of stephentoub October 23, 2025 03:55
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

@stephentoub stephentoub merged commit f3c990f into main Oct 23, 2025
89 checks passed
@stephentoub stephentoub deleted the copilot/optimize-regex-fail-fast-again branch October 23, 2025 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Earlier fail fast for fixed-length regex patterns with anchors

3 participants