Skip to content

Conversation

@stephentoub
Copy link
Member

@stephentoub stephentoub commented Oct 19, 2025

This has been an issue in the interpreter forever, as far as I can tell. We've had multiple issues over the years all flagging problems with different symptoms that stem from the same core problem.

Fixes #43314
Fixes #58786
Fixes #63385
Fixes #111051
Fixes #114626

The problem came down to how the regex interpreter handled lazy quantifiers over expressions that can match the empty string. When the interpreter reaches one of these lazy loops, it uses an internal instruction called Lazybranchmark to manage entering and potentially looping the subexpression. To keep track of loop state and capture boundaries, the interpreter uses two internal stacks, a grouping stack, that tracks positions relevant to capturing groups (e.g. where a group started), and a backtracking stack, that tracks states that are needed if the engine has to go back and try a different match. The bug occurred in the case when the subpattern inside the lazy loop matches nothing. In this case, the interpreter unconditionally pushed a placeholder onto the grouping stack. If the rest of the pattern then succeeded without backtracking through this loop, that extra placeholder remained on the grouping stack. This polluted the capture bookkeeping: later parts of the pattern popped that placeholder, treating it as a real start position, and shifted captures to the wrong place.

The fix is to stop pushing onto the grouping stack when the loop matches empty. Instead, the interpreter records two things on the backtracking stack: the old group boundary and a flag indicating whether the grouping stack needs to be popped later. If the interpreter ends up backtracking through this lazy loop, it checks the flag: if a grouping stack entry was added earlier, it pops it; if not, it leaves the grouping stack untouched. This keeps the grouping stack and backtracking stack in sync in both forward and backtracking paths. As a result, empty lazy loops no longer leave stray entries on the grouping stack. This also prevents the unbounded stack growth that previously caused overflows or hangs on some patterns involving nested lazy quantifiers.

This has been an issue in the interpreter forever, as far as I can tell. We've had multiple issues over the years all flagging problems with different symptoms that stem from the same core problem.

The problem came down to how the regex interpreter handled lazy quantifiers over expressions that can match the empty string. When the interpreter reaches one of these lazy loops, it uses an internal instruction called `Lazybranchmark` to manage entering and potentially looping the subexpression. To keep track of loop state and capture boundaries, the interpreter uses two internal stacks, a grouping stack, that tracks positions relevant to capturing groups (e.g. where a group started), and a backtracking stack, that tracks states that are needed if the engine has to go back and try a different match. The bug occurred in the case when the subpattern inside the lazy loop matches nothing. In this case, the interpreter unconditionally pushed a placeholder onto the grouping stack. If the rest of the pattern then succeeded without backtracking through this loop, that extra placeholder remained on the grouping stack. This polluted the capture bookkeeping: later parts of the pattern popped that placeholder, treating it as a real start position, and shifted captures to the wrong place.

The fix is to stop pushing onto the grouping stack when the loop matches empty. Instead, the interpreter records two things on the backtracking stack: the old group boundary and a flag indicating whether the grouping stack needs to be popped later. If the interpreter ends up backtracking through this lazy loop, it checks the flag: if a grouping stack entry was added earlier, it pops it; if not, it leaves the grouping stack untouched. This keeps the grouping stack and backtracking stack in sync in both forward and backtracking paths. As a result, empty lazy loops no longer leave stray entries on the grouping stack. This also prevents the unbounded stack growth that previously caused overflows or hangs on some patterns involving nested lazy quantifiers.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a longstanding bug in the regex interpreter related to handling lazy quantifiers over expressions that can match empty strings. The bug caused incorrect capture group positioning and potential stack overflow issues when lazy loops matched empty strings.

  • Fixed stack management in the regex interpreter to properly handle empty matches in lazy quantifiers
  • Updated backtracking logic to track whether grouping stack entries need to be popped
  • Added comprehensive test cases covering the various scenarios that previously failed

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
RegexInterpreter.cs Fixed core bug by modifying stack management logic for lazy quantifiers with empty matches
Regex.Match.Tests.cs Added test cases for lazy loops with empty matches that previously failed
Regex.MultipleMatches.Tests.cs Added test cases for multiple match scenarios with lazy quantifiers

@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Copy link
Member

@tarekgh tarekgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed description, without it would be hard to understand the fix!

@stephentoub
Copy link
Member Author

/ba-g deadletter

@stephentoub stephentoub merged commit 48a162c into dotnet:main Oct 19, 2025
81 of 85 checks passed
@stephentoub stephentoub deleted the fixinterpreter branch October 19, 2025 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment