Fix longstanding regex interpreter bug around lazy loops with empty matches #120872

stephentoub · 2025-10-19T03:01:38Z

This has been an issue in the interpreter forever, as far as I can tell. We've had multiple issues over the years all flagging problems with different symptoms that stem from the same core problem.

Fixes #43314
Fixes #58786
Fixes #63385
Fixes #111051
Fixes #114626

The problem came down to how the regex interpreter handled lazy quantifiers over expressions that can match the empty string. When the interpreter reaches one of these lazy loops, it uses an internal instruction called Lazybranchmark to manage entering and potentially looping the subexpression. To keep track of loop state and capture boundaries, the interpreter uses two internal stacks, a grouping stack, that tracks positions relevant to capturing groups (e.g. where a group started), and a backtracking stack, that tracks states that are needed if the engine has to go back and try a different match. The bug occurred in the case when the subpattern inside the lazy loop matches nothing. In this case, the interpreter unconditionally pushed a placeholder onto the grouping stack. If the rest of the pattern then succeeded without backtracking through this loop, that extra placeholder remained on the grouping stack. This polluted the capture bookkeeping: later parts of the pattern popped that placeholder, treating it as a real start position, and shifted captures to the wrong place.

The fix is to stop pushing onto the grouping stack when the loop matches empty. Instead, the interpreter records two things on the backtracking stack: the old group boundary and a flag indicating whether the grouping stack needs to be popped later. If the interpreter ends up backtracking through this lazy loop, it checks the flag: if a grouping stack entry was added earlier, it pops it; if not, it leaves the grouping stack untouched. This keeps the grouping stack and backtracking stack in sync in both forward and backtracking paths. As a result, empty lazy loops no longer leave stray entries on the grouping stack. This also prevents the unbounded stack growth that previously caused overflows or hangs on some patterns involving nested lazy quantifiers.

This has been an issue in the interpreter forever, as far as I can tell. We've had multiple issues over the years all flagging problems with different symptoms that stem from the same core problem. The problem came down to how the regex interpreter handled lazy quantifiers over expressions that can match the empty string. When the interpreter reaches one of these lazy loops, it uses an internal instruction called `Lazybranchmark` to manage entering and potentially looping the subexpression. To keep track of loop state and capture boundaries, the interpreter uses two internal stacks, a grouping stack, that tracks positions relevant to capturing groups (e.g. where a group started), and a backtracking stack, that tracks states that are needed if the engine has to go back and try a different match. The bug occurred in the case when the subpattern inside the lazy loop matches nothing. In this case, the interpreter unconditionally pushed a placeholder onto the grouping stack. If the rest of the pattern then succeeded without backtracking through this loop, that extra placeholder remained on the grouping stack. This polluted the capture bookkeeping: later parts of the pattern popped that placeholder, treating it as a real start position, and shifted captures to the wrong place. The fix is to stop pushing onto the grouping stack when the loop matches empty. Instead, the interpreter records two things on the backtracking stack: the old group boundary and a flag indicating whether the grouping stack needs to be popped later. If the interpreter ends up backtracking through this lazy loop, it checks the flag: if a grouping stack entry was added earlier, it pops it; if not, it leaves the grouping stack untouched. This keeps the grouping stack and backtracking stack in sync in both forward and backtracking paths. As a result, empty lazy loops no longer leave stray entries on the grouping stack. This also prevents the unbounded stack growth that previously caused overflows or hangs on some patterns involving nested lazy quantifiers.

Copilot

Pull Request Overview

This PR fixes a longstanding bug in the regex interpreter related to handling lazy quantifiers over expressions that can match empty strings. The bug caused incorrect capture group positioning and potential stack overflow issues when lazy loops matched empty strings.

Fixed stack management in the regex interpreter to properly handle empty matches in lazy quantifiers
Updated backtracking logic to track whether grouping stack entries need to be popped
Added comprehensive test cases covering the various scenarios that previously failed

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
RegexInterpreter.cs	Fixed core bug by modifying stack management logic for lazy quantifiers with empty matches
Regex.Match.Tests.cs	Added test cases for lazy loops with empty matches that previously failed
Regex.MultipleMatches.Tests.cs	Added test cases for multiple match scenarios with lazy quantifiers

...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs

dotnet-policy-service · 2025-10-19T03:02:21Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

tarekgh

Thanks for the detailed description, without it would be hard to understand the fix!

stephentoub · 2025-10-19T18:46:56Z

/ba-g deadletter

stephentoub requested review from danmoseley and tarekgh October 19, 2025 03:01

stephentoub added the area-System.Text.RegularExpressions label Oct 19, 2025

Copilot AI review requested due to automatic review settings October 19, 2025 03:01

dotnet-policy-service bot assigned stephentoub Oct 19, 2025

Copilot AI reviewed Oct 19, 2025

View reviewed changes

...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs Show resolved Hide resolved

Add another test from another issue

63ec098

build-analysis bot mentioned this pull request Oct 19, 2025

Unable to pull image from mcr.microsoft.com #117164

Open

tarekgh approved these changes Oct 19, 2025

View reviewed changes

stephentoub merged commit 48a162c into dotnet:main Oct 19, 2025
81 of 85 checks passed

stephentoub deleted the fixinterpreter branch October 19, 2025 18:47

dotnet-maestro bot mentioned this pull request Oct 21, 2025

[main] Source code updates from dotnet/runtime dotnet/dotnet#3024

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix longstanding regex interpreter bug around lazy loops with empty matches #120872

Fix longstanding regex interpreter bug around lazy loops with empty matches #120872

Uh oh!

stephentoub commented Oct 19, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

dotnet-policy-service bot commented Oct 19, 2025

Uh oh!

tarekgh left a comment

Uh oh!

stephentoub commented Oct 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix longstanding regex interpreter bug around lazy loops with empty matches #120872

Fix longstanding regex interpreter bug around lazy loops with empty matches #120872

Uh oh!

Conversation

stephentoub commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

dotnet-policy-service bot commented Oct 19, 2025

Uh oh!

tarekgh left a comment

Choose a reason for hiding this comment

Uh oh!

stephentoub commented Oct 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stephentoub commented Oct 19, 2025 •

edited

Loading