Skip to content

Conversation

@muellerj2
Copy link
Contributor

@muellerj2 muellerj2 commented Nov 1, 2025

Fixes #5797.

Bug analysis

The matcher contains an optimization for empty matches in a loop that has a minimum number of repetitions set. It's based on the observation that when a repetition matched the empty string, the next repetition can just match the empty string again, so the minimum is essentially lifted. This is achieved by simulating minimum number of reps - k - 1 matches, if the empty match was the k-th one.

However, this means we also have fewer remaining repetitions if there is also a maximum number of repetitions set. If these aren't enough, the matching along this trajectory fails.

This isn't much of an issue, though, if the empty match can happen in any repetition: If we must have at most y empty matches, then we will find the solution in the trajectory where the first empty match happens in repetition (min number of reps - y). (A mathematician might say that repetitions matching an empty string must commute with all other repetitions.)

But it's possible to construct empty matches that can only appear when the repetition happens at specific positions in the input string using assertions: ECMAScript's lookahead assertions, but also caret and dollar anchors. Using these constructs and disjunctions inside the repeated expression, we can construct input-position-dependent empty matches. This means that the bug could be encountered in some ECMAScript, POSIX extended and awk regexes.

What about POSIX basic regexes?

I'm not sure. While such regexes have one more context-dependent construct compared to ECMAScript and POSIX extended, namely backreferences (ECMAScript backreferences are not the right kind of context-dependent because they either stay the same throughout the whole evaluation of the loop or are reset at the start of each repetition), these regexes don't have any disjunctions. I have tried to come up with a problematic basic regex, but have failed.

What about simple loops?

Simple loops implement this optimization as well. But I think context-dependent constructs cannot appear in simple loops currently (though maybe in the future). And even if they could appear, it wouldn't matter: Since simple loops are guaranteed to match strings of the same length in each repetition, a simple loop can only ever match empty strings if it matched an empty string once.

Fix

The easiest fix is to just disable this optimization for non-simple loops if a maximum number of repetitions has been set on the loop.

Alternatives

We could separate the repetition counters for minimum and maximum number of repetitions and only increase the counter used for the minimum if an empty match is encountered. Alternatively, we could add some additional loop state variables that encode that the minimum number of repetitions have essentially been lifted. But this would mean that more loop state would have to be stored and managed even for loops that don't gain anything from this optimization. I doubt that all this work is worth it just to enable this optimization in a few more cases.

Doubts about this optimization

This is second time this optimization is a source of bugs in the matcher (see #5218 (comment) with fix in #5494 after it became clear that this can cause some wrong assignments to capturing groups in ECMAScript as well) because it is applied too aggressively. I can't guarantee that we will not discover another bug caused by it. It also seems to optimize a weird corner case: What's the point of setting the minimum number of repetitions on a loop if the repeated pattern can match the empty string in all repetitions? So I will probably just pull it out if we stumble on another bug that is caused by this optimization.

@muellerj2 muellerj2 requested a review from a team as a code owner November 1, 2025 19:49
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Nov 1, 2025
@StephanTLavavej StephanTLavavej added bug Something isn't working regex meow is a substring of homeowner labels Nov 1, 2025
@StephanTLavavej StephanTLavavej self-assigned this Nov 1, 2025
@StephanTLavavej StephanTLavavej removed their assignment Nov 3, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Nov 3, 2025
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Nov 4, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 3f29c70 into microsoft:main Nov 5, 2025
41 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Nov 5, 2025
@StephanTLavavej
Copy link
Member

🐞 🛠️ 😸

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working regex meow is a substring of homeowner

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

<regex>: Loops with bounded number of repetitions and context-dependent empty alternative are mishandled

2 participants