<regex>: Fix matching of loops with context-dependent empty matches and bounded number of repetitions
#5820
+15
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #5797.
Bug analysis
The matcher contains an optimization for empty matches in a loop that has a minimum number of repetitions set. It's based on the observation that when a repetition matched the empty string, the next repetition can just match the empty string again, so the minimum is essentially lifted. This is achieved by simulating minimum number of reps - k - 1 matches, if the empty match was the k-th one.
However, this means we also have fewer remaining repetitions if there is also a maximum number of repetitions set. If these aren't enough, the matching along this trajectory fails.
This isn't much of an issue, though, if the empty match can happen in any repetition: If we must have at most y empty matches, then we will find the solution in the trajectory where the first empty match happens in repetition (min number of reps - y). (A mathematician might say that repetitions matching an empty string must commute with all other repetitions.)
But it's possible to construct empty matches that can only appear when the repetition happens at specific positions in the input string using assertions: ECMAScript's lookahead assertions, but also caret and dollar anchors. Using these constructs and disjunctions inside the repeated expression, we can construct input-position-dependent empty matches. This means that the bug could be encountered in some ECMAScript, POSIX extended and awk regexes.
What about POSIX basic regexes?
I'm not sure. While such regexes have one more context-dependent construct compared to ECMAScript and POSIX extended, namely backreferences (ECMAScript backreferences are not the right kind of context-dependent because they either stay the same throughout the whole evaluation of the loop or are reset at the start of each repetition), these regexes don't have any disjunctions. I have tried to come up with a problematic basic regex, but have failed.
What about simple loops?
Simple loops implement this optimization as well. But I think context-dependent constructs cannot appear in simple loops currently (though maybe in the future). And even if they could appear, it wouldn't matter: Since simple loops are guaranteed to match strings of the same length in each repetition, a simple loop can only ever match empty strings if it matched an empty string once.
Fix
The easiest fix is to just disable this optimization for non-simple loops if a maximum number of repetitions has been set on the loop.
Alternatives
We could separate the repetition counters for minimum and maximum number of repetitions and only increase the counter used for the minimum if an empty match is encountered. Alternatively, we could add some additional loop state variables that encode that the minimum number of repetitions have essentially been lifted. But this would mean that more loop state would have to be stored and managed even for loops that don't gain anything from this optimization. I doubt that all this work is worth it just to enable this optimization in a few more cases.
Doubts about this optimization
This is second time this optimization is a source of bugs in the matcher (see #5218 (comment) with fix in #5494 after it became clear that this can cause some wrong assignments to capturing groups in ECMAScript as well) because it is applied too aggressively. I can't guarantee that we will not discover another bug caused by it. It also seems to optimize a weird corner case: What's the point of setting the minimum number of repetitions on a loop if the repeated pattern can match the empty string in all repetitions? So I will probably just pull it out if we stumble on another bug that is caused by this optimization.