-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternation match regression #2884
Comments
In some rare cases, it was possible for ripgrep's inner literal detector to extract a set of literals that could produce a false negative. #2884 gives an example: `(?i:e.x|ex)`. In this case, the set extracted can be discovered by running `rg '(?i:e.x|ex) --trace`: Seq[E("EX"), E("Ex"), E("eX"), E("ex")] This extraction leads to building a multi-substring matcher for `EX`, `Ex`, `eX` and `ex`. Searching the haystack `e-x` produces no match, and thus, ripgrep shows no matches. But the regex `(?i:e.x|ex)` matches `e-x`. The issue at play here was that when two extracted literal sequences were unioned, we were correctly unioning their "prefix" attribute. And this in turn leads to those literal sequences being combined incorrectly via cross product. This case in particular triggers it because two different optimizations combine to produce an incorrect result. Firslty, the regex has a common prefix extracted and is rewritten as `(?i:e(?:.x|x))`. Secondly, the `x` in the first branch of the alternation has its `prefix` attribute set to `false` (correctly), which means it can't be cross producted with another concatenation. But in this case, it is unioned with the `x` from the second branch, and this results in the union result having `prefix` set to `true`. This in turn pops up and lets it get cross producted with the `e` prefix, producing an incorrect literal sequence. We fix this by changing the implementation of `union` to return `prefix` set to `true` only when *both* literal sequences being unioned have `prefix` set to `true`. Doing this exposed a second bug that was present, but was purely cosmetic: the extracted literals in this case, after the fix, are `X` and `x`. They were considered "exact" (i.e., lead to a match), but of course they are not. Observing an `X` or an `x` does not mean there is a match. This was fixed by making `choose` always return an inexact literal sequence. This is perhaps too conservative in aggregate in some cases, but always correct. The idea here is that if one is choosing between two concatenations, then it is likely the case that the sequence returned should be considered inexact. The issue is that this can lead to avoiding cross products in some cases that would otherwise be correct. This is bad because it means extracting shorter literals in some cases. (In general, the longer the literal the better.) But we prioritize correctness for now and fix it. You can see a few tests where this shortens some extracted literals. Fixes #2884
Excellent find. When you find weird behaviors like this where small perturbations to the regex make the bug go away, it almost always points toward literal extraction. In this case, the bug was in ripgrep's inner literal extractor (which isn't in the regex engine), and that only runs when the regex engine believes its own optimizations "aren't great." And there are tons of layers here. In this case, multiple optimizations and heuristics have to come together to produce an incorrect result. This is also made clearer in ripgrep's trace logging, which shows the literals it extracted which are clearly incorrect:
Specifically, this bit:
Clearly, this regex won't match the input, and thus ripgrep reports a false negative. I fixed this in #2885. |
This should be fixed in the 14.1.1 release. |
Thank you @BurntSushi! |
This MR contains the following updates: | Package | Update | Change | |---|---|---| | [BurntSushi/ripgrep](https://github.com/BurntSushi/ripgrep) | patch | `14.1.0` -> `14.1.1` | MR created with the help of [el-capitano/tools/renovate-bot](https://gitlab.com/el-capitano/tools/renovate-bot). **Proposed changes to behavior should be submitted there as MRs.** --- ### Release Notes <details> <summary>BurntSushi/ripgrep (BurntSushi/ripgrep)</summary> ### [`v14.1.1`](https://github.com/BurntSushi/ripgrep/blob/HEAD/CHANGELOG.md#1411-2024-09-08) [Compare Source](BurntSushi/ripgrep@14.1.0...14.1.1) \=================== This is a minor release with a bug fix for a matching bug. In particular, a bug was found that could cause ripgrep to ignore lines that should match. That is, false negatives. It is difficult to characterize the specific set of regexes in which this occurs as it requires multiple different optimization strategies to collide and produce an incorrect result. But as one reported example, in ripgrep, the regex `(?i:e.x|ex)` does not match `e-x` when it should. (This bug is a result of an inner literal optimization performed in the `grep-regex` crate and not in the `regex` crate.) Bug fixes: - [BUG #​2884](BurntSushi/ripgrep#2884): Fix bug where ripgrep could miss some matches that it should report. Miscellaneous: - [MISC #​2748](BurntSushi/ripgrep#2748): Remove ripgrep's `simd-accel` feature because it was frequently broken. </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever MR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this MR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this MR, check this box --- This MR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy40NDAuNyIsInVwZGF0ZWRJblZlciI6IjM3LjQ0MC43IiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJSZW5vdmF0ZSBCb3QiXX0=-->
Please tick this box to confirm you have reviewed the above.
What version of ripgrep are you using?
14.1.0
How did you install ripgrep?
Cargo
What operating system are you using ripgrep on?
Fedora 40
Describe your bug.
ripgrep fails to return some matches when case is ignored. This only happens with very particular matches and specific characters.
A git bisect seems to suggest this is a regression introduced by
ca740d9.
What are the steps to reproduce the behavior?
Run the following:
What is the actual behavior?
What is the expected behavior?
ripgrep should have returned
e-x
instead of returning no matches.The same occurs when
e
is replaced withk
,s
, ort
, but doesn't with any other alphabetic character, or if case isn't ignored.The text was updated successfully, but these errors were encountered: