Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternation match regression #2884

Closed
1 task done
codido opened this issue Sep 9, 2024 · 3 comments · Fixed by #2885
Closed
1 task done

Alternation match regression #2884

codido opened this issue Sep 9, 2024 · 3 comments · Fixed by #2885
Labels
bug A bug.

Comments

@codido
Copy link

codido commented Sep 9, 2024

Please tick this box to confirm you have reviewed the above.

  • I have a different issue.

What version of ripgrep are you using?

14.1.0

How did you install ripgrep?

Cargo

What operating system are you using ripgrep on?

Fedora 40

Describe your bug.

ripgrep fails to return some matches when case is ignored. This only happens with very particular matches and specific characters.

A git bisect seems to suggest this is a regression introduced by
ca740d9.

What are the steps to reproduce the behavior?

Run the following:

echo -n "e-x" | rg -i "e.x|ex"

What is the actual behavior?

$ echo -n "e-x" | rg --debug -i "e.x|ex"
rg: DEBUG|rg::flags::config|/home/ido/.cargo/registry/src/index.crates.io-6f17d22bba15001f/ripgrep-14.1.0/crates/core/flags/config.rs:41: /home/ido/.ripgreprc: arguments loaded from config file: []
rg: DEBUG|rg::flags::parse|/home/ido/.cargo/registry/src/index.crates.io-6f17d22bba15001f/ripgrep-14.1.0/crates/core/flags/parse.rs:97: no extra arguments found from configuration file
rg: DEBUG|rg::flags::hiargs|/home/ido/.cargo/registry/src/index.crates.io-6f17d22bba15001f/ripgrep-14.1.0/crates/core/flags/hiargs.rs:1099: using heuristics to determine whether to read from stdin or search ./ (is_readable_stdin=true, stdin_consumed=false, mode=Search(Standard))
rg: DEBUG|rg::flags::hiargs|/home/ido/.cargo/registry/src/index.crates.io-6f17d22bba15001f/ripgrep-14.1.0/crates/core/flags/hiargs.rs:1112: heuristic chose to search stdin
rg: DEBUG|rg::flags::hiargs|/home/ido/.cargo/registry/src/index.crates.io-6f17d22bba15001f/ripgrep-14.1.0/crates/core/flags/hiargs.rs:1260: found hostname for hyperlink configuration: sourcerer
rg: DEBUG|rg::flags::hiargs|/home/ido/.cargo/registry/src/index.crates.io-6f17d22bba15001f/ripgrep-14.1.0/crates/core/flags/hiargs.rs:1270: hyperlink format: ""
rg: DEBUG|rg::flags::hiargs|/home/ido/.cargo/registry/src/index.crates.io-6f17d22bba15001f/ripgrep-14.1.0/crates/core/flags/hiargs.rs:174: using 1 thread(s)
rg: DEBUG|grep_regex::literal|/home/ido/.cargo/registry/src/index.crates.io-6f17d22bba15001f/grep-regex-0.1.12/src/literal.rs:117: extracted fast line regex: "(?:(?:EX)|(?:Ex)|(?:eX)|(?:ex))"
rg: DEBUG|globset|/home/ido/.cargo/registry/src/index.crates.io-6f17d22bba15001f/globset-0.4.14/src/lib.rs:453: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
$

What is the expected behavior?

ripgrep should have returned e-x instead of returning no matches.
The same occurs when e is replaced with k, s, or t, but doesn't with any other alphabetic character, or if case isn't ignored.

@BurntSushi BurntSushi added the bug A bug. label Sep 9, 2024
BurntSushi added a commit that referenced this issue Sep 9, 2024
In some rare cases, it was possible for ripgrep's inner literal detector
to extract a set of literals that could produce a false negative. #2884
gives an example: `(?i:e.x|ex)`. In this case, the set extracted can be
discovered by running `rg '(?i:e.x|ex) --trace`:

    Seq[E("EX"), E("Ex"), E("eX"), E("ex")]

This extraction leads to building a multi-substring matcher for `EX`,
`Ex`, `eX` and `ex`. Searching the haystack `e-x` produces no match,
and thus, ripgrep shows no matches. But the regex `(?i:e.x|ex)` matches
`e-x`.

The issue at play here was that when two extracted literal sequences
were unioned, we were correctly unioning their "prefix" attribute.
And this in turn leads to those literal sequences being combined
incorrectly via cross product. This case in particular triggers it
because two different optimizations combine to produce an incorrect
result. Firslty, the regex has a common prefix extracted and is
rewritten as `(?i:e(?:.x|x))`. Secondly, the `x` in the first branch of
the alternation has its `prefix` attribute set to `false` (correctly),
which means it can't be cross producted with another concatenation. But
in this case, it is unioned with the `x` from the second branch, and
this results in the union result having `prefix` set to `true`. This
in turn pops up and lets it get cross producted with the `e` prefix,
producing an incorrect literal sequence.

We fix this by changing the implementation of `union` to return
`prefix` set to `true` only when *both* literal sequences being unioned
have `prefix` set to `true`.

Doing this exposed a second bug that was present, but was purely
cosmetic: the extracted literals in this case, after the fix, are
`X` and `x`. They were considered "exact" (i.e., lead to a match),
but of course they are not. Observing an `X` or an `x` does not mean
there is a match. This was fixed by making `choose` always return
an inexact literal sequence. This is perhaps too conservative in
aggregate in some cases, but always correct. The idea here is that if
one is choosing between two concatenations, then it is likely the case
that the sequence returned should be considered inexact. The issue
is that this can lead to avoiding cross products in some cases that
would otherwise be correct. This is bad because it means extracting
shorter literals in some cases. (In general, the longer the literal the
better.) But we prioritize correctness for now and fix it. You can see
a few tests where this shortens some extracted literals.

Fixes #2884
@BurntSushi
Copy link
Owner

Excellent find. When you find weird behaviors like this where small perturbations to the regex make the bug go away, it almost always points toward literal extraction. In this case, the bug was in ripgrep's inner literal extractor (which isn't in the regex engine), and that only runs when the regex engine believes its own optimizations "aren't great." And there are tons of layers here. In this case, multiple optimizations and heuristics have to come together to produce an incorrect result.

This is also made clearer in ripgrep's trace logging, which shows the literals it extracted which are clearly incorrect:

$ echo "e-x" | rg-14.1.0 "(?i:e.x|ex)" --trace
rg: DEBUG|rg::flags::config|crates/core/flags/config.rs:41: /home/andrew/.ripgreprc: arguments loaded from config file: ["--max-columns-preview", "--colors=match:bg:0xff,0x7f,0x00", "--colors=match:fg:0xff,0xff,0xff", "--colors=line:none", "--colors=line:fg:magenta", "--colors=path:fg:green", "--type-add=got:*_test.go"]
rg: DEBUG|rg::flags::hiargs|crates/core/flags/hiargs.rs:1099: using heuristics to determine whether to read from stdin or search ./ (is_readable_stdin=true, stdin_consumed=false, mode=Search(Standard))
rg: DEBUG|rg::flags::hiargs|crates/core/flags/hiargs.rs:1112: heuristic chose to search stdin
rg: DEBUG|rg::flags::hiargs|crates/core/flags/hiargs.rs:1260: found hostname for hyperlink configuration: duff
rg: DEBUG|rg::flags::hiargs|crates/core/flags/hiargs.rs:1270: hyperlink format: ""
rg: DEBUG|rg::flags::hiargs|crates/core/flags/hiargs.rs:174: using 1 thread(s)
rg: TRACE|grep_regex::matcher|crates/regex/src/matcher.rs:66: final regex: "(?:[Ee](?:(?:[\0-\t\u{b}-\u{10ffff}][Xx])|[Xx]))"
rg: TRACE|grep_regex::literal|crates/regex/src/literal.rs:154: extracted inner literals: Seq[E("EX"), E("Ex"), E("EX"), E("Ex"), E("eX"), E("ex"), E("eX"), E("ex")]
rg: TRACE|grep_regex::literal|crates/regex/src/literal.rs:156: extracted inner literals after optimization: Seq[E("EX"), E("Ex"), E("eX"), E("ex")]
rg: DEBUG|grep_regex::literal|crates/regex/src/literal.rs:117: extracted fast line regex: "(?:(?:EX)|(?:Ex)|(?:eX)|(?:ex))"
rg: DEBUG|globset|crates/globset/src/lib.rs:453: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
rg: TRACE|rg::search|crates/core/search.rs:254: <stdin>: binary detection: BinaryDetection(Convert(0))
rg: TRACE|grep_searcher::searcher|crates/searcher/src/searcher/mod.rs:743: generic reader: searching via roll buffer strategy
rg: TRACE|grep_searcher::searcher::core|crates/searcher/src/searcher/core.rs:65: searcher core: will use fast line searcher

Specifically, this bit:

extracted fast line regex: "(?:(?:EX)|(?:Ex)|(?:eX)|(?:ex))"

Clearly, this regex won't match the input, and thus ripgrep reports a false negative.

I fixed this in #2885.

@BurntSushi
Copy link
Owner

This should be fixed in the 14.1.1 release.

@codido
Copy link
Author

codido commented Sep 9, 2024

Thank you @BurntSushi!

tmeijn pushed a commit to tmeijn/dotfiles that referenced this issue Sep 13, 2024
This MR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [BurntSushi/ripgrep](https://github.com/BurntSushi/ripgrep) | patch | `14.1.0` -> `14.1.1` |

MR created with the help of [el-capitano/tools/renovate-bot](https://gitlab.com/el-capitano/tools/renovate-bot).

**Proposed changes to behavior should be submitted there as MRs.**

---

### Release Notes

<details>
<summary>BurntSushi/ripgrep (BurntSushi/ripgrep)</summary>

### [`v14.1.1`](https://github.com/BurntSushi/ripgrep/blob/HEAD/CHANGELOG.md#1411-2024-09-08)

[Compare Source](BurntSushi/ripgrep@14.1.0...14.1.1)

\===================
This is a minor release with a bug fix for a matching bug. In particular, a bug
was found that could cause ripgrep to ignore lines that should match. That is,
false negatives. It is difficult to characterize the specific set of regexes
in which this occurs as it requires multiple different optimization strategies
to collide and produce an incorrect result. But as one reported example, in
ripgrep, the regex `(?i:e.x|ex)` does not match `e-x` when it should. (This
bug is a result of an inner literal optimization performed in the `grep-regex`
crate and not in the `regex` crate.)

Bug fixes:

-   [BUG #&#8203;2884](BurntSushi/ripgrep#2884):
    Fix bug where ripgrep could miss some matches that it should report.

Miscellaneous:

-   [MISC #&#8203;2748](BurntSushi/ripgrep#2748):
    Remove ripgrep's `simd-accel` feature because it was frequently broken.

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever MR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this MR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this MR, check this box

---

This MR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy40NDAuNyIsInVwZGF0ZWRJblZlciI6IjM3LjQ0MC43IiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJSZW5vdmF0ZSBCb3QiXX0=-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants