Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support more sophisticated boolean matching operations #875

Open
mqudsi opened this issue Apr 3, 2018 · 57 comments
Open

support more sophisticated boolean matching operations #875

mqudsi opened this issue Apr 3, 2018 · 57 comments
Labels
enhancement An enhancement to the functionality of the software. icebox A feature that is recognized as possibly desirable, but is unlikely to implemented any time soon. question An issue that is lacking clarity on one or more points.

Comments

@mqudsi
Copy link

mqudsi commented Apr 3, 2018

With the new convention to use the capitalized version of a short flag to indicate the opposite it's too bad that -E is already used to mean --encoding, as I would like to suggest an "inverse pattern" mode where only lines/words (depending on other parameters as normal) matching pattern e but not matching pattern E are included in the result set.

Andrew, I know you are loathe to add more ! support but given the pre-existing -E, perhaps a -e !PATTERN?

@BurntSushi
Copy link
Owner

BurntSushi commented Apr 3, 2018

The name of the flag is really not the interesting part of this feature request. The interesting part is the request to support more sophisticated boolean tests.

I think if we were to decide to do this, then it needs to be part of a larger story that encompasses more sophisticated expressions. We also need to address the fact that, today, we can actually express quite a bit, but it requires piping. Namely, piping permits expressing "and". Piping plus the -v flag permits any arbitrary boolean expression you might want. For example, rg foo | rg -v bar says "show lines matching foo but do not contain bar," which is exactly your feature request.

git grep has support for this via -not, -and and -or. I don't know if I'm willing to add this to ripgrep. There must be a point at which we say, "piping is good enough."

An alternative way to implement this feature is in the regex engine itself (since intersection and complement are available as operations on regular languages), but this is extremely non-trivial to do.

I try not to speak in absolutes, but, "I don't want to add anything else that uses ! in a shell" is as close to an absolute that I can get. Let's drop that idea.

@BurntSushi BurntSushi added question An issue that is lacking clarity on one or more points. icebox A feature that is recognized as possibly desirable, but is unlikely to implemented any time soon. enhancement An enhancement to the functionality of the software. labels Apr 3, 2018
@BurntSushi BurntSushi changed the title Support inverse -e flag support more sophisticated boolean matching operations Apr 3, 2018
@mqudsi
Copy link
Author

mqudsi commented Apr 3, 2018

I understand completely. I currently pipe (to grep, I didn't realize I could pipe to rg itself!) but was wondering from a performance perspective basically about using the regex engine itself to optimize the search with the additional boolean constraints.

Thanks.

@BurntSushi
Copy link
Owner

BurntSushi commented Apr 3, 2018

but was wondering from a performance perspective basically about using the regex engine itself to optimize the search with the additional boolean constraints.

Well, the "best" way is to, as I hinted at, build complement and intersection into the regex engine. But as I said, this is extremely non-trivial to do efficiently. If we were to implement this, then we'd need an algorithm that selects the (attempted) optimal matching path given all of the boolean conditions. e.g., if you said "x and not y and not z," then ripgrep would search for x and only apply the y and z blacklist on matches to filter them out. If you had x or y or z, then ripgrep would, as it does today, combine them into one regex joined by |. If you had not x and not y and not z, then ripgrep behave as it would today if you ran rg -v x and then use the y and z blacklists to filter our matches. If you had not x or not y or not z, then ripgrep could behave as it does today if you ran rg -v 'x|y|z'. And so on...

It is plausible that this would result in a performance improvement. But you can't just throw that out there as a benefit and expect it to stick. :-) Performance does not exist in a vacuum. Pipelines tend to be constructed in a way that iteratively reduces the search space, which in turn makes performance less and less of an issue. The interesting bits are probably pipelines that start with an inverted match on a rarely occurring pattern, which would not reduce the search space much. Regardless, I personally find this to be a somewhat flimsy motivation for a feature like this unless someone can convince me otherwise. IMO, if we add a feature like this, it should be primarily for the UX.

@kenorb
Copy link

kenorb commented Apr 11, 2018

Example of using git grep with AND patterns:

git grep -e pattern1 --and -e pattern2 --and -e pattern3

@kenorb
Copy link

kenorb commented Apr 11, 2018

Example of AND operation using Rust's regex engine:

rg -N '(?P<p1>.*pattern1.*)(?P<p2>.*pattern2.*)(?P<p3>.*pattern3.*)' file.txt

@BurntSushi
Copy link
Owner

@kenorb That's presumably not the same as what git grep does. git grep -e pattern1 --and -e pattern2 will match pattern2pattern1 but (.*pattern1.*)(.*pattern2.*) will not. The standard way to perform "and" queries in ripgrep is with piping, as I mentioned above in my comment.

@peterbe
Copy link

peterbe commented Jun 6, 2018

I quite like the simplicity and "natural feel" of using rg foo | rg bar to do the equivalent of git grep -e foo --and -e bar. The only significant difference is the color.

git grep -e foo --and -e bar
screen shot 2018-06-06 at 8 03 13 am

rg string | rg query
screen shot 2018-06-06 at 8 04 42 am

See, no highlight of the word string in the rg pipe.

@BurntSushi
Copy link
Owner

@peterbe You should be able to fix that by adding --color always to your first invocation of ripgrep. Not ideal of course.

@peterbe
Copy link

peterbe commented Jun 6, 2018

I don't even know if it's possible with pipes but if you could know that that the next pipe is another rg the --color always could be on by default. One can dream.

@elbaro
Copy link

elbaro commented Jun 29, 2018

Piping loses the file headers.

rg abc

a.txt
4: ...abc...xyz...
7: ...abc...

b.txt
3: ...abc...xyz...
rg abc | rg xyz

4: ...abc...xyz...
3: ...abc...xyz...

@BurntSushi
Copy link
Owner

That example doesn't look right. It should retain file names not as headers but in each line in standard grep format.

@elbaro
Copy link

elbaro commented Jun 29, 2018

Sorry my bad. It looks like this:

rg abc | rg xyz
a.txt: ...abc...xyz...
a.txt: ...abc...xyz...
b.txt: ...abc...xyz...
b.txt: ...abc...xyz...

Still hard to parse when there are many files.
I think it's an example where the built-in op can provide better UX than piping.

Another example is piping with -A or -B.

// want to print a line including "abc" and "xyz" with +- 3 lines
rg abc -A 3 -B -3 | rg xyz -A 3 -B 3  // not what we want

@BurntSushi
Copy link
Owner

That's certainly part of an argument in favor of this, but I will not allow that argument to be used as a hammer. Taken to its logical conclusion, ripgrep should bundle every conceivable transform on its data. At some point, people need to become OK with piping ripgrep's output and dealing with the different format. Different people will have different opinions on where that line is drawn.

@BatmanAoD
Copy link

I have definitely wished for an easy way to preserve headers when piping rg to rg. Maybe a flag for "header passthrough" would be useful on its own.

@aldanor
Copy link

aldanor commented Jan 7, 2019

I have definitely wished for an easy way to preserve headers when piping rg to rg. Maybe a flag for "header passthrough" would be useful on its own.

That would be nice but won't work in all cases. E.g., consider

rg -C5 foo | rg -v bar

Now the context lines around the matched lines in the first rg call are being matched by the second rg call and your output may end up being a bit of a mess and not what you might expect.


IMO, if we add a feature like this, it should be primarily for the UX.

Looking at a few now-closed duplicate issues, what most people want is just "a and not b" with all of headers/context preserved which might make sense to special-case if that's much simpler that the general case.

@amitbha
Copy link

amitbha commented Feb 23, 2019

Files looks like this:

a.txt
4: ...abc...
30: ...xyz...

b.txt
4: ...abc...
.....
(no 'xyz' in content)

How to find files like a.txt with 'abc' and 'xyz' in different lines?

@BurntSushi
Copy link
Owner

BurntSushi commented Feb 23, 2019 via email

@amitbha
Copy link

amitbha commented Feb 23, 2019

Use multiline search.

On Fri, Feb 22, 2019, 19:35 amitbha @.***> wrote: Files looks like this: a.txt 4: ...abc... 30: ...xyz... b.txt 4: ...abc... ..... (no 'xyz' in content) How to find files like a.txt with 'abc' and 'xyz' in different lines? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#875 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAb34iFvSILtyapoZbiWQTX9675DE3n0ks5vQIzHgaJpZM4TEQ9s .

Thanks for reply.
I tried rg -U --multiline-dotall -e 'abc.*xyz, the right files were found. But there were too many outputs like:

4: ...abc...
5: xxxxx
6: xxxxx
...
29: xxxxx
30: ...xyz...

rg -U --multiline-dotall -e 'abc.*xyz | rg abc
No filename and line-numbers.

rg -U --multiline-dotall -l -e 'abc.*xyz' | rg 'abc' -
No result. How to read path from pipe?

rg -U --multiline-dotall -l -e 'abc.*xyz' | while read line; do rg 'xyz' "$line"; done
Almost done! But filenames are missing. 😔

rg -U --multiline-dotall -l -e 'abc.*xyz' | while read line; do echo "$line"; rg 'xyz' "$line"; echo; done
Done! 😌

@BurntSushi
Copy link
Owner

BurntSushi commented Feb 23, 2019 via email

@amitbha
Copy link

amitbha commented Feb 24, 2019

rg -U --multiline-dotall -l -e 'abc.*xyz' | while read line; do rg --with-filename 'xyz' "$line"; echo; done
Got it!
😌

@gd4c
Copy link

gd4c commented Nov 5, 2020

What's confusing?

Also, forgot to say that piping to rg searches lines in previous stdout, not the matching files!

@BurntSushi
Copy link
Owner

BurntSushi commented Nov 5, 2020

The upsides and downsides here are well known. I've stated repeatedly what the problems are with rg foo | rg bar. They don't need to keep being repeated. So I'm confused at why you're rehashing things.

Adding this feature reflects significant work. The first step is to come up with a comprehensive UX specification of behavior. That would be useful. Further argumentation about why ripgrep should have this feature is not useful. It's just noise and it's just filling up my inbox.

I said about as much almost a year and a half ago, so now I'm just repeating myself. And I'm confused at why I need to do it.

@gd4c
Copy link

gd4c commented Nov 5, 2020

Sorry for the misunderstanding. I recognize that you understand its usefulness and that the issue is the complexity. I was just responding to your solution above.

You made a great tool and I am grateful!

Cheers!

@zachriggle
Copy link

zachriggle commented Nov 6, 2020

My current work-around for this is effectively use rg -l to find all OR'ed matches, and then pass off to git grep.

A statement that looks roughly like this:

$ my-grep -e foo --and -e bar --and --not '(' -e fizz -e buzz ')'

Gets translated roughly to:

rg -e foo -e bar -l -0 | xargs -0 git grep --threads 12 --no-index -e foo --and -e bar --and --not '(' -e fizz -e buzz ')'

There's a LOT of extra plumbing in my shell script to achieve better performance (e.g. don't have ripgrep search for expressions in an --and --not ( -e fizz -e buzz ) block, but ultimately rg -l -0 | xargs -0 git grep --no-index works pretty effectively, and is much faster than git grep by itself if you make use of e.g. rg type filters (e.g. rg -t c -t py).

This also allows you to specify some git grep specific formatting, like --show-function, in addition to those that rg also supports like --break --heading --line-number.

@bionicles
Copy link

The first step is to come up with a comprehensive UX specification of behavior. That would be useful.

for UX on multiple patterns with boolean logic,

what if users wrote a double quoted string with 'AND' , 'OR', 'NOT' in it , but only within quotes and in all caps after a certain flag. if necessary could specifically surround query patterns with single quotes or curly brackets to make it easier to pluck them out.

if each query result is a set of matches, the logic can be a tree of set operations like intersection ("AND"), set union ("OR"), complement ("NOT") to get the results which meet the criteria

# rg --logic "practitioner AND surgery"
r1, r2 = find(["practitioner", "surgery"], data)
result = intersection(r1, r2)

# rg --logic "patient AND (diabetes OR cancer) AND statin NOT lung cancer"
r1, r2, r3, r4, r5 = find(["patient", "diabetes", "cancer", "statin", "lung cancer"], data)
result = intersection(r1, union(r2, r3), r4).difference(r5)

could follow a specific order of operations such as listed here: order of operations on sets

as for performance, that could still be single pass over the data (presumably) but would require a check for multiple queries at each point O(N), plus each set operation is (i believe) linear in the number of matches, and thus fairly fast, but this can add a whole new dimension to the capabilities of ripgrep, AND / OR / NOT / Parenthesis is easily interpretable

@BurntSushi
Copy link
Owner

@bionicles It's a nice idea, but it's only a start. You're basically saying, "use a DSL." A DSL comes with its own complications. Your examples are simple because none of your queries are regexes. A DSL that embeds regexes means you'll need to handle escaping, and that could become quite annoying. A DSL also implies, to me, that it needs facilities for bulk-including regexes, like via the -f flag.

If I had time to implement this, my first step would be to look at git grep and write a specification based on its man pages and behavior. Then evaluate whether that specification could be improved.

@bionicles
Copy link

eh, git grep syntax looks pretty bad. weird nested flags --and --e thing1 --e thing2 ... it doesn't read like english

i just believe ripgrep could be a super fast offline-first alternative to algolia ...but perhaps regex could be surrounded by curlies?

rg --logic "{^[A-Z]+_SUSPEND$} OR rustacean"

if somebody needs to regex for curly brackets, or regex for regex patterns, they could use double curly brackets, or suck it up and use something else. and even if the logic matching didnt support regex at all, i'd use it, and it would be extremely useful in the vast majority of use cases where one doesn't need regex

the plan i described is unclear about the order of operations tho, I'd suggest

parenthesis > OR > AND > NOT
because "or" is more inclusive

and just error if there's ambiguity about order of operations

another alternative would be a lispy syntax like

rg --logic "((simple english NOT dsl) AND regex)"

@bionicles
Copy link

to de-spec and simplify, just some way to do AND queries alone would be 80/20
aka, find files which match 3 different queries:
rg --multi "patient" --multi "pacemaker" --multi "expired"
no DSL required

@zachriggle
Copy link

zachriggle commented Mar 19, 2021 via email

@BurntSushi
Copy link
Owner

BurntSushi commented Mar 19, 2021

Right. It being an existing and very popular implementation of the idea gives it credibility. It should be our starting point. Note also though, that I said we may evaluate its specification.

@zachriggle
Copy link

zachriggle commented Mar 20, 2021

If I can make one VERY small suggestion that differs from the git-grep API, would be to replace (or permit both of) '(' with [ for expression grouping since the ( requires quoting like '(' or else the shell tries to do weird things with it. The [ and ] are not treated specially and don't need quotes.

I use a wrapper script (company internal only, can't share sorry) around RipGrep and Git Grep to get the best of both worlds, and it does substitution of [ with '(' to make things easier to read.

@HK47196
Copy link

HK47196 commented Mar 22, 2022

@gd4c rg foo | rg bar will only print lines that contain both foo and bar.

It will also match lines that merely contain foo but have bar in the filename.

~ echo 'foo' > bar
~ rg foo | rg bar
bar:foo

@kenorb
Copy link

kenorb commented May 2, 2022

I've got the following simple use case to share (useful when preprocessing Markdown text or Wikipedia articles).

So I'm trying to match using OR operator words which have special characters around it and print both combinations. I've tried the following:

$ echo "{{Foo}} [[Bar]]" | rg -o '\w+|\S+'
{{Foo}}
[[Bar]]

With the OR operation, I would expect something like this:

$ echo "{{Foo}} [[Bar]]" | tee >(rg -o '\w+') >(rg -o '\S+') >/dev/null | cat
{{Foo}}
[[Bar]]
Foo
Bar

I've tried with the following pattern file (to use with -f):

\w+|\S+
\S+
\w+

but it's the same result - it stops processing after the first match (I've tried adding -e, but it doesn't make any matches). I've also tried to specify multiply pattern files, or --passthru '^|(\w+)|(\S+)' -o -r '$0', no luck.

@BurntSushi
Copy link
Owner

@kenorb "or" already exists with |. It looks like what you want is an overlapping query, which we already talked about. That's completely unrelated to boolean queries.

@aboueleyes
Copy link

Is there a way to have nice formatting while querying

$ rg foo | rg -v bar 

@BurntSushi
Copy link
Owner

@aboueleyes No.

@timotheecour
Copy link

timotheecour commented Apr 7, 2024

I don't think this was mentioned earlier:
rg foo | rg bar isn't the same as rg foo --and bar, because rg bar will match also filenames with bar, so will have false positives

and the workaround rg 'foo.*bar|bar.*foo' doesn't scale well

@hraban
Copy link

hraban commented Apr 10, 2024

For context the primary reason I was looking for this functionality is because I use rg mostly in Emacs through projectile. From what I understand it parses the highlighted output from rg (highlighting matches) and displays them similarly highlighted in the emacs projectile ripgrep search results window. Doing any kind of piping defeats this, and I'm not sure there's a clean way around it without first-class support from the tool.

Does this sound like a familiar problem to anyone? Did you find a work around?

Basically I have grown to love fzf's "regex-like" style of matching, where individual words are regexes, but everything that is separated by a space character is actually considered a separate match which are all allowed to match anywhere on the line, not just in that order (so it's not like replacing space with .*). It's great UX. Not necessarily ripgreps responsibility to fill this niche of course, just thought I'd give some context.

@fabian-thomas
Copy link

I can recommend using ugrep (https://github.com/Genivia/ugrep), which supports boolean matching, for what you @hraban describe.

Together with this simple script:

args=()
query=()
while [[ $# -gt 0 ]]; do
    case "$1" in
        --)
            # for files
            break
            ;;
        -*)
            args+=("$1")
            shift
            ;;
        *)
            query+=(--and "$1")
            shift
    esac
done
ug --column-number "${args[@]}" "${query[@]}" "$@"

You can then find lines that match one and two in any order by calling the-script one two -- /path/to/fle /other/file or for recursive search the-script one two. This, together with ugrep's support for terminal hyperlinks results in a really nice experience I make use of daily.

@BurntSushi
Copy link
Owner

This, together with ugrep's support for terminal hyperlinks results in a really nice experience I make use of daily.

ripgrep also supports terminal hyperlinks.

@bionicles
Copy link

Here's a NAND regex. I can confirm correct boolean regex is possible in the simple case of non-nested expressions foreach of AND/OR/NOT/XOR/NAND via TDD

one issue is, as @hraban mentioned, a natural way to input these is to pass multiple patterns, and the ripgrep cli expects a single pattern and paths as far as i understand right now. these boolean expression regexes get pretty long and arcane, decent user experience needs a builder pattern / DSL sort of thing. No doubt there are many ways to achieve this aim and cramming everything into one regex might not be the move, but since you have some of the best regex code around, compiling boolean expressions to regex is likely faster than doing logic queries outside of the regex machinery

image

currently hella swamped on unpaid work ugh, maybe it's a fun thing to tinker with next month or this fall, wrote this in january, dont want to keep it bottled up, just want to post here in support of this valuable opportunity to further upgrade the already excellent ripgrep cli !

@BurntSushi
Copy link
Owner

@bionicles Thanks! The problem there is that it relies on look-around, and thus will only work with the -P/--pcre2 switch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement to the functionality of the software. icebox A feature that is recognized as possibly desirable, but is unlikely to implemented any time soon. question An issue that is lacking clarity on one or more points.
Projects
None yet
Development

No branches or pull requests