-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support more sophisticated boolean matching operations #875
Comments
The name of the flag is really not the interesting part of this feature request. The interesting part is the request to support more sophisticated boolean tests. I think if we were to decide to do this, then it needs to be part of a larger story that encompasses more sophisticated expressions. We also need to address the fact that, today, we can actually express quite a bit, but it requires piping. Namely, piping permits expressing "and". Piping plus the
An alternative way to implement this feature is in the regex engine itself (since intersection and complement are available as operations on regular languages), but this is extremely non-trivial to do. I try not to speak in absolutes, but, "I don't want to add anything else that uses |
I understand completely. I currently pipe (to Thanks. |
Well, the "best" way is to, as I hinted at, build complement and intersection into the regex engine. But as I said, this is extremely non-trivial to do efficiently. If we were to implement this, then we'd need an algorithm that selects the (attempted) optimal matching path given all of the boolean conditions. e.g., if you said "x and not y and not z," then ripgrep would search for It is plausible that this would result in a performance improvement. But you can't just throw that out there as a benefit and expect it to stick. :-) Performance does not exist in a vacuum. Pipelines tend to be constructed in a way that iteratively reduces the search space, which in turn makes performance less and less of an issue. The interesting bits are probably pipelines that start with an inverted match on a rarely occurring pattern, which would not reduce the search space much. Regardless, I personally find this to be a somewhat flimsy motivation for a feature like this unless someone can convince me otherwise. IMO, if we add a feature like this, it should be primarily for the UX. |
Example of using
|
Example of AND operation using Rust's regex engine:
|
@kenorb That's presumably not the same as what |
@peterbe You should be able to fix that by adding |
I don't even know if it's possible with pipes but if you could know that that the next pipe is another |
Piping loses the file headers.
|
That example doesn't look right. It should retain file names not as headers but in each line in standard grep format. |
Sorry my bad. It looks like this:
Still hard to parse when there are many files. Another example is piping with -A or -B.
|
That's certainly part of an argument in favor of this, but I will not allow that argument to be used as a hammer. Taken to its logical conclusion, ripgrep should bundle every conceivable transform on its data. At some point, people need to become OK with piping ripgrep's output and dealing with the different format. Different people will have different opinions on where that line is drawn. |
I have definitely wished for an easy way to preserve headers when piping |
That would be nice but won't work in all cases. E.g., consider rg -C5 foo | rg -v bar Now the context lines around the matched lines in the first rg call are being matched by the second rg call and your output may end up being a bit of a mess and not what you might expect.
Looking at a few now-closed duplicate issues, what most people want is just "a and not b" with all of headers/context preserved which might make sense to special-case if that's much simpler that the general case. |
Files looks like this: a.txt b.txt How to find files like a.txt with 'abc' and 'xyz' in different lines? |
Use multiline search.
…On Fri, Feb 22, 2019, 19:35 amitbha ***@***.***> wrote:
Files looks like this:
a.txt
4: ...abc...
30: ...xyz...
b.txt
4: ...abc...
.....
(no 'xyz' in content)
How to find files like a.txt with 'abc' and 'xyz' in different lines?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#875 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAb34iFvSILtyapoZbiWQTX9675DE3n0ks5vQIzHgaJpZM4TEQ9s>
.
|
Thanks for reply.
|
Please skim the options in the man page. Use the -n and --with-filename
flags.
…On Sat, Feb 23, 2019, 03:25 amitbha ***@***.***> wrote:
Use multiline search.
… <#m_6621645017383223918_>
On Fri, Feb 22, 2019, 19:35 amitbha ***@***.***> wrote: Files looks like
this: a.txt 4: ...abc... 30: ...xyz... b.txt 4: ...abc... ..... (no 'xyz'
in content) How to find files like a.txt with 'abc' and 'xyz' in different
lines? — You are receiving this because you commented. Reply to this email
directly, view it on GitHub <#875 (comment)
<#875 (comment)>>,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAb34iFvSILtyapoZbiWQTX9675DE3n0ks5vQIzHgaJpZM4TEQ9s
.
Thanks for reply.
I tried rg -U --multiline-dotall -e 'abc.*xyz, the right files were
found. But there were too many outputs like:
4: ...abc...
5: xxxxx
6: xxxxx
...
29: xxxxx
30: ...xyz...
rg -U --multiline-dotall -e 'abc.*xyz | rg abc
No filename and line-numbers.
rg -U --multiline-dotall -l -e 'abc.*xyz' | rg -e 'abc' -
No result. How to read path from pipe?
rg -U --multiline-dotall -l -e 'abc.*xyz' | while read line; do rg -e
'xyz' "$line"; done
Almost done! But filenames are missing.
😔
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#875 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAb34jwonl0CGHe9DS2PCPvcqLH8d2rFks5vQPr0gaJpZM4TEQ9s>
.
|
|
What's confusing? Also, forgot to say that piping to rg searches lines in previous stdout, not the matching files! |
The upsides and downsides here are well known. I've stated repeatedly what the problems are with Adding this feature reflects significant work. The first step is to come up with a comprehensive UX specification of behavior. That would be useful. Further argumentation about why ripgrep should have this feature is not useful. It's just noise and it's just filling up my inbox. I said about as much almost a year and a half ago, so now I'm just repeating myself. And I'm confused at why I need to do it. |
Sorry for the misunderstanding. I recognize that you understand its usefulness and that the issue is the complexity. I was just responding to your solution above. You made a great tool and I am grateful! Cheers! |
My current work-around for this is effectively use A statement that looks roughly like this:
Gets translated roughly to:
There's a LOT of extra plumbing in my shell script to achieve better performance (e.g. don't have ripgrep search for expressions in an This also allows you to specify some |
for UX on multiple patterns with boolean logic, what if users wrote a double quoted string with 'AND' , 'OR', 'NOT' in it , but only within quotes and in all caps after a certain flag. if necessary could specifically surround query patterns with single quotes or curly brackets to make it easier to pluck them out. if each query result is a set of matches, the logic can be a tree of set operations like intersection ("AND"), set union ("OR"), complement ("NOT") to get the results which meet the criteria # rg --logic "practitioner AND surgery"
r1, r2 = find(["practitioner", "surgery"], data)
result = intersection(r1, r2)
# rg --logic "patient AND (diabetes OR cancer) AND statin NOT lung cancer"
r1, r2, r3, r4, r5 = find(["patient", "diabetes", "cancer", "statin", "lung cancer"], data)
result = intersection(r1, union(r2, r3), r4).difference(r5) could follow a specific order of operations such as listed here: order of operations on sets as for performance, that could still be single pass over the data (presumably) but would require a check for multiple queries at each point O(N), plus each set operation is (i believe) linear in the number of matches, and thus fairly fast, but this can add a whole new dimension to the capabilities of ripgrep, AND / OR / NOT / Parenthesis is easily interpretable |
@bionicles It's a nice idea, but it's only a start. You're basically saying, "use a DSL." A DSL comes with its own complications. Your examples are simple because none of your queries are regexes. A DSL that embeds regexes means you'll need to handle escaping, and that could become quite annoying. A DSL also implies, to me, that it needs facilities for bulk-including regexes, like via the If I had time to implement this, my first step would be to look at |
eh, git grep syntax looks pretty bad. weird nested flags i just believe ripgrep could be a super fast offline-first alternative to algolia ...but perhaps regex could be surrounded by curlies?
if somebody needs to regex for curly brackets, or regex for regex patterns, they could use double curly brackets, or suck it up and use something else. and even if the logic matching didnt support regex at all, i'd use it, and it would be extremely useful in the vast majority of use cases where one doesn't need regex the plan i described is unclear about the order of operations tho, I'd suggest parenthesis > OR > AND > NOT and just error if there's ambiguity about order of operations another alternative would be a lispy syntax like
|
to de-spec and simplify, just some way to do AND queries alone would be 80/20 |
git grep's advanced Boolean expression should be the model for this. It allows nesting and grouped expressions.
git grep -e foo —and —not ( -e fizz -e buzz ) —and -e bar
Will match anything with “foo” and “bar” but exclude anything that contains “fizz” or “buzz”. You can recursively nest things and make very complex queries this way.
Defining a new DSL for ripgrep seems unwise, when there is a battle-proven
interface with git-grep.
IMHO ripgrep should emulate git-grep’s CLI interface.
|
Right. It being an existing and very popular implementation of the idea gives it credibility. It should be our starting point. Note also though, that I said we may evaluate its specification. |
If I can make one VERY small suggestion that differs from the git-grep API, would be to replace (or permit both of) I use a wrapper script (company internal only, can't share sorry) around RipGrep and Git Grep to get the best of both worlds, and it does substitution of |
It will also match lines that merely contain foo but have bar in the filename.
|
I've got the following simple use case to share (useful when preprocessing Markdown text or Wikipedia articles). So I'm trying to match using OR operator words which have special characters around it and print both combinations. I've tried the following:
With the OR operation, I would expect something like this:
I've tried with the following pattern file (to use with
but it's the same result - it stops processing after the first match (I've tried adding |
@kenorb "or" already exists with |
Is there a way to have nice formatting while querying $ rg foo | rg -v bar |
@aboueleyes No. |
I don't think this was mentioned earlier: and the workaround |
For context the primary reason I was looking for this functionality is because I use rg mostly in Emacs through projectile. From what I understand it parses the highlighted output from rg (highlighting matches) and displays them similarly highlighted in the emacs projectile ripgrep search results window. Doing any kind of piping defeats this, and I'm not sure there's a clean way around it without first-class support from the tool. Does this sound like a familiar problem to anyone? Did you find a work around? Basically I have grown to love fzf's "regex-like" style of matching, where individual words are regexes, but everything that is separated by a |
I can recommend using ugrep (https://github.com/Genivia/ugrep), which supports boolean matching, for what you @hraban describe. Together with this simple script: args=()
query=()
while [[ $# -gt 0 ]]; do
case "$1" in
--)
# for files
break
;;
-*)
args+=("$1")
shift
;;
*)
query+=(--and "$1")
shift
esac
done
ug --column-number "${args[@]}" "${query[@]}" "$@" You can then find lines that match |
ripgrep also supports terminal hyperlinks. |
Here's a NAND regex. I can confirm correct boolean regex is possible in the simple case of non-nested expressions foreach of AND/OR/NOT/XOR/NAND via TDD one issue is, as @hraban mentioned, a natural way to input these is to pass multiple patterns, and the ripgrep cli expects a single pattern and paths as far as i understand right now. these boolean expression regexes get pretty long and arcane, decent user experience needs a builder pattern / DSL sort of thing. No doubt there are many ways to achieve this aim and cramming everything into one regex might not be the move, but since you have some of the best regex code around, compiling boolean expressions to regex is likely faster than doing logic queries outside of the regex machinery currently hella swamped on unpaid work ugh, maybe it's a fun thing to tinker with next month or this fall, wrote this in january, dont want to keep it bottled up, just want to post here in support of this valuable opportunity to further upgrade the already excellent ripgrep cli ! |
@bionicles Thanks! The problem there is that it relies on look-around, and thus will only work with the |
With the new convention to use the capitalized version of a short flag to indicate the opposite it's too bad that
-E
is already used to mean--encoding
, as I would like to suggest an "inverse pattern" mode where only lines/words (depending on other parameters as normal) matching patterne
but not matching patternE
are included in the result set.Andrew, I know you are loathe to add more
!
support but given the pre-existing-E
, perhaps a-e !PATTERN
?The text was updated successfully, but these errors were encountered: