support searching across multiple lines #176

isobit · 2016-10-13T21:31:20Z

Say for example I'm trying to find instances of click that reside in a listeners block, like so:

listeners: {
    foo: ...
    click: ....
}

According to the Rust regex docs, I should be able to do: rg '(?s)listeners.+click', but this doesn't seem to work. Does ripgrep not support multiline regex?

The text was updated successfully, but these errors were encountered:

BurntSushi · 2016-10-13T22:06:19Z

Does ripgrep not support multiline regex?

Correct. Not even the s flag will help, because ripgrep explicitly instructs the regex automaton to never match \n. Like grep, ripgrep is a line oriented search tool.

ripgrep can perform a search in two different ways. One of them reads a chunk of bytes at a time and searches it. The other memory maps the file and searches that all at once. The former has a number of advantages, including being faster when searching a large number of small files in parallel and being able to search streams in constant memory. The latter has the advantage of being faster for single files (sometimes) and much simpler to implement.

The former only works because search is line oriented. A multiline regex can technically match, say, 2GB of data, which is completely incompatible with searching small chunks at a time.

The latter could be made to work with multiline search, but memory maps can't search stdin for example. So a multiline search on stdin would have to block and read all of stdin into memory before searching. (There exists a way around even this, but it requires changing the regex engine to be capable of incremental search, which is an even bigger change, but theoretically possible.)

multiline searching therefore comes with significant implementation complexity, and IMO is a pretty niche use case. I can also imagine it having a pretty big impact on the printing code. This fact alone is a good reason why it may never be in ripgrep proper, but perhaps once #162 is done, others can take a crack at it.

This is a good example of a feature that The Silver Searcher has that ripgrep may either never have or won't have for a long time.

isobit · 2016-10-13T23:50:30Z

Gotcha, thanks for the explanation. I really like ripgrep as a tool, just was hoping to use it for this case too 😉 .

BurntSushi · 2016-10-13T23:53:40Z

@joshglendenning Yeah, I admit, it would be nice, and if it were easy, I'd have no problems with it. While I do consider it niche, I have no doubts that it would be quite useful!

Once I split out most of the pieces of ripgrep to library form, perhaps there will be interest in building other tools for more niche use cases! I will keep this case in mind as I do that though.

maxbrunsfeld · 2017-01-09T19:46:01Z

This is a really cool tool, but I might suggest including this as a caveat in the README, alongside the comparisons to ag, since ag does support multi-line patterns.

BurntSushi · 2017-01-10T00:57:12Z

@maxbrunsfeld I've been meaning to add an "anti pitch" section to the README like the one in my blog post. That's now done. Thanks for the reminder!

BurntSushi · 2017-03-17T01:18:30Z

I'm going to re-open this, because it's one of the most highly requested features.

Nothing has changed about the problems I outlined above. However, multiline search needn't be the default. If we provide it as a flag, then we can do what we need to do to support multiline search only when that flag is provided. The critical thing that multiline search needs is a complete sequence of bytes in memory to search. Memory maps can provide this, but failing that, we would need to read the entire file into memory before starting a search.

Other than using heap space proportional to the file being searched, the fundamental issue with this flag is when it's used in conjunction with searching stdin. Namely, ripgrep will need to block until EOF is read on stdin before a search can even start. Alternatively, multiline search simply wouldn't be allowed on stdin. The silver searcher will in fact do this silently when searching stdin:

/* TODO: this will only match single lines. multi-line regexes silently don't match */
void search_stream(FILE *stream, const char *path) {
    // ...
}

I don't like the "silent" idea, but stopping ripgrep with an error is certainly something I'd be open to. Neither seem like good choices to me, but I don't think it should block this feature altogether.

N.B. This is a significant feature and it would have to be part of the libripgrep effort.

BurntSushi · 2017-03-17T01:22:24Z

The other thing I forgot to mention is that multiline search will negate inner literal optimizations. Normal prefix and, in special cases, suffix, literal optimizations will still be performed as part of the regex engine. (I've long thought about making inner literal optimizations work on arbitrary strings, but it's hard.)

gulshan · 2017-03-17T01:35:31Z

A naive question/suggestion. Assuming single lines are being loaded for search now, can that be changed to n lines, n set to 10 or 20 or something like that? While a line gets in, another gets out of the load in FIFO fashion? This will not be technically correct for all cases, but may be enough for most cases.

d-akara · 2017-03-17T01:36:08Z

How significant are the trade-offs to the user experience?
If doing multiline is more expensive, I'm fine with that as long as single line performance is not impacted.

Would you actually need a special flag to ripgrep? or can you reliably determine from the expression itself?

BurntSushi · 2017-03-17T01:49:17Z

Great questions! Keep'em coming.

Assuming single lines are being loaded for search now

They are not. If they were, ripgrep would be very slow. The reasons for this are a bit subtle, but basically, "it's faster to search a huge chunk than it is to break it into little pieces and then search each piece." "Huge chunk" in this case might be the size of some internal buffer, perhaps, 8KB.

If you're curious about how a fast grep tool works in more detail, check out this section in my blog post on ripgrep: http://blog.burntsushi.net/ripgrep/#anatomy-of-a-grep

can that be changed to n lines, n set to 10 or 20 or something like that? While a line gets in, another gets out of the load in FIFO fashion? This will not be technically correct for all cases, but may be enough for most cases.

If you have a regex like a\s+b, then it's not possible to determine the length of the match up front. You have three choices:

You use a regex engine that supports incremental search. (This is somewhat at odds with performance if "incremental" means "byte at a time." So for something like this, you'd need an incremental engine that can process chunks at a time.) ripgrep's regex engine doesn't support this.
You feed the regex engine every byte you got. (The Plan.)
You arbitrarily cap the size of the match. This will invariably get things wrong and there's no way to escape.

I still actually strongly believe that multiline search is a very niche feature, but it is one that can be quite useful when the situation calls for it. (A text editor is perhaps one such situation, but ripgrep is first and foremost a command line tool where multiline search feels a lot less common.) Therefore, taking approach (3) doesn't seem worth it. In the common case, memory maps will work just fine and your OS will manage the memory for you. It's only the corner cases that are sub-optimal: when memory maps can't be used (e.g., on virtual files or stdin).

How significant are the trade-offs to the user experience? If doing multiline is more expensive, I'm fine with that as long as single line performance is not impacted.

If --multiline is behind a flag, then I'm pretty confident that the standard UX of ripgrep won't be impacted. Including performance.

Would you actually need a special flag to ripgrep? or can you reliably determine from the expression itself?

A flag is 100% necessary. A regex like a\s+b shouldn't match across multiple lines by default, because that's what we've all come to expect from line oriented searchers. But it is totally plausible that you might want it to. That's when you'd pass a flag.

d-akara · 2017-03-17T02:57:00Z

I still actually strongly believe that multiline search is a very niche feature, but it is one that can be quite useful when the situation calls for it.

I would agree use is actually niche, but desire to use is not.

It is a bit non intuitive how to properly write a multiline expression. Especially if the engine doesn't support the . dotAll matching and even worse if you want to constrain to a range like next N lines.
Due to 1, many use incomplete results although not always knowingly. Most coding languages can have line breaks almost anywhere.

I would say if you are searching for 2 terms and completeness is important then using multiline would often be your default. However, writing an expression to find termA followed by termB within 5 or less lines is likely not something that rolls off of the fingertips of someone who occasionally uses regular expressions although I think many would find it useful and use such expressions if more intuitive to write.

BurntSushi · 2017-03-17T10:58:36Z

@dakaraphi Good points. I'd like to use your comment to constrain this feature, namely, that multiline search is the ability to apply a regex whose matches may span an arbitrary number of lines.

With that said:

It would be plausible to make . match \n by default if multiline mode is enabled.
The use case of "where do A and B co-occur within N lines of each other" is definitely something I agree can be useful. It's possible to some extent to do this with a regex, e.g., A([^\n]*\n){0,5}[^\n]*B|B([^\n]*\n){0,5}[^\n]*A, but that is a little painful. Extending this to three terms would probably be horrifying.

I think (2) is something that's enabled by multiline search, although, today, you can do something similar with contexts: rg B -C5 | rg A -C5 for example works to some extent. Regardless, it might be wiser to categorize this into a separate feature whose UX can be more thoughtfully designed. Others have requested similarish things, as in #346 and #360. sift is a tool that has support for this kind of matching, so we may be able to crib ideas from them.

With all that said, we must be careful not to get too far away from what ripgrep is supposed to be good at doing: searching lines. :-) I say this because there has to be a point at which "write code for your specialized search" becomes a valid thing to say. The key is figuring out where that point is.

d-akara · 2017-03-17T12:47:42Z

multiline search is the ability to apply a regex whose matches may span an arbitrary number of lines

Just to make sure I understand the intention, could you state that as what you see ripgrep would not do that possibly other regex engines do when searching multiline?

BurntSushi · 2017-03-17T12:54:07Z

@dakaraphi Sorry, the intention of me saying that was to push UX concerns like "how do I find co-occurring terms, A and B, within a fixed number of lines" out of multiline support. i.e., I don't think that particular UX should be addressed as part of standard multiline support, but should instead be considered as a separate feature (that may or may not happen). :-)

I don't think there's anything ripgrep would do differently in terms of UX with respect to the silver searcher, other than 1) not doing it by default and 2) probably not doing silent things.

BurntSushi · 2017-03-17T12:54:54Z

Are there are other tools that support multiline search other than the silver searcher?

d-akara · 2017-03-17T14:22:24Z

I'm not sure about command line tools. Prior to using VS Code I was using Brackets which supported multiline file search. I believe other editors like Sublime, Notepad++ etc also support multiline.

d-akara · 2017-03-17T14:25:04Z

I don't think that particular UX should be addressed as part of standard multiline support, but should instead be considered as a separate feature (that may or may not happen). :-)

ok right. Yes I'm not sure if that really should be part of something like ripgrep or not. For example, I've been thinking about maybe writing some extension for VS Code like a regex helper or such that would take something like common patterns or templates and you just plugin the values for such use cases and it would generate the regex.

BurntSushi · 2017-03-17T14:38:11Z

@dakaraphi Great! I think we're on the same page now. :-) Thanks for poking!

After [this comment](BurntSushi#176 (comment)) it seems like the statement about never supporting multiline search should be removed.

rshpeley · 2017-04-30T05:34:04Z

@dakaraphi directed me here from Microsoft/vscode #13155

It looks like one of the most common requests for searching across multiple lines is related to text editors. At the moment, my needs are very simple. If I can get a match across multiple files in a project for a multiline selection -- even if it's fully literal -- I could work with it. For most text editors, the menu option to search across multiple lines is separate than a simple search, and so a ripgrep flag, as @BurntSushi suggested, would naturally fit this use case.

I'm still making it through @BurntSushi's anatomy of a grep link, but it appears to me that a multiline search for text editors mostly requires a literal search with some multiple literals (white space, line endings) and therefore the search won't even make it to the regex engine for these cases.

Isn't the multiple line selection just a contiguous sequence of bytes (in the fully literal case) to be matched in a buffer? Or am I missing something related to optimisation here?

I'm sure people will come up with cases where a regex in a multiline search/replace would be mighty handy, but I think support for the simpler multiple literal multiline case would be a good start to give some text editors (such as vscode and atom) missing functionality.

btw, a most excellent ripgrep article @BurntSushi!

priyadarshan · 2017-04-30T06:30:54Z

Multi-line searching would be a boon to many. See for example this use case.

mateon1 · 2018-08-17T03:00:28Z

Some nits:

This flag causes '.' to match new lines

Should be newlines for consistency reasons

requires that each file it searches appear as if it exists contiguously in

appears, but perhaps this section could be worded differently.
Maybe: ... ripgrep requires that the searched file is laid out/mapped/allocated contiguously in memory
I'm unsure which wording is the best (I prefer laid out, but maybe that's not appropriate for documentation), but all three sound better to me than the existing version.

Specifically, if the --multiline flag is provided by the regex
cannot match over multiple lines

s/by/but/

waldyrious · 2018-08-17T08:59:38Z

@BurntSushi I'm glad you agree with the suggestions! The reworded sentence is indeed much clearer, after fixing the typo pointed out by @mateon1.

Here's the diff of that sentence, for future reference/convenience:

-That is, even if you use the --multiline flag but your regex cannot
-match over multiple lines, then ripgrep won't consume unnecessary resources.
+Specifically, if the --multiline flag is provided but the regex
+cannot match over multiple lines, then ripgrep won't read each file into memory
+before searching it.

Now that I re-read that, I'm not sure "cannot match" is the best choice of words, since it can imply both a neutral statement or an imperative enforcement. (Not sure I'm being clear myself; let me know if I should rephrase!)

I suppose you're referring to the case where the regex does not contain any patterns that would match newlines, or it contains . without the dotall flag being activated. Is that correct?

BurntSushi · 2018-08-17T10:54:06Z

@mateon1 Thanks! I took your advice, and chose "laid out."

@waldyrious

I suppose you're referring to the case where the regex does not contain any patterns that would match newlines, or it contains . without the dotall flag being activated. Is that correct?

Yes. Whether dotall is enabled or not is mostly orthogonal; what matters is whether a \n exists in any of the possible matches of a regex. Enabling dotall and uttering . is one way to achieve that, but a literal \n, \s, \p{any} and so on also achieve that.

It is possible I should just remove this part of the docs. I'm not sure. I put it there as a way of saying that even if you enable multiline mode but don't make use it, you generally won't pay (much) for it. But maybe that's not that important.

waldyrious · 2018-08-17T14:07:20Z

I think it wouldn't be a problem if it were removed, but it is useful information so I'd have a slight preference to keep it.

IMO changing that sentence to something like this:

"Specifically, if the --multiline flag is provided, but the regex ~~cannot match over multiple lines~~ does not contain patterns that would match \n characters, then ripgrep ~~won't read~~ will automatically avoid reading each file into memory before searching it."

...would make it sufficiently unambiguous.

BurntSushi · 2018-08-17T14:19:19Z

@waldyrious I like it. Much better. Thanks! :)

This commit updates the CHANGELOG to reflect all the work done to make libripgrep a reality. * Closes #162 (libripgrep) * Closes #176 (multiline search) * Closes #188 (opt-in PCRE2 support) * Closes #244 (JSON output) * Closes #416 (Windows CRLF support) * Closes #917 (trim prefix whitespace) * Closes #993 (add --null-data flag) * Closes #997 (--passthru works with --replace) * Fixes #2 (memory maps and context handling work) * Fixes #200 (ripgrep stops when pipe is closed) * Fixes #389 (more intuitive `-w/--word-regexp`) * Fixes #643 (detection of stdin on Windows is better) * Fixes #441, Fixes #690, Fixes #980 (empty matching lines are weird) * Fixes #764 (coalesce color escapes) * Fixes #922 (memory maps failing is no big deal) * Fixes #937 (color escapes no longer used for empty matches) * Fixes #940 (--passthru does not impact exit status) * Fixes #1013 (show runtime CPU features in --version output)

myfairsyer · 2018-08-23T21:25:19Z

Will \n only match \n / 0x0A or any common single line break (\r?\n) (or if you take the classic MacOS and BBC into account ((\n\r?)|(\r\n?)))

(I do know that both styles exist among regex engines but couldn't tell which is which)

Sry if there is an answer to that somewhere.

BurntSushi · 2018-08-23T21:52:35Z

\n only matches \n.

Current master has a --crlf option that causes $ to match \r\n line breaks in addition to \n.

I'm not aware of any regex engines that permit a literal \n to match \r\n. Some regex engines certainly allow for a looser definition of what "line terminator" actually means when necessary, e.g., when matching the ^ or $ anchors. If you know of a regex engine that permits a literal \n to match \r\n then I'd like to have a link to that so I can investigate!

roblourens · 2018-08-23T23:36:21Z

VS Code matches \r\n on \n when ctrl+f searching in a single file, it's useful in an editor but I wouldn't use that as inspiration for ripgrep.

BurntSushi · 2018-08-23T23:43:12Z

Oh interesting. Is that something VS code layers on, or is it part of JS regexes?

…

On Thu, Aug 23, 2018, 19:36 Rob Lourens ***@***.***> wrote: VS Code matches \r\n on \n when ctrl+f searching in a single file, it's useful in an editor but I wouldn't use that as inspiration. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#176 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAb34nyP5Q-P7DWfp_i4QLvDJ-p0RCLDks5uTzx3gaJpZM4KWZCK> .

roblourens · 2018-08-23T23:46:10Z

No, it's just something vscode does.

myfairsyer · 2018-08-24T06:38:59Z

If you know of a regex engine that permits a literal \n to match \r\n then I'd like to have a link to that so I can investigate!

@BurntSushi Most probably I only encountered it in text editors like VSCode.

it's useful in an editor but I wouldn't use that as inspiration for ripgrep.

@roblourens Would you mind to elaborate?
And does that mean that VSCode will behave differently inside an editor and when searching across files?

roblourens · 2018-08-24T15:18:25Z

Personally I don't prefer "magic" like that, but yeah I'll have to see whether we can rewrite \n to \r?\n so that search across files works the same as search inside files.

myfairsyer · 2018-08-24T16:11:07Z

it's useful in an editor but I wouldn't use that as inspiration for ripgrep.

@roblourens Would you mind to elaborate?

Personally I don't prefer "magic" like that

@roblourens
I was rather driving at the distinction between text editor and ripgrep.
I couldn't quite follow.
Is it b/c you consider ripgrep as a command line tool having a more advanced audience which demands more control and less magic than a graphical text editor?

@BurntSushi
I don't want to derail or hijack this therad for irrelevant discussions.
You said you'd like to know more and investigate and found VSCode's behavior interesting.
If you don't anymore tell me.

wmww · 2018-10-31T20:41:39Z

Currently, if you try to make a multiline search without the -U/--multiline option, ripgrep errors with the literal '"\n"' is not allowed in a regex. Would it make sense to mention the existence of a multiline enabling option here?

BurntSushi · 2018-10-31T21:08:03Z

@wmww That should already be done on master. See: #1055

Also, please file new issues for new requests.

unphased · 2020-11-25T22:33:00Z

Hi, I'm curious if there is a way to make multiline dot non-greedy? I tried (?s).*? and .*? under --multiline-dotall and neither worked. It seems with multiline mode, the .*? fails to become non-greedy. Is there an underlying reason for this?

~~[^>] and such work under multiline mode, though, which is the less general way to do sort of non-greedy stuff.~~

BurntSushi · 2020-11-25T22:39:22Z

Please file a new issue. And please don't use phrases like "does not work" without actually showing what you mean by it. Please fill out the complete bug template.

unphased · 2020-11-26T00:22:26Z

You're right, sorry for trying to resurrect and derail an old issue. I did more testing on this and I think I neglected to consider something with my test. it seems all to work as expected.

amosbird · 2021-12-01T04:10:36Z

Do we have an option to limit the max number of lines each match can have?

BurntSushi · 2021-12-01T13:07:55Z

No. And I don't see any obvious way to implement that either. You can usually build such limits into your regex instead.

amosbird · 2021-12-01T13:13:32Z

No. And I don't see any obvious way to implement that either. You can usually build such limits into your regex instead.

Can we build a regex to match the following? foo.*[at most three new lines]bar.*[at most three new lines]baz.* (all together at most three new lines)

BurntSushi · 2021-12-01T13:39:10Z

@amosbird Sure? Why not? foo(.*\n?){0,3}bar(.*\n?){0,3}, or something like that anyway.

In the future, I'd really prefer you open new tickets for support questions. Bumping old issues doesn't make these discussions easy to find. There is even a Q&A forum designed for this purpose. Please use it.

amosbird · 2021-12-01T14:11:33Z

In the future, I'd really prefer you open new tickets for support questions. Bumping old issues doesn't make these discussions easy to find. There is even a Q&A forum designed for this purpose. Please use it.

Sure. Will continue the discussion in the Q&A forum.

BurntSushi closed this as completed Oct 13, 2016

timotheecour mentioned this issue Feb 12, 2017

multiline search for simple cases #360

Closed

BurntSushi reopened this Mar 17, 2017

BurntSushi mentioned this issue Mar 17, 2017

Explore using a native third-party search tool such as ripgrep or Silver Searcher microsoft/vscode#19983

Closed

BurntSushi added the libripgrep An issue related to modularizing ripgrep into libraries. label Mar 17, 2017

BurntSushi added this to the libripgrep milestone Mar 17, 2017

paldepind added a commit to paldepind/ripgrep that referenced this issue Mar 23, 2017

Remove statement about never supporting multiline search

de3f4a4

After [this comment](BurntSushi#176 (comment)) it seems like the statement about never supporting multiline search should be removed.

jeancroy mentioned this issue Mar 25, 2017

No support for lookbehind atom/find-and-replace#571

Closed

BurntSushi added the enhancement An enhancement to the functionality of the software. label Apr 9, 2017

d-akara mentioned this issue Apr 29, 2017

Support multi-line search for Global search microsoft/vscode#13155

Closed

3 tasks

TheNetAdmin mentioned this issue Aug 17, 2018

Multiline string highlight error textmate/yaml.tmbundle#28

Open

BurntSushi mentioned this issue Aug 19, 2018

libripgrep: PCRE2 support, multiline search, JSON output and more #1017

Merged

BurntSushi closed this as completed in #1017 Aug 20, 2018

support searching across multiple lines #176

support searching across multiple lines #176

Comments

isobit commented Oct 13, 2016

BurntSushi commented Oct 13, 2016

isobit commented Oct 13, 2016

BurntSushi commented Oct 13, 2016

maxbrunsfeld commented Jan 9, 2017

BurntSushi commented Jan 10, 2017

BurntSushi commented Mar 17, 2017 • edited Loading

BurntSushi commented Mar 17, 2017

gulshan commented Mar 17, 2017

d-akara commented Mar 17, 2017

BurntSushi commented Mar 17, 2017

d-akara commented Mar 17, 2017

BurntSushi commented Mar 17, 2017 • edited Loading

d-akara commented Mar 17, 2017

BurntSushi commented Mar 17, 2017

BurntSushi commented Mar 17, 2017

d-akara commented Mar 17, 2017

d-akara commented Mar 17, 2017

BurntSushi commented Mar 17, 2017

rshpeley commented Apr 30, 2017

priyadarshan commented Apr 30, 2017

mateon1 commented Aug 17, 2018

waldyrious commented Aug 17, 2018 • edited Loading

BurntSushi commented Aug 17, 2018

waldyrious commented Aug 17, 2018

BurntSushi commented Aug 17, 2018

myfairsyer commented Aug 23, 2018 • edited Loading

BurntSushi commented Aug 23, 2018 • edited Loading

roblourens commented Aug 23, 2018 • edited Loading

BurntSushi commented Aug 23, 2018 via email

roblourens commented Aug 23, 2018

myfairsyer commented Aug 24, 2018

roblourens commented Aug 24, 2018 • edited Loading

myfairsyer commented Aug 24, 2018

wmww commented Oct 31, 2018 • edited Loading

BurntSushi commented Oct 31, 2018

unphased commented Nov 25, 2020 • edited Loading

BurntSushi commented Nov 25, 2020

unphased commented Nov 26, 2020

amosbird commented Dec 1, 2021

BurntSushi commented Dec 1, 2021

amosbird commented Dec 1, 2021

BurntSushi commented Dec 1, 2021

amosbird commented Dec 1, 2021

BurntSushi commented Mar 17, 2017 •

edited

Loading

BurntSushi commented Mar 17, 2017 •

edited

Loading

waldyrious commented Aug 17, 2018 •

edited

Loading

myfairsyer commented Aug 23, 2018 •

edited

Loading

BurntSushi commented Aug 23, 2018 •

edited

Loading

roblourens commented Aug 23, 2018 •

edited

Loading

roblourens commented Aug 24, 2018 •

edited

Loading

wmww commented Oct 31, 2018 •

edited

Loading

unphased commented Nov 25, 2020 •

edited

Loading