Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected (or incorrect?) multiline replace behavior #1311

Closed
mqudsi opened this issue Jun 23, 2019 · 7 comments
Closed

Unexpected (or incorrect?) multiline replace behavior #1311

mqudsi opened this issue Jun 23, 2019 · 7 comments
Labels
bug A bug. rollup A PR that has been merged with many others in a rollup.

Comments

@mqudsi
Copy link

mqudsi commented Jun 23, 2019

What version of ripgrep are you using?

ripgrep 11.0.1 (rev 34677d2)
+SIMD -AVX (compiled)
+SIMD +AVX (runtime)

How did you install ripgrep?

cargo from git

What operating system are you using ripgrep on?

Distributor ID: Debian
Description: Debian GNU/Linux 10 (buster)
Release: 10
Codename: buster

Describe your question, feature request, or bug.

I was trying to replace some of my usages of other commands (sed, tr, ag) with ripgrep now that it has proper multiline and pcre2 support features, and ran into the following odd behavior:

> printf "hello\nworld\n" | rg -U "\n" --color always 
hello
world 

This works fine, everything matches, nothing is highlighted because the newline can't be colored.

But when I try to replace, I get the following behavior:

> printf "hello\nworld\n" | rg -U "\n" -r "?" 
hello?
world?

which isn't correct because the newline has been correctly matched but incorrectly retained in the output.

Whereas the following does work:

> printf "hello\nworld\n" | rg --null-data "\n" -r "?"
hello?world?

But I don't see any documented reason why the former shouldn't give the same output as well. I can understand technically why this is happening (rg internally preserves the concept of lines but merely glosses over them for search purposes with -U), but the confluence of both -U and -r results in incorrect behavior: the output should either include the replacement character and no new lines (desirable) or the newlines and no replacement character (unfortunate but understandable limitation). I don't think including both can be considered correct, however.

@mqudsi
Copy link
Author

mqudsi commented Jun 23, 2019

Actually, this has to be a mere oversight because the following does work:

> printf "hello\nworld\n" | rg -U o\nw -r '?'
hell?orld

and I can even trick rg into giving me the behavior I'm looking for with this:

> printf "hello\nworld\n" | rg -U '(.?)\n(.?)' -r '$1?$2'
hello?world

(although that fails when there are empty lines in the input)

@learnbyexample
Copy link

My understanding is that -U doesn't strictly behave as if entire input is single input string. Rather, it allows the match to cross multiple lines when necessary, like the printf "hello\nworld\n" | rg -U 'o\nw' -r '?' example.

You might feel --null-data is working, but that is simply adding NUL character for each block of output, instead of newline. It so happens you have only one block in your example as there is no NUL character in input. printf "hello\nworld\n" | perl -0777 -pe 's/\n/?/g' is an example for slurping entire input as single string.

@mqudsi
Copy link
Author

mqudsi commented Jun 24, 2019

You might feel --null-data is working, but that is simply adding NUL character for each block of output, instead of newline.

Yes, and that's why I would like -U to work instead ;)

@BurntSushi BurntSushi added the bug A bug. label Jun 24, 2019
@BurntSushi
Copy link
Owner

This is likely a bug in how new lines are handled by the printer. I don't know when I'll get around to it.

@mqudsi
Copy link
Author

mqudsi commented Jun 25, 2019

Thanks for the confirmation anyway, @BurntSushi!

@dkasak
Copy link

dkasak commented Jun 4, 2020

Just found this bug myself. Here's some more weirdness in case it points to the source of the bug.

➜ printf 'foo\n bar' | rg --multiline --passthru --replace '?' '\s+'            
foo?bar

➜ printf 'foo \nbar' | rg --multiline --passthru --replace '?' '\s+' 
foo?
bar

➜ printf 'foo \n bar' | rg --multiline --passthru --replace '?' '\s+'
foo?bar

@BurntSushi BurntSushi added the rollup A PR that has been merged with many others in a rollup. label May 31, 2021
BurntSushi added a commit that referenced this issue Jun 1, 2021
This commit fixes a subtle bug in multi-line replacement of line
terminators.

The problem is that even though ripgrep supports multi-line searches, it
is *still* line oriented. It still needs to print line numbers, for
example. For this reason, there are various parts in the printer that
iterate over lines in order to format them into the desired output.

This turns out to be problematic in some cases. #1311 documents one of
those cases (with line numbers enabled to highlight a point later):

    $ printf "hello\nworld\n" | rg -n -U "\n" -r "?"
    1:hello?
    2:world?

But the desired output is this:

    $ printf "hello\nworld\n" | rg -n -U "\n" -r "?"
    1:hello?world?

At first I had thought that the main problem was that the printer was
taking ownership of writing line terminators, even if the input already
had them. But it's more subtle than that. If we fix that issue, we get
output like this instead:

    $ printf "hello\nworld\n" | rg -n -U "\n" -r "?"
    1:hello?2:world?

Notice how '2:' is printed before 'world?'. The reason it works this way
is because matches are reported to the printer in a line oriented way.
That is, the printer gets a block of lines. The searcher guarantees that
all matches that start or end in any of those lines also end or start in
another line in that same block. As a result, the printer uses this
assumption: once it has processed a block of lines, the next match will
begin on a new and distinct line. Thus, things like '2:' are printed.

This is generally all fine and good, but an impedance mismatch arises
when replacements are used. Because now, the replacement can be used to
change the "block of lines" approach. Now, in terms of the output, the
subsequent match might actually continue the current line since the
replacement might get rid of the concept of lines altogether.

We can sometimes work around this. For example:

    $ printf "hello\nworld\n" | rg -U "\n(.)?" -r '?$1'
    hello?world?

Why does this work? It's because the '(.)' after the '\n' causes the
match to overlap between lines. Thus, the searcher guarantees that the
block sent to the printer contains every line.

And there in lay the solution: all we need to do is tweak the multi-line
searcher so that it combines lines with matches that directly adjacent,
instead of requiring at least one byte of overlap. Fixing that solves
the issue above. It does cause some tests to fail:

* The binary3 test in the searcher crate fails because adjacent line
  matches are now one part of block, and that block is scanned for
  binary data. To preserve the essence of the test, we insert a couple
  dummy lines to split up the blocks.
* The JSON CRLF test. It was testing that we didn't output any messages
  with an empty 'submatches' array. That is indeed still the case. The
  difference is that the messages got combined because of the adjacent
  line merging behavior. This is a slight change to the output, but is
  still correct.

Fixes #1311
@krtab
Copy link

krtab commented Jun 16, 2021

Hi @BurntSushi!

Just wanted to thank you for fixing this. I have just encountered this very bug, and it is very pleasing to see "closed 16 days ago"! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug. rollup A PR that has been merged with many others in a rollup.
Projects
None yet
Development

No branches or pull requests

5 participants