Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ripgrep doesn't match arbitrary bytes within a file #1339

Closed
sersorrel opened this issue Aug 5, 2019 · 1 comment
Closed

ripgrep doesn't match arbitrary bytes within a file #1339

sersorrel opened this issue Aug 5, 2019 · 1 comment
Labels
doc An issue with or an improvement to documentation.

Comments

@sersorrel
Copy link

What version of ripgrep are you using?

ripgrep 11.0.1 (rev 1f1cd9b)
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

How did you install ripgrep?

GitHub deb, I think

What operating system are you using ripgrep on?

Ubuntu 18.04

Describe your question, feature request, or bug.

grep finds arbitrary bytes within a binary file, but ripgrep does not.

If this is a bug, what are the steps to reproduce the behavior?

$ grep $'\xa7' <(printf '\xa7')
Binary file /dev/fd/63 matches
$ echo $?
0
$ rg -uuu --text --binary '\xa7' <(printf '\xa7')
$ echo $?
1

If this is a bug, what is the actual behavior?

$ rg -uuu --text --binary --debug '\xa7' <(printf '\xa7')
DEBUG|grep_regex::literal|grep-regex/src/literal.rs:59: literal prefixes detected: Literals { lits: [Complete(§)], limit_size: 250, limit_class: 10 }
DEBUG|globset|globset/src/lib.rs:435: built glob set; 0 literals, 0 basenames, 11 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|globset/src/lib.rs:435: built glob set; 0 literals, 0 basenames, 11 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|globset/src/lib.rs:435: built glob set; 0 literals, 0 basenames, 11 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes

If this is a bug, what is the expected behavior?

Either ripgrep should be able to search for arbitrary bytes, or it should not print a message implying that it can do so:

$ rg $'\xa7'
found invalid UTF-8 in pattern at byte offset 0 (use hex escape sequences to match arbitrary bytes in a pattern, e.g., \xFF): '\xA7'
@BurntSushi
Copy link
Owner

By default, patterns are Unicode aware, so all escape sequences identify the codepoint to search for, which is not the same as the byte to search for. To search for a raw byte, you need to disable Unicode. For example, (?-u:\xa7) should do what you want. This is documented in the syntax docs linked from the man page.

The error message is indeed wrong, or at least, incomplete.

@BurntSushi BurntSushi added the doc An issue with or an improvement to documentation. label Aug 5, 2019
BurntSushi added a commit that referenced this issue May 8, 2020
When a pattern with invalid UTF-8 is given, the error message suggests
unqualified use of hex escape sequences to match arbitrary bytes. But
you *also* need to disable Unicode mode. So include that in the error
message.

Fixes #1339
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc An issue with or an improvement to documentation.
Projects
None yet
Development

No branches or pull requests

2 participants