-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ripgrep slower than grep #838
Comments
Thanks for the bug report! I was able to reproduce this. It looks like most of the time spent here is just compiling the patterns (since there are >100K). As you pointed out on IRC, using Aho-Corasick here should help quite a bit. |
BurntSushi
added
the
libripgrep
An issue related to modularizing ripgrep into libraries.
label
Mar 10, 2018
BurntSushi
removed
the
libripgrep
An issue related to modularizing ripgrep into libraries.
label
Aug 19, 2018
BurntSushi
added
the
icebox
A feature that is recognized as possibly desirable, but is unlikely to implemented any time soon.
label
Jan 27, 2019
BurntSushi
added a commit
that referenced
this issue
Apr 7, 2019
This makes the case of searching for a dictionary of a very large number of literals much much faster. (~10x or so.) In particular, we achieve this by short-circuiting the construction of a full regex when we know we have a simple alternation of literals. Building the regex for a large dictionary (>100,000 literals) turns out to be quite slow, even if it internally will dispatch to Aho-Corasick. Even that isn't quite enough. It turns out that even *parsing* such a regex is quite slow. So when the -F/--fixed-strings flag is set, we short circuit regex parsing completely and jump straight to Aho-Corasick. We aren't quite as fast as GNU grep here, but it's much closer (less than 2x slower). In general, this is somewhat of a hack. In particular, it seems plausible that this optimization could be implemented entirely in the regex engine. Unfortunately, the regex engine's internals are just not amenable to this at all, so it would require a larger refactoring effort. For now, it's good enough to add this fairly simple hack at a higher level. Unfortunately, if you don't pass -F/--fixed-strings, then ripgrep will be slower, because of the aforementioned missing optimization. Moreover, passing flags like `-i` or `-S` will cause ripgrep to abandon this optimization and fall back to something potentially much slower. Again, this fix really needs to happen inside the regex engine, although we might be able to special case -i when the input literals are pure ASCII via Aho-Corasick's `ascii_case_insensitive`. Fixes #497, Fixes #838
BurntSushi
added a commit
that referenced
this issue
Apr 7, 2019
This makes the case of searching for a dictionary of a very large number of literals much much faster. (~10x or so.) In particular, we achieve this by short-circuiting the construction of a full regex when we know we have a simple alternation of literals. Building the regex for a large dictionary (>100,000 literals) turns out to be quite slow, even if it internally will dispatch to Aho-Corasick. Even that isn't quite enough. It turns out that even *parsing* such a regex is quite slow. So when the -F/--fixed-strings flag is set, we short circuit regex parsing completely and jump straight to Aho-Corasick. We aren't quite as fast as GNU grep here, but it's much closer (less than 2x slower). In general, this is somewhat of a hack. In particular, it seems plausible that this optimization could be implemented entirely in the regex engine. Unfortunately, the regex engine's internals are just not amenable to this at all, so it would require a larger refactoring effort. For now, it's good enough to add this fairly simple hack at a higher level. Unfortunately, if you don't pass -F/--fixed-strings, then ripgrep will be slower, because of the aforementioned missing optimization. Moreover, passing flags like `-i` or `-S` will cause ripgrep to abandon this optimization and fall back to something potentially much slower. Again, this fix really needs to happen inside the regex engine, although we might be able to special case -i when the input literals are pure ASCII via Aho-Corasick's `ascii_case_insensitive`. Fixes #497, Fixes #838
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
What version of ripgrep are you using?
ripgrep 0.8.1 (rev c8e9f25)
+SIMD -AVX
What operating system are you using ripgrep on?
Debian Stretch 64 bit
Describe your question, feature request, or bug.
ripgrep is at least 4 times slower than grep on a certain use case (never waited long enough for it to return). See data with test files attached.
ripgrep_data_test.tar.gz
attached file is around 15 mb uncompressed, may be possible to reduce the sample further but the original files i was dealing with were 100X.
See README.md and test.sh , i tried with and without --dfa-size-limit
Another minor issue I noticed is that a hint to use --fixed-strings us provided to the user even when using -F already, perhaps this only happens when regex-size-limit is not passed in.
Ideally ripgrep would be faster than grep :-)
If this is a bug, what are the steps to reproduce the behavior?
Assumes rg is in the directory or already in the path, then run test.sh
The text was updated successfully, but these errors were encountered: