Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ripgrep slower than grep #838

Closed
bbbsg opened this issue Feb 27, 2018 · 1 comment · Fixed by #1238
Closed

ripgrep slower than grep #838

bbbsg opened this issue Feb 27, 2018 · 1 comment · Fixed by #1238
Labels
bug A bug. icebox A feature that is recognized as possibly desirable, but is unlikely to implemented any time soon.

Comments

@bbbsg
Copy link

bbbsg commented Feb 27, 2018

What version of ripgrep are you using?

ripgrep 0.8.1 (rev c8e9f25)
+SIMD -AVX

What operating system are you using ripgrep on?

Debian Stretch 64 bit

Describe your question, feature request, or bug.

ripgrep is at least 4 times slower than grep on a certain use case (never waited long enough for it to return). See data with test files attached.
ripgrep_data_test.tar.gz

attached file is around 15 mb uncompressed, may be possible to reduce the sample further but the original files i was dealing with were 100X.

See README.md and test.sh , i tried with and without --dfa-size-limit

Another minor issue I noticed is that a hint to use --fixed-strings us provided to the user even when using -F already, perhaps this only happens when regex-size-limit is not passed in.

Ideally ripgrep would be faster than grep :-)

If this is a bug, what are the steps to reproduce the behavior?

Assumes rg is in the directory or already in the path, then run test.sh

@BurntSushi BurntSushi added the bug A bug. label Feb 27, 2018
@BurntSushi
Copy link
Owner

Thanks for the bug report! I was able to reproduce this. It looks like most of the time spent here is just compiling the patterns (since there are >100K). As you pointed out on IRC, using Aho-Corasick here should help quite a bit.

@BurntSushi BurntSushi added the libripgrep An issue related to modularizing ripgrep into libraries. label Mar 10, 2018
@BurntSushi BurntSushi added this to the libripgrep milestone Mar 10, 2018
@BurntSushi BurntSushi removed the libripgrep An issue related to modularizing ripgrep into libraries. label Aug 19, 2018
@BurntSushi BurntSushi removed this from the libripgrep milestone Aug 19, 2018
@BurntSushi BurntSushi added the icebox A feature that is recognized as possibly desirable, but is unlikely to implemented any time soon. label Jan 27, 2019
BurntSushi added a commit that referenced this issue Apr 7, 2019
This makes the case of searching for a dictionary of a very large number
of literals much much faster. (~10x or so.) In particular, we achieve this
by short-circuiting the construction of a full regex when we know we have
a simple alternation of literals. Building the regex for a large dictionary
(>100,000 literals) turns out to be quite slow, even if it internally will
dispatch to Aho-Corasick.

Even that isn't quite enough. It turns out that even *parsing* such a regex
is quite slow. So when the -F/--fixed-strings flag is set, we short
circuit regex parsing completely and jump straight to Aho-Corasick.

We aren't quite as fast as GNU grep here, but it's much closer (less than
2x slower).

In general, this is somewhat of a hack. In particular, it seems plausible
that this optimization could be implemented entirely in the regex engine.
Unfortunately, the regex engine's internals are just not amenable to this
at all, so it would require a larger refactoring effort. For now, it's
good enough to add this fairly simple hack at a higher level.

Unfortunately, if you don't pass -F/--fixed-strings, then ripgrep will
be slower, because of the aforementioned missing optimization. Moreover,
passing flags like `-i` or `-S` will cause ripgrep to abandon this
optimization and fall back to something potentially much slower. Again,
this fix really needs to happen inside the regex engine, although we
might be able to special case -i when the input literals are pure ASCII
via Aho-Corasick's `ascii_case_insensitive`.

Fixes #497, Fixes #838
BurntSushi added a commit that referenced this issue Apr 7, 2019
This makes the case of searching for a dictionary of a very large number
of literals much much faster. (~10x or so.) In particular, we achieve this
by short-circuiting the construction of a full regex when we know we have
a simple alternation of literals. Building the regex for a large dictionary
(>100,000 literals) turns out to be quite slow, even if it internally will
dispatch to Aho-Corasick.

Even that isn't quite enough. It turns out that even *parsing* such a regex
is quite slow. So when the -F/--fixed-strings flag is set, we short
circuit regex parsing completely and jump straight to Aho-Corasick.

We aren't quite as fast as GNU grep here, but it's much closer (less than
2x slower).

In general, this is somewhat of a hack. In particular, it seems plausible
that this optimization could be implemented entirely in the regex engine.
Unfortunately, the regex engine's internals are just not amenable to this
at all, so it would require a larger refactoring effort. For now, it's
good enough to add this fairly simple hack at a higher level.

Unfortunately, if you don't pass -F/--fixed-strings, then ripgrep will
be slower, because of the aforementioned missing optimization. Moreover,
passing flags like `-i` or `-S` will cause ripgrep to abandon this
optimization and fall back to something potentially much slower. Again,
this fix really needs to happen inside the regex engine, although we
might be able to special case -i when the input literals are pure ASCII
via Aho-Corasick's `ascii_case_insensitive`.

Fixes #497, Fixes #838
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug. icebox A feature that is recognized as possibly desirable, but is unlikely to implemented any time soon.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants