-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: optionally enable use of cache also for plain text files #56
Comments
Thank you! :) The index is actually a full copy of previously searched files, not an "index" as you might expect. The reason it's faster for pdfs etc is obviously mostly due to the extractors not having to be run multiple times - but for plain text files this can still be faster:
The search work is repeated every time, yes, but ripgrep is blazingly fast so for many cases you don't even notice :) The above only applies if your operating system disk cache is either cold or full though - if you run rga twice in fairly close succession, your OS will cache all the content of the read files in RAM anyway, so reading those from rga cache doesn't help any and costs extra CPU time. Since it's not easy to tell when the above benefits do and don't apply, I don't think it would be a good idea to implement text caching - in most cases it would probably just bloat the cache db. I think basically what you're looking for is a different solution, like recoll (has a no-gui option i think), or something based on https://xapian.org/features . I don't really have an overview over available tools here. burntsushi is also thinking about adding indexing support to ripgrep itself, so that may be interesting in the future: BurntSushi/ripgrep#95
That could in theory be possible, since the data.mdb file is basically just a huge list of key-value pairs on [filename, mtime, extractor]->[contents], but it's not guaranteed in any way to be complete (files may not be cached for multiple reasons, e.g. too large, extractor runs too fast, etc). So I don't think it would be very useful to implement, except for data recovery reasons. |
Type: Reply 💬
Sometimes, when I repeatedly run ripgrep-all for directories with nested subdirectories (which probably means the cache is being used):
I wait #63 that I can determine the problem and post detailed bug report. Thanks. |
In summary, caching plain text files is possible in theory but not likely to be a very idea , since the cache would just a compressed copy of plain text files. |
First of all thank you for you work in this tool.
Very usefull and clever designed.
And thank you for remembering ARM
and providing a binary release also. Otherwise compiling Rust on the Raspberry Pi and friends takes GBs and hours.
Now for my request.
I'm very interested in this cache functionality.
It works wonderfully.
It allows me to say, for ex, have a huge
collection of academic, books, PDF's, epubs etc. that dont change neither content nor location in the file system, and have a kind of index for fast search. And when I say huge, I say really huge. No more worrying about which format did I get that resource. I now have a single tool to search them all.
I noticed ripgrep-all doesnt create the "data.mdb" file when I only
search in a collection of plain text files.
Ex. rga python /usr/share/nvim/runtime/docs
I'm no programmer, please correct me if I'm wrong.
1. This "data.mdb" file is a kind of index right ?
Or is it really just a cache, so that the conversion work of the "heavy tools" like pdftotext" doesnt have to be repeated ? But the "search work" of ripgrep is still repeated each time
I'm asking this because I also would like to "index" huge amounts of
plain text files. Years of notes files, wikis in plain markdown (ex git wikis), downloaded web sites converted to plain text, user documentation files ( ex vim docs ) etc etc.
There are other specialized tools for this.
Ex, desktop utilites, or recoll.
But I prefer the command line, and to use grep or ripgrep.
2. Would it be possible to search against this "data.mdb" without specifing the location of the files.
rga some-important-fact-regex data.mdb
So that I dont have to specify, which file or folder to look for. I dont remember where it is. I just know I read it someday in some of my books/notes/papers. That data.mdb is just my personal knowledge base.
It's the old dream of having your text files indexed, indexed in one file,
instead of running grep all over again over them each time you look for something.
Or am I seeing it wrongly, and it doesnt make sense to "index" plain text
files, as the computational cost of running grep or ripgrep on them is low ? Am I confusing indexing and caching ? An "index" created with ripgrep is something completely different ?
The text was updated successfully, but these errors were encountered: