Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: optionally enable use of cache also for plain text files #56

Closed
m040601 opened this issue May 28, 2020 · 3 comments
Closed

Comments

@m040601
Copy link

m040601 commented May 28, 2020

First of all thank you for you work in this tool.
Very usefull and clever designed.
And thank you for remembering ARM
and providing a binary release also. Otherwise compiling Rust on the Raspberry Pi and friends takes GBs and hours.

Now for my request.
I'm very interested in this cache functionality.
It works wonderfully.
It allows me to say, for ex, have a huge
collection of academic, books, PDF's, epubs etc. that dont change neither content nor location in the file system, and have a kind of index for fast search. And when I say huge, I say really huge. No more worrying about which format did I get that resource. I now have a single tool to search them all.

I noticed ripgrep-all doesnt create the "data.mdb" file when I only
search in a collection of plain text files.
Ex. rga python /usr/share/nvim/runtime/docs

I'm no programmer, please correct me if I'm wrong.

1. This "data.mdb" file is a kind of index right ?

Or is it really just a cache, so that the conversion work of the "heavy tools" like pdftotext" doesnt have to be repeated ? But the "search work" of ripgrep is still repeated each time

I'm asking this because I also would like to "index" huge amounts of
plain text files. Years of notes files, wikis in plain markdown (ex git wikis), downloaded web sites converted to plain text, user documentation files ( ex vim docs ) etc etc.

There are other specialized tools for this.
Ex, desktop utilites, or recoll.
But I prefer the command line, and to use grep or ripgrep.

2. Would it be possible to search against this "data.mdb" without specifing the location of the files.

rga some-important-fact-regex data.mdb

So that I dont have to specify, which file or folder to look for. I dont remember where it is. I just know I read it someday in some of my books/notes/papers. That data.mdb is just my personal knowledge base.

It's the old dream of having your text files indexed, indexed in one file,
instead of running grep all over again over them each time you look for something.

Or am I seeing it wrongly, and it doesnt make sense to "index" plain text
files, as the computational cost of running grep or ripgrep on them is low ? Am I confusing indexing and caching ? An "index" created with ripgrep is something completely different ?

@phiresky
Copy link
Owner

phiresky commented Jun 6, 2020

I'm very interested in this cache functionality.
It works wonderfully.

Thank you! :)

The index is actually a full copy of previously searched files, not an "index" as you might expect. The reason it's faster for pdfs etc is obviously mostly due to the extractors not having to be run multiple times - but for plain text files this can still be faster:

  1. The cache is compressed with an awesome compression algorithm. Since text files are compressed to maybe 10% their original size, and since zstd can decompress at 2GB/s, this means that reading those files may be 10x faster than reading the text files written to disk (assuming the disk is the bottleneck, which it probably is). It also means that cached files may be closer together on disk so readahead etc can catch multiple of them.
  2. The cache is in your home cache directory, which has a higher probability of being on an SSD than the disk your data is on

The search work is repeated every time, yes, but ripgrep is blazingly fast so for many cases you don't even notice :)

The above only applies if your operating system disk cache is either cold or full though - if you run rga twice in fairly close succession, your OS will cache all the content of the read files in RAM anyway, so reading those from rga cache doesn't help any and costs extra CPU time.

Since it's not easy to tell when the above benefits do and don't apply, I don't think it would be a good idea to implement text caching - in most cases it would probably just bloat the cache db.

I think basically what you're looking for is a different solution, like recoll (has a no-gui option i think), or something based on https://xapian.org/features . I don't really have an overview over available tools here.

burntsushi is also thinking about adding indexing support to ripgrep itself, so that may be interesting in the future: BurntSushi/ripgrep#95

  1. Would it be possible to search against this "data.mdb" without specifing the location of the files.

That could in theory be possible, since the data.mdb file is basically just a huge list of key-value pairs on [filename, mtime, extractor]->[contents], but it's not guaranteed in any way to be complete (files may not be cached for multiple reasons, e.g. too large, extractor runs too fast, etc). So I don't think it would be very useful to implement, except for data recovery reasons.

@Kristinita
Copy link

Type: Reply 💬

The search work is repeated every time, yes, but ripgrep is blazingly fast so for many cases you don't even notice :)

Sometimes, when I repeatedly run ripgrep-all for directories with nested subdirectories (which probably means the cache is being used):

  1. PC hangs
  2. Or ripgrep-all works a several minutes

I wait #63 that I can determine the problem and post detailed bug report.

Thanks.

@phiresky
Copy link
Owner

In summary, caching plain text files is possible in theory but not likely to be a very idea , since the cache would just a compressed copy of plain text files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants