-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add decompression support on Unix systems #539
Comments
If that's the case, it seems like it would be easy to implement an independent rust library which is able to transparently open compressed files as a io::Read using any method (streaming external tool via pipes, C FFI or native Rust library), without ripgrep itself actually caring about the method. It still needs some thought though:
|
We should probably punt on archives for now. |
What to do think about anycat crate for the task? |
@tailhook No, I don't think so. That seems like a nifty little utility, but it doesn't pull its weight as a library dependency. Also, that's using C libraries, which I talked about in #225. This particular ticket is strictly for an implementation path that involves shelling out to existing utilities. |
I had a quick glance at this and had the following thought. The standard library provides a By using the above, I was able to get a very trivial solution for this problem by just doing the following.
I tested it out and it was able to spawn the process, do the unzipping and print the expected output to stdout. @BurntSushi, My apologies if this looks like a very naive solution and not what you were looking for at all. The only reason I wanted to post that line of thought here is that I felt it achieved the intended purpose: Support decompression on unix systems at minimal cost to the maintainer for the time being. Please let me know whether this is what you had in mind when you opened this ticket. |
@balajisivaraman Woohoo! Yeah that is exactly what I had in mind. :) The trickier issues are thinking through the UX I think. I don't have time to rethink it through all right now but I can rattle off some general principles:
There are maybe other things I've missed. :) |
Yay! 😃 Like you, I will also do some thinking on this and arrive at a cohesive list of things on how the UX should be by maybe tomorrow; adding to the ones you've already posted. Once that is finalised and agreed, I can actually take this up and start working on it, if that is OK with you. 👍 |
@balajisivaraman Yeah that would be great! This would be a significant contribution. It's a really nice feature to have in ripgrep. When thinking through the UX, keep in mind that "let's punt" is a totally valid answer. Keeping things simple and building on it later is a totally fine way to do things. The only downside here is that you need to be a little careful to leave yourself breathing room with respect to breaking changes. i.e., Adding a new flag for this and then removing that flag later is probably something we want to avoid (but adding no flags now and adding more flags later is OK!). (Not that I'm saying we should or shouldn't add flags for this, it's just an example.) |
Ok, I've thought about it a fair bit and here are my thoughts, partly summations of BurntSushi's:
I can't think of any more UX/code-maintencance/platform-support related issues, though I'm sure I will start discovering them as I begin working on the feature. If this sounds good to you, I can begin working on this right away. (I'll mostly put effort into this over nights and weekends.) |
I no longer have Windows to test this, but that sounds like a 32/64-bit mismatch. You should check Event Viewer for any errors in the Application section, or maybe try In any case, this should probably be left for the user to handle. |
Other compression-only formats to think about:
Of those, Compress and LZ4 are the most common i believe.
Only if they're supplied directly on the command line, i assume? Or in the debug messages maybe? It would be irksome if
I agree, i think it would make sense to have a flag that toggles compressed-file support as a whole on or off. That would surely be relevant no matter how the feature may be implemented in the future. (I suppose that users will still be able to ignore files with I wonder also about how I assume that scenario doesn't come up often, but a possibility to consider |
I was thinking the errors would be output when the user specifically asks to search in archive files using the new
One option to circumvent this issue is to use the magic crate suggested earlier in this thread, which is the equivalent of running the
Thanks for the suggestion. I'll cleanup my Windows installation of Rust and the binaries that I have to recheck this. |
@BurntSushi, I've begun working on this on a branch in my fork, going with the simpler file extension based detection instead of the fancier |
(It occurs to be that by "gracefully exit," you mean, "continue searching." Apologies if I misunderstood! But I'll leave the text below as I wrote it before I realize the potential misunderstanding.) This is definitely not OK. ripgrep must not stop working just because an external program isn't installed. If the binary isn't available, then there are two reasonable choices I can see:
My preference is (1). It's the most conservative change and is pretty consistent with ripgrep's general behavior of silently ignoring files that it thinks is irrelevant. In the future, we may of course elect to do (2).
This is good. Notably, I don't think this should be a debug message, but rather, an actual warning emitting to stderr. This is in contrast to how a missing binary is handled.
I would like to see this be a debug message. In particular, it's consistent with how we handle other files that are ignored.
I agree with adding this flag, and by implication, making decompression opt-in.
👍
There is definitely a potential performance impact here by shelling out to another process as opposed to using a library, but that is certainly part of the trade off of doing things this way. Otherwise, yeah, fully agreed. 👍
Aye. I think my recommendation here is to start with something that doesn't do anything platform specific. That is, even if the implementation is always incorrect on Windows, we should be covered by the aforementioned error handling.
That is interesting scenario! I think I'd suggest ignoring it for now, and fixing it if and when somebody files a bug for it. Then we can try and fix it from there. I'm somewhat more encouraged to say this is OK since this decompression behavior will be opt-in. @balajisivaraman I'd probably avoid |
@balajisivaraman Thanks so much for the detailed specification that you wrote! Truly a great example to follow. :-) |
@BurntSushi, Yeah, I meant that |
Repeat comment in #225, @BurntSushi @wavexx You can set environment variable Notice |
On Sun, Jan 28 2018, Liu Yubao wrote:
Repeat comment in #225, @BurntSushi @wavexx
You can set environment variable GREP to "rg", then use
zgrep/bzgrep/xzgrep -Hn to uncompress and search on
.gz/.bz2/.xz/.lzma/.lzo with rg, I feel this is somewhat enough :-)
I don't generally know what sort of compressed files I'm dealing with.
But setting this aside, this bypasses the directory walking mechanism in
rg itself which has larger implications.
|
Usually searching compressed files is only required to handle log files or some archive files, no need to support language types, gitignore etc, the bottleneck in combination of Certainly, it would be great if ripgrep can directly support compressed file and tar files, my workaround is just somewhat enough before tons of work to port and optimize those C decompression libraries to Rust. BTW, after so many years, GNU grep doesn't support compressed files and tar files too, it's a UNIX philosophy to combine and reuse existed little utilities. |
@Dieken Thanks for pointing out a clever work around. I'm sure some people will find it useful. Let's continue to push forward with #751 (which I still need to review), which is a nice intermediate step between "a hack using existing tools" and "port all the things to Rust."
Let's try to avoid this type of argumentation. Appealing to the UNIX philosophy kills debates instead of letting them thrive. Reality will dictate. The inherent nature of a tool that does recursive search is that it must violate the UNIX philosophy because it is combining multiple distinct tasks (directory walking and line oriented searching) into one. We create this coupling for UX and performance reasons, and there is no turning back now. (The irony is that, aesthetically speaking, I am someone that very much appreciates the UNIX philosophy but I simultaneously maintain a tool that unquestionably violates it. The key is that the UNIX philosophy is a means to an end, not an end itself.) |
This commit adds opt-in support for searching compressed files during recursive search. This behavior is only enabled when the `-z/--search-zip` flag is passed to ripgrep. When enabled, a limited set of common compression formats are recognized via file extension, and a new process is spawned to perform the decompression. ripgrep then searches the stdout of that spawned process. Closes #539
This commit adds opt-in support for searching compressed files during recursive search. This behavior is only enabled when the `-z/--search-zip` flag is passed to ripgrep. When enabled, a limited set of common compression formats are recognized via file extension, and a new process is spawned to perform the decompression. ripgrep then searches the stdout of that spawned process. Closes #539
It looks like the |
@BatmanAoD I very much doubt it's as easy as you claim, since zips are archives and not just decompression. Please move discussion to #1112 which is more similar to zip support than this ticket is. |
True. I hadn't realized Using So to me, at least, it seems reasonable to implement simple decompression without extraction, since it can be a useful feature on its own. This should be fairly easy to let users implement for themselves using |
I don't know where you got the impression that
The |
@ssokolow Simply because Andrew, does the fact that we already have an archive format being decompressed but not extracted by the |
@BurntSushi Since RipGrep already supports decompressing-but-not-extracting LZMA, would you object to the same behavior (at least for now) for |
I'd rather not, sorry. Please use |
@wavexx Suggests farming out decompression to established stream decompressors. I am somewhat attracted to this path since it will cover a lot of use cases, and at least on the face of things, seems relatively easy to implement. If this turns out to be a much larger project than I anticipated, then the trade off may no longer make sense. The intention is to support decompression on all supported platforms, so we don't want to invest a bunch of work in a Unix-only option.
I'm not sure when I personally will have time to work on this, but as long as it's easy to maintain, I'm not intrinsically opposed to farming this out to another process. Note that the core search code can work on any arbitrary thing that implements
io::Read
, which means the glue code required for this should be pretty limited. There's already an example of this at play for encoding support, where any supported encoding is first transcoded to UTF-8 before being searched. The scope of that addition was pretty much limited to implementingio::Read
and doesn't require touching any of the core search code.If someone else wanted to work on this, I'd be happy to mentor it. I haven't thought about it too much, but I think these are probably the high level steps one needs to take:
.tar.gz
(probably by skipping them at first, but maybe eventually trying to readtar
archives as well).io::Read
that is suitable to pass to theio::Read
searcher.gzip
command seamlessly? This will be important for eventually migrating to cross platform decompression support.stdin
. Callers can invoke their own decompressor in this case easily enough that we shouldn't do anything smart. (A future path might add a flag to force decompression of a specific format.)The text was updated successfully, but these errors were encountered: