-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic decoder program rather than/in addition to extension-based decompression #978
Comments
This is an interesting request! Thanks for the detailed description! I can definitely see how this could come in handy, and given its generality, I think I'd be comfortable with something like this in ripgrep. If someone wanted to add this, I'd be happy to work through most of the details in a PR. I do have a few constraints/concerns:
|
Totally fair point, binary detection-wise. Oh well. :-) Environment variable vs. command option for activation and various namings are certainly just a matter of taste. Exporting a (local to/interpreted by all children) environment variable like An environment variable for the child is a simpler interface (IMO). What I do with my little static C preprocessor is optionally take the name of an environment variable which is then the name of the input file. So, all you need is a Also, to not mandate an extra The process creation/file descriptor handling I use in C is: int fds[2];
if (pipe (fds) == -1) die ("cannot create a pipe");
switch (pid = vfork ()) { /* vfork much faster but MUST exec|_exit */
case 0: /* child */
dup2 (desc, 0); /* stdin = desc on file itself */
dup2 (fds[1], 1); /* stdout = write end of pipe [1] */
close (fds[1]); /* do not need extra handle on pipe */
close (fds[0]); /* close read end of pipe [0] */
if (filename)
setenv ("RIPGREP_INPUT", filename, 1);
execvp (preproc[0], preproc); /* become a preprocessor */
fprintf (stderr, "%s: %s\n", preproc[0], strerror (errno));
_exit (1);
default: /* parent */
close (fds[1]); /* close write end of pipe[1] */
close (desc); /* close desc on file itself */
desc = fds[0]; /* replace desc value with read end of pipe [0] */
break;
case -1: /* vfork failed => resource exhaustion => die */
die ("cannot vfork");
} The only "out of band" set up for that is that Beyond this you just need the option parsing and reaping the children the same way as however the current In truth, I know very little Rust and I only just discovered |
I'm trying to follow everything you're saying, but you appear to be very deep in the weeds. :-) Which is a good thing, but it's hard for me to follow. In particular, talking about
That's it. No parsing of shell commands. No environment variables.
Indeed, but not just for backward compatibility. The generality of your proposal comes with a cost: the need for another program. There are giant wins to be had for supporting common cases by default, which is ultimately the point of the ultra specific |
Ah. Sorry. I never program on Windows, but I wanted to be specific about file descriptor operations since you seemed concerned about that complexity. The problem with your third bullet is that the preprocessor is what classifies the file and dispatches. Hence the child knows whether the decoder program needs or does not need a pathname. So, If you want to simplify that protocol then it wouldn't be so bad to always just pass one argument, the path. Many users would probably use that argument -- racing unnecessarily (though the And, yeah, |
I don't know about Rust interfaces for spawning kids, but from a "less-Unixy" point of view, what you need is to either A) regular-mode: open a file (regular mode) or B) preprocess mode: open a pipe to a child, with the child's stdin connected to a file and the child's stdout the replacement for the file you would have used in A). That really shouldn't be hard. |
@c-blake Yeah Rust's standard library has a decent high level cross platform API for that sort of stuff. I'm honestly not that worried about the extra forks here, nor am I worried about the racing. I am more than OK with saying that if you need to care about those sorts of things, then you should just go out and write your own program, which will be a feasible thing Real Soon Now. :-) |
Looking at your Whether you want to pretty-ify/generalize your Rust code for shelling out to either a decompressor or a general decoder..Well, that is another question. Compared to just adding a new command option and some new code to handle it, that is a bigger change..probably one best left to the primary author. (Or author of a whole new program using |
@c-blake Understood. If and when I get to refactoring that code, I'll see about adding this functionality. But I have no idea when that will happen. |
Ok. That's totally fair. If and when I learn enough Rust to do a good PR, I'll just add a new CLI option, always both pass the 1 path arg to the kid and attach But someone (anyone?) else might well get to it first. I bet the simple version of this is like a 15 minute job for experienced Rustaceans..or at least less than the time to read this thread, most likely. Maybe more with proper testing. :-) |
I was just looking for a way to get The only problem I can think of is printing the page numbers: pdfgrep prints them the same as Searching a whole directory structure of pdf files (esp. lecture slides) is something I do all the time, and it would be great if ripgrep would support this :). Any kind of delay from executing processes is irrelevant here, since just parsing the pdf usually already takes 300ms+. |
Actually, nevermind: line number mappings can fairly easily be accomplished by just prefixing every line with the current page number of the pdf document. |
What version of ripgrep are you using?
ripgrep 0.8.1
-SIMD -AVX
How did you install ripgrep?
emerge sys-apps/ripgrep
What operating system are you using ripgrep on?
Gentoo linux (kernel 4.17.5); Rust version 1.27.0, but none of these questions should be relevant to this feature suggestion/request.
Describe your question, feature request, or bug.
In a recursive searching problem setting, one often has many types of (usually binary) files one would like to search through the "decoded" text of. Here decoding is something that may even be idiosyncratic to a specific user on a specific system. For example, these may not only be compressed files - other encodings are possible/common such as PDF files where a text layer is extractable via a program like
pdftotext
in the poppler package ormutool draw
or some similar tool.ripgrep
already does a better job than most greps here, but it could be much better. Besides an ever growing/ultimately unknowable set of desired encodings/translations, file name extensions are not the most reliable way to detect file types. Thefile
program or itslibmagic
are more complete, but is quite slow and even it may omit classes of file a user wants to use. Further, the transformation that counts as decoding could be idiosyncratic. I give a concrete example with unidiffs below.In light of both of the above considerations, the most flexible/robust approach is akin to what
less
does with itsLESSOPEN
environment variable - allow a user specified program to both classify and decode files thatripgrep
cannot. The dowside of this relative to the currentripgrep
filename extension oriented approach is a "double or tripleexec
" cost, but there are many upsides -- total generality, no shared library dependencies and so on. { Many might not even pay attention to these extra fork/exec costs. For example, many programs usesystem(3)
rather than fork and exec which already incurs an extraexec
. These daysgunzip
is even a shell script wrapper causing an extra exec of /bin/sh. But in a recursive grep context the costs surely add up. }To be concrete consider something like this script which a hypothetical new
GREP_OPEN
orRIPGREP_OPEN
variable could point to:Another possibility using
file
is:where this second example assumes a few things - namely,
ripgrep
sets up stdin for the forked child andfile -
will do anlseek
backward post classification so thatpzstd
sees the full input. Also, it assumes apzstd
program is installed and some script/sed script/whatever program to, just as an example, strip the context from a unidiff so thatripgrep
only searches through actually changing text. Note as part of this example that people often don't just name unified diff outputfoo.patch
, but sometimesfoo.diff
or other things. The file type is almost always recognized from contents not filename extension. The regularfile
does not do the seek, by the way, but a trivial backward compatible patch repairs that situation:The
/bin/sh-case-file
approach can be a little slow mostly because file takes like 1 ms per file which can be a lot if files are small and IO is fast. I personally have a little statically linked C program I used for this which looks at magic numbers and re-exec appropriate decoder/transformers all with more like 50 microsecond overhead (plus decoding/transforming time, of course). Of course, you also want all your decoding C programs to be statically linked for minimum overhead. In point of fact, a static C dispatcher + static decoder are already much lower overhead than the overhead from the current likely deployment ofripgrep
doing filename extension dispatcher -> dynamically linked decoders.Anyway, the actual work to implement this in
ripgrep
is probably quite minor and the result is something much more general/powerful than the statically encoded list of filename extensions. I have a personally patched GNU grep that does this and it is very useful to search through PDF collections of papers in various subdirectories and so on, just as one example use case. I'm sure there are many more -- limited only by one's imagination, and the diversity/entropy of one's collections of files. :-)It might be nice as an optimization to allow the "binary file" detection to trigger use of
RIPGREP_OPEN
so that it is only invoked when actually needed, although that would block the use case of transforming ordinary text files such as the unidiff patch file example.The text was updated successfully, but these errors were encountered: