Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return first file found and terminate #472

Closed
matthew-piziak opened this issue Aug 20, 2019 · 7 comments · Fixed by #555
Closed

Return first file found and terminate #472

matthew-piziak opened this issue Aug 20, 2019 · 7 comments · Fixed by #555

Comments

@matthew-piziak
Copy link

Does fd support the equivalent of find PATH -name NAME -print -quit, which finds the first match, prints the result, and terminates?

@matthew-piziak
Copy link
Author

I looked into some closed issues and found fd --max-buffer-time=0 NAME PATH | head -n 1, which takes 0.9s real time compared to 0.5s with find PATH -name NAME -print -quit. Am I missing something?

@sharkdp
Copy link
Owner

sharkdp commented Sep 13, 2019

Thank you for your feedback.

I managed to find a similar example on my filesystem where I could reproduce your results. I think the problem is that piping into head -n 1 doesn't necessarily immediately shut down the process.

As a demonstration, let's look at find first. I am using hyperfine for running the benchmarks:

hyperfine --warmup 3 \
  'find -iname "*flow.yaml"' \
  'find -iname "*flow.yaml" | head -n1' \
  'find -iname "*flow.yaml" -print -quit'
Command Mean [s] Min [s] Max [s] Relative
find -iname "*flow.yaml" 2.558 ± 0.023 2.523 2.597 21.7
find -iname "*flow.yaml" | head -n1 2.576 ± 0.043 2.542 2.684 21.9
find -iname "*flow.yaml" -print -quit 0.118 ± 0.002 0.114 0.122 1.0

Notice how the variant with | head -n 1 actually takes the same time. Apparently, find just keeps on running in case of a broken pipe (head closes it's STDIN when the necessary number of lines has been read).

With fd, the results look slightly different (note that these are milliseconds, not seconds like above):

Command Mean [ms] Min [ms] Max [ms] Relative
fd --max-buffer-time=0 flow.yaml 256.8 ± 2.8 253.9 263.0 1.3
fd --max-buffer-time=0 flow.yaml | head -n 1 191.2 ± 3.5 184.4 196.6 1.0

The variant with head -n 1 is slightly faster. However, when I run fd interactively, I can clearly see that it outputs the first result very quickly and only quits when the second result would be about to get printed(!). The reason is that this is the first time that fd notices that its STDOUT pipe is closed (= heads STDIN).

We can demonstrate a similar behavior by running:

(echo first; sleep 1; echo second; sleep 100; echo third) | head -n 1

This command runs one second instead of quitting immediately.

To make sure that this is the actual problem with fd as well, I quickly changed the print_entry_uncolorized function to print an additional newline:

--- a/src/output.rs
+++ b/src/output.rs
@@ -90,5 +90,6 @@ fn print_entry_uncolorized(
     let separator = if config.null_separator { "\0" } else { "\n" };
 
     let path_str = path.to_string_lossy();
-    write!(stdout, "{}{}", path_str, separator)
+    write!(stdout, "{}{}", path_str, separator)?;
+    writeln!(stdout)
 }

With this small modification, fd is suddenly blazing fast (a factor of 10 faster than find instead of a factor 1.6 slower)

Command Mean [ms] Min [ms] Max [ms] Relative
fd --max-buffer-time=0 flow.yaml | head -n1 11.3 ± 1.0 8.7 15.0 1.0

Now, this is obviously not something we want to implement in this way. If anybody has any good suggestions on how to "fix" this, please let us know. One potential way could be to test (however that works) if STDOUT has been closed after printing each result. However, it should be checked if this has any performance impact when not piping to head.

If there is no great solution, we should actually think about implementing a --max-results <count> option (see also #476).

@tavianator
Copy link
Collaborator

One potential way could be to test (however that works) if STDOUT has been closed after printing each result.

I believe you can attempt to write 0 bytes to stdout, and you'll get EPIPE back if the pipe is closed (and you're ignoring SIGPIPE like Rust does by default). It's probably not a good idea to do two write syscalls every time you print something though. And I think it's still racey, since if head hasn't finished reading the first line by the time you do the second write, it won't fail.

So maybe have a timer such that if the main thread hasn't received any files to print in a while, it writes 0 bytes to stdout and exits if that fails. Alternatively don't bother, since no other tool seems to.

@tavianator
Copy link
Collaborator

Correction: despite what StackOverflow said, writing 0 bytes to a closed pipe does not trigger EPIPE. I'm not sure there's a non-destructive way to find out if the other end of a pipe is closed.

@tavianator
Copy link
Collaborator

There is a way, at least on Linux: https://stackoverflow.com/a/57959507/502399

On Windows, apparently the write-zero-bytes thing works.

sharkdp added a commit that referenced this issue Apr 2, 2020
This new option can be used instead of piping to `head -n <count>` for
improved performance:

| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|:---|---:|---:|---:|---:|
| `fd --max-buffer-time=0 flow.yaml` | 153.9 ± 2.5 | 151.3 | 170.3 | 4.21 ± 5.86 |
| `fd --max-buffer-time=0 flow.yaml \| head -n 1` | 145.3 ± 17.4 | 111.0 | 180.2 | 3.98 ± 5.55 |
| `fd --max-results=1 flow.yaml` | 36.5 ± 50.8 | 7.2 | 145.7 | 1.00 |

Note: there is a large standard deviation on the last result due to the
non-deterministic file system traversal. With `--max-results`, we don't
have to traverse the whole filesystem tree, so it's all about luck.

closes #472
closes #476
@sharkdp
Copy link
Owner

sharkdp commented Apr 2, 2020

@tavianator Thank you very much for your analysis. I opted to implement --max-results=<count> because that seemed like a much cleaner way of solving this use case.

Please see #555 for benchmark results.

sharkdp added a commit that referenced this issue Apr 2, 2020
This new option can be used instead of piping to `head -n <count>` for
improved performance:

| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|:---|---:|---:|---:|---:|
| `fd --max-buffer-time=0 flow.yaml` | 153.9 ± 2.5 | 151.3 | 170.3 | 4.21 ± 5.86 |
| `fd --max-buffer-time=0 flow.yaml \| head -n 1` | 145.3 ± 17.4 | 111.0 | 180.2 | 3.98 ± 5.55 |
| `fd --max-results=1 flow.yaml` | 36.5 ± 50.8 | 7.2 | 145.7 | 1.00 |

Note: there is a large standard deviation on the last result due to the
non-deterministic file system traversal. With `--max-results`, we don't
have to traverse the whole filesystem tree, so it's all about luck.

closes #472
closes #476
@sharkdp sharkdp added this to the v8.0 milestone Apr 8, 2020
@sharkdp
Copy link
Owner

sharkdp commented Apr 16, 2020

This has now been released in fd v8.0. We also have -1 as an alias for --max-results=1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants