-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
io: optimizing the chances of large write in copy_bidirectional and copy #6532
Conversation
I think this change could make the benchmark worse in cases where the I tested using $ cd benches && cargo bench --bench copy
copy_mem_to_slow_hdd time: [767.41 ms 770.31 ms 773.42 ms]
change: [+294.24% +298.58% +302.79%] (p = 0.00 < 0.05)
Performance has regressed.
copy_chunk_to_slow_hdd time: [789.88 ms 792.86 ms 795.96 ms]
change: [+425.55% +433.04% +440.10%] (p = 0.00 < 0.05)
Performance has regressed. |
My question here is, in If so, I think this change is acceptable even if it will worsen performance, because this PR is simply trying to do right thing in that assumption. |
Thank you for pointing this, I've missed this benchmark when looking for it. This huge regression was due to the fact that I tried to flush even when no flush was issued during a pending read. Hence, I've implemented a few changes that change this behavior and thus reduce this huge performance regression to something more acceptable at the cost of an additional boolean in The flushing futures are still completed properly with those changes. copy_mem_to_mem time: [159.04 µs 161.96 µs 165.48 µs]
change: [-2.3046% +1.7540% +5.9405%] (p = 0.40 > 0.05)
No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
3 (3.00%) high mild
6 (6.00%) high severe
copy_mem_to_slow_hdd time: [181.27 ms 181.42 ms 181.60 ms]
change: [+0.0776% +0.1623% +0.2530%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
14 (14.00%) high severe
copy_chunk_to_mem time: [128.06 ms 128.10 ms 128.15 ms]
change: [-0.1713% -0.0813% -0.0033%] (p = 0.05 < 0.05)
Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
copy_chunk_to_slow_hdd time: [176.05 ms 176.75 ms 177.42 ms]
change: [+16.939% +17.675% +18.433%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) low mild As a nice side effect, it makes the flushing condition more readable in my opinion. |
That's my understanding, or it is implementation-defined. Either way, I think it should probably be documented if that's possible. What is sure is that the flushing future needs to be properly polled until completion, which was not the case before. With the new changes, I think that the small performance regression is not a big deal anymore (I may be wrong though). However, note that if we consider that |
I can't completely tell what's going on here, it but it sounds like your If your The rules must be this way, because otherwise |
As discussed with @Darksonn in the last comments of #6519 (comment), I've adapted this PR to optimize a bit the chances of large write in The benefits are modest, but they might still be worth to keep (2 slight improvements out of 4 benchmarks, one more significant than the other): copy_mem_to_mem time: [131.88 µs 133.09 µs 134.48 µs]
change: [-4.3908% -2.6619% -1.0456%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe
copy_mem_to_slow_hdd time: [181.26 ms 181.30 ms 181.35 ms]
change: [-0.0183% +0.0132% +0.0477%] (p = 0.44 > 0.05)
No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
copy_chunk_to_mem time: [128.11 ms 128.13 ms 128.16 ms]
change: [-0.0458% -0.0191% +0.0075%] (p = 0.16 > 0.05)
No change in performance detected.
copy_chunk_to_slow_hdd time: [141.29 ms 141.43 ms 141.59 ms]
change: [-5.1731% -4.7362% -4.2937%] (p = 0.00 < 0.05)
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
1 (1.00%) high mild
9 (9.00%) high severe Although the gains on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
I'm having a bit of a hard time understanding what's going on. Can you explain which cases your change would improve? |
Of course, sorry if I was unclear. I'm trying to optimize the cases where there is a lot of data to read with a reader which is faster than the writer (even by a bit). In such cases, the writer might be pending, thus trying to perform a poll_read. When the reader itself is pending too, poll_read will not be performed before trying to write again, which can lead to a lost chance of a larger write, because the reader might have been ready to be polled at this point. I'll give you a concrete example that I've observed during my tests. Let's imagine that our buffer is 8192 bytes long, with the current implementation of the function.
The goal of this PR is to rework this workflow to achieve this instead:
The only drawback I can think of is that with my reworked implementation, even if there is only one free byte left in the buffer, a poll_read will still be performed, which could be inefficient if the read is indeed ready. The condition could be slightly modified to overcome this potential issue, for instance by checking that the free space in the buffer is at least of a certain size. Note that pending poll_read are exploited only when the buffer is empty to match the current behaviour. Indeed, it does not make sense to try to flush or to return from the function while we are trying to optimize chances of a large write, since we could end in a deadlock or just reduce performance instead of improving them. Although some of the benchmarks provided with Tokio are slightly faster, I could notice a significant improvement in a project of mine. With a TcpStream as a reader, a bounded mpsc Sender as a writer (encapsulated in a lightweight abstraction) and a bit of contention, With my modifications, I could observe a better throughput by about 10%, with "only" a gigabyte of data transferred, and it seems it could be even better with more data. I hope it's more understandable now, otherwise I'll be happy to clarify further :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. This looks okay to me.
…nal (#10608) We forked copy_bidirectional to solve some issues like fast-shutdown (disallowing half-open connections) and to introduce better error tracking (which side of the conn closed down). A change recently made its way upstream offering performance improvements: tokio-rs/tokio#6532. These seem applicable to our fork, thus it makes sense to apply them here as well.
Although this PR was originally related to #6519 (comment), it aims now to improve a bit the chances of large write in
io::copy
andio::copy_bidirectional
.Motivation
tokio::io::copy_bidirectional
andtokio::io::copy
can, in some circumstances, lead to some non-optimal chances of large write when both writing and reading are pending (see here). For more information, refer to the issue linked above.This PR aims to improve this behavior by reading more often.
Solution
Changelog
Benchmarks