-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync: mpsc performance optimization avoid false sharing #5829
Conversation
The origin benchmark result: $ cargo bench --bench sync_mpsc running 10 tests test contention_bounded ... bench: 1,008,359 ns/iter (+/- 412,814) test contention_bounded_full ... bench: 1,427,243 ns/iter (+/- 500,287) test contention_unbounded ... bench: 845,013 ns/iter (+/- 394,673) test create_100_000_medium ... bench: 182 ns/iter (+/- 1) test create_100_medium ... bench: 182 ns/iter (+/- 1) test create_1_medium ... bench: 181 ns/iter (+/- 2) test send_large ... bench: 16,525 ns/iter (+/- 329) test send_medium ... bench: 628 ns/iter (+/- 5) test uncontented_bounded ... bench: 478,514 ns/iter (+/- 1,923) test uncontented_unbounded ... bench: 303,990 ns/iter (+/- 1,607) test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured The current benchmark result: $ cargo bench --bench sync_mpsc running 10 tests test contention_bounded ... bench: 606,516 ns/iter (+/- 402,326) test contention_bounded_full ... bench: 727,239 ns/iter (+/- 340,756) test contention_unbounded ... bench: 760,523 ns/iter (+/- 482,628) test create_100_000_medium ... bench: 315 ns/iter (+/- 5) test create_100_medium ... bench: 317 ns/iter (+/- 6) test create_1_medium ... bench: 315 ns/iter (+/- 5) test send_large ... bench: 16,166 ns/iter (+/- 516) test send_medium ... bench: 695 ns/iter (+/- 6) test uncontented_bounded ... bench: 456,975 ns/iter (+/- 18,969) test uncontented_unbounded ... bench: 306,282 ns/iter (+/- 3,058) test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured
What system are you running the benchmark on? |
The unfortunate issue with this change is it makes creating a channel much more expensive for "one shot" operations. |
Could you experiment w/ only padding 64 instead of 128. Does it make a difference? |
On both Linux and Darwin. |
The benchmark difference is almost the same on both systems. |
I known, but this may not the first of all issue. Because of zero cost in rust, in oneshot scene users may use oneshot instead. The external cost may accepted, though I would like to found how it happen. |
Which size to pad, it depends on the which cpu you run. The size of cache line is designed by hardware. On my mac, the cache line is If the
But when it use
And on my linux (which cache line size is
Benchmark performance results might not be stable enough, but aligning the some fields in struct according to the size of the cache line may have better performance in the most benchmark test cases. |
The reason why creating The source code: fn get_s() -> Box<S> {
Box::new(S { i: 0, j: 0 })
}
struct S {
i: usize,
j: usize,
}
#[repr(align(128))]
struct Align{
i:usize,
j:usize,
}
fn get_align()-> Box::<Align> {
Box::new(Align { i: 0, j:0 })
}
fn main(){
let s = get_s();
let a = get_align();
} The assembly code of two function
As we can see, the funciton You can see more detail about the above assembly code in https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=7ddf997fbb5e23ceaf4a9ab485f01771 |
I'm not getting a memcpy in the playground when I try it.
|
Sorry for that, it is a mistake, we should assemble the above code in the release mode. |
This MR has has been at a standstill for a long time, what is the current progress? |
There are some CI failures. Please ensure that CI passes. (And you probably want to merge in master so that you are up-to-date with CI changes.) |
The current CI error is
And another error is the same. |
If the |
The CI passed. |
I'm ok with moving forward with this, but it is worth to consider whether this is the best way to pad the fields. Did you try any other configurations? Why did you pick this one in particular? |
I believe there are three questions, if I do not misunderstand:
(1) Why choose I believe the And here are some historical discussions on it in crossbeam:
The mpsc in crossbeam is merged into I belive the best size of Using Although no detailed testing of false sharing can be provided, the performance is indeed improved, and the std library of high quality and can be trusted. (2) Why just choose Using (3) Why wrap both field Which field to choose to be wrapped in For the use case of
Theoretically, fileds In my test ( Original version benchmark :
Only
We only focus on the results of test contention_bounded, test contention_bounded_full and test contention_unbounded, we can find out that: The performance of
Making |
The other sync tools in I initially wrote a benchmark test:https://github.com/wathenjiang/tokio-2.8.2-coderead/blob/bfe26af3b56fe8eac81b6087829eb4c17c88fb49/benches/sync_broadcast.rs#L1-L46 And apply The following is a performance comparison test before and after applying Before
after
An increase of approximately 11%. |
It may be worth testing on a few different systems here. I'd like to see:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making
semaphore
wrapped inCachePadded
improves performance sightly, and reduces performance ofcreate
slightly. It made me hesitant to wrapsemaphore
inCachePadded
. This can be discussed further.
Okay, it sounds like the fields you used in the PR are close to optimal. That's good enough for me. If wrapping semaphore
doesn't give much additional advantage, then lets keep it non-wrapped.
The other sync tools in
tokio::sync
may could benefit from applyingCachePadded
. For example, thetokio::sync::broadcast
.
I don't think it makes sense to wrap buffer
in CachePadded
. The field is an immutable pointer. The things it points at are modified, but that's in a different cacheline. Let's not do broadcast for now.
It may be worth testing on a few different systems here.
Sure, that would be interesting. But I don't want to block this PR over that.
Let me know if you want to do any more testing, or if you just want me to merge this.
@Noah-Kennedy @Darksonn @carllerche The above tests are on Intel( The following test is on M1 mac, whose CPU architecture is arrch64. The test result tells the same result. Original version:
tx and rx_waker are wrapped in CachePadded:
The performance of
The following test is on Windows( Original version:
tx and rx_waker are wrapped in CachePadded:
The performance of
I believe the benchmark tests should be enough for now, so please merge this. |
Thank you for the PR! |
Motivation
Cache line optimization on mpsc.
Solution
Cacheline optimization on mpsc by using
CachePadded
.Benchmark
The origin benchmark result:
The current benchmark result: