-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
runtime: eliminate unnecessary lfence in fn len
#7390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think
Pointer dereferencing generates more efficient LLVM IR because it better reflects the author's intent. Regarding the loading itself, I would expect LLVM to emit the same instructions for pointer dereferencing and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the modification on head is sequenced-before the Release modification on tail for stealing the tasks, and the Release modification on tail is synchronizes-with the Acquire on tail. I think there is benefit to use specialized implementation for Local and Steal.
I think there is a room to optimise the implementation of Steal::len(), but I haven't thought too much of that yet.
|
Thanks for reviewing.
If I understand right, I don't think this is intentional. The head and tail only increase and never decrease. The actual index is
|
4d4520e to
1f53f88
Compare
There are some usage of wrapping_add in the implementation.
I'm glad to discuss this topic deeply, but let's align our understanding first, and please correct me if I'm wrong. |
|
Oh, I was wrong. I didn’t consider using up all the indexes before (I just found we use Reconsidering all these things, using |
Does an I think the following 4 code snippets are equivalent to each other for calculating the length of head.load(Acquire)
tail.load(Acquire)head.load(Acquire)
tail.load(Relaxed)head.load(Relaxed)
tail.load(Acquire)head.load(Relaxed)
tail.load(Relaxed) |
No. What I meant was:
But after reconsidering your reply, I think the latter will never happen.
I think you are right. If I understand right, what you meant is that since every Release store on head is sequenced-before an Acquire load on tail and vice versa, no threads at any moment could see head is unexpected larger than tail, making every implementations of calculating the length safe. As a conclusion, is changing the implementation to two Relaxed loads the best solution?
Sorry about that. I squashed the commits because I thought these commits are not steps of the solution but totally different solutions. I will try to think more before pushing commits next time. |
|
The tokio/tokio/src/runtime/scheduler/multi_thread/worker.rs Lines 1159 to 1163 in 714e5b5
This means that when
Now the question is, assuming that no other worker attempts to steal from A, will worker C succeed in stealing the task in A? If the load of Now, without the acquire/release, it may be the case that worker C thinks that A has not yet finished pushing the task to its local queue, and therefore the task is not yet available to be stolen. Whether this can happen in practice, I don't know, but it seems best to keep the acquire/release. |
|
@Darksonn Thanks for your review, I believe you review PR on a more holistic level, which helps to accurately determine downstream impacts. I just drew a diagram to help me understand your comment, please correct me if I'm wrong, I hope it also helps @Toby-Shi-cloud . |
|
Thanks for reviews. The diagram is great for better understanding the logics. But I still have one question in tokio/tokio/src/runtime/scheduler/multi_thread/queue.rs Lines 463 to 469 in 933fa49
Since thread A synchronizes with line 469 let src_tail = self.0.tail.load(Acquire), Is there any case that worker C thinks that A has not yet finished pushing the task to its local queue?I think, in this diagram, thread A is always happens before C and thread B is always happens before C, so there is no problems in C. I think the problem here may be in thread B. Thread B may read head and tail out of date, causing it to wake up C by mistake or fail to wake up C. Please correct me if I'm wrong. |
Worker A's modification on
Sorry, I don't understand what is "fail to wake up C", could you explain it more? Regarding the "wake up C by mistake", consider this scenario: Let's say there are only three workers.
Of course, we could execute |
|
I've thought more about this. It's possible that this change is correct, but if it is, then I think the reasons are too complex. I would prefer to keep the Regardless, thanks for taking the time to submit a PR. |
|
@ADD-SP @Darksonn Thanks very much for reviewing and explaining the details these days.
I agree with this, so I probably won't push this PR any further. Thanks for taking your time to review my PR.
I meant that if we used |

Motivation
Inspired by #7340. Explained in #7385.
Solution
Relaxed load for
tailAs explained in #7385, mark the loading of
headtoRelaxedwill need an additional comparison. Therefore, I only mark the loading oftailtoRelaxed.Merge implementations of
Local<T>::lenandSteal<T>::lenI cannot see any deference between using directly pointer deref and atomic relaxed loading, since the atomic acquire loading prevent compiler optimization already. Therefore, I merge these two implementations together.
If I misunderstood something, please let me know.