Skip to content

Conversation

@Toby-Shi-cloud
Copy link

Motivation

Inspired by #7340. Explained in #7385.

As value returned from Steal::len and Local::len can not be used for indexing, the loads of head and tail in these functions can be marked Relaxed (but not unsync_load for Steal). After all, no matter whice memory order we use, the value returned is an approximate size, not actual size.

Solution

Relaxed load for tail

As explained in #7385, mark the loading of head to Relaxed will need an additional comparison. Therefore, I only mark the loading of tail to Relaxed.

If we mark the loads of both head and tail to Relaxed, it's possible to see head larger than tail (for Steal::len), but only mark the load of tail to Relaxed is safe.

Merge implementations of Local<T>::len and Steal<T>::len

I cannot see any deference between using directly pointer deref and atomic relaxed loading, since the atomic acquire loading prevent compiler optimization already. Therefore, I merge these two implementations together.

If I misunderstood something, please let me know.

@github-actions github-actions bot added the R-loom-multi-thread Run loom multi-thread tests on this PR label Jun 6, 2025
@ADD-SP
Copy link
Member

ADD-SP commented Jun 6, 2025

it's possible to see head larger than tail

I think head > tail is intentional, given that this queue is essentially a ring buffer.

I cannot see any deference between using directly pointer deref and atomic relaxed loading

Pointer dereferencing generates more efficient LLVM IR because it better reflects the author's intent.

Regarding the loading itself, I would expect LLVM to emit the same instructions for pointer dereferencing and Relaxed loading on Arm and x64. However, Relaxed loading prevents the compiler from performing certain optimisations, such as memory coalescing.

Copy link
Member

@ADD-SP ADD-SP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the modification on head is sequenced-before the Release modification on tail for stealing the tasks, and the Release modification on tail is synchronizes-with the Acquire on tail. I think there is benefit to use specialized implementation for Local and Steal.

I think there is a room to optimise the implementation of Steal::len(), but I haven't thought too much of that yet.

@Toby-Shi-cloud
Copy link
Author

Thanks for reviewing.

I think head > tail is intentional, given that this queue is essentially a ring buffer.

If I understand right, I don't think this is intentional. The head and tail only increase and never decrease. The actual index is head & mask and tail & mask. head > tail should never happen in Local<T>, but it may happen in case of specific synchronization order if we use Relaxed. Consider following example:

  1. The owner thread A increased tail.
  2. Thread B stole data and increased head.
  3. It's possible for thread C to only see the update written by B while not seeing the update written by A.

However, Relaxed loading prevents the compiler from performing certain optimisations, such as memory coalescing.
I thought the Required atomic load prevent the compiler optimizations already. But I am not an expert on compiler or llvm. Maybe it's a good idea to keep the pointer dereferencing as is.

@Toby-Shi-cloud Toby-Shi-cloud force-pushed the master branch 2 times, most recently from 4d4520e to 1f53f88 Compare June 7, 2025 14:04
@ADD-SP
Copy link
Member

ADD-SP commented Jun 7, 2025

The head and tail only increase and never decrease.

There are some usage of wrapping_add in the implementation.

I'm glad to discuss this topic deeply, but let's align our understanding first, and please correct me if I'm wrong.

@Toby-Shi-cloud
Copy link
Author

Oh, I was wrong. I didn’t consider using up all the indexes before (I just found we use u16 on 32bit machines and u32 could be ran out of sometimes). Sorry about that.

Reconsidering all these things, using relaxed load in head may be unacceptable, since we have no way of knowing whether this is due to a synchronization order issue or if it is intentional.

ADD-SP

This comment was marked as outdated.

@ADD-SP
Copy link
Member

ADD-SP commented Jun 8, 2025

  1. The owner thread A increased tail.
  2. Thread B stole data and increased head.
  3. It's possible for thread C to only see the update written by B while not seeing the update written by A.

Does an Acquire load on head in thread C ensure the Release store on head in thread B is visible in thread C 👀 ?


I think the following 4 code snippets are equivalent to each other for calculating the length of Steal<T>, this is because only Seqcst can establish the single total modification order.

head.load(Acquire)
tail.load(Acquire)
head.load(Acquire)
tail.load(Relaxed)
head.load(Relaxed)
tail.load(Acquire)
head.load(Relaxed)
tail.load(Relaxed)

@Darksonn Darksonn added A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime labels Jun 8, 2025
@Toby-Shi-cloud
Copy link
Author

Does an Acquire load on head in thread C ensure the Release store on head in thread B is visible in thread C 👀 ?

No. What I meant was:

  • With Acquire load, if thread C gets the same value on head that stored by thread B, it is guaranteed the store on tail in thread A is visible in thread C.
  • With only Relaxed load, no memory order is guaranteed, so I think it's possible that tail is invisible and head is visible in thread C.

But after reconsidering your reply, I think the latter will never happen.

I think the following 4 code snippets are equivalent to each other for calculating the length of Steal, this is because only Seqcst can establish the single total modification order.

I think you are right. If I understand right, what you meant is that since every Release store on head is sequenced-before an Acquire load on tail and vice versa, no threads at any moment could see head is unexpected larger than tail, making every implementations of calculating the length safe.

As a conclusion, is changing the implementation to two Relaxed loads the best solution?

By the way, If you open a larger PR in the future, try not to avoid squashing commit during the review process because it removes the history, this make PR hard to follow. For small PR like this doesn't matter.

Sorry about that. I squashed the commits because I thought these commits are not steps of the solution but totally different solutions. I will try to think more before pushing commits next time.

@Darksonn
Copy link
Contributor

Darksonn commented Jun 9, 2025

The len method is used through is_empty here:

fn notify_if_work_pending(&self) {
for remote in &self.shared.remotes[..] {
if !remote.steal.is_empty() {
self.notify_parked_local();
return;

This means that when head != tail, we wake up a worker so that it can steal tasks. Consider this scenario:

  1. Worker A is pushing a task to its local queue.
  2. Worker A performs a release store to tail.
  3. Worker B gets false from self.shared.remotes[A].steal.is_empty() indicating that a value can be stolen in A.
  4. Worker B calls unpark on worker C.
  5. Worker C wakes up and attempts to steal from A.

Now the question is, assuming that no other worker attempts to steal from A, will worker C succeed in stealing the task in A? If the load of tail is acquire, then the answer is clearly yes: Worker C stealing happens-before worker B calling unpark which happens-before worker B reading from tail which happens-before worker A writing to tail (because of acquire/release).

Now, without the acquire/release, it may be the case that worker C thinks that A has not yet finished pushing the task to its local queue, and therefore the task is not yet available to be stolen.

Whether this can happen in practice, I don't know, but it seems best to keep the acquire/release.

@ADD-SP
Copy link
Member

ADD-SP commented Jun 9, 2025

@Darksonn Thanks for your review, I believe you review PR on a more holistic level, which helps to accurately determine downstream impacts.

I just drew a diagram to help me understand your comment, please correct me if I'm wrong, I hope it also helps @Toby-Shi-cloud .

queue

@Toby-Shi-cloud
Copy link
Author

Thanks for reviews. The diagram is great for better understanding the logics. But I still have one question in steal_into2:

fn steal_into2(&self, dst: &mut Local<T>, dst_tail: UnsignedShort) -> UnsignedShort {
let mut prev_packed = self.0.head.load(Acquire);
let mut next_packed;
let n = loop {
let (src_head_steal, src_head_real) = unpack(prev_packed);
let src_tail = self.0.tail.load(Acquire);

Since thread A synchronizes with line 469 let src_tail = self.0.tail.load(Acquire), Is there any case that worker C thinks that A has not yet finished pushing the task to its local queue?
I think, in this diagram, thread A is always happens before C and thread B is always happens before C, so there is no problems in C.
I think the problem here may be in thread B. Thread B may read head and tail out of date, causing it to wake up C by mistake or fail to wake up C.
Please correct me if I'm wrong.

@ADD-SP
Copy link
Member

ADD-SP commented Jun 14, 2025

Is there any case that worker C thinks that A has not yet finished pushing the task to its local queue?

Worker A's modification on tail is always visible in Worker C because of the transitive happens-before in this diagram.

Thread B may read head and tail out of date, causing it to wake up C by mistake or fail to wake up C.

Sorry, I don't understand what is "fail to wake up C", could you explain it more?

Regarding the "wake up C by mistake", consider this scenario:

Let's say there are only three workers.

  1. Worker A is empty (head == 10 && tail == 10)
  2. Worker A pushes a new task to its local queue (tail.store(11, Release))
  3. Worker A pops a new task out of its local queue (head.CAS(10, 11, AcqRel))
  4. Worker A is preempted.
  5. Worker B gets head.load(Acquire) == 10 && tail.load(Acquire) == 11 from Steal::len(), and wakes up the Worker C.
  6. Worker B is preempted.
  7. Worker C tries to steal the task at Steal::inner::buffer[10], and then fails at the head.CAS(10, 11).
    let res = self
    .0
    .head
    .compare_exchange(prev_packed, next_packed, AcqRel, Acquire);
  8. Worker C cannot steal anything from worker A.

Of course, we could execute fetch_add(0) for both head and tail to avoid this possibility, but this only works for such highly restricted scenario like this example above. If other worker tires to steal tasks from Worker A at the same time, the Worker C might still fail on the above CAS.

@Darksonn
Copy link
Contributor

I've thought more about this. It's possible that this change is correct, but if it is, then I think the reasons are too complex. I would prefer to keep the Acquire load as-is for simplicity.

Regardless, thanks for taking the time to submit a PR.

@Darksonn Darksonn closed this Jun 18, 2025
@Toby-Shi-cloud
Copy link
Author

@ADD-SP @Darksonn Thanks very much for reviewing and explaining the details these days.

It's possible that this change is correct, but if it is, then I think the reasons are too complex. I would prefer to keep the Acquire load as-is for simplicity.

I agree with this, so I probably won't push this PR any further. Thanks for taking your time to review my PR.

Thread B may read head and tail out of date, causing it to wake up C by mistake or fail to wake up C.

Sorry, I don't understand what is "fail to wake up C", could you explain it more?
Regarding the "wake up C by mistake", consider this scenario:
...

I meant that if we used Relaxed instead of Acquire, there may be more possibilities for worker B to get wrong size of the queue. Although this is not a problem, it may waste more time.
I think the change is correct, but may not necessarily perform better. This is also why I think to keep it as-is is a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime R-loom-multi-thread Run loom multi-thread tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants