runtime: eliminate unnecessary lfence in fn `len` #7390

Toby-Shi-cloud · 2025-06-06T08:54:47Z

Motivation

Inspired by #7340. Explained in #7385.

As value returned from Steal::len and Local::len can not be used for indexing, the loads of head and tail in these functions can be marked Relaxed (but not unsync_load for Steal). After all, no matter whice memory order we use, the value returned is an approximate size, not actual size.

Solution

Relaxed load for `tail`

As explained in #7385, mark the loading of head to Relaxed will need an additional comparison. Therefore, I only mark the loading of tail to Relaxed.

If we mark the loads of both head and tail to Relaxed, it's possible to see head larger than tail (for Steal::len), but only mark the load of tail to Relaxed is safe.

Merge implementations of `Local<T>::len` and `Steal<T>::len`

I cannot see any deference between using directly pointer deref and atomic relaxed loading, since the atomic acquire loading prevent compiler optimization already. Therefore, I merge these two implementations together.

If I misunderstood something, please let me know.

ADD-SP · 2025-06-06T16:51:24Z

it's possible to see head larger than tail

I think head > tail is intentional, given that this queue is essentially a ring buffer.

I cannot see any deference between using directly pointer deref and atomic relaxed loading

Pointer dereferencing generates more efficient LLVM IR because it better reflects the author's intent.

Regarding the loading itself, I would expect LLVM to emit the same instructions for pointer dereferencing and Relaxed loading on Arm and x64. However, Relaxed loading prevents the compiler from performing certain optimisations, such as memory coalescing.

ADD-SP

Since the modification on head is sequenced-before the Release modification on tail for stealing the tasks, and the Release modification on tail is synchronizes-with the Acquire on tail. I think there is benefit to use specialized implementation for Local and Steal.

I think there is a room to optimise the implementation of Steal::len(), but I haven't thought too much of that yet.

Toby-Shi-cloud · 2025-06-07T14:01:46Z

Thanks for reviewing.

I think head > tail is intentional, given that this queue is essentially a ring buffer.

If I understand right, I don't think this is intentional. The head and tail only increase and never decrease. The actual index is head & mask and tail & mask. head > tail should never happen in Local<T>, but it may happen in case of specific synchronization order if we use Relaxed. Consider following example:

The owner thread A increased tail.
Thread B stole data and increased head.
It's possible for thread C to only see the update written by B while not seeing the update written by A.

However, Relaxed loading prevents the compiler from performing certain optimisations, such as memory coalescing.
I thought the Required atomic load prevent the compiler optimizations already. But I am not an expert on compiler or llvm. Maybe it's a good idea to keep the pointer dereferencing as is.

ADD-SP · 2025-06-07T14:08:57Z

The head and tail only increase and never decrease.

There are some usage of wrapping_add in the implementation.

tokio/tokio/src/runtime/scheduler/multi_thread/queue.rs

Line 247 in c38de96

self.inner.tail.store(tail.wrapping_add(1), Release);
tokio/tokio/src/runtime/scheduler/multi_thread/queue.rs

Lines 295 to 296 in c38de96

head.wrapping_add(NUM_TASKS_TAKEN),

head.wrapping_add(NUM_TASKS_TAKEN),
tokio/tokio/src/runtime/scheduler/multi_thread/queue.rs

Line 323 in c38de96

let i_idx = self.i.wrapping_add(self.head) as usize & MASK;
tokio/tokio/src/runtime/scheduler/multi_thread/queue.rs

Line 366 in c38de96

let next_real = real.wrapping_add(1);

I'm glad to discuss this topic deeply, but let's align our understanding first, and please correct me if I'm wrong.

Toby-Shi-cloud · 2025-06-07T14:22:51Z

Oh, I was wrong. I didn’t consider using up all the indexes before (I just found we use u16 on 32bit machines and u32 could be ran out of sometimes). Sorry about that.

Reconsidering all these things, using relaxed load in head may be unacceptable, since we have no way of knowing whether this is due to a synchronization order issue or if it is intentional.

ADD-SP · 2025-06-08T12:13:12Z

The owner thread A increased tail.

Thread B stole data and increased head.

It's possible for thread C to only see the update written by B while not seeing the update written by A.

Does an Acquire load on head in thread C ensure the Release store on head in thread B is visible in thread C 👀 ?

I think the following 4 code snippets are equivalent to each other for calculating the length of Steal<T>, this is because only Seqcst can establish the single total modification order.

head.load(Acquire)
tail.load(Acquire)

head.load(Acquire)
tail.load(Relaxed)

head.load(Relaxed)
tail.load(Acquire)

head.load(Relaxed)
tail.load(Relaxed)

Toby-Shi-cloud · 2025-06-09T05:19:04Z

Does an Acquire load on head in thread C ensure the Release store on head in thread B is visible in thread C 👀 ?

No. What I meant was:

With Acquire load, if thread C gets the same value on head that stored by thread B, it is guaranteed the store on tail in thread A is visible in thread C.
With only Relaxed load, no memory order is guaranteed, so I think it's possible that tail is invisible and head is visible in thread C.

But after reconsidering your reply, I think the latter will never happen.

I think the following 4 code snippets are equivalent to each other for calculating the length of Steal, this is because only Seqcst can establish the single total modification order.

I think you are right. If I understand right, what you meant is that since every Release store on head is sequenced-before an Acquire load on tail and vice versa, no threads at any moment could see head is unexpected larger than tail, making every implementations of calculating the length safe.

As a conclusion, is changing the implementation to two Relaxed loads the best solution?

By the way, If you open a larger PR in the future, try not to avoid squashing commit during the review process because it removes the history, this make PR hard to follow. For small PR like this doesn't matter.

Sorry about that. I squashed the commits because I thought these commits are not steps of the solution but totally different solutions. I will try to think more before pushing commits next time.

Darksonn · 2025-06-09T08:57:34Z

The len method is used through is_empty here:

tokio/tokio/src/runtime/scheduler/multi_thread/worker.rs

Lines 1159 to 1163 in 714e5b5

    
           fn notify_if_work_pending(&self) { 
        
               for remote in &self.shared.remotes[..] { 
        
                   if !remote.steal.is_empty() { 
        
                       self.notify_parked_local(); 
        
                       return;

This means that when head != tail, we wake up a worker so that it can steal tasks. Consider this scenario:

Worker A is pushing a task to its local queue.
Worker A performs a release store to tail.
Worker B gets false from self.shared.remotes[A].steal.is_empty() indicating that a value can be stolen in A.
Worker B calls unpark on worker C.
Worker C wakes up and attempts to steal from A.

Now the question is, assuming that no other worker attempts to steal from A, will worker C succeed in stealing the task in A? If the load of tail is acquire, then the answer is clearly yes: Worker C stealing happens-before worker B calling unpark which happens-before worker B reading from tail which happens-before worker A writing to tail (because of acquire/release).

Now, without the acquire/release, it may be the case that worker C thinks that A has not yet finished pushing the task to its local queue, and therefore the task is not yet available to be stolen.

Whether this can happen in practice, I don't know, but it seems best to keep the acquire/release.

ADD-SP · 2025-06-09T14:23:07Z

@Darksonn Thanks for your review, I believe you review PR on a more holistic level, which helps to accurately determine downstream impacts.

I just drew a diagram to help me understand your comment, please correct me if I'm wrong, I hope it also helps @Toby-Shi-cloud .

Toby-Shi-cloud · 2025-06-11T16:27:15Z

Thanks for reviews. The diagram is great for better understanding the logics. But I still have one question in steal_into2:

tokio/tokio/src/runtime/scheduler/multi_thread/queue.rs

Lines 463 to 469 in 933fa49

    
           fn steal_into2(&self, dst: &mut Local<T>, dst_tail: UnsignedShort) -> UnsignedShort { 
        
               let mut prev_packed = self.0.head.load(Acquire); 
        
               let mut next_packed; 
        
               let n = loop { 
        
                   let (src_head_steal, src_head_real) = unpack(prev_packed); 
        
                   let src_tail = self.0.tail.load(Acquire);

Since thread A synchronizes with line 469 let src_tail = self.0.tail.load(Acquire), Is there any case that worker C thinks that A has not yet finished pushing the task to its local queue?
I think, in this diagram, thread A is always happens before C and thread B is always happens before C, so there is no problems in C.
I think the problem here may be in thread B. Thread B may read head and tail out of date, causing it to wake up C by mistake or fail to wake up C.
Please correct me if I'm wrong.

ADD-SP · 2025-06-14T09:44:58Z

Is there any case that worker C thinks that A has not yet finished pushing the task to its local queue?

Worker A's modification on tail is always visible in Worker C because of the transitive happens-before in this diagram.

Thread B may read head and tail out of date, causing it to wake up C by mistake or fail to wake up C.

Sorry, I don't understand what is "fail to wake up C", could you explain it more?

Regarding the "wake up C by mistake", consider this scenario:

Let's say there are only three workers.

Worker A is empty (head == 10 && tail == 10)
Worker A pushes a new task to its local queue (tail.store(11, Release))
Worker A pops a new task out of its local queue (head.CAS(10, 11, AcqRel))
Worker A is preempted.
Worker B gets head.load(Acquire) == 10 && tail.load(Acquire) == 11 from Steal::len(), and wakes up the Worker C.
Worker B is preempted.

Worker C tries to steal the task at Steal::inner::buffer[10], and then fails at the head.CAS(10, 11).

tokio/tokio/src/runtime/scheduler/multi_thread/queue.rs

Lines 494 to 497 in 933fa49

    
           let res = self 
        
               .0 
        
               .head 
        
               .compare_exchange(prev_packed, next_packed, AcqRel, Acquire);

Worker C cannot steal anything from worker A.

Of course, we could execute fetch_add(0) for both head and tail to avoid this possibility, but this only works for such highly restricted scenario like this example above. If other worker tires to steal tasks from Worker A at the same time, the Worker C might still fail on the above CAS.

Darksonn · 2025-06-18T21:35:56Z

I've thought more about this. It's possible that this change is correct, but if it is, then I think the reasons are too complex. I would prefer to keep the Acquire load as-is for simplicity.

Regardless, thanks for taking the time to submit a PR.

Toby-Shi-cloud · 2025-06-19T10:37:50Z

@ADD-SP @Darksonn Thanks very much for reviewing and explaining the details these days.

It's possible that this change is correct, but if it is, then I think the reasons are too complex. I would prefer to keep the Acquire load as-is for simplicity.

I agree with this, so I probably won't push this PR any further. Thanks for taking your time to review my PR.

Thread B may read head and tail out of date, causing it to wake up C by mistake or fail to wake up C.

Sorry, I don't understand what is "fail to wake up C", could you explain it more?
Regarding the "wake up C by mistake", consider this scenario:
...

I meant that if we used Relaxed instead of Acquire, there may be more possibilities for worker B to get wrong size of the queue. Although this is not a problem, it may waste more time.
I think the change is correct, but may not necessarily perform better. This is also why I think to keep it as-is is a good idea.

github-actions bot added the R-loom-multi-thread Run loom multi-thread tests on this PR label Jun 6, 2025

ADD-SP reviewed Jun 6, 2025

View reviewed changes

Toby-Shi-cloud force-pushed the master branch 2 times, most recently from 4d4520e to 1f53f88 Compare June 7, 2025 14:04

runtime: eliminate unnecessary lfence in fn len

bd608d8

Toby-Shi-cloud force-pushed the master branch from 1f53f88 to bd608d8 Compare June 7, 2025 14:40

This comment was marked as outdated.

Sign in to view

Darksonn added A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime labels Jun 8, 2025

Darksonn closed this Jun 18, 2025

Uh oh!

runtime: eliminate unnecessary lfence in fn len #7390

runtime: eliminate unnecessary lfence in fn len #7390

Uh oh!

Conversation

Toby-Shi-cloud commented Jun 6, 2025

Motivation

Solution

Relaxed load for tail

Merge implementations of Local<T>::len and Steal<T>::len

Uh oh!

ADD-SP commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ADD-SP left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Toby-Shi-cloud commented Jun 7, 2025

Uh oh!

ADD-SP commented Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Toby-Shi-cloud commented Jun 7, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

ADD-SP commented Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Toby-Shi-cloud commented Jun 9, 2025

Uh oh!

Darksonn commented Jun 9, 2025

Uh oh!

ADD-SP commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Toby-Shi-cloud commented Jun 11, 2025

Uh oh!

ADD-SP commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Darksonn commented Jun 18, 2025

Uh oh!

Toby-Shi-cloud commented Jun 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

runtime: eliminate unnecessary lfence in fn `len` #7390

runtime: eliminate unnecessary lfence in fn `len` #7390

Relaxed load for `tail`

Merge implementations of `Local<T>::len` and `Steal<T>::len`

ADD-SP commented Jun 6, 2025 •

edited

Loading

ADD-SP left a comment •

edited

Loading

ADD-SP commented Jun 7, 2025 •

edited

Loading

ADD-SP commented Jun 8, 2025 •

edited

Loading

ADD-SP commented Jun 9, 2025 •

edited

Loading

ADD-SP commented Jun 14, 2025 •

edited

Loading