Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time: avoid traversing entries in the time wheel twice #6584

Merged
merged 23 commits into from
Jun 14, 2024

Conversation

wathenjiang
Copy link
Contributor

@wathenjiang wathenjiang commented May 24, 2024

Motivation

Currently we use a pending linked list to temporarily store the timer entry to be fired.

If there are a large number of timers that need to be waken at once, then there will be a cache miss issue here.

Solution

For synchronizing the state between threads, when we want to modify the state of entries in wheel, we must hold the lock of wheel. However, when we hold a lock, in order to avoid the deadlock issue, we are not allowed wake the walker of entries.

Furthermore, to avoid the issue of the infinite loop, we need to take all of the entries off the slot before processing any of them. And, throughout this process, we always hold the lock.

The main reasons for using a pending linked list currently mentioned above.

In addition to supporting random deletion, slot linked list only have FIFO insert_head and pop_back. Given this, if a counter is introduced for each slot, the maximum number of pop_backs can be determined before traversing the slot, which solves the issue of the infinite loop.

Furthermore, by solving the infinite loop issue, we no longer need to take all of the entries off the slot before processing any of them. Finally, in this PR, we removed the pending linked list.

Expected benefits are as follows

  • Since there is no need to traverse the linked list twice, the wake-up time will be reduced.
  • Since there is no need to process all entries in the linked list after locking, the lock contention of time wheel will be reduced.

@github-actions github-actions bot added the R-loom-time-driver Run loom time driver tests on this PR label May 24, 2024
@Darksonn Darksonn added A-tokio Area: The main tokio crate M-time Module: tokio/time labels May 25, 2024
@wathenjiang wathenjiang marked this pull request as ready for review May 26, 2024 11:08
@wathenjiang wathenjiang force-pushed the iterate-time-wheel-once branch from bd79cf8 to d57c0cb Compare May 27, 2024 16:59
Comment on lines 323 to 339
// Gets the number of entries in this slot, which will be
// the maximum number of times we traverse.
// Limiting the maximum number of iterations is very important,
// because it's possible that those entries might need to be
// reinserted into the same slot.

// This happens on the highest level, when an entry is inserted
// more than MAX_DURATION into the future. When this happens, we wrap
// around, and process some entries a multiple of MAX_DURATION before
// they actually need to be dropped down a level. We then reinsert them
// back into the same position; we must make sure we don't then process
// those entries again or we'll end up in an infinite loop.

// This can also happens when other threads concurrently add entries to
// this currently traversing slot.
let count = lock.get_entries_count(&expiration);
for _ in 0..count {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we usually handle this kind of thing is to move all entries in the list to another list (using the guarded linked list), and then we wake entries from the guarded list in batches. Can we not do the same here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope I understand you correctly.

What I want to address is to avoid moving all entries in the list to another list at once.

I currently see GuardedLinkedList used in broadcast and notify, and it seems that it must remove all entries from the original linked list at once through calling std::mem::take.

Using the std::mem::take, we have no way to address the goal of traversing the Linked List only once. Because once the TimerHandle is removed from the original linked list, other threads need to know that it is no longer in the original linked list (they can be dropped concurrently). Here, synchronization must be achieved through locks. After modifying the state of the TimerHandle, we can not wake it up immediately because the lock is held at this time, and we need to avoid deadlock.

In short,

  • Because all entries are moved at once through std::mem::take, we must traverse the entire list and mark them as no longer in the original linked list.
  • And wakeup does not allow holding locks, so we must temporarily store these entries in the pending linked list.

The master version is:

  • Get the lock
  • Remove all entries from the slot linked list at once through calling std::mem::take.
  • Traverse the slot linked list, mark all entries as STATE_PENDING_FIRE.
  • Release the lock
  • After that, other threads can know their TimerHanles are stored in the pending linked list instead of the original one.
  • Traverse the pending linked list in batch:
    • Get the lock
    • Traverse the pending linked list, mark some entries as STATE_DEREGISTERED in batch, and get their waker (32 in batch).
    • Release the lock
    • After that, other threads can know their TimerHanles are not in the pending linked list or the slot linked list.
    • Wake them up.
    • Get the lock again ...

The current Version is:

Traverse the slot linked list in batch:

  • Get the lock
  • Traverse the slot linked list, mark some entries as STATE_FIRING and then STATE_DEREGISTERED in batch, and get their waker (32 in batch).
  • Release the lock
  • After that, other threads can know their TimerHanles are not in the slot linked list.
  • Wake them up.
  • Get the lock again ...

Comment on lines 360 to 362
Err(expiration_tick) => {
lock.add_entry(entry, expiration.clone(), expiration_tick);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There must be some operation somewhere that moves timers into lower levels. Is this it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is the exactly code block which do such things. The position of TimerEntry will recaculate here.

@FrankReh
Copy link

Drive-by comments.

I have no idea if the maintainers will accept this idea. On the surface, this is what it looks like is happening with these changes.

A timer wheel's slot's entries are now counted. This increases the timer wheel sizes in memory but is meant to allow cycling through them once.

When running through a timer wheel slot list, the original count is used to try to know when to stop popping from the back of the list, because logic is now added that pushes back to the front. But because the lock has to be released and then reacquired when the local stack-frame slice of temporary expired events has become full, another thread may have come in and added an entry (that conceivably has a 'now' deadline) so it seems this new entry will not be seen by the loop.

This is going to be a very rare event, as it also must coincide with the local stack-frame's slice having filled up, so it is conceivable a solution is to punt out of the looping at that point and restart at the top of the function, getting the new total size of the list, starting with an empty stack frame slice too, and just starting again to look for entries that expired 'now'.

It is nice that the poll function logic is simplified. (If that's actually correct - I haven't groked why that would be the case but I'm probably not seeing the whole picture.)

The Expiration Clone isn't necessary as the one use of clone() can be avoided by passing a reference, or just passing the Expriation.deadline directly.

There is a new use of unsafe in wheel/mod.rs, in the new function add_entry. To match other coding style in that file, the add_entry might be labeled unsafe and the SAFETY comment that would be added in time/mod.rs would say why it was safe to be added back.

The idea is commendable. It seems in the common case, the work the runtime has to do is being streamlined. But there's at least the one corner case I mentioned above and the logic is spread out over four files so someone with a clearer idea of all the workings would have to look at this and say that the timer wheel logic was still tight because the corner cases seem too hard to test for exhaustively and even fuzz testing isn't going to be a guarantee the whole thing still hangs together.

But also, is the added size of the timer wheels worth it? The sharding and the lazy initialization changes that went in recently already made the timer wheel much more efficient for busy systems.

@wathenjiang
Copy link
Contributor Author

wathenjiang commented May 29, 2024

Thanks for your comments.

A timer wheel's slot's entries are now counted. This increases the timer wheel sizes in memory but is meant to allow cycling through them once.

Yes. The current approach comes at the cost of increased space usage, which is the cost. :)
And what, Gul'dan, must we give it return?

When running through a timer wheel slot list, the original count is used to try to know when to stop popping from the back of the list, because logic is now added that pushes back to the front. But because the lock has to be released and then reacquired when the local stack-frame slice of temporary expired events has become full, another thread may have come in and added an entry (that conceivably has a 'now' deadline) so it seems this new entry will not be seen by the loop.

This is going to be a very rare event, as it also must coincide with the local stack-frame's slice having filled up, so it is conceivable a solution is to punt out of the looping at that point and restart at the top of the function, getting the new total size of the list, starting with an empty stack frame slice too, and just starting again to look for entries that expired 'now'.

It is nice that the poll function logic is simplified. (If that's actually correct - I haven't groked why that would be the case but I'm probably not seeing the whole picture.)

Yes. But the current design also does not ensures that the very new entry can be seen by the second traverse loop.

Because of we set_elapsed immediately, this will not happen.

The Expiration Clone isn't necessary as the one use of clone() can be avoided by passing a reference, or just passing the Expriation.deadline directly.

Yes, the Expiration Clone isn't necessary, and I will remove the #[derive(Clone)] of that type.

There is a new use of unsafe in wheel/mod.rs, in the new function add_entry. To match other coding style in that file, the add_entry might be labeled unsafe and the SAFETY comment that would be added in time/mod.rs would say why it was safe to be added back.

Yes, we should add SAFETY comments to the new function add_entry in wheel/mod.rs.

I rename it reinsert_entry now.

The idea is commendable. It seems in the common case, the work the runtime has to do is being streamlined. But there's at least the one corner case I mentioned above and the logic is spread out over four files so someone with a clearer idea of all the workings would have to look at this and say that the timer wheel logic was still tight because the corner cases seem too hard to test for exhaustively and even fuzz testing isn't going to be a guarantee the whole thing still hangs together.

But also, is the added size of the timer wheels worth it? The sharding and the lazy initialization changes that went in recently already made the timer wheel much more efficient for busy systems.

Yes. The benefit here depends on how we use the Timer. But if there is a chance to traverse the list only once instead of twice, it is definitely a performance win.

@wathenjiang
Copy link
Contributor Author

The benchmark code is as follows:

#[tokio::main]
async fn main() {
    let now = tokio::time::Instant::now();
    let five_seconds_later = now + tokio::time::Duration::from_secs(5);
    
    let mut handles = vec![];
    for _ in 0..1024 * 1024 {
        handles.push(tokio::spawn(async move {
            tokio::time::sleep_until(five_seconds_later).await;
        }));
    }
    for h in handles {
        let _ = h.await.unwrap();
    }
    println!("{:?}", now.elapsed());
}

The benchmark results of the above are as follows:

  • original version: 5.846156162s

  • this pr version: 5.72068938s

On Linux with the Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz CPU.

This shows there is a time reduction of (0.846156162 - 0.72068938) / 0.846156162 = 14.8%.

@FrankReh
Copy link

FrankReh commented Jun 2, 2024

The benchmark results of the above are as follows:

  • original version: 5.846156162s
  • this pr version: 5.72068938s

On Linux with the Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz CPU.

This shows there is a time reduction of (0.846156162 - 0.72068938) / 0.846156162 = 14.8%.

Why are times 5.8 and 5.7 in one part and 0.8 and 0.7 in another?

@wathenjiang
Copy link
Contributor Author

The benchmark results of the above are as follows:

  • original version: 5.846156162s
  • this pr version: 5.72068938s

On Linux with the Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz CPU.
This shows there is a time reduction of (0.846156162 - 0.72068938) / 0.846156162 = 14.8%.

Why are times 5.8 and 5.7 in one part and 0.8 and 0.7 in another?

Here, I want to measure the time taken for time wheel related operations (which may mainly include traversing the time wheel and waking up and executing tasks).

5 seconds is the time we have to wait, while 0.8 and 0.7 measure the time of related operations on the time wheel.

@wathenjiang
Copy link
Contributor Author

wathenjiang commented Jun 5, 2024

As FrankReh worried, introducing counters for all levels will lead to additional memory consumption. This memory problem is exacerbated because the timer wheel is sharded by worker threads.

I decide to only use a counter for highest level. Now, if our time wheel shard size is 8, the total memory increment of timer wheels would be 3968 bytes. As far as I know, 8 work threads will take 16MB(2MB * 8) as least, so this 4KB extra memory might be acceptable.

@FrankReh
Copy link

FrankReh commented Jun 5, 2024

@wathenjiang Apologies if I have it backwards or if I read the new code incorrectly. Don't you mean to provide counters only for the shortest intervals? I think that would mean level 0, not level 5. But I could easily be missing something.

Also, I do like the idea of treating the most frequently accessed level differently from the rest. It made me wonder if the whole sharding idea could have been done differently, and only shard the first one, two or three levels. But that would come at the cost of more code that needs to be maintained. And I'm not a maintainer. Hardly even a contributor these days. But I hold timer wheel ideas very dear and I haven't been asked not to thrown in my 2 cents now and then.

Also, it would be nice if you got some indications from @Darksonn if you might even be on the right track or if you should just stop. Best I can tell from the comment left by @Darksonn above, your fundamental idea hasn't been given much of a go-ahead yet and you shouldn't take the wrong impression by my comments that your efforts will be rewarded.

@wathenjiang
Copy link
Contributor Author

@wathenjiang Apologies if I have it backwards or if I read the new code incorrectly. Don't you mean to provide counters only for the shortest intervals? I think that would mean level 0, not level 5. But I could easily be missing something.

Also, I do like the idea of treating the most frequently accessed level differently from the rest. It made me wonder if the whole sharding idea could have been done differently, and only shard the first one, two or three levels. But that would come at the cost of more code that needs to be maintained. And I'm not a maintainer. Hardly even a contributor these days. But I hold timer wheel ideas very dear and I haven't been asked not to thrown in my 2 cents now and then.

Also, it would be nice if you got some indications from @Darksonn if you might even be on the right track or if you should just stop. Best I can tell from the comment left by @Darksonn above, your fundamental idea hasn't been given much of a go-ahead yet and you shouldn't take the wrong impression by my comments that your efforts will be rewarded.

I appreciate your comments. If I have been unclear, let me make it clear: we should only deal with the highest level's infinite loop issue.

The idea of the counter is to avoid infinite loops, if there are other solutions, I'm open to them.

Not using counters for the low level is implementation specific, not design specific. I'd love to see the memory savings. On the other hand, if the memory usage is not affected much, we can also use the previous version.

Yes, feedback is important.

@Darksonn
Copy link
Contributor

Darksonn commented Jun 6, 2024

Thank you @FrankReh for prodding me. I will try to summarize my thoughts on this PR. The idea of using a counter to avoid the infinite loop makes me a bit worried, and I can't really tell whether it's correct. It seems to me that the counter is being used to solve a problem that we have encountered many times in Tokio. Previous times we encountered this problem, we solved it using GuardedLinkedList rather than a counter. If we are going to use a different solution here, I need to understand why GuardedLinkedList can't work.

Let me try to summarize my understanding of the PR.

Currently we move items to the pending list one-by-one, and then we traverse pending afterwards. The perf issue is that moving them one-by-one means that we traverse the entries twice.

However, as far as I understand, the items that are moved correspond exactly to those that are stored in a specific bucket, correct? Could we move all of those items into pending in one operation? One of the advantages of linked lists is that you can make that kind of batch operation cheaply, after all.

However, in this comment you object that when we move all of the items into the GuardedLinkedList, we would need to traverse the entire list to mark the entries as being in pending rather than the timer wheel. And that this is because the entries could be dropped while they are in pending.

However, the documentation for LinkedList::remove says this:

/// # Safety
///
/// The caller **must** ensure that exactly one of the following is true:
/// - `node` is currently contained by `self`,
/// - `node` is not contained by any list,
/// - `node` is currently contained by some other `GuardedLinkedList` **and**
/// the caller has an exclusive access to that list. This condition is
/// used by the linked list in `sync::Notify`.
pub(crate) unsafe fn remove(&mut self, node: NonNull<L::Target>) -> Option<L::Handle> {

That is to say, you are allowed to call remove on a linked list in the timer wheel even though the entry is actually in some other GuardedLinkedList as long as the timer wheel and the GuardedLinkedList are protected by the same mutex. In the other users of GuardedLinkedList we have been able to avoid a traversal to mark entries as being in a GuardedLinkedList, so I'm thinking that we can probably also avoid that traversal in this case.

@wathenjiang
Copy link
Contributor Author

Thank you @FrankReh for prodding me. I will try to summarize my thoughts on this PR. The idea of using a counter to avoid the infinite loop makes me a bit worried, and I can't really tell whether it's correct. It seems to me that the counter is being used to solve a problem that we have encountered many times in Tokio. Previous times we encountered this problem, we solved it using GuardedLinkedList rather than a counter. If we are going to use a different solution here, I need to understand why GuardedLinkedList can't work.

Let me try to summarize my understanding of the PR.

Currently we move items to the pending list one-by-one, and then we traverse pending afterwards. The perf issue is that moving them one-by-one means that we traverse the entries twice.

However, as far as I understand, the items that are moved correspond exactly to those that are stored in a specific bucket, correct? Could we move all of those items into pending in one operation? One of the advantages of linked lists is that you can make that kind of batch operation cheaply, after all.

However, in this comment you object that when we move all of the items into the GuardedLinkedList, we would need to traverse the entire list to mark the entries as being in pending rather than the timer wheel. And that this is because the entries could be dropped while they are in pending.

However, the documentation for LinkedList::remove says this:

/// # Safety
///
/// The caller **must** ensure that exactly one of the following is true:
/// - `node` is currently contained by `self`,
/// - `node` is not contained by any list,
/// - `node` is currently contained by some other `GuardedLinkedList` **and**
/// the caller has an exclusive access to that list. This condition is
/// used by the linked list in `sync::Notify`.
pub(crate) unsafe fn remove(&mut self, node: NonNull<L::Target>) -> Option<L::Handle> {

That is to say, you are allowed to call remove on a linked list in the timer wheel even though the entry is actually in some other GuardedLinkedList as long as the timer wheel and the GuardedLinkedList are protected by the same mutex. In the other users of GuardedLinkedList we have been able to avoid a traversal to mark entries as being in a GuardedLinkedList, so I'm thinking that we can probably also avoid that traversal in this case.

Thanks! I have never noticed the comment of the remove method in linked_list. Sorry for mistaking such important information.

I believe this is the best way to address our purpose.

@FrankReh
Copy link

FrankReh commented Jun 7, 2024

I looked it over and actually didn't have anything to add. The GuardedListedList is a very nice resource that tokio provides for some of its drivers. Glad that didn't have to be invented for this. I will say it removes the pending entry so simplifies things in a way and I didn't notice any new uses of unsafe. There is a new constant and the tests around the constants have changed a bit so requires someone who has internalized all the timer wheel workers to say the changes don't open up any holes.

Just wait for @Darksonn to find time. There are a lot of changes so probably takes more than a few minutes to decide if it looks right.

But from my perspective .. nicely done.

@Darksonn
Copy link
Contributor

Darksonn commented Jun 7, 2024

Overall looks reasonable to me.

// Wake a batch of wakers. To avoid deadlock,
// we must do this with the lock temporarily dropped.
drop(lock);
waker_list.wake_all();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about panics here. What happens to entries still in the guarded list?

I think some other examples of this wrap the guarded list in something that removes all entries from the list without waking their wakers.

Copy link
Contributor Author

@wathenjiang wathenjiang Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We shoud ensure the list is empty before dropping it and no double panic, even if it occurs panic. I will wrap it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, we use EntryWaitersList instead. It will address our purpose of memory safety.

Comment on lines 334 to 341
let unguarded_list = lock.take_entries(&expiration);
// It is critical for `GuardedLinkedList` safety that the guard node is
// pinned in memory and is not dropped until the guarded list is dropped.
let guard = TimerShared::new(id);
pin!(guard);
let guard_ptr = NonNull::from(guard.as_ref().get_ref());
// GuardedLinkedList ensures that the concurrent drop of Entry in this slot is safe.
let mut guarded_list = unguarded_list.into_guarded(guard_ptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the call to lock.take_entries so that we immediately call into_guarded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add the method get_waiters_list to replace it. Is this design reasonable enough for us?

Comment on lines 173 to 178
// This entry is in the `guarded_list`, so it can not be removed
// from the entry list with the same `level` and `slot`.
// Because its state is STATE_DEREGISTERED, it has been fired.
if cur_state == STATE_DEREGISTERED {
break Err(cur_state);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment. How do you know that it's in the guarded list? And why is that a problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the Entry which is pop_back ed from guarded_list can be mark_firing. So we can ensure this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we fail to mark a Entry as STATE_FIRING, then we get a problem.

This entry has been fired, so we fail to fire it again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code seems reasonable, I just found the comment confusing. Why is it important that it can't be removed from the entry list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// This entry is in the `guarded_list`, so it can not be removed
// from the entry list with the same `level` and `slot`.

Sorry, this took me some time to think about why I wrote the above comments.

What I means is that if the entry list with the same level and slot is empty, we can not remove the entry from the entry list, but the Entry is in the guarded list.

The comments are very confusing, I decide to delete it.

Copy link
Contributor Author

@wathenjiang wathenjiang Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is actually my explanation of why the state value is STATE_DEREGISTERED here.

Because it can not be removed, but can be fired, so we meet this Entry here.

Please see the clear_entry method of the Handle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can make this more clear by adding some comments.

@wathenjiang wathenjiang force-pushed the iterate-time-wheel-once branch from 8e88893 to 17fc129 Compare June 13, 2024 10:43
@wathenjiang wathenjiang force-pushed the iterate-time-wheel-once branch from 77245fd to 7043eb2 Compare June 13, 2024 12:39
Copy link
Contributor

@Darksonn Darksonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

@Darksonn
Copy link
Contributor

This change has been reverted in #6715.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-time Module: tokio/time R-loom-time-driver Run loom time driver tests on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants