-
Notifications
You must be signed in to change notification settings - Fork 175
scx_mitosis: move tick() to timer #2759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. A few ideas / questions.
if (bpf_get_smp_processor_id()) | ||
return; | ||
int ret; | ||
if ((ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, 0))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the angle of reducing re-entrancy risk, would it make sense to re-arm the timer at the end of this function instead of the beginning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading, I think re-entrancy on the same core is impossible. This timer callback runs in a softirq context. While interrupts are enabled, sleeping is disabled, and another softirq cannot run until this one is finished.
Perhaps that implies we should try to keep reconfig short because it can block processing other softirqs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let me read up on this, it's not clear if there's any re-entrancy on a timer. It probably makes sense to do it at the end regardless.
bpf_rcu_read_unlock(); | ||
bpf_cgroup_release(root_cgrp_ref); | ||
return 0; | ||
out_root_cgrp: | ||
bpf_cgroup_release(root_cgrp_ref); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How bout using cleanup attributes for these? I find it difficult to keep all the gotos straight. I think it's also an easier way to (not have to) reason about error propagation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me look into this as a followup PR. The cleanup attributes are kind of messy still and may require reordering code in some cases. But in general, I agree that it makes the code cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call bpf_timer_cancel()
somewhere? mitosis_exit()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is protecting the core callbacks (enqueue, dispatch, select) from race with reconfiguration? This question isn't new to this diff, I was confused about it in the tick() implementation as well.
I don't have a firm example in mind, but how much trouble could we get in if a cpumask changed in the middle of dispatch()? (Unfortunately I'm thinking in terms of the L3 aware version of dispatch() here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call bpf_timer_cancel() somewhere? mitosis_exit()?
I don't think we need to - the bpf program getting unloaded will kill the timer.
What is protecting the core callbacks (enqueue, dispatch, select) from race with reconfiguration?
It's inherently racy - the only important properties are that we preserve atomicity - e.g. the cpumask is always something that the timer work published, never some intermediate and that we never miss an update - e.g. that eventually we can be sure all scheduled tasks are using (or will use on next schedule) the latest published cpumask. You'll notice a common pattern to how we mutate cpumasks:
- We operate on the tmp_cpumask associated with a cell - this is only accessed by the tick() or timer() work so there's no worry about partial reads of this cpumask
- Once we have completed the operation we bpf_kptr_xchg() that tmp_cpumask into the real cpumask - this atomically "publishes" the new cpumask.
- Any concurrent enqueue, dispatch, init, select, etc. would either read the previously published cpumask or the newly published cpumask - atomicity is ensured.
- We sequence the read/writes of applied_configuration_seq such that we always publish the cpumask first and then bump applied_configuration_seq and on the read side we always read the applied_configuration_seq and then read the cpumask so it's not possible to miss a cpumask update
f096d1a
to
0e5aca7
Compare
If cpu 0 is not running any scx_mitosis tasks, we can miss configuration updates. This was due to a misunderstanding of how tick() works. The intended behavior is just to periodically run this logic on some core. A bpf_timer is the appropriate way to do this and so this commit makes that change. There's an additional small fix to the freeing logic ensuring that cpus are given back to the root cgroup. Signed-off-by: Dan Schatzberg <[email protected]>
0e5aca7
to
b9f1706
Compare
If cpu 0 is not running any scx_mitosis tasks, we can miss configuration updates. This was due to a misunderstanding of how tick() works. The intended behavior is just to periodically run this logic on some core. A bpf_timer is the appropriate way to do this and so this commit makes that change. There's an additional small fix to the freeing logic ensuring that cpus are given back to the root cgroup.
Most of the changes involve the need to cover more of what was previously the tick() logic in a larger rcu lock due to migrating it to a sleepable timer.