Skip to content

Conversation

dschatzberg
Copy link
Contributor

If cpu 0 is not running any scx_mitosis tasks, we can miss configuration updates. This was due to a misunderstanding of how tick() works. The intended behavior is just to periodically run this logic on some core. A bpf_timer is the appropriate way to do this and so this commit makes that change. There's an additional small fix to the freeing logic ensuring that cpus are given back to the root cgroup.

Most of the changes involve the need to cover more of what was previously the tick() logic in a larger rcu lock due to migrating it to a sleepable timer.

Copy link
Contributor

@tommy-u tommy-u left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. A few ideas / questions.

if (bpf_get_smp_processor_id())
return;
int ret;
if ((ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, 0))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the angle of reducing re-entrancy risk, would it make sense to re-arm the timer at the end of this function instead of the beginning?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading, I think re-entrancy on the same core is impossible. This timer callback runs in a softirq context. While interrupts are enabled, sleeping is disabled, and another softirq cannot run until this one is finished.

Perhaps that implies we should try to keep reconfig short because it can block processing other softirqs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let me read up on this, it's not clear if there's any re-entrancy on a timer. It probably makes sense to do it at the end regardless.

bpf_rcu_read_unlock();
bpf_cgroup_release(root_cgrp_ref);
return 0;
out_root_cgrp:
bpf_cgroup_release(root_cgrp_ref);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How bout using cleanup attributes for these? I find it difficult to keep all the gotos straight. I think it's also an easier way to (not have to) reason about error propagation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me look into this as a followup PR. The cleanup attributes are kind of messy still and may require reordering code in some cases. But in general, I agree that it makes the code cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call bpf_timer_cancel() somewhere? mitosis_exit()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is protecting the core callbacks (enqueue, dispatch, select) from race with reconfiguration? This question isn't new to this diff, I was confused about it in the tick() implementation as well.

I don't have a firm example in mind, but how much trouble could we get in if a cpumask changed in the middle of dispatch()? (Unfortunately I'm thinking in terms of the L3 aware version of dispatch() here.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call bpf_timer_cancel() somewhere? mitosis_exit()?

I don't think we need to - the bpf program getting unloaded will kill the timer.

What is protecting the core callbacks (enqueue, dispatch, select) from race with reconfiguration?

It's inherently racy - the only important properties are that we preserve atomicity - e.g. the cpumask is always something that the timer work published, never some intermediate and that we never miss an update - e.g. that eventually we can be sure all scheduled tasks are using (or will use on next schedule) the latest published cpumask. You'll notice a common pattern to how we mutate cpumasks:

  1. We operate on the tmp_cpumask associated with a cell - this is only accessed by the tick() or timer() work so there's no worry about partial reads of this cpumask
  2. Once we have completed the operation we bpf_kptr_xchg() that tmp_cpumask into the real cpumask - this atomically "publishes" the new cpumask.
  3. Any concurrent enqueue, dispatch, init, select, etc. would either read the previously published cpumask or the newly published cpumask - atomicity is ensured.
  4. We sequence the read/writes of applied_configuration_seq such that we always publish the cpumask first and then bump applied_configuration_seq and on the read side we always read the applied_configuration_seq and then read the cpumask so it's not possible to miss a cpumask update

@dschatzberg dschatzberg force-pushed the mitosis-cell-recreation branch from f096d1a to 0e5aca7 Compare September 15, 2025 15:09
If cpu 0 is not running any scx_mitosis tasks, we can miss configuration
updates. This was due to a misunderstanding of how tick() works. The intended
behavior is just to periodically run this logic on some core. A bpf_timer is the
appropriate way to do this and so this commit makes that change. There's an
additional small fix to the freeing logic ensuring that cpus are given back to
the root cgroup.

Signed-off-by: Dan Schatzberg <[email protected]>
@dschatzberg dschatzberg force-pushed the mitosis-cell-recreation branch from 0e5aca7 to b9f1706 Compare September 15, 2025 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants