scx_lavd: support CPU frequency scaling #263

multics69 · 2024-05-03T15:34:28Z

This PR supports CPU frequency scaling in scx_lavd in a minimal form.

We keep track of 1) utilization of each CPU and 2) performance criticality of each task to know the required CPU performance (e.g., frequency) demand. The performance criticality of a task denotes how critical it is to CPU performance (frequency). Like the notion of latency criticality, we use three factors: the task's average runtime, wake-up frequency, and waken-up frequency. A task's runtime is longer, and its two frequencies are higher; the task is more performance-critical because it would be a bottleneck in the middle of the task chain.

To know the required CPU performance (e.g., frequency) demand, we keep track of 1) utilization of each CPU and 2) _performance criticality_ of each task. The performance criticality of a task denotes how critical it is to CPU performance (frequency). Like the notion of latency criticality, we use three factors: the task's average runtime, wake-up frequency, and waken-up frequency. A task's runtime is longer, and its two frequencies are higher; the task is more performance-critical because it would be a bottleneck in the middle of the task chain. Signed-off-by: Changwoo Min <changwoo@igalia.com>

htejun

Generally looks good to me. Given that the task to CPU mapping is completely arbitrary now, the per-CPU determination of perf target seems like it may lead to arbitrary artifacts but maybe the future plan is to have better control over how tasks are assigned to CPUs depending on their criticality and other factors?

scheds/rust/scx_lavd/src/bpf/intf.h

scheds/rust/scx_lavd/src/bpf/main.bpf.c

htejun · 2024-05-03T17:55:50Z

scheds/rust/scx_lavd/src/bpf/main.bpf.c

+	 * maximum.
+	 */
+	max_load = cutil_cur->avg_perf_cri * 1000 /* max cpu util */;
+	cpu_load = taskc->perf_cri * cpuc->util;


If a CPU has a high average perf criticality, wouldn't that mean that we want the CPU to be running fast even if the current task doesn't happen to have high criticality? e.g. In a single CPU system, if there's an active task which would qualify for highest performance level, wouldn't that warrant keeping the clock high?

Ah, it compares against the system-wide average performance criticality (cutil_cur->avg_perf_cri).
The key idea is as follows:

We determine the clock frequency of a CPU using two factors: 1) the current CPU utilization (cpuc->util) and 2) the current task's performance criticality (taskc->perf_cri) compared to the system-wide average performance criticality (cutil_cur->avg_perf_cri).

When a current CPU utilization is 100%, and the current task's performance criticality is the same as the system-wide average criticality, we set the target CPU frequency to the maximum.

In other words, even if CPU utilization is not so high, the target CPU frequency could be high when the task's performance criticality is high enough (i.e., boosting CPU frequency). On the other hand, the target CPU frequency could be low even if CPU utilization is high when a non-performance-critical task is running (i.e., deboosting CPU frequency).

For simplicity, let's assume a single CPU system. Let's say there are two tasks. One with low criticality and the other high. The low one is executing and the high one is right behind. The current logic would put the CPU in a lower clock state, right? Or is it that as the high perf criticality one is likely to have priority over the low one anyway, this mostly doesn't matter?

Yes, your understanding is correct. If it happens, the clock frequency will go up in the next tick, so I assumed that doesn't matter much. One potential solution would be calling scx_bpf_cpuperf_set() at ops.running() too. My slight worry about such an approach is the cost of scx_bpf_cpuperf_set()—actually, the cost of scaling the driver's hook, especially when task's runtime is short (say 100 usec). If the cost of scx_bpf_cpuperf_set() is cheap enough, we can call scx_bpf_cpuperf_set() in two places: at ops.running() for immediate reaction and at ops.tick() for reflecting changed system load. What do you think?

I don't have any practical experience with cpufreq, so my opinion doesn't really matter. I was just trying to read the code and imagine how it would work. In case they might be helpful, the followings are what I learned from reading kernel code and listening to someone who's experimenting with this internally:

The schedutil and cpufreq code does have its own rate-limiting mechanism.

But still, trying to switch the freq too often does have noticeable overhead.

The reason why I kept asking about the same question is because the in-kernel implementation seems to be 1. use per-cpu aggregate value to determine target freq 2. avoid ramping down freq if util is close to 100%. ie. it looked like one of the main goals of the in-kernel implementation is avoiding lower total amount of work done compared to max-freq condition. I was just wondering whether there would be conditions where lavd's implementation would lose total amount of work performed. So, just a theoretical curiosity than anything else.

Then I think it would be better to keep the code as it is. I also considered two things (from the book). Updating the target frequency at ops.tick() is a kind of rate-limiting with a bounded delay. I will come up with more sophisticate approach after reading the kernel code more to adopt a similar (or better) strategy.

Signed-off-by: Changwoo Min <changwoo@igalia.com>

multics69 · 2024-05-04T01:50:52Z

This is just the first iteration for frequency scaling. As you correctly pointed out, the current arbitrary task for CPU mapping makes frequency management less effective. I feel that clustering similar performance criticality tasks into a CPU would be more effective (e.g., triggering hardware turbo boost by under-clocking a few cores) than the conventional cache-affinity-based task placement. I am still trying to figure out how to reconcile performance-criticality-based placement vs. cache-affinity-based placement.

multics69 requested review from htejun and Byte-Lab May 3, 2024 15:34

htejun approved these changes May 3, 2024

View reviewed changes

scx_lavd: more comments about CPU frequency scaling

Loading
Loading status checks…

a24e1d7

Signed-off-by: Changwoo Min <changwoo@igalia.com>

multics69 merged commit 01e5a46 into sched-ext:main May 5, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx_lavd: support CPU frequency scaling #263

scx_lavd: support CPU frequency scaling #263

multics69 commented May 3, 2024

htejun left a comment

htejun May 3, 2024

multics69 May 4, 2024

htejun May 4, 2024

multics69 May 4, 2024

htejun May 4, 2024 •

edited

Loading

multics69 May 5, 2024

multics69 commented May 4, 2024

scx_lavd: support CPU frequency scaling #263

scx_lavd: support CPU frequency scaling #263

Conversation

multics69 commented May 3, 2024

htejun left a comment

Choose a reason for hiding this comment

htejun May 3, 2024

Choose a reason for hiding this comment

multics69 May 4, 2024

Choose a reason for hiding this comment

htejun May 4, 2024

Choose a reason for hiding this comment

multics69 May 4, 2024

Choose a reason for hiding this comment

htejun May 4, 2024 • edited Loading

Choose a reason for hiding this comment

multics69 May 5, 2024

Choose a reason for hiding this comment

multics69 commented May 4, 2024

htejun May 4, 2024 •

edited

Loading