-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scx_lavd: support CPU frequency scaling #263
Conversation
To know the required CPU performance (e.g., frequency) demand, we keep track of 1) utilization of each CPU and 2) _performance criticality_ of each task. The performance criticality of a task denotes how critical it is to CPU performance (frequency). Like the notion of latency criticality, we use three factors: the task's average runtime, wake-up frequency, and waken-up frequency. A task's runtime is longer, and its two frequencies are higher; the task is more performance-critical because it would be a bottleneck in the middle of the task chain. Signed-off-by: Changwoo Min <changwoo@igalia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good to me. Given that the task to CPU mapping is completely arbitrary now, the per-CPU determination of perf target seems like it may lead to arbitrary artifacts but maybe the future plan is to have better control over how tasks are assigned to CPUs depending on their criticality and other factors?
* maximum. | ||
*/ | ||
max_load = cutil_cur->avg_perf_cri * 1000 /* max cpu util */; | ||
cpu_load = taskc->perf_cri * cpuc->util; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a CPU has a high average perf criticality, wouldn't that mean that we want the CPU to be running fast even if the current task doesn't happen to have high criticality? e.g. In a single CPU system, if there's an active task which would qualify for highest performance level, wouldn't that warrant keeping the clock high?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it compares against the system-wide average performance criticality (cutil_cur->avg_perf_cri
).
The key idea is as follows:
-
We determine the clock frequency of a CPU using two factors: 1) the current CPU utilization (cpuc->util) and 2) the current task's performance criticality (taskc->perf_cri) compared to the system-wide average performance criticality (cutil_cur->avg_perf_cri).
-
When a current CPU utilization is 100%, and the current task's performance criticality is the same as the system-wide average criticality, we set the target CPU frequency to the maximum.
-
In other words, even if CPU utilization is not so high, the target CPU frequency could be high when the task's performance criticality is high enough (i.e., boosting CPU frequency). On the other hand, the target CPU frequency could be low even if CPU utilization is high when a non-performance-critical task is running (i.e., deboosting CPU frequency).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For simplicity, let's assume a single CPU system. Let's say there are two tasks. One with low criticality and the other high. The low one is executing and the high one is right behind. The current logic would put the CPU in a lower clock state, right? Or is it that as the high perf criticality one is likely to have priority over the low one anyway, this mostly doesn't matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, your understanding is correct. If it happens, the clock frequency will go up in the next tick, so I assumed that doesn't matter much. One potential solution would be calling scx_bpf_cpuperf_set() at ops.running() too. My slight worry about such an approach is the cost of scx_bpf_cpuperf_set()—actually, the cost of scaling the driver's hook, especially when task's runtime is short (say 100 usec). If the cost of scx_bpf_cpuperf_set() is cheap enough, we can call scx_bpf_cpuperf_set() in two places: at ops.running() for immediate reaction and at ops.tick() for reflecting changed system load. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any practical experience with cpufreq, so my opinion doesn't really matter. I was just trying to read the code and imagine how it would work. In case they might be helpful, the followings are what I learned from reading kernel code and listening to someone who's experimenting with this internally:
- The schedutil and cpufreq code does have its own rate-limiting mechanism.
- But still, trying to switch the freq too often does have noticeable overhead.
The reason why I kept asking about the same question is because the in-kernel implementation seems to be 1. use per-cpu aggregate value to determine target freq 2. avoid ramping down freq if util is close to 100%. ie. it looked like one of the main goals of the in-kernel implementation is avoiding lower total amount of work done compared to max-freq condition. I was just wondering whether there would be conditions where lavd's implementation would lose total amount of work performed. So, just a theoretical curiosity than anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I think it would be better to keep the code as it is. I also considered two things (from the book). Updating the target frequency at ops.tick() is a kind of rate-limiting with a bounded delay. I will come up with more sophisticate approach after reading the kernel code more to adopt a similar (or better) strategy.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
This is just the first iteration for frequency scaling. As you correctly pointed out, the current arbitrary task for CPU mapping makes frequency management less effective. I feel that clustering similar performance criticality tasks into a CPU would be more effective (e.g., triggering hardware turbo boost by under-clocking a few cores) than the conventional cache-affinity-based task placement. I am still trying to figure out how to reconcile performance-criticality-based placement vs. cache-affinity-based placement. |
This PR supports CPU frequency scaling in scx_lavd in a minimal form.
We keep track of 1) utilization of each CPU and 2) performance criticality of each task to know the required CPU performance (e.g., frequency) demand. The performance criticality of a task denotes how critical it is to CPU performance (frequency). Like the notion of latency criticality, we use three factors: the task's average runtime, wake-up frequency, and waken-up frequency. A task's runtime is longer, and its two frequencies are higher; the task is more performance-critical because it would be a bottleneck in the middle of the task chain.