-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libct/cg/sd/v1: add a knob to skip freeze on set #3161
Conversation
Commit b810da1 added support for Device* properties into systemd cgroup v1 controller. Unfortunately, systemd re-applies device rules on every resource update, and does so disruptively, which results in spurious device permission errors. To work around that, systemd cgroup v1 controller freezes the cgroup before (and thaws it after) applying systemd properties. This freeze fixes the issue for runc, but creates a problem for kubernetes, which uses libcontainer/cgroups/systemd to manage pod slice units. When kubelet updates a pod slice, all the containers under that pod are unexpectedly frozen for some time, which leads to other issues. One way to fix this was tried in commit f2db879. By looking into unit properties, we can predict whether systemd is going to update device rules, and skip the freeze if it won't. Unfortunately, the above commit contains a few bugs, resulting in broken device rules check, so it's still not fixing the issue. Upon fixing those bug (see PR 3143), a benchmark was created to assess how expensive getting unit properties is. Turns out, it is -- Set() times went from 200us to 500us (see full benchmark results in that PR). An alternative approach was tried in PR 3151, in which we check devices.list to find out whether they have some rules or just a single "allow-all". Alas, this approach is not working if a cgroup there is a parent with some device rules set. Considering all this, it seems that the best approach is to add a "do not freeze on set" flag, which kubernetes can use when it knows the slice did not and does not have device rules, so freeze can safely be skipped. Remove the (expensive and currently broken) unit properties querying, and add/use a SkipFreezeOnSet flag instead. Since the logic is trivial, no tests are needed. Signed-off-by: Kir Kolyshkin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm 🙌
This will remove (tho it has never worked) support for avoiding the freeze for containers with full access. Should we add this flag and remove it, or should we do both? The overhead of the one/two dbus calls would still be lower than actually freezing containers with a lot of procs.
I'm trying to assess what's better:
I don't like it either way, but the only other choice is
Taking into account that
I think I prefer solution 1. Solution 2 is also OK but I like it less mostly because of complexity. If anyone can work on solution 3, I'm all for it :) |
Well that is all up to you as As long as we can get a flag for k8s so that we can avoid permanently frozen control groups, that's the most important part. |
Here's another option: #3166. As of this morning it seems best. |
Closed in favor of #3166 (which includes this one). |
Commit b810da1 added support for setting Device* properties to
systemd cgroup v1 controller (as otherwise systemd applied "allow-all"
device rule).
Unfortunately, systemd re-applies device rules on every resource update,
and does so in a disruptive way, causing spurious device permission
errors.
To work around that, systemd cgroup v1 controller freezes the
cgroup before (and thaws it after) applying systemd properties.
This freeze fixes the issue for runc, but creates a problem for
kubernetes.
Kubernetes uses libcontainer/cgroups/systemd to manage pod slice
units. When kubelet updates a pod slice, all the containers under that
pod are unexpectedly frozen for some time, which leads to other issues.
One way to fix this was tried in commit f2db879 (PR #3082).
By looking into unit properties, we can predict whether systemd is going
to update device rules, and skip the freeze if it won't.
Unfortunately, the above commit contains a few bugs, resulting in
broken device rules check, so it's still not fixing the issue.
Upon fixing those bug (see PR #3143), a benchmark was created to assess
how expensive getting unit properties is. Turns out, it is --
Set()
times went from 200us to 500us (see full benchmark results here).
An alternative approach was tried in PR #3151, in which we check
devices.list
cgroup file to find out whether the cgroup has some rules or justa single "allow-all". Alas, this approach is not working if a cgroup
has a parent with some device rules set.
Considering all this, it seems that the best approach is to add a
"do not freeze on set" flag, which kubernetes can use when it knows
the slice did not and does not have device rules, so freeze can safely
be skipped.
Remove the (expensive and currently broken) unit properties querying,
and add/use a
SkipFreezeOnSet
flag instead.Since the logic is trivial, no tests are needed.
This is a stop-gap measure. Long term solution is for k8s to stop
using libcontainer/cgroups (kubernetes/kubernetes#104325).