libct/cg/sd/v1: add a knob to skip freeze on set #3161

kolyshkin · 2021-08-13T20:20:04Z

Commit b810da1 added support for setting Device* properties to
systemd cgroup v1 controller (as otherwise systemd applied "allow-all"
device rule).

Unfortunately, systemd re-applies device rules on every resource update,
and does so in a disruptive way, causing spurious device permission
errors.

To work around that, systemd cgroup v1 controller freezes the
cgroup before (and thaws it after) applying systemd properties.
This freeze fixes the issue for runc, but creates a problem for
kubernetes.

Kubernetes uses libcontainer/cgroups/systemd to manage pod slice
units. When kubelet updates a pod slice, all the containers under that
pod are unexpectedly frozen for some time, which leads to other issues.

One way to fix this was tried in commit f2db879 (PR #3082).
By looking into unit properties, we can predict whether systemd is going
to update device rules, and skip the freeze if it won't.

Unfortunately, the above commit contains a few bugs, resulting in
broken device rules check, so it's still not fixing the issue.

Upon fixing those bug (see PR #3143), a benchmark was created to assess
how expensive getting unit properties is. Turns out, it is -- Set()
times went from 200us to 500us (see full benchmark results here).

An alternative approach was tried in PR #3151, in which we check
devices.list cgroup file to find out whether the cgroup has some rules or just
a single "allow-all". Alas, this approach is not working if a cgroup
has a parent with some device rules set.

Considering all this, it seems that the best approach is to add a
"do not freeze on set" flag, which kubernetes can use when it knows
the slice did not and does not have device rules, so freeze can safely
be skipped.

Remove the (expensive and currently broken) unit properties querying,
and add/use a SkipFreezeOnSet flag instead.

Since the logic is trivial, no tests are needed.

This is a stop-gap measure. Long term solution is for k8s to stop
using libcontainer/cgroups (kubernetes/kubernetes#104325).

Commit b810da1 added support for Device* properties into systemd cgroup v1 controller. Unfortunately, systemd re-applies device rules on every resource update, and does so disruptively, which results in spurious device permission errors. To work around that, systemd cgroup v1 controller freezes the cgroup before (and thaws it after) applying systemd properties. This freeze fixes the issue for runc, but creates a problem for kubernetes, which uses libcontainer/cgroups/systemd to manage pod slice units. When kubelet updates a pod slice, all the containers under that pod are unexpectedly frozen for some time, which leads to other issues. One way to fix this was tried in commit f2db879. By looking into unit properties, we can predict whether systemd is going to update device rules, and skip the freeze if it won't. Unfortunately, the above commit contains a few bugs, resulting in broken device rules check, so it's still not fixing the issue. Upon fixing those bug (see PR 3143), a benchmark was created to assess how expensive getting unit properties is. Turns out, it is -- Set() times went from 200us to 500us (see full benchmark results in that PR). An alternative approach was tried in PR 3151, in which we check devices.list to find out whether they have some rules or just a single "allow-all". Alas, this approach is not working if a cgroup there is a parent with some device rules set. Considering all this, it seems that the best approach is to add a "do not freeze on set" flag, which kubernetes can use when it knows the slice did not and does not have device rules, so freeze can safely be skipped. Remove the (expensive and currently broken) unit properties querying, and add/use a SkipFreezeOnSet flag instead. Since the logic is trivial, no tests are needed. Signed-off-by: Kir Kolyshkin <[email protected]>

odinuge

lgtm 🙌

This will remove (tho it has never worked) support for avoiding the freeze for containers with full access. Should we add this flag and remove it, or should we do both? The overhead of the one/two dbus calls would still be lower than actually freezing containers with a lot of procs.

kolyshkin · 2021-08-14T22:30:03Z

I'm trying to assess what's better:

Penalize privileged (no device restrictions) containers by freezing them on Set.
Penalize all containers by making Set slower with the device* properties check.

I don't like it either way, but the only other choice is

Fix systemd, doing something similar to what @cyphar did in PR cgroup: devices: major cleanups and minimal transition rules #2391.

Taking into account that

solution 1 is the simplest
I haven't seen anyone complaining about freezing their privileged container during update
we're switching to cgroup v2 which do not have this problem

I think I prefer solution 1. Solution 2 is also OK but I like it less mostly because of complexity. If anyone can work on solution 3, I'm all for it :)

odinuge · 2021-08-15T15:57:40Z

Well that is all up to you as runc maintianers I guess. The only reason for asking was that when this was implemented instead of a separate flag, the argument was that such privileged containers would also benefit from it. If that decision is changed, I am totally fine with it tho.

As long as we can get a flag for k8s so that we can avoid permanently frozen control groups, that's the most important part.

kolyshkin · 2021-08-16T19:21:04Z

Here's another option: #3166. As of this morning it seems best.

kolyshkin · 2021-08-17T22:50:24Z

Closed in favor of #3166 (which includes this one).

This was referenced Aug 13, 2021

libct/cg/sd/v1: fix freezeBeforeSet (alt) #3151

Closed

libct/cg/sd/v1: Fix unnecessary freeze/thaw #3143

Closed

kolyshkin requested a review from cyphar August 13, 2021 20:26

odinuge approved these changes Aug 14, 2021

View reviewed changes

kolyshkin mentioned this pull request Aug 16, 2021

libct/cg/sd/v1: fix freezeBeforeSet (alt 2) #3166

Merged

This was referenced Aug 16, 2021

Release 1.0.2 #3149

Closed

[1.0] libct/cg/sd/v1: fix freezeBeforeSet (alt 2) #3167

Merged

kolyshkin closed this Aug 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libct/cg/sd/v1: add a knob to skip freeze on set #3161

libct/cg/sd/v1: add a knob to skip freeze on set #3161

kolyshkin commented Aug 13, 2021

odinuge left a comment

kolyshkin commented Aug 14, 2021

odinuge commented Aug 15, 2021

kolyshkin commented Aug 16, 2021

kolyshkin commented Aug 17, 2021

libct/cg/sd/v1: add a knob to skip freeze on set #3161

libct/cg/sd/v1: add a knob to skip freeze on set #3161

Conversation

kolyshkin commented Aug 13, 2021

odinuge left a comment

Choose a reason for hiding this comment

kolyshkin commented Aug 14, 2021

odinuge commented Aug 15, 2021

kolyshkin commented Aug 16, 2021

kolyshkin commented Aug 17, 2021