-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cgroup: devices: major cleanups and minimal transition rules #2391
Conversation
@mYmNeo does this PR fix the original issue for you? |
According to my testing, this does fix #2204. |
Can we have integration tests? |
@cyphar I've tested these patches, it works well. Thank you very much. 👍 |
This comment has been minimized.
This comment has been minimized.
@crosbymichael I'd like to know if you're okay with the |
@AkihiroSuda Integration test added. |
This comment has been minimized.
This comment has been minimized.
tests/integration/update.bats
Outdated
|
||
# Trigger an update. This update doesn't actually change the device rules, | ||
# but it will trigger the devices cgroup code to reapply the current rules. | ||
# We trigger the update a few times to make sure we hit the race. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slow Saturday here... which race?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That the container will try to write to /dev/null
during the window between a *:* rwm
being denied and the rest of the white-list being applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add code comments for the race?
Ah, I know what the issue is. This is quite scary because during the race window, systemd actually switches the devices cgroup to be allow-all (since that's the default). Which means that a container could theoretically get full access to any block or character device on the system. Oops. Yet another reason to not use the systemd cgroup manager I guess 😉. We'll need to generate the |
I just implemented the above, but it turns out that systemd has the same bug outlined above (if you tell it to reset |
Since this PR solves a few actual security bugs, I might request a CVE. The systemd cgroup one is probably worth a CVE on its own because it means containers could do all sorts of worrying attacks... |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Okay, this requires a bit of explanation. The reason for this emulation is to allow us to have seamless updates of the devices cgroup for running containers. This was triggered by several users having issues where our initial writing of a deny-all rule (in all cases) results in spurrious errors. The obvious solution would be to just remove the deny-all rule, right? Well, it turns out that runc doesn't actually control the deny-all rule because all users of runc have explicitly specified their own deny-all rule for many years. This appears to have been done to work around a bug in runc (which this series has fixed in [1]) where we would actually act as a black-list despite this being a violation of the OCI spec. This means that not adding our own deny-all rule in the case of updates won't solve the issue. However, it will also not solve the issue in several other cases (the most notable being where a container is being switched between default-permission modes). So in order to handle all of these cases, a way of tracking the relevant internal cgroup state (given a certain state of "cgroups.list" and a set of rules to apply) is necessary. That is the purpose of DevicesEmulator. Reading "devices.list" is quite important because that's the only way we can tell if it's safe to skip the troublesome deny-all rules without making potentially-dangerous assumptions about the container. We also are currently bug-compatible with the devices cgroup (namely, removing rules that don't exist or having superfluous rules all works as with the in-kernel implementation). The only exception to this is that we give an error if a user requests to revoke part of a wildcard exception, because allowing such configurations could result in security holes (cgroupv1 silently ignores such rules, meaning in white-list mode that the access is still permitted). [1]: b2bec98 ("cgroup: devices: eradicate the Allow/Deny lists") Signed-off-by: Aleksa Sarai <[email protected]>
Now that all of the infrastructure for devices.Emulator is in place, we can finally implement minimal transition rules for devices cgroups. This allows for minimal disruption to running containers if a rule update is requested. Only in very rare circumstances (black-list cgroups and mode switching) will a clear-all rule be written. As a result, containers should no longer see spurious errors. A similar issue affects the cgroupv2 devices setup, but that is a topic for another time (as the solution is drastically different). Signed-off-by: Aleksa Sarai <[email protected]>
It seems we missed that systemd added support for the devices cgroup, as a result systemd would actually *write an allow-all rule each time you did 'runc update'* if you used the systemd cgroup driver. This is obviously ... bad and was a clear security bug. Luckily the commits which introduced this were never in an actual runc release. So we simply generate the cgroupv1-style rules (which is what systemd's DeviceAllow wants) and default to a deny-all ruleset. Unfortunately it turns out that systemd is susceptible to the same spurrious error failure that we were, so that problem is out of our hands for systemd cgroup users. However, systemd has a similar bug to the one fixed in [1]. It will happily write a disruptive deny-all rule when it is not necessary. Unfortunately, we cannot even use devices.Emulator to generate a minimal set of transition rules because the DBus API is limited (you can only clear or append to the DeviceAllow= list -- so we are forced to always clear it). To work around this, we simply freeze the container during SetUnitProperties. [1]: afe8348 ("cgroupv1: devices: use minimal transition rules with devices.Emulator") Fixes: 1d4ccc8 ("fix data inconsistent when runc update in systemd driven cgroup v1") Fixes: 7682a2b ("fix data inconsistent when runc update in systemd driven cgroup v2") Signed-off-by: Aleksa Sarai <[email protected]>
Unfortunately, runc update doesn't support setting devices rules directly so we have to trigger it by modifying a different rule (which happens to trigger a devices update). Signed-off-by: Aleksa Sarai <[email protected]>
Such containers should remain paused after the update. This has historically been true, but this helps ensure that the systemd cgroup changes (freezing the container during SetUnitProperties) don't break this behaviour. Signed-off-by: Aleksa Sarai <[email protected]>
I've rebased and cleaned up the commit message for the I promise I won't touch this PR anymore. 😉 /cc @opencontainers/runc-maintainers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
https://build.opensuse.org/request/show/804889 by user cyphar + dimstar_suse - Backport opencontainers/runc#2391 to help fix bsc#1168481. + bsc1168481-0001-cgroup-devices-major-cleanups-and-minimal-transition.patch
https://build.opensuse.org/request/show/804891 by user cyphar + dimstar_suse - Backport opencontainers/runc#2391 to help fix bsc#1168481. + bsc1168481-0001-cgroup-devices-major-cleanups-and-minimal-transition.patch
When I try to vendor runc library v1.0.0-rc91 into Docker/Moby, I found an unexpected behavior: moby/moby#41178 (comment) After debugging, I found that the problem was introduced since this PR. PTAL, Thanks! moby/moby#41178 |
Could you open a new issue? |
Aleksa has opened #2529 I don't need to open a new issue. 😸 |
#2366 is mostly fixed by this, though the patch is unfortunately a little more complicated than you might've first expected (it includes an emulator for the devices controller). Unfortunately this logic is necessary in order to be able to calculate and verify the minimal transition rules for a cgroup update.
A similar issue affects the cgroupv2 devices setup, but that is a topic for another time (as the solution is drastically different).
Fixes #2204
Signed-off-by: Aleksa Sarai [email protected]