Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgroup: devices: major cleanups and minimal transition rules #2391

Merged
merged 10 commits into from
May 14, 2020
Merged

cgroup: devices: major cleanups and minimal transition rules #2391

merged 10 commits into from
May 14, 2020

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented May 8, 2020

#2366 is mostly fixed by this, though the patch is unfortunately a little more complicated than you might've first expected (it includes an emulator for the devices controller). Unfortunately this logic is necessary in order to be able to calculate and verify the minimal transition rules for a cgroup update.

A similar issue affects the cgroupv2 devices setup, but that is a topic for another time (as the solution is drastically different).

Fixes #2204
Signed-off-by: Aleksa Sarai [email protected]

@cyphar
Copy link
Member Author

cyphar commented May 8, 2020

@mYmNeo does this PR fix the original issue for you?

@cyphar
Copy link
Member Author

cyphar commented May 8, 2020

According to my testing, this does fix #2204.

@AkihiroSuda
Copy link
Member

Can we have integration tests?

@mYmNeo
Copy link

mYmNeo commented May 9, 2020

@cyphar I've tested these patches, it works well. Thank you very much. 👍

@cyphar

This comment has been minimized.

@cyphar
Copy link
Member Author

cyphar commented May 9, 2020

@crosbymichael I'd like to know if you're okay with the /dev/console rule removal as well as some of the core refactors. Quite a bit of this code predates even my involvement in libcontainer. In particular -- is there a reason that /dev/console was included in the default white-list?

@cyphar
Copy link
Member Author

cyphar commented May 9, 2020

@AkihiroSuda Integration test added.

@cyphar

This comment has been minimized.


# Trigger an update. This update doesn't actually change the device rules,
# but it will trigger the devices cgroup code to reapply the current rules.
# We trigger the update a few times to make sure we hit the race.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slow Saturday here... which race?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That the container will try to write to /dev/null during the window between a *:* rwm being denied and the rest of the white-list being applied.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add code comments for the race?

@cyphar
Copy link
Member Author

cyphar commented May 10, 2020

Ah, I know what the issue is. systemd now supports device restrictions, but we don't tell it about our rules. So it clears all the rules, which results in runc following up by doing a full rule setup rather than using the minimal transition rules.

This is quite scary because during the race window, systemd actually switches the devices cgroup to be allow-all (since that's the default). Which means that a container could theoretically get full access to any block or character device on the system. Oops. Yet another reason to not use the systemd cgroup manager I guess 😉.

We'll need to generate the DevicesAllow= set...

@cyphar
Copy link
Member Author

cyphar commented May 10, 2020

I just implemented the above, but it turns out that systemd has the same bug outlined above (if you tell it to reset DevicesAllow and then give it a new allow list, it will always write a deny-all blacklist first). So we'll just need to disable this test for systemd cgroups, since this behaviour is coming from systemd.

@cyphar
Copy link
Member Author

cyphar commented May 10, 2020

Since this PR solves a few actual security bugs, I might request a CVE. The systemd cgroup one is probably worth a CVE on its own because it means containers could do all sorts of worrying attacks...

@cyphar

This comment has been minimized.

@AkihiroSuda

This comment has been minimized.

@cyphar

This comment has been minimized.

Okay, this requires a bit of explanation.

The reason for this emulation is to allow us to have seamless updates of
the devices cgroup for running containers. This was triggered by several
users having issues where our initial writing of a deny-all rule (in all
cases) results in spurrious errors.

The obvious solution would be to just remove the deny-all rule, right?
Well, it turns out that runc doesn't actually control the deny-all rule
because all users of runc have explicitly specified their own deny-all
rule for many years. This appears to have been done to work around a bug
in runc (which this series has fixed in [1]) where we would actually act
as a black-list despite this being a violation of the OCI spec.

This means that not adding our own deny-all rule in the case of updates
won't solve the issue. However, it will also not solve the issue in
several other cases (the most notable being where a container is being
switched between default-permission modes).

So in order to handle all of these cases, a way of tracking the relevant
internal cgroup state (given a certain state of "cgroups.list" and a set
of rules to apply) is necessary. That is the purpose of DevicesEmulator.
Reading "devices.list" is quite important because that's the only way we
can tell if it's safe to skip the troublesome deny-all rules without
making potentially-dangerous assumptions about the container.

We also are currently bug-compatible with the devices cgroup (namely,
removing rules that don't exist or having superfluous rules all works as
with the in-kernel implementation). The only exception to this is that
we give an error if a user requests to revoke part of a wildcard
exception, because allowing such configurations could result in security
holes (cgroupv1 silently ignores such rules, meaning in white-list mode
that the access is still permitted).

[1]: b2bec98 ("cgroup: devices: eradicate the Allow/Deny lists")

Signed-off-by: Aleksa Sarai <[email protected]>
Now that all of the infrastructure for devices.Emulator is in place, we
can finally implement minimal transition rules for devices cgroups. This
allows for minimal disruption to running containers if a rule update is
requested. Only in very rare circumstances (black-list cgroups and mode
switching) will a clear-all rule be written. As a result, containers
should no longer see spurious errors.

A similar issue affects the cgroupv2 devices setup, but that is a topic
for another time (as the solution is drastically different).

Signed-off-by: Aleksa Sarai <[email protected]>
It seems we missed that systemd added support for the devices cgroup, as
a result systemd would actually *write an allow-all rule each time you
did 'runc update'* if you used the systemd cgroup driver. This is
obviously ... bad and was a clear security bug. Luckily the commits which
introduced this were never in an actual runc release.

So we simply generate the cgroupv1-style rules (which is what systemd's
DeviceAllow wants) and default to a deny-all ruleset. Unfortunately it
turns out that systemd is susceptible to the same spurrious error
failure that we were, so that problem is out of our hands for systemd
cgroup users.

However, systemd has a similar bug to the one fixed in [1]. It will
happily write a disruptive deny-all rule when it is not necessary.
Unfortunately, we cannot even use devices.Emulator to generate a minimal
set of transition rules because the DBus API is limited (you can only
clear or append to the DeviceAllow= list -- so we are forced to always
clear it). To work around this, we simply freeze the container during
SetUnitProperties.

[1]: afe8348 ("cgroupv1: devices: use minimal transition rules with devices.Emulator")

Fixes: 1d4ccc8 ("fix data inconsistent when runc update in systemd driven cgroup v1")
Fixes: 7682a2b ("fix data inconsistent when runc update in systemd driven cgroup v2")
Signed-off-by: Aleksa Sarai <[email protected]>
Unfortunately, runc update doesn't support setting devices rules
directly so we have to trigger it by modifying a different rule (which
happens to trigger a devices update).

Signed-off-by: Aleksa Sarai <[email protected]>
Such containers should remain paused after the update. This has
historically been true, but this helps ensure that the systemd cgroup
changes (freezing the container during SetUnitProperties) don't break
this behaviour.

Signed-off-by: Aleksa Sarai <[email protected]>
@cyphar
Copy link
Member Author

cyphar commented May 13, 2020

I've rebased and cleaned up the commit message for the DeviceAllow commit to better clarify that it was a security issue but since no released version of runc included it, it didn't deserve a proper security advisory.

I promise I won't touch this PR anymore. 😉

/cc @opencontainers/runc-maintainers

@kolyshkin
Copy link
Contributor

kolyshkin commented May 14, 2020

LGTM

Approved with PullApprove

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AkihiroSuda
Copy link
Member

AkihiroSuda commented May 14, 2020

LGTM

Approved with PullApprove

@AkihiroSuda AkihiroSuda merged commit 3f1e886 into opencontainers:master May 14, 2020
@cyphar cyphar deleted the devices-cgroup branch May 14, 2020 02:45
@cyphar cyphar mentioned this pull request May 14, 2020
bmwiedemann added a commit to bmwiedemann/openSUSE that referenced this pull request May 14, 2020
https://build.opensuse.org/request/show/804889
by user cyphar + dimstar_suse
- Backport opencontainers/runc#2391 to help fix
  bsc#1168481.
  + bsc1168481-0001-cgroup-devices-major-cleanups-and-minimal-transition.patch
bmwiedemann added a commit to bmwiedemann/openSUSE that referenced this pull request May 14, 2020
https://build.opensuse.org/request/show/804891
by user cyphar + dimstar_suse
- Backport opencontainers/runc#2391 to help fix
  bsc#1168481.
  + bsc1168481-0001-cgroup-devices-major-cleanups-and-minimal-transition.patch
@kolyshkin kolyshkin mentioned this pull request Jun 16, 2020
@tao12345666333
Copy link

When I try to vendor runc library v1.0.0-rc91 into Docker/Moby, I found an unexpected behavior: moby/moby#41178 (comment)

After debugging, I found that the problem was introduced since this PR. PTAL, Thanks! moby/moby#41178

@AkihiroSuda
Copy link
Member

Could you open a new issue?

@tao12345666333
Copy link

tao12345666333 commented Jul 28, 2020

sure.

Aleksa has opened #2529 I don't need to open a new issue. 😸

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update command causes 'Operation not permitted'
5 participants