-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cgroup: devices updates appear to be broken #2366
Comments
Fixing the white-list problem actually isn't too bad -- we just need to re-calculate the device list (merging entries which overlap with each other) rather than blindly applying the provided list of devices in-order. Basically we should pre-calculate the minimal |
And actually this does bring up a different question -- what does |
I've played with it in crun and I think we can use pinning to the bpf file system to get back a fd to the previous eBPF program. What do you think of the solution here: containers/crun#352 ? |
Pinning is another fine solution to the issue (as much as I'm not a fan of how ugly the pinning system is). Another choice would be to use the lookup-by-id system but then we'd have to deal with IDs being able to overflow. But while |
The cgroupv1 portion of this issue was fixed by #2391. |
Changed the milestone from rc92 to 1.0 |
@cyphar Can we postpone this to v1.1.0? |
cgroupv2 device updates are probably fine for now, especially since we use systemd to manage cgroupv2 (and while systemd isn't perfect it does handle updates reasonably). cgroupv1 was the more important one and we've already fixed that. |
We also leak a file descriptor on every |
Indeed, update functionality is not worded out good enough in spec. Ideally, for devices there should be some kind of "append or replace" flag (and I also hope one day we'll be able to skip reading all the resources from the state file on update, at least if "replace" is specified for devices). If that's too complicated or redundant, we can say "runc only implements replace for devices" (if the spec will allow us too). |
@cyphar We're planning to run k8s+containerd with cgroup v2 in production, but this work is suspended due to this issue. Do you have a date when the issue with cgroup v2 will be fixed? Many thanks. |
please be aware that k8s has still some issues with cgroup v2 (e.g., kubernetes/kubernetes#99230) |
Thanks @giuseppe, we will keep watch for this process and do help if you need. |
Hi @cyphar. Sorry to bother you. I wonder why we must use the flag BPF_F_ALLOW_MULTI. How about using unix.BPF_F_ALLOW_OVERRIDE to replace BPF_F_ALLOW_MULTI ? if err := prog.Attach(dirFD, ebpf.AttachCGroupDevice, unix.BPF_F_ALLOW_OVERRIDE); err != nil {
return nilCloser, errors.Wrap(err, "failed to call BPF_PROG_ATTACH (BPF_CGROUP_DEVICE, BPF_F_ALLOW_OVERRIDE)")
}
|
From memory we need to use |
Get it, thanks. BTW, the idea using a BPF_MAP_TYPE_ARRAY_OF_MAPS map to store rules sounds great. Is someone doing this ? :) @cyphar |
Perhaps we can simplify the task by using It would also help to not call setDevices in case devices are not changed. This requires some API changes though (for cgroup manager's Set to be able to see that current set and new set are identical) |
|
This reverts commit 07f5e84. k8s+containerd with cgroup v2 is currently not working correctly due to opencontainers/runc#2366
Merged #2951 |
This affects both versions, but in quite different ways:
For cgroupv1, Don't deny all devices when update cgroup resource #2205 highlighted that on device cgroup updates, we temporarily block all devices. This results in spurious errors in the container (such as programs being unable to open
/dev/null
). We've seen this happen on customer systems under Kubernetes, so this is definitely a real issue.runc
actually incorrectly implements the spec here -- technicallyrunc
actually is a black-list by default and users have to convertrunc
to be a white-list. Aside from not following the spec this is a worrying security stance.For cgroupv2, devices cgroup updates are implemented by appending a new BPF program to the cgroup. This means that only new denials have an effect, and thus it's incorrectly implemented. (EDIT: This also means that we "leak" eBPF programs and thus after 64+ applications we start getting errors -- see api, cgroupv2: skip setting the devices cgroup #2474.)
Unfortunately this is a bit complicated to fix, but I have figured out how to do it. We need to make an eBPF map of typeBPF_MAP_TYPE_PROG_ARRAY
and then tail-call into it in a small stub eBPF program which we attach to the actual cgroup. This which will allow us to atomically update the devices cgroup rules (there is no way to atomically replace an eBPF program withBPF_F_ALLOW_MULTI
-- and without any program, all device accesses would be permitted).bpf_tail_call
from cgroup programs. So we will need to instead implement it through an eBPF map (which we can atomically replace by mis-usingBPF_MAP_TYPE_ARRAY_OF_ARRAY
).Part of #2315.
The text was updated successfully, but these errors were encountered: