-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cgroups: select which cgroup hierarchy and subsystem state to use #369
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
tixxdz
force-pushed
the
pr/tixxdz/cgroup-select-hierarchies
branch
from
September 5, 2022 12:54
d8fd0f8
to
1e56861
Compare
willfindlay
approved these changes
Sep 6, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes a nasty cgroupsv2 bug we've been tracking on 4.19 🎉
tixxdz
force-pushed
the
pr/tixxdz/cgroup-select-hierarchies
branch
from
September 7, 2022 14:05
175dad1
to
6d50460
Compare
tixxdz
force-pushed
the
pr/tixxdz/cgroup-select-hierarchies
branch
from
September 7, 2022 14:25
6d50460
to
c69cdef
Compare
This is a preparation patch that adds: tetragon_conf struct to store Tetragon runtime configuration in order to improve cgroup implementation and how we lookup container IDs. User space will gather information and store it into bpf map where cgroup helpers will read it and adapt their behavior accordingly. We have experienced events that did not have the proper container ID 'docker' nor the pod fields set. The reasons are: In Cgroupv1 mode usually systemd only sets up by default the 'cpu, cpuacct, memory, devices and pids' controllers, cpuset controller which in normal cases indexed at 0 is not installed, and it was the default controller that our bpf helpers used to fetch cgroup information including the name. Since some container runtimes and different environments may use systemd as a cgroup driver this can cause problems where we won't operate on the right hierarchy (controller). We should also note that these controllers are kernel compile CONFIG_* options, and Tetragon should only work on machines that have the CONFIG_CGROUP_PIDS or CONFIG_CGROUP_MEMORY controllers compiled. Most production machines these days have or must have these compiled-in to properly work. Let's be consistent with systemd, Kubernetes and Container runtimes, select by default either the 'memory' or 'pids' controllers to be used as the tracking Cgroup hierarchy for all processes in Cgroupv1. Usually these two controllers are always present and set. We do this by selecting the right hierarchy ID and the cgroup subsystem state index that it initialized once during boot and propagated to all css_set's of tasks of the machine. In Cgroupv2 mode, systemd successfully sets the related controllers that are safe to be used by default. However we have experienced machines that did not have the cpuset controller which is kind of strange, the controller is not propagated down to services and processes. This ends up in same error as for Cgroupv1. In order to avoid such errors we do same operation for Cgroupv1, we gather the Cgroup subsystem state index and pass it into tetragon_conf struct at startup. Then to get the Cgroup ID we first use the default Cgroupv2 BPF helpers, if they fail we fallback to the per subsystem state index. Last, to get the Cgroup name we always query the subsystem state index and read the kernfs node name. This should allow Tetragon to work on most of environments, regardless of the Cgroup configuration and driver being used, assuming they have the CONFIG_CGROUP_MEMORY or CONFIG_CGROUP_PIDS compiled-in. Signed-off-by: Djalal Harouni <[email protected]>
Update our definition of CGROUP_SUBSYS_COUNT since new Cgroup controllers were added. These value will be used as an in-bound limit guard. We are only interested in 'memory' and 'pids' indexes, however we will use the values provided here for safety in-bound checks. Since some controllers may not be compiled in, we instead read /proc/cgroups at startup and update the right subsystem indexes at runtime to accommodate to the current machine where Tetragon is running, this is more flexible. Reference: https://elixir.bootlin.com/linux/v5.19/source/include/linux/cgroup_subsys.h Signed-off-by: Djalal Harouni <[email protected]>
Preparation patch for later that adds convenience macro to check that provided named type (struct/union/enum/typedef) exists in a target kernel. We need this to check that 'union kernfs_node_id' type exists which was the id for kernfs nodes in kernels prior to 5.5 Code reference from: https://github.com/libbpf/libbpf Signed-off-by: Djalal Harouni <[email protected]>
… older In newer kernels the kernfs_node id is u64 type, however on kernels from 5.4 and older it was a union. So let's add the kernfs_node_id union type so we can check it at runtime and properly operate on the right structure layout to get the Cgroup ID on these older kernels. This never worked and we did not notice it since we do not use the Cgroup IDs. However this will change in future so make sure we fix this now while we are it. This also helps debugging so we get the right IDs instead of zeroes. Signed-off-by: Djalal Harouni <[email protected]>
Add our bpf cgroups helpers to properly operate on desired cgroups. The helpers allow to select which css, cgroup and related information to use. Some upstream BPF helpers work only on Cgroupv2 where we want more flexiblity, work on both Cgroupv1 and v2 without distinction and allow Tetragon to adapt to current machine where it is running. Signed-off-by: Djalal Harouni <[email protected]>
Use the new bpf cgroups helpers to gather cgroup information during execve() events. This will ensure that: - We operate on the right hierarchy, css and its cgroup where the information we want is available. - Use the passed subsys index as a selection to get the cgroup name which can be transformed to a container ID in user-space. - Fix the get cgroup id logic which never worked on older kernels on cgroupv1, it always returned zero, with this change we will get the right cgroup ids. Signed-off-by: Djalal Harouni <[email protected]>
This package contains helpers to operate on cgroups: - Performs cgroup filesystem detection. - Performs cgroup mode detection based on https://systemd.io/CGROUP_DELEGATION/ but should also work for non-systemd init machines. - Validates cgroup paths obtained from /proc/self/cgroup for both cgroupv1 and cgroupv2 All these will be used in follow up patches. Signed-off-by: Djalal Harouni <[email protected]>
Signed-off-by: Djalal Harouni <[email protected]>
Add helpers to read and write Tetragon runtime conf that is stored in a bpf map. The UpdateRuntimeConf() Gathers information about Tetragon runtime environment and updates BPF TetragonConfMap It detects the CgroupFS magic, Cgroup runtime mode, discovers cgroup css's that are registered during boot and propagated to all tasks inside their css_set, detects the deployment mode from kubernetes, containers, to standalone or systemd services. All discovered information will also be logged for debugging purpose. Signed-off-by: Djalal Harouni <[email protected]>
Update TetragonConf bpf map at startup with the gathered cgroup and environment information. Signed-off-by: Djalal Harouni <[email protected]>
Signed-off-by: Djalal Harouni <[email protected]>
tixxdz
force-pushed
the
pr/tixxdz/cgroup-select-hierarchies
branch
from
September 7, 2022 15:35
c69cdef
to
8efc6eb
Compare
kevsecurity
approved these changes
Sep 7, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This was part of #225 , it was cleaned in order to improve our logic how we operate on cgroup hierarchies.
bpf:cgroups: pass subsys index when operating on cgroup_subsys_state set
Select which cgroup controllers to use at runtime by analyzing current machine
cgroup configuration and adapt the bpf helpers to use the best option.
We have experienced events that did not have the proper container ID 'docker' nor the pod fields set. The reason is due to:
In Cgroupv1 mode usually systemd only sets up by default the 'cpu, cpuacct, memory, devices and pids' controllers, cpuset which in normal cases indexed at 0 is not installed. Since some container runtimes and different environments may use systemd as a cgroup driver this can cause problems where we won't operate on the right hierarchy. We should also note that these controllers are kernel compile CONFIG_* options, and Tetragon should only work on machines that have the CONFIG_CGROUP_PIDS or CONFIG_CGROUP_MEMORY controllers compiled. Most production machines these days have or must have these compiled in to properly work.
Let's be consistent with systemd, Kubernetes and Container runtimes and select by default either the 'memory' or 'pids' to be used as the tracking Cgroup hierarchy for all processes. Usually these two controllers are always present and set. We do this by selecting the right hierarchy ID and the cgroup subsystem state index.
In Cgroupv2 mode, systemd successfully sets the related controllers that are safe to be used by default. However we have experienced machines that did not have the cpuset controller. In order to avoid such errors we do same operation for Cgroupv1, we gather the Cgroup subsystem state index and pass it into tetragon_conf struct at startup. To get the Cgroup ID we first use the default Cgroupv2 BPF helpers, if they fail we fallback to the per subsystem index. Last, to get the Cgroup name we always query the subsystem index and read the kernfs node name.
This allows Tetragon to work on different environments, regardless of the Cgroup configuration and driver being used.
Further reference: https://github.com/systemd/systemd/blob/main/src/basic/cgroup-util.h#L20
Signed-off-by: Djalal Harouni [email protected]