Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgroups: select which cgroup hierarchy and subsystem state to use #369

Merged
merged 11 commits into from
Sep 7, 2022

Conversation

tixxdz
Copy link
Member

@tixxdz tixxdz commented Aug 29, 2022

This was part of #225 , it was cleaned in order to improve our logic how we operate on cgroup hierarchies.

bpf:cgroups: pass subsys index when operating on cgroup_subsys_state set

Select which cgroup controllers to use at runtime by analyzing current machine
cgroup configuration and adapt the bpf helpers to use the best option.

We have experienced events that did not have the proper container ID 'docker' nor the pod fields set. The reason is due to:

In Cgroupv1 mode usually systemd only sets up by default the 'cpu, cpuacct, memory, devices and pids' controllers, cpuset which in normal cases indexed at 0 is not installed. Since some container runtimes and different environments may use systemd as a cgroup driver this can cause problems where we won't operate on the right hierarchy. We should also note that these controllers are kernel compile CONFIG_* options, and Tetragon should only work on machines that have the CONFIG_CGROUP_PIDS or CONFIG_CGROUP_MEMORY controllers compiled. Most production machines these days have or must have these compiled in to properly work.

Let's be consistent with systemd, Kubernetes and Container runtimes and select by default either the 'memory' or 'pids' to be used as the tracking Cgroup hierarchy for all processes. Usually these two controllers are always present and set. We do this by selecting the right hierarchy ID and the cgroup subsystem state index.

In Cgroupv2 mode, systemd successfully sets the related controllers that are safe to be used by default. However we have experienced machines that did not have the cpuset controller. In order to avoid such errors we do same operation for Cgroupv1, we gather the Cgroup subsystem state index and pass it into tetragon_conf struct at startup. To get the Cgroup ID we first use the default Cgroupv2 BPF helpers, if they fail we fallback to the per subsystem index. Last, to get the Cgroup name we always query the subsystem index and read the kernfs node name.

This allows Tetragon to work on different environments, regardless of the Cgroup configuration and driver being used.

Further reference: https://github.com/systemd/systemd/blob/main/src/basic/cgroup-util.h#L20

Signed-off-by: Djalal Harouni [email protected]

@tixxdz tixxdz requested a review from a team as a code owner August 29, 2022 16:42
@tixxdz tixxdz requested a review from tpapagian August 29, 2022 16:42
@tixxdz tixxdz force-pushed the pr/tixxdz/cgroup-select-hierarchies branch from d8fd0f8 to 1e56861 Compare September 5, 2022 12:54
Copy link
Contributor

@willfindlay willfindlay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes a nasty cgroupsv2 bug we've been tracking on 4.19 🎉

@tixxdz tixxdz force-pushed the pr/tixxdz/cgroup-select-hierarchies branch from 175dad1 to 6d50460 Compare September 7, 2022 14:05
@tixxdz tixxdz requested a review from kevsecurity September 7, 2022 14:08
@tixxdz tixxdz force-pushed the pr/tixxdz/cgroup-select-hierarchies branch from 6d50460 to c69cdef Compare September 7, 2022 14:25
This is a preparation patch that adds:

tetragon_conf struct to store Tetragon runtime configuration in order
to improve cgroup implementation and how we lookup container IDs. User space
will gather information and store it into bpf map where cgroup helpers will
read it and adapt their behavior accordingly.

We have experienced events that did not have the proper container ID 'docker'
nor the pod fields set. The reasons are:

In Cgroupv1 mode usually systemd only sets up by default the 'cpu, cpuacct,
memory, devices and pids' controllers, cpuset controller which in normal cases
indexed at 0 is not installed, and it was the default controller that our bpf
helpers used to fetch cgroup information including the name.

Since some container runtimes and different environments may use systemd as
a cgroup driver this can cause problems where we won't operate on the right
hierarchy (controller).

We should also note that these controllers are kernel compile CONFIG_* options,
and Tetragon should only work on machines that have the CONFIG_CGROUP_PIDS or
CONFIG_CGROUP_MEMORY controllers compiled. Most production machines
these days have or must have these compiled-in to properly work.

Let's be consistent with systemd, Kubernetes and Container runtimes, select
by default either the 'memory' or 'pids' controllers to be used as the tracking
Cgroup hierarchy for all processes in Cgroupv1. Usually these two controllers
are always present and set. We do this by selecting the right hierarchy ID and
the cgroup subsystem state index that it initialized once during boot
and propagated to all css_set's of tasks of the machine.

In Cgroupv2 mode, systemd successfully sets the related controllers that are
safe to be used by default. However we have experienced machines that did not
have the cpuset controller which is kind of strange, the controller is not
propagated down to services and processes. This ends up in same error as
for Cgroupv1.

In order to avoid such errors we do same operation for Cgroupv1, we gather
the Cgroup subsystem state index and pass it into tetragon_conf struct at
startup. Then to get the Cgroup ID we first use the default Cgroupv2 BPF
helpers, if they fail we fallback to the per subsystem state index.

Last, to get the Cgroup name we always query the subsystem state index and
read the kernfs node name.

This should allow Tetragon to work on most of environments, regardless of
the Cgroup configuration and driver being used, assuming they have the
CONFIG_CGROUP_MEMORY or CONFIG_CGROUP_PIDS compiled-in.

Signed-off-by: Djalal Harouni <[email protected]>
Update our definition of CGROUP_SUBSYS_COUNT since new Cgroup controllers
were added. These value will be used as an in-bound limit guard.

We are only interested in 'memory' and 'pids' indexes, however we will use the
values provided here for safety in-bound checks. Since some controllers may not
be compiled in, we instead read /proc/cgroups at startup and update the right
subsystem indexes at runtime to accommodate to the current machine where
Tetragon is running, this is more flexible.

Reference: https://elixir.bootlin.com/linux/v5.19/source/include/linux/cgroup_subsys.h

Signed-off-by: Djalal Harouni <[email protected]>
Preparation patch for later that adds convenience macro to check
that provided named type (struct/union/enum/typedef) exists in a
target kernel.

We need this to check that 'union kernfs_node_id' type exists which
was the id for kernfs nodes in kernels prior to 5.5

Code reference from: https://github.com/libbpf/libbpf

Signed-off-by: Djalal Harouni <[email protected]>
… older

In newer kernels the kernfs_node id is u64 type, however on kernels
from 5.4 and older it was a union. So let's add the kernfs_node_id
union type so we can check it at runtime and properly operate on
the right structure layout to get the Cgroup ID on these older
kernels.

This never worked and we did not notice it since we do not use the
Cgroup IDs. However this will change in future so make sure we fix
this now while we are it.

This also helps debugging so we get the right IDs instead of zeroes.

Signed-off-by: Djalal Harouni <[email protected]>
Add our bpf cgroups helpers to properly operate on desired cgroups. The
helpers allow to select which css, cgroup and related information to use.

Some upstream BPF helpers work only on Cgroupv2 where we want more
flexiblity, work on both Cgroupv1 and v2 without distinction and allow
Tetragon to adapt to current machine where it is running.

Signed-off-by: Djalal Harouni <[email protected]>
Use the new bpf cgroups helpers to gather cgroup information during
execve() events.

This will ensure that:
- We operate on the right hierarchy, css and its cgroup where the
  information we want is available.
- Use the passed subsys index as a selection to get the cgroup name
  which can be transformed to a container ID in user-space.
- Fix the get cgroup id logic which never worked on older kernels on
  cgroupv1, it always returned zero, with this change we will get the
  right cgroup ids.

Signed-off-by: Djalal Harouni <[email protected]>
This package contains helpers to operate on cgroups:

- Performs cgroup filesystem detection.
- Performs cgroup mode detection based on https://systemd.io/CGROUP_DELEGATION/
  but should also work for non-systemd init machines.
- Validates cgroup paths obtained from /proc/self/cgroup for both
  cgroupv1 and cgroupv2

All these will be used in follow up patches.

Signed-off-by: Djalal Harouni <[email protected]>
Add helpers to read and write Tetragon runtime conf that is stored in a bpf map.

The UpdateRuntimeConf() Gathers information about Tetragon runtime environment and
updates BPF TetragonConfMap

It detects the CgroupFS magic, Cgroup runtime mode, discovers cgroup css's that
are registered during boot and propagated to all tasks inside their css_set, detects
the deployment mode from kubernetes, containers, to standalone or systemd services.
All discovered information will also be logged for debugging purpose.

Signed-off-by: Djalal Harouni <[email protected]>
Update TetragonConf bpf map at startup with the gathered cgroup and
environment information.

Signed-off-by: Djalal Harouni <[email protected]>
@tixxdz tixxdz force-pushed the pr/tixxdz/cgroup-select-hierarchies branch from c69cdef to 8efc6eb Compare September 7, 2022 15:35
@tixxdz tixxdz merged commit b7503f0 into main Sep 7, 2022
@tixxdz tixxdz deleted the pr/tixxdz/cgroup-select-hierarchies branch September 7, 2022 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants