[BUG] intermittent errors after PR 1202 (drop privileges) #1803

rafaeldtinoco · 2022-06-07T01:24:09Z

Prerequisites

This affects latest released version.
This affects current development tree (origin/HEAD).
There isn't an issue describing the bug.

Select one OR another:

I'm going to create a PR to solve this (assign to yourself).
Someone else should solve this.

Bug description

I'm executing tests in multiple environments through the DAILY TESTS and I have observed that we're currently getting intermittent failures after merging commits 6b0cad4 and bf0600a. I know tests are intermittently failing because after an initial run I got the following tests failing:

CO-RE (TRC-3, focalhwe-5.13)
CO-RE (TRC-4, focalhwe-5.13)
CO-RE (TRC-9, focalhwe-5.13)
CO-RE (TRC-14, focalhwe-5.13)

CO-RE (TRC-4, jammy-5.15)
CO-RE (TRC-11, jammy-5.15)

Beside the known failure of GKE kernel for TRC-9.

After running tests again, most of them passed but:

CO-RE (TRC-4, focalhwe-5.13)

ALL the errors were similar to the one bellow:

Meaning that it might be that tracee dropped privilege BEFORE the eBPF program load+attachment.

Steps to reproduce

Steps to reproduce the issue:

Run the CO-RE daily tests and observe intermittent issues.

Additional info

The full log for the last error can be found at:

https://github.com/aquasecurity/tracee/runs/6765902831?check_suite_focus=true

The text was updated successfully, but these errors were encountered:

rafaeldtinoco · 2022-06-07T01:25:15Z

@AlonZivony would you mind taking a look at this ? Until this is sorted out, tests will intermittently fail.

It looks like we're dropping privileges that we shouldn't in kernels above 5.8 AND this might be causing intermittent errors because of the timing in between dropping the privilege AND loading the eBPF program/attaching it "de facto".

Thank you.

yanivagman · 2022-06-07T07:00:26Z

According to perf_event_open() man page (https://linux.die.net/man/2/perf_event_open) CAP_SYS_ADMIN is required when using pid = -1, while libbpf uses this value when attaching a kprobe (https://github.com/libbpf/libbpf/blob/86eb09863c1c0177e99c2c703092042d3cdba910/src/libbpf.c#L10654).
So it seems that CAP_SYS_ADMIN is required anyways, and we can't just use CAP_BPF and CAP_PERFMON instead (which are available since kernel 5.8)

rafaeldtinoco · 2022-06-08T05:23:18Z

It says CAP_PERFMON (since 5.8) OR CAP_SYS_ADMIN, no ?

Here as well:

And the errors are intermittent:

Which is weird (hopefully not a kernel issue). If the privilege drop is being correctly done (with recent code) then it should either fail all times or succeed all times.

There are multiple places SYS_perf_event_open checks for permissions, returning possible EINVALs:

LSM hook (security_per_event_open w/ PERF_SECURITY_OPEN)
if (attr.namespaces) -> perfmom_capable()
perf_allow_kernel (if sample is PHYS ADDR):

perfmon_capable() +
LSM hook (security_perf_event_open w/ PERF_SECURITY_KERNEL)

security_locked_down(if sample is REGS INTR):

LSM hook (security_locked_down)

if PID is given (not kprobe's case) -> perf_check_permission()

I can trace where the EACCESS is coming from, if needed (probably not #1 or #2 in our case), just want to rule out chances of us doing something wrong in userland before going down that path.

AlonZivony · 2022-06-08T07:15:24Z

All these failures are quite odd, because it seems that they are not consistent.
In the link you sent for the full log, only TRC-4 is failing for the focalhwe distribution.
Because the tests failing are not consistent, I don't know if the answer is a missing capability, at least not for general use-case.
Anyways I will try to reproduce the error and play with the capabilities.

Meaning that it might be that tracee dropped privilege BEFORE the eBPF program load+attachment.

We do drop the capabilities BEFORE the eBPF program load+attach intentionally. The drop happens before any logic at all happens, with the goal to reduce attack surface.

So it seems that CAP_SYS_ADMIN is required anyways, and we can't just use CAP_BPF and CAP_PERFMON instead (which are available since kernel 5.8)

If it was true, all test for new kernel versions should have failed (because non use the CAP_SYS_ADMIN capability).

yanivagman · 2022-06-08T07:30:37Z

It says CAP_PERFMON (since 5.8) OR CAP_SYS_ADMIN, no ?

Yes, you are right. I missed the CAP_PERFMON.

After I wrote this comment, I noticed that it only happens for load_elf_phdrs kprobe (merged a week ago - #1752). It also seems that a recent PR we merged #1791 has possibly "fixed" the issue. This PR marks the load_elf_phdrs probe as not required for sched_process_exec in case of a failure to attach - https://github.com/aquasecurity/tracee/pull/1791/files#diff-89c872feefdac8c2c860803274a0358e8900fc822d8160b712522b2a2d6b353eR4865

So if from this point on the tests will pass for new PRs, the issue might be related to the load_elf_phdrs symbol (which I currently have no idea why this might happen). Otherwise (tests continue to fail for other symbols), it is indeed related to the capability PR.

AlonZivony · 2022-06-08T09:01:56Z

I managed to reproduce the bug, and noticed an interesting thing.
It seems that the failure with Permission Denied error doesn't happen in each run (at least in the environment I tested in).
About third of the times tracee failed to run.
I managed to check the capabilities of tracee-ebpf when failing and running, and it seems that there is no difference in the capabilities in both cases.
This means that the issue here is not the capabilities of the process in the end. Maybe there is a race condition here between reducing capabilities and loading the ebpf programs (maybe it takes time to reduce the capabilities what result that the loading of the eBPF programs occurs most of the times with all privileges).

Another interesting observation I had was that one time I have run tracee and for some reason no capability got dropped (stayed with all root capabilities).
It might reflect that there is an issue with the capabilities dropping mechanism. It needs more investigation.

NDStrahilevitz · 2022-06-08T11:07:07Z

So I got a new version of this problem when running tracee just now:

libbpf: kprobe perf_event_open() failed: Permission denied
libbpf: prog 'trace_security_socket_create': failed to create kprobe 'security_socket_create+0x0' perf event: Permission denied
2022/06/08 11:03:54 error creating Tracee: failed to attach required probe: error attaching probe security_socket_create: failed to attach security_socket_create k(ret)probe to program trace_security_socket_create: permission denied

I don't have any insight to add on this probe, but it is interesting that #1791 didn't "fix" load_elf_phdrs.

Edit:
And now this

libbpf: kprobe perf_event_open() failed: Permission denied
libbpf: prog 'trace_cap_capable': failed to create kprobe 'cap_capable+0x0' perf event: Permission denied
2022/06/08 11:07:54 error creating Tracee: failed to attach required probe: error attaching probe cap_capable: failed to attach cap_capable k(ret)probe to program trace_cap_capable: permission denied

So I think this supports that this might be a kprobe only thing.

rafaeldtinoco · 2022-06-08T11:40:36Z

All these failures are quite odd, because it seems that they are not consistent. In the link you sent for the full log, only TRC-4 is failing for the focalhwe distribution.
Because the tests failing are not consistent, I don't know if the answer is a missing capability, at least not for general use-case.

So, the fact that the error is happening more times in focalhwe image might only be a coincidence because this might be a timing issue indeed.

I don't think, even, it is a timing issue between userland (by the time we drop privileges) and kernel (by the time the kprobe program is attached) because we're not dropping privileges that are needed for the program load / attachment to happen.

I think knowing where in the kernel we're failing might be a good answer @AlonZivony since you told me that in both cases the caps are the same. WDYT ? I can get where in the kernel we're failing and go from there (what could be causing).

yanivagman · 2022-06-08T11:47:09Z

I don't have any insight to add on this probe, but it is interesting that #1791 didn't "fix" load_elf_phdrs.

Well actually it did. Both of the errors you got are not related to load_elf_phdrs.

So these are the things we know (correct me if I'm wrong):

The issue happens when attaching a kprobe using perf_event_open(), which return permission denied error.
The error is not deterministic
The error only happens for kernels > 5.8, meaning that we drop CAP_SYS_ADMIN, and use CAP_BPF and CAP_PERFMON instead, which theoretically should be enough.
Sometimes it seems that no capabilities are dropped

According to the above my guess is that in golang, which has its own threading model, we can't be sure that the thread that dropped the capabilities is the one being used to load/attach the bpf programs. When a different thread is being used to attach the kprobe (with full privileges), no error occurs, while if the thread that dropped capabilities is being used, we get this error. If that's the case, it also means that CAP_PERFMON is not enough to attach a kprobe, and CAP_SYS_ADMIN is required as well

NDStrahilevitz · 2022-06-08T12:26:30Z

I don't have any insight to add on this probe, but it is interesting that #1791 didn't "fix" load_elf_phdrs.

Well actually it did. Both of the errors you got are not related to load_elf_phdrs.

Well I did get the load_elf_phdrs error at times too and it's still produced in our PR workflows so that's why I said it's unfixed. But it does seem the reported end error isn't about attaching but about opening the perf buffer sometimes, so it might not be it too.
I just wanted to add that I am also getting these now.

rafaeldtinoco · 2022-06-08T12:28:16Z

According to the above my guess is that in golang, which has its own threading model, we can't be sure that the thread that dropped the capabilities is the one being used to load/attach the bpf programs. When a different thread is being used to attach the kprobe (with full privileges), no error occurs, while if the thread that dropped capabilities is being used, we get this error. If that's the case, it also means that CAP_PERFMON is not enough to attach a kprobe, and CAP_SYS_ADMIN is required as well

Interesting. I agree to this theory. If that is the case maybe we could try setting Inheritable capabilities right from the beginning to the golang runtime. So every thread created by it would have the same capabilities, no ?

rafaeldtinoco · 2022-06-08T12:30:17Z

Inheritable
       This  is  a  set of capabilities preserved across an execve(2).  Inheritable capabilities remain inheritable
       when executing any program, and inheritable capabilities are added to the permitted  set  when  executing  a
       program that has the corresponding bits set in the file inheritable set.

       Because  inheritable  capabilities  are  not generally preserved across execve(2) when running as a non-root
       user, applications that wish to run helper programs with elevated capabilities should consider using ambient
       capabilities, described below.

No, I think it wont work as well... ^

rafaeldtinoco · 2022-06-08T12:34:46Z

golang/go#1435 <- Related

rafaeldtinoco · 2022-06-08T12:41:46Z

@yanivagman it is a good theory since I can reproduce the issue way more often by doing:

When I reduce number of threads (not sure if I got 1, but I can reproduce it very often).

yanivagman · 2022-06-08T12:48:04Z

While at it, I think we should move from the package we use for managing capabilities github.com/syndtr/gocapability/capability to the (new) official package: https://pkg.go.dev/kernel.org/pub/linux/libs/security/libcap/cap

yanivagman · 2022-06-08T12:59:41Z

The official capabilities package uses the posix semantics to enforce capabilities in the process level, and not in the thread level, which is exactly what we need (see explanations about the posix semantics in the link above)

AlonZivony · 2022-06-08T13:10:24Z

While at it, I think we should move from the package we use for managing capabilities github.com/syndtr/gocapability/capability to the (new) official package: https://pkg.go.dev/kernel.org/pub/linux/libs/security/libcap/cap

Nice, great idea!
I checked and saw that when adding the CAP_SYS_ADMIN capability the issue is resolved in the machine I used.
This means that we can patch it for now by requiring the CAP_SYS_ADMIN capability for all versions, instead of CAP_BPF and CAP_PERFMON in newer once.
I think it would be better to find a way to not use the CAP_SYS_ADMIN capability, since it includes too many privileges. If you have an idea of a missing capability we can use instead write to me.
I will open soon a PR for the bugfix.

rafaeldtinoco · 2022-06-08T13:39:51Z

@AlonZivony and I synchronized and we agreed on:

Solve this issue by a MITIGATION (make it clear it is a mitigation)
Open an issue for "dropping privileges should have worked" (maybe a kernel issue)
Applying permissions to all threads change to this code base (as a fix to the feature).

rafaeldtinoco · 2022-06-08T13:50:27Z

@AlonZivony can you also address requests from @yanivagman from #1202 in this (1) mitigation ?

rafaeldtinoco added the kind/bug label Jun 7, 2022

rafaeldtinoco assigned AlonZivony Jun 7, 2022

rafaeldtinoco mentioned this issue Jun 7, 2022

Feature/drop unnecessary capabilities #1202

Merged

rafaeldtinoco changed the title ~~[BUG]~~ [BUG] intermittent errors after PR 1202 (drop privileges) Jun 7, 2022

This was referenced Jun 8, 2022

[BUG] Capabilities enforcement apply only for 1 thread #1813

Closed

Bugfix/missing capability new kernels #1814

Merged

[BUG] Use elevated capabilities in new kernels #1815

Closed

yanivagman closed this as completed in #1814 Jun 8, 2022

AlonZivony mentioned this issue Jul 17, 2022

Update checkRequiredCapabilities to support newer kernels #747

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] intermittent errors after PR 1202 (drop privileges) #1803

[BUG] intermittent errors after PR 1202 (drop privileges) #1803

rafaeldtinoco commented Jun 7, 2022 •

edited

Loading

rafaeldtinoco commented Jun 7, 2022 •

edited

Loading

yanivagman commented Jun 7, 2022

rafaeldtinoco commented Jun 8, 2022

AlonZivony commented Jun 8, 2022

yanivagman commented Jun 8, 2022 •

edited

Loading

AlonZivony commented Jun 8, 2022

NDStrahilevitz commented Jun 8, 2022 •

edited

Loading

rafaeldtinoco commented Jun 8, 2022

yanivagman commented Jun 8, 2022 •

edited

Loading

NDStrahilevitz commented Jun 8, 2022 •

edited

Loading

rafaeldtinoco commented Jun 8, 2022 •

edited

Loading

rafaeldtinoco commented Jun 8, 2022 •

edited

Loading

rafaeldtinoco commented Jun 8, 2022

rafaeldtinoco commented Jun 8, 2022 •

edited

Loading

yanivagman commented Jun 8, 2022 •

edited

Loading

yanivagman commented Jun 8, 2022

AlonZivony commented Jun 8, 2022

rafaeldtinoco commented Jun 8, 2022

rafaeldtinoco commented Jun 8, 2022

[BUG] intermittent errors after PR 1202 (drop privileges) #1803

[BUG] intermittent errors after PR 1202 (drop privileges) #1803

Comments

rafaeldtinoco commented Jun 7, 2022 • edited Loading

Prerequisites

Bug description

Steps to reproduce

Additional info

rafaeldtinoco commented Jun 7, 2022 • edited Loading

yanivagman commented Jun 7, 2022

rafaeldtinoco commented Jun 8, 2022

AlonZivony commented Jun 8, 2022

yanivagman commented Jun 8, 2022 • edited Loading

AlonZivony commented Jun 8, 2022

NDStrahilevitz commented Jun 8, 2022 • edited Loading

rafaeldtinoco commented Jun 8, 2022

yanivagman commented Jun 8, 2022 • edited Loading

NDStrahilevitz commented Jun 8, 2022 • edited Loading

rafaeldtinoco commented Jun 8, 2022 • edited Loading

rafaeldtinoco commented Jun 8, 2022 • edited Loading

rafaeldtinoco commented Jun 8, 2022

rafaeldtinoco commented Jun 8, 2022 • edited Loading

yanivagman commented Jun 8, 2022 • edited Loading

yanivagman commented Jun 8, 2022

AlonZivony commented Jun 8, 2022

rafaeldtinoco commented Jun 8, 2022

rafaeldtinoco commented Jun 8, 2022

rafaeldtinoco commented Jun 7, 2022 •

edited

Loading

rafaeldtinoco commented Jun 7, 2022 •

edited

Loading

yanivagman commented Jun 8, 2022 •

edited

Loading

NDStrahilevitz commented Jun 8, 2022 •

edited

Loading

yanivagman commented Jun 8, 2022 •

edited

Loading

NDStrahilevitz commented Jun 8, 2022 •

edited

Loading

rafaeldtinoco commented Jun 8, 2022 •

edited

Loading

rafaeldtinoco commented Jun 8, 2022 •

edited

Loading

rafaeldtinoco commented Jun 8, 2022 •

edited

Loading

yanivagman commented Jun 8, 2022 •

edited

Loading