Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] intermittent errors after PR 1202 (drop privileges) #1803

Closed
3 of 5 tasks
rafaeldtinoco opened this issue Jun 7, 2022 · 19 comments · Fixed by #1814
Closed
3 of 5 tasks

[BUG] intermittent errors after PR 1202 (drop privileges) #1803

rafaeldtinoco opened this issue Jun 7, 2022 · 19 comments · Fixed by #1814
Assignees
Labels

Comments

@rafaeldtinoco
Copy link
Contributor

rafaeldtinoco commented Jun 7, 2022

Prerequisites

  • This affects latest released version.
  • This affects current development tree (origin/HEAD).
  • There isn't an issue describing the bug.

Select one OR another:

  • I'm going to create a PR to solve this (assign to yourself).
  • Someone else should solve this.

Bug description

I'm executing tests in multiple environments through the DAILY TESTS and I have observed that we're currently getting intermittent failures after merging commits 6b0cad4 and bf0600a. I know tests are intermittently failing because after an initial run I got the following tests failing:

CO-RE (TRC-3, focalhwe-5.13)
CO-RE (TRC-4, focalhwe-5.13)
CO-RE (TRC-9, focalhwe-5.13)
CO-RE (TRC-14, focalhwe-5.13)

CO-RE (TRC-4, jammy-5.15)
CO-RE (TRC-11, jammy-5.15)

Beside the known failure of GKE kernel for TRC-9.

After running tests again, most of them passed but:

CO-RE (TRC-4, focalhwe-5.13)

ALL the errors were similar to the one bellow:

image

Meaning that it might be that tracee dropped privilege BEFORE the eBPF program load+attachment.

Steps to reproduce

Steps to reproduce the issue:

Run the CO-RE daily tests and observe intermittent issues.

Additional info

The full log for the last error can be found at:

https://github.com/aquasecurity/tracee/runs/6765902831?check_suite_focus=true

@rafaeldtinoco
Copy link
Contributor Author

rafaeldtinoco commented Jun 7, 2022

@AlonZivony would you mind taking a look at this ? Until this is sorted out, tests will intermittently fail.

It looks like we're dropping privileges that we shouldn't in kernels above 5.8 AND this might be causing intermittent errors because of the timing in between dropping the privilege AND loading the eBPF program/attaching it "de facto".

Thank you.

@rafaeldtinoco rafaeldtinoco changed the title [BUG] [BUG] intermittent errors after PR 1202 (drop privileges) Jun 7, 2022
@yanivagman
Copy link
Collaborator

According to perf_event_open() man page (https://linux.die.net/man/2/perf_event_open) CAP_SYS_ADMIN is required when using pid = -1, while libbpf uses this value when attaching a kprobe (https://github.com/libbpf/libbpf/blob/86eb09863c1c0177e99c2c703092042d3cdba910/src/libbpf.c#L10654).
So it seems that CAP_SYS_ADMIN is required anyways, and we can't just use CAP_BPF and CAP_PERFMON instead (which are available since kernel 5.8)

@rafaeldtinoco
Copy link
Contributor Author

image

It says CAP_PERFMON (since 5.8) OR CAP_SYS_ADMIN, no ?

Here as well:

image

And the errors are intermittent:

image

Which is weird (hopefully not a kernel issue). If the privilege drop is being correctly done (with recent code) then it should either fail all times or succeed all times.

There are multiple places SYS_perf_event_open checks for permissions, returning possible EINVALs:

  1. LSM hook (security_per_event_open w/ PERF_SECURITY_OPEN)
  2. if (attr.namespaces) -> perfmom_capable()
  3. perf_allow_kernel (if sample is PHYS ADDR):
  • perfmon_capable() +
  • LSM hook (security_perf_event_open w/ PERF_SECURITY_KERNEL)
  1. security_locked_down(if sample is REGS INTR):
  • LSM hook (security_locked_down)
  1. if PID is given (not kprobe's case) -> perf_check_permission()

I can trace where the EACCESS is coming from, if needed (probably not #1 or #2 in our case), just want to rule out chances of us doing something wrong in userland before going down that path.

@AlonZivony
Copy link
Contributor

All these failures are quite odd, because it seems that they are not consistent.
In the link you sent for the full log, only TRC-4 is failing for the focalhwe distribution.
Because the tests failing are not consistent, I don't know if the answer is a missing capability, at least not for general use-case.
Anyways I will try to reproduce the error and play with the capabilities.

Meaning that it might be that tracee dropped privilege BEFORE the eBPF program load+attachment.

We do drop the capabilities BEFORE the eBPF program load+attach intentionally. The drop happens before any logic at all happens, with the goal to reduce attack surface.

So it seems that CAP_SYS_ADMIN is required anyways, and we can't just use CAP_BPF and CAP_PERFMON instead (which are available since kernel 5.8)

If it was true, all test for new kernel versions should have failed (because non use the CAP_SYS_ADMIN capability).

@yanivagman
Copy link
Collaborator

yanivagman commented Jun 8, 2022

It says CAP_PERFMON (since 5.8) OR CAP_SYS_ADMIN, no ?

Yes, you are right. I missed the CAP_PERFMON.

After I wrote this comment, I noticed that it only happens for load_elf_phdrs kprobe (merged a week ago - #1752). It also seems that a recent PR we merged #1791 has possibly "fixed" the issue. This PR marks the load_elf_phdrs probe as not required for sched_process_exec in case of a failure to attach - https://github.com/aquasecurity/tracee/pull/1791/files#diff-89c872feefdac8c2c860803274a0358e8900fc822d8160b712522b2a2d6b353eR4865

So if from this point on the tests will pass for new PRs, the issue might be related to the load_elf_phdrs symbol (which I currently have no idea why this might happen). Otherwise (tests continue to fail for other symbols), it is indeed related to the capability PR.

@AlonZivony
Copy link
Contributor

I managed to reproduce the bug, and noticed an interesting thing.
It seems that the failure with Permission Denied error doesn't happen in each run (at least in the environment I tested in).
About third of the times tracee failed to run.
I managed to check the capabilities of tracee-ebpf when failing and running, and it seems that there is no difference in the capabilities in both cases.
This means that the issue here is not the capabilities of the process in the end. Maybe there is a race condition here between reducing capabilities and loading the ebpf programs (maybe it takes time to reduce the capabilities what result that the loading of the eBPF programs occurs most of the times with all privileges).

Another interesting observation I had was that one time I have run tracee and for some reason no capability got dropped (stayed with all root capabilities).
It might reflect that there is an issue with the capabilities dropping mechanism. It needs more investigation.

@NDStrahilevitz
Copy link
Collaborator

NDStrahilevitz commented Jun 8, 2022

So I got a new version of this problem when running tracee just now:

libbpf: kprobe perf_event_open() failed: Permission denied
libbpf: prog 'trace_security_socket_create': failed to create kprobe 'security_socket_create+0x0' perf event: Permission denied
2022/06/08 11:03:54 error creating Tracee: failed to attach required probe: error attaching probe security_socket_create: failed to attach security_socket_create k(ret)probe to program trace_security_socket_create: permission denied

I don't have any insight to add on this probe, but it is interesting that #1791 didn't "fix" load_elf_phdrs.

Edit:
And now this

libbpf: kprobe perf_event_open() failed: Permission denied
libbpf: prog 'trace_cap_capable': failed to create kprobe 'cap_capable+0x0' perf event: Permission denied
2022/06/08 11:07:54 error creating Tracee: failed to attach required probe: error attaching probe cap_capable: failed to attach cap_capable k(ret)probe to program trace_cap_capable: permission denied

So I think this supports that this might be a kprobe only thing.

@rafaeldtinoco
Copy link
Contributor Author

All these failures are quite odd, because it seems that they are not consistent. In the link you sent for the full log, only TRC-4 is failing for the focalhwe distribution.
Because the tests failing are not consistent, I don't know if the answer is a missing capability, at least not for general use-case.

So, the fact that the error is happening more times in focalhwe image might only be a coincidence because this might be a timing issue indeed.

I don't think, even, it is a timing issue between userland (by the time we drop privileges) and kernel (by the time the kprobe program is attached) because we're not dropping privileges that are needed for the program load / attachment to happen.

I think knowing where in the kernel we're failing might be a good answer @AlonZivony since you told me that in both cases the caps are the same. WDYT ? I can get where in the kernel we're failing and go from there (what could be causing).

@yanivagman
Copy link
Collaborator

yanivagman commented Jun 8, 2022

I don't have any insight to add on this probe, but it is interesting that #1791 didn't "fix" load_elf_phdrs.

Well actually it did. Both of the errors you got are not related to load_elf_phdrs.

So these are the things we know (correct me if I'm wrong):

  1. The issue happens when attaching a kprobe using perf_event_open(), which return permission denied error.
  2. The error is not deterministic
  3. The error only happens for kernels > 5.8, meaning that we drop CAP_SYS_ADMIN, and use CAP_BPF and CAP_PERFMON instead, which theoretically should be enough.
  4. Sometimes it seems that no capabilities are dropped

According to the above my guess is that in golang, which has its own threading model, we can't be sure that the thread that dropped the capabilities is the one being used to load/attach the bpf programs. When a different thread is being used to attach the kprobe (with full privileges), no error occurs, while if the thread that dropped capabilities is being used, we get this error. If that's the case, it also means that CAP_PERFMON is not enough to attach a kprobe, and CAP_SYS_ADMIN is required as well

@NDStrahilevitz
Copy link
Collaborator

NDStrahilevitz commented Jun 8, 2022

I don't have any insight to add on this probe, but it is interesting that #1791 didn't "fix" load_elf_phdrs.

Well actually it did. Both of the errors you got are not related to load_elf_phdrs.

Well I did get the load_elf_phdrs error at times too and it's still produced in our PR workflows so that's why I said it's unfixed. But it does seem the reported end error isn't about attaching but about opening the perf buffer sometimes, so it might not be it too.
I just wanted to add that I am also getting these now.

@rafaeldtinoco
Copy link
Contributor Author

rafaeldtinoco commented Jun 8, 2022

According to the above my guess is that in golang, which has its own threading model, we can't be sure that the thread that dropped the capabilities is the one being used to load/attach the bpf programs. When a different thread is being used to attach the kprobe (with full privileges), no error occurs, while if the thread that dropped capabilities is being used, we get this error. If that's the case, it also means that CAP_PERFMON is not enough to attach a kprobe, and CAP_SYS_ADMIN is required as well

Interesting. I agree to this theory. If that is the case maybe we could try setting Inheritable capabilities right from the beginning to the golang runtime. So every thread created by it would have the same capabilities, no ?

@rafaeldtinoco
Copy link
Contributor Author

rafaeldtinoco commented Jun 8, 2022

Inheritable
       This  is  a  set of capabilities preserved across an execve(2).  Inheritable capabilities remain inheritable
       when executing any program, and inheritable capabilities are added to the permitted  set  when  executing  a
       program that has the corresponding bits set in the file inheritable set.

       Because  inheritable  capabilities  are  not generally preserved across execve(2) when running as a non-root
       user, applications that wish to run helper programs with elevated capabilities should consider using ambient
       capabilities, described below.

No, I think it wont work as well... ^

@rafaeldtinoco
Copy link
Contributor Author

golang/go#1435 <- Related

@rafaeldtinoco
Copy link
Contributor Author

rafaeldtinoco commented Jun 8, 2022

@yanivagman it is a good theory since I can reproduce the issue way more often by doing:

image

When I reduce number of threads (not sure if I got 1, but I can reproduce it very often).

@yanivagman
Copy link
Collaborator

yanivagman commented Jun 8, 2022

While at it, I think we should move from the package we use for managing capabilities github.com/syndtr/gocapability/capability to the (new) official package: https://pkg.go.dev/kernel.org/pub/linux/libs/security/libcap/cap

@yanivagman
Copy link
Collaborator

The official capabilities package uses the posix semantics to enforce capabilities in the process level, and not in the thread level, which is exactly what we need (see explanations about the posix semantics in the link above)

@AlonZivony
Copy link
Contributor

While at it, I think we should move from the package we use for managing capabilities github.com/syndtr/gocapability/capability to the (new) official package: https://pkg.go.dev/kernel.org/pub/linux/libs/security/libcap/cap

Nice, great idea!
I checked and saw that when adding the CAP_SYS_ADMIN capability the issue is resolved in the machine I used.
This means that we can patch it for now by requiring the CAP_SYS_ADMIN capability for all versions, instead of CAP_BPF and CAP_PERFMON in newer once.
I think it would be better to find a way to not use the CAP_SYS_ADMIN capability, since it includes too many privileges. If you have an idea of a missing capability we can use instead write to me.
I will open soon a PR for the bugfix.

@rafaeldtinoco
Copy link
Contributor Author

@AlonZivony and I synchronized and we agreed on:

  1. Solve this issue by a MITIGATION (make it clear it is a mitigation)
  2. Open an issue for "dropping privileges should have worked" (maybe a kernel issue)
  3. Applying permissions to all threads change to this code base (as a fix to the feature).

@rafaeldtinoco
Copy link
Contributor Author

@AlonZivony can you also address requests from @yanivagman from #1202 in this (1) mitigation ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants