-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] intermittent errors after PR 1202 (drop privileges) #1803
Comments
@AlonZivony would you mind taking a look at this ? Until this is sorted out, tests will intermittently fail. It looks like we're dropping privileges that we shouldn't in kernels above 5.8 AND this might be causing intermittent errors because of the timing in between dropping the privilege AND loading the eBPF program/attaching it "de facto". Thank you. |
According to perf_event_open() man page (https://linux.die.net/man/2/perf_event_open) CAP_SYS_ADMIN is required when using pid = -1, while libbpf uses this value when attaching a kprobe (https://github.com/libbpf/libbpf/blob/86eb09863c1c0177e99c2c703092042d3cdba910/src/libbpf.c#L10654). |
It says CAP_PERFMON (since 5.8) OR CAP_SYS_ADMIN, no ? Here as well: And the errors are intermittent: Which is weird (hopefully not a kernel issue). If the privilege drop is being correctly done (with recent code) then it should either fail all times or succeed all times. There are multiple places SYS_perf_event_open checks for permissions, returning possible EINVALs:
I can trace where the EACCESS is coming from, if needed (probably not #1 or #2 in our case), just want to rule out chances of us doing something wrong in userland before going down that path. |
All these failures are quite odd, because it seems that they are not consistent.
We do drop the capabilities BEFORE the eBPF program load+attach intentionally. The drop happens before any logic at all happens, with the goal to reduce attack surface.
If it was true, all test for new kernel versions should have failed (because non use the |
Yes, you are right. I missed the CAP_PERFMON. After I wrote this comment, I noticed that it only happens for load_elf_phdrs kprobe (merged a week ago - #1752). It also seems that a recent PR we merged #1791 has possibly "fixed" the issue. This PR marks the load_elf_phdrs probe as not required for sched_process_exec in case of a failure to attach - https://github.com/aquasecurity/tracee/pull/1791/files#diff-89c872feefdac8c2c860803274a0358e8900fc822d8160b712522b2a2d6b353eR4865 So if from this point on the tests will pass for new PRs, the issue might be related to the load_elf_phdrs symbol (which I currently have no idea why this might happen). Otherwise (tests continue to fail for other symbols), it is indeed related to the capability PR. |
I managed to reproduce the bug, and noticed an interesting thing. Another interesting observation I had was that one time I have run tracee and for some reason no capability got dropped (stayed with all root capabilities). |
So I got a new version of this problem when running tracee just now:
I don't have any insight to add on this probe, but it is interesting that #1791 didn't "fix" load_elf_phdrs. Edit:
So I think this supports that this might be a kprobe only thing. |
So, the fact that the error is happening more times in focalhwe image might only be a coincidence because this might be a timing issue indeed. I don't think, even, it is a timing issue between userland (by the time we drop privileges) and kernel (by the time the kprobe program is attached) because we're not dropping privileges that are needed for the program load / attachment to happen. I think knowing where in the kernel we're failing might be a good answer @AlonZivony since you told me that in both cases the caps are the same. WDYT ? I can get where in the kernel we're failing and go from there (what could be causing). |
Well actually it did. Both of the errors you got are not related to load_elf_phdrs. So these are the things we know (correct me if I'm wrong):
According to the above my guess is that in golang, which has its own threading model, we can't be sure that the thread that dropped the capabilities is the one being used to load/attach the bpf programs. When a different thread is being used to attach the kprobe (with full privileges), no error occurs, while if the thread that dropped capabilities is being used, we get this error. If that's the case, it also means that CAP_PERFMON is not enough to attach a kprobe, and CAP_SYS_ADMIN is required as well |
Well I did get the load_elf_phdrs error at times too and it's still produced in our PR workflows so that's why I said it's unfixed. But it does seem the reported end error isn't about attaching but about opening the perf buffer sometimes, so it might not be it too. |
Interesting. I agree to this theory. If that is the case maybe we could try setting Inheritable capabilities right from the beginning to the golang runtime. So every thread created by it would have the same capabilities, no ? |
No, I think it wont work as well... ^ |
golang/go#1435 <- Related |
@yanivagman it is a good theory since I can reproduce the issue way more often by doing: When I reduce number of threads (not sure if I got 1, but I can reproduce it very often). |
While at it, I think we should move from the package we use for managing capabilities github.com/syndtr/gocapability/capability to the (new) official package: https://pkg.go.dev/kernel.org/pub/linux/libs/security/libcap/cap |
The official capabilities package uses the posix semantics to enforce capabilities in the process level, and not in the thread level, which is exactly what we need (see explanations about the posix semantics in the link above) |
Nice, great idea! |
@AlonZivony and I synchronized and we agreed on:
|
@AlonZivony can you also address requests from @yanivagman from #1202 in this (1) mitigation ? |
Prerequisites
Select one OR another:
Bug description
I'm executing tests in multiple environments through the DAILY TESTS and I have observed that we're currently getting intermittent failures after merging commits 6b0cad4 and bf0600a. I know tests are intermittently failing because after an initial run I got the following tests failing:
CO-RE (TRC-3, focalhwe-5.13)
CO-RE (TRC-4, focalhwe-5.13)
CO-RE (TRC-9, focalhwe-5.13)
CO-RE (TRC-14, focalhwe-5.13)
CO-RE (TRC-4, jammy-5.15)
CO-RE (TRC-11, jammy-5.15)
Beside the known failure of GKE kernel for TRC-9.
After running tests again, most of them passed but:
CO-RE (TRC-4, focalhwe-5.13)
ALL the errors were similar to the one bellow:
Meaning that it might be that tracee dropped privilege BEFORE the eBPF program load+attachment.
Steps to reproduce
Steps to reproduce the issue:
Run the CO-RE daily tests and observe intermittent issues.
Additional info
The full log for the last error can be found at:
https://github.com/aquasecurity/tracee/runs/6765902831?check_suite_focus=true
The text was updated successfully, but these errors were encountered: