Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create proposal for modern-bfp kernel-side filtering #1867

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

stevenbrz
Copy link

What type of PR is this?
/kind documentation

Any specific area of the project related to this PR?
/area libscap-engine-modern-bpf
/area proposals

What this PR does / why we need it:
In response to feedback in #1557, this adds a proposal for introducing a kernel-side event filtering feature to the modern-bpf driver. I hope this helps facilitate discussion on how we can best implement this feature.

Further detail for motivation and potential implementation can be found in the proposal.

cc. @bajzekm

@poiana
Copy link
Contributor

poiana commented May 14, 2024

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@poiana
Copy link
Contributor

poiana commented May 14, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: stevenbrz
Once this PR has been reviewed and has the lgtm label, please assign mstemm for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@poiana
Copy link
Contributor

poiana commented May 14, 2024

Welcome @stevenbrz! It looks like this is your first PR to falcosecurity/libs 🎉

@poiana poiana added the size/L label May 14, 2024
@incertum
Copy link
Contributor

Amazing, we will get to it soon likely after the next release is out. Thanks a bunch for all the efforts you are already putting into this!

@FedeDP FedeDP added this to the TBD milestone May 15, 2024
@Andreagit97
Copy link
Member

First of all, thank you for this proposal!

my 2 cents:
kernel filtering could help us in at least 2 ways:

  1. reduce/eliminate the number of drops
  2. reduce the Falco CPU usage. Even if we were able to read all the events from the buffers without drops, the CPU would probably run at 100% with some high throughput that you showed us here.
    It is not easy to implement it effectively but we should at least try it!

At the moment I have no precise idea about the implementation but I would like to better understand the data we see here in this proposal.

Starting from (scap.evts_rate_sec):

count      1022425.000000 (number of metric samples)
mean         78169.513609
std          56479.322696
min              0.000000
50%          71634.525317
90%         139310.344902
99%         260659.990578
99.9%       565055.582845
99.99%      734383.889699
99.999%     779872.462636
max         824091.991662

Looking at the mean event rate I see ~ 78170 evt/s. This is pretty low...

  • Are you experiencing event drops with these low rates? Even the max rate 824092 evt/s seems not so high. Is this the same machine you showed us the previous time (with 128 CPUs)?
  • How many Falco rules are you loading in this case?
  • Is this Vanilla Falco 0.36.2?

For what concern drops, it would interesting to understand how the evt rate changes... i expect 2 things:

  • a lower evt rate because we are sending less events to userspace of course.
  • a lower evt rate because we are slowing down the entire machine with the kernel filtering solution so we should have less evt/s.

What is not clear to me at the moment is why we have worse 90-99th performance...maybe it is due to a slightly different load of the machine? If i'm not missing something the kernel filtering patch should always perform better if we only look at the userspace statistics like evt/s and drop percentage like we are doing here...Do you have an explanation for this?

Falco 0.37.0 with our patch filter filtering out 6 commonly seen path prefixes for file open events:

If i understood weel you obtained these results by using a simple filter like

filters:
  - syscall: 257 # syscall number for `openat` on `x86_64` 
    arg: 1 # file path
    prefixes: ["1", "2", "3", "4", "5", "6"]

is this true?

@Andreagit97
Copy link
Member

Andreagit97 commented May 31, 2024

Just another minor question, have you ever tried this Falco config drop_failed_exit? i'm curious to understand what could be the impact of failed syscalls on your system

@stevenbrz
Copy link
Author

  • Are you experiencing event drops with these low rates? Even the max rate 824092 evt/s seems not so high. Is this the same machine you showed us the previous time (with 128 CPUs)?

These datapoints are pulled from all of our hosts ranging in size from 32-128 CPUs.

Here are some crude charts with the data from the filtering patch. It doesn't look like there's a clear correlation between event rate and drop rate:

Linear Scale:
linear_scale
Log Scale:
log_scale

  • How many Falco rules are you loading in this case?

We're using the default rule set with about 300 lines of override conditions filtering out noise applicable to our systems. Previously we ran a test with our custom rule set vs a single dummy rule against a synthetic workload and we didn't notice a significant difference in drop rate, so we concluded we should look elsewhere than the ruleset. If you think that could be contributing, I could revisit doing a more realistic test.

  • Is this Vanilla Falco 0.36.2?

Yeah, the stats in the OP are for vanilla Falco 0.36.2 with the same ruleset as mentioned above.

For what concern drops, it would interesting to understand how the evt rate changes... i expect 2 things:

  • a lower evt rate because we are sending less events to userspace of course.
  • a lower evt rate because we are slowing down the entire machine with the kernel filtering solution so we should have less evt/s.

What is not clear to me at the moment is why we have worse 90-99th performance...maybe it is due to a slightly different load of the machine? If i'm not missing something the kernel filtering patch should always perform better if we only look at the userspace statistics like evt/s and drop percentage like we are doing here...Do you have an explanation for this?

I agree, I'd expect the same result. The only explanation I can think of is some performance regressions introduced between the two versions since the only functional difference between the vanilla version and our patch is some additional config processing at startup and then additional compute per hook, but that would result in overall system slowdown.

Falco 0.37.0 with our patch filter filtering out 6 commonly seen path prefixes for file open events:

If i understood weel you obtained these results by using a simple filter like

filters:
  - syscall: 257 # syscall number for `openat` on `x86_64` 
    arg: 1 # file path
    prefixes: ["1", "2", "3", "4", "5", "6"]

is this true?

Yeah that's right.

Just another minor question, have you ever tried this Falco config drop_failed_exit? i'm curious to understand what could be the impact of failed syscalls on your system

Yeah we use this feature as it seemed like a good way to squeeze out some extra performance.

@Andreagit97
Copy link
Member

Ok, thank you for the additional info!

Since the problem is already complex and there is a lot of entropy I would like to focus on a unique node, WDYT? Evaluating metrics across nodes with different specs/load could become more difficult than it should.

What we could do is:

  • run Falco on the node without the kernel filtering patch, for 1h for example, and check the Falco metrics
  • run the same Falco version on the same machine but this time with the kernel filtering enabled and check again the metrics after 1 hour.

The load during the tests should be similar otherwise data won't be so relevant. What I'm curious to understand is how the evt/s rate changes with the kernel filtering solution. The idea is to understand how effective the solution is in real-world scenarios.

I would also be curious to see some real filters that you could use in production always to understand the real effectiveness of the solution. In the proposal I've seen some example paths like "/proc" and "/sys/fs/cgroup", are these real paths that you are using in production?

BTW the current approach used seems fine to me. Of course, I would prefer a whitelist more than a blacklist approach, for example, we could take only paths under /etc but this would be a huge hole in the security posture since we don't resolve symlinks and we could have relative paths...

The other thing that I had in mind was a filter on the thread-id. Let's say you want only some processes with a certain command line, you could check the command line in userspace and add the tid in a hash table in the kernel, like many tools already do. This filter is more complex to manage because in someway you need to free the hash table from stale entries and you need to synchronize it with the userspace, but on the other side it would allow us a lot of flexibility, you can filter processes using almost any filter we have in user space, for example, you can filter out the children of a certain parent or other things like that. Of course also here we need to understand if we have real concrete use cases in production. Do you see any application of this approach for your use case?

Moreover, this other approach is not in conflict with the one you proposed and could also be built easily on top of that.

@Andreagit97
Copy link
Member

Andreagit97 commented Jun 11, 2024

I was looking again at some of your data here (https://gist.github.com/stevenbrz/69391aa71b22d205b6add88ae10fb905#file-gistfile1-txt-L124)

More in detail I can see that in many cases, it is the close syscall that causes a lot of drops.
Looking at one of them more in-detail:

base_syscalls: [clone, clone3, fork, vfork, execve, execveat, getsockopt, socket, connect, close]

{
  "hostname": "{redacted}",
  "output": "Falco metrics snapshot",
  "output_fields": {
    "evt.source": "syscall",
    "evt.time": 1701980769449620870,
    "falco.cpu_usage_perc": 31.5,
    "falco.duration_sec": 115,
    "falco.evts_rate_sec": 2128166.9561680453,
    "falco.host_boot_ts": 1700163046000000000,
    "falco.host_num_cpus": 128,
    "falco.hostname": "{redacted}",
    "falco.kernel_release": "6.1.52",
    "falco.memory_pss": 452,
    "falco.memory_rss": 712,
    "falco.memory_vsz": 1013,
    "falco.num_evts": 58861475,
    "falco.num_evts_prev": 37580082,
    "falco.outputs_queue_num_drops": 0,
    "falco.start_ts": 1701980653673532364,
    "falco.version": "0.36.2",
    "scap.engine_name": "modern_bpf",
    "scap.evts_drop_rate_sec": 21251559.170386888,
    "scap.evts_rate_sec": 23875792.173082028,
    "scap.n_drops": 445590948,
    "scap.n_drops_buffer_clone_fork_enter": 5459,
    "scap.n_drops_buffer_clone_fork_exit": 12140,
    "scap.n_drops_buffer_close_exit": 222665600,
    "scap.n_drops_buffer_connect_enter": 6262,
    "scap.n_drops_buffer_connect_exit": 6303,
    "scap.n_drops_buffer_execve_enter": 2675,
    "scap.n_drops_buffer_execve_exit": 3030,
    "scap.n_drops_buffer_other_interest_enter": 240,
    "scap.n_drops_buffer_other_interest_exit": 241,
    "scap.n_drops_buffer_proc_exit": 5869,
    "scap.n_drops_buffer_total": 445590948,
    "scap.n_drops_perc": 89.0088128441085,
    "scap.n_drops_prev": 233078118,
    "scap.n_evts": 511333930,
    "scap.n_evts_prev": 272579111
  },
  "priority": "Informational",
  "rule": "Falco internal: metrics snapshot",
  "source": "internal",
  "time": "2023-12-07T20:26:09.449620870Z"
}

You have 222665600 close exit events in 115 sec... and the close syscall is something that we probably cannot filter with the proposed kernel filtering solution since its unique parameter is the fd...uhm just thinking about what we could do to mitigate this scenario...

@incertum
Copy link
Contributor

incertum commented Jun 14, 2024

We urgently need such capabilities. I just wanted to reiterate this!

@stevenbrz

It doesn't look like there's a clear correlation between event rate and drop rate:

I agree, bursts are likely a bigger problem.

@Andreagit97

At the moment I have no precise idea about the implementation

What I'm curious to understand is how the evt/s rate changes with the kernel filtering solution. The idea is to understand how effective the solution is in real-world scenarios.

Of course, I would prefer a whitelist more than a blacklist approach, for example, we could take only paths under /etc but this would be a huge hole in the security posture since we don't resolve symlinks and we could have relative paths...

I agree with all of the above.

The other thing that I had in mind was a filter on the thread-id.

I agree as well. It could be useful to base the high-volume tid identification on a robust combination of fields. However, the downside is that the events would need to reach userspace first. From a security perspective, you need to be careful not to accidentally filter out security-relevant information. For a kernel-side filtering goal, especially in cases with high kernel-side drops, it would be more robust to avoid relying on userspace.

the close syscall is something that we probably cannot filter with the proposed kernel filtering solution since its unique parameter is the fd...uhm just thinking about what we could do to mitigate this scenario...

This is a general problem for kernel-side filtering, as you only have certain information available at a time.


During our last core maintainer call, Andrea, others, and I discussed the proposal a bit. We generally agreed that a hashing approach could be more powerful than string comparison. It would also align this approach with the existing userspace tid suppression and the userspace anomaly detection approach that is currently being developed.

Based on that, I researched a bit and would like to share my initial ideas:

Starting with kernel 5.16 (would this work for you btw?), we can use an official eBPF-supported approach https://docs.kernel.org/bpf/map_bloom_filter.html. The advantages include using an established, official method instead of something custom. Additionally, a bloom filter approach would be excellent because it is a proven algorithm that is useful for many problems and works effectively in production.

For example, with your current approach, you are limited to the number of patterns you can push into the kernel (I read a maximum of 12). This limitation will hinder your ability to achieve the desired results. Instead, with a bloom filter, you could shuffle many distinct patterns into the filter(s), not limited to string types either.

However, I also read that you wish to filter for variable-length file prefixes. What I was thinking is to traverse the filename and perform look ups after each sub-directory in the bloom filter and even the full path. For example, for ../sys/fs/cgroup, we would do the first lookup for ../, then ../sys/, then ../sys/fs/, and finally ../sys/fs/cgroup, etc. That way we may achieve similar results to the current approach (Big O should be comparable, maybe you know better).

As Andrea pointed out, we would also need correlated filtering, such as not sending the close event for the fds for which we dropped the open call for example.


nit: we have robust methods in place to map the syscall string name to the syscall id, hence we can pass the syscall name to the config.

@stevenbrz
Copy link
Author

I can work on doing some more formulaic tests given the bulk results were quite inconclusive yeah. Since our systems run varied workloads, I think it may be better to use synthetic tests like stress-ng so I'll pursue that.

As for the feedback on the filtering implementation, I agree hashing is vastly preferable to string comparisons. I feel using a bloom filter may be overkill though since we could definitely just throw every prefix into a hash table and check for its presence kernel-side one dir at a time like you highlighted above. The limitation with number of prefixes comes from the verifier's complexity bound when doing linear string comparisons.

We'd still have to do string parsing there but it should be a lot more efficient given we won't have a nested loop.

@incertum
Copy link
Contributor

incertum commented Jun 17, 2024

As for the feedback on the filtering implementation, I agree hashing is vastly preferable to string comparisons. I feel using a bloom filter may be overkill though since we could definitely just throw every prefix into a hash table and check for its presence kernel-side one dir at a time like you highlighted above. The limitation with number of prefixes comes from the verifier's complexity bound when doing linear string comparisons.

More thoughts in favor of using the eBPF bloom filter:

  1. Official implementation of this filtering approach would need to be more flexible and suitable for many adopters beyond one use case
  2. Memory safety, scalability and other benefits, e.g. https://boyter.org/posts/bloom-filter/

On that note: I would be curious to see if you could move the needle even more when having much much more prefixes. It could also help to avoid accidentally filtering out important information given the prefixes are high-level (for example filtering out the entire proc folder hinders you to implement some detections).

We'd still have to do string parsing there but it should be a lot more efficient given we won't have a nested loop.

Yes, plus this approach can be easily extended to non strings, for example ips could be a very valuable use case for many adopters.

@poiana
Copy link
Contributor

poiana commented Sep 15, 2024

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

@Andreagit97
Copy link
Member

/remove-lifecycle stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants