-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a unit test for missing exec event in the eventcache #283
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor comment from me.
51a05b4
to
c7855b0
Compare
I'm somehow missing how a msg in the eventcache gets 'readded' to the cache. The eventcache kicks the event out to the NotifyListeners these can do an Add() in the eventcache again? |
Aha so fairly sure the issue here is the dummyInterface is calling HandleMessage which does not reflect what the actual code does run from main does. The flow is pkg/exporter/exporter.go: Implements the Send() method and Start() methods to satisfy the interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing dummy notifier should resolve issue and then we don't need patch 3.
ee4b2c4
to
6129ed4
Compare
2ff5479
to
17d4d16
Compare
fc6acdb
to
e93f662
Compare
As we have more caches in this path (i.e. procCache in pkg/process/process.go) let's use a different PID in all tests in order not to have any interference. Signed-off-by: Anastasios Papagiannis <[email protected]>
This unit-test emulates the case where we get an exit events but we miss the exec event. In that case, we should wait for the cache iterations to end and then produce an exit event with partial process info. In that case we only have the process start time and the process pid. Signed-off-by: Anastasios Papagiannis <[email protected]>
Before this commit the high level view of the path that we follow in the unit-tests in pkg/grpc/exec/exec_test.go is: HandleMessage() -> ExportEvent | v eventcache() | v HandleMessage() -> ExportEvent This is not the same as what the tetragon code does. The path in tetragon is: HandleMessage() -> ExportEvent | v eventcache() -> ExportEvent Which also results in the failure of a unit-test introduced in the previous commit (TestGrpcMissingExec). This commit fixes NotifyListener() to reflect the tetragon codepath. Signed-off-by: Anastasios Papagiannis <[email protected]>
9a20c72
to
f63e26f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Nice catch on the timing issue. I believe this will fix the linked issues.
@kkourt It seems that it still have some flakes related to these issues. This makes things better but I believe that it does not solve this entirely. Working on that (this is the reason that this PR is in draft). |
ce9e722
to
dd39a9c
Compare
c59b250
to
8acb4c9
Compare
Before this commit eventcache tries to purge any events every 10 seconds. This commit reduces this interval to 2 seconds. There is no need to wait for 10 seconds to get an event. We keep the same 30 seconds timeout for missing exec events by increasing the tries from 3 to 15. Signed-off-by: Anastasios Papagiannis <[email protected]>
We do this so that we can use the value in the json checker and in testing. See next two commit for details. Signed-off-by: Anastasios Papagiannis <[email protected]>
8acb4c9
to
4dfab55
Compare
In the case where we have an OOO exec event, we have to wait for the eventcache which is triggered every eventcache.EventRetryTimer (2) seconds. This commit makes the delay between retries in jsonchecker to be related to the eventcache timeout (+1 second just to be sure). We also reduce the number of retries to keep the total time about the same as before. Partially FIXES: #285 FIXES: #247 Signed-off-by: Anastasios Papagiannis <[email protected]>
For missing events we used a fix amount of time to wait. This commit makes that proportional to eventcache.CacheStrikes. Signed-off-by: Anastasios Papagiannis <[email protected]>
When we run Go tests, we start a process that will run all tests. For each case we start a new observer that reads all existing processes from /proc. When a test case ends we start again a new observer that keeps the previous process cache and we add again all existing processes from /proc (wihout any cleanup). This causes ref counts to be 0 and thus these processes evicted from the cache. This leads to kprobe events that cannot find process info and for this reason stay for a large period of time in the event cache. To solve this issue, when we start a new observer, even in the case where the process cache already exist, we cleanup it and we initialize a new clean proc cache for each test case. This commit also files process.cacheGarbageCollector where it calls .Purge() recursively insead of calling .Purge() for pc.cache and pc.pidMap. Hopefully FIXES: #285 FIXES: #247 Signed-off-by: Anastasios Papagiannis <[email protected]>
4dfab55
to
2557de2
Compare
Fixed dummy notifier and addressed the other comments. Ready for review now. |
// garbage collection retries | ||
cacheStrikes = 15 | ||
// garbage collection run interval | ||
eventRetryTimer = time.Second * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 10seconds was about balancing GC performance cost vs getting events out. We should add a configMap option for this I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #296 to keep track of that.
This all looks good to me. With one question about making the time to retry cache configurable. For testing and watching results over grpc or cmd line its nice to have the system be more reactive, but when writing to logs for offsystem analysis I'm not sure it matters much if its 2 seconds or 2minutes. |
This unit-test emulates the case where we get an exit event but we miss the exec event. In that case, we should wait for the cache iterations to end and then produce an exit event with partial process info. In that case we only have the process start time and the process pid.
This PR also fixes #285 and #247. OOO events was related to these issues and thus I don't use a separate PR for them. To test this I have run 8 successful consecutive runs in the CI while before a kprobe flake appears every ~3 runs.
Signed-off-by: Anastasios Papagiannis [email protected]