Add a unit test for missing exec event in the eventcache #283

tpapagian · 2022-08-03T12:43:26Z

This unit-test emulates the case where we get an exit event but we miss the exec event. In that case, we should wait for the cache iterations to end and then produce an exit event with partial process info. In that case we only have the process start time and the process pid.

This PR also fixes #285 and #247. OOO events was related to these issues and thus I don't use a separate PR for them. To test this I have run 8 successful consecutive runs in the CI while before a kprobe flake appears every ~3 runs.

Signed-off-by: Anastasios Papagiannis [email protected]

willfindlay

Just a minor comment from me.

pkg/api/readyapi/readyapi.go

jrfastab · 2022-08-03T17:47:30Z

I'm somehow missing how a msg in the eventcache gets 'readded' to the cache. The eventcache kicks the event out to the NotifyListeners these can do an Add() in the eventcache again?

jrfastab · 2022-08-03T18:19:44Z

Aha so fairly sure the issue here is the dummyInterface is calling HandleMessage which does not reflect what the actual code does run from main does. The flow is HandleMessage -> Exporter with some constructors through processManager to add the exporter as a listener and then server.go used to construct the exporter correctly. To follow along the relevant files are:

pkg/exporter/exporter.go: Implements the Send() method and Start() methods to satisfy the interface
pkg/server/server.go: Embeds into exporter above and used by exporter to do aggregation and filtering
pkg/grpc//process_manager.go: Has the Add logic to add multiple listeners. This lets us run both gRPC listeners, dbg listeners and json writers.

jrfastab

Fixing dummy notifier should resolve issue and then we don't need patch 3.

As we have more caches in this path (i.e. procCache in pkg/process/process.go) let's use a different PID in all tests in order not to have any interference. Signed-off-by: Anastasios Papagiannis <[email protected]>

This unit-test emulates the case where we get an exit events but we miss the exec event. In that case, we should wait for the cache iterations to end and then produce an exit event with partial process info. In that case we only have the process start time and the process pid. Signed-off-by: Anastasios Papagiannis <[email protected]>

Before this commit the high level view of the path that we follow in the unit-tests in pkg/grpc/exec/exec_test.go is: HandleMessage() -> ExportEvent | v eventcache() | v HandleMessage() -> ExportEvent This is not the same as what the tetragon code does. The path in tetragon is: HandleMessage() -> ExportEvent | v eventcache() -> ExportEvent Which also results in the failure of a unit-test introduced in the previous commit (TestGrpcMissingExec). This commit fixes NotifyListener() to reflect the tetragon codepath. Signed-off-by: Anastasios Papagiannis <[email protected]>

kkourt

Thanks!

Nice catch on the timing issue. I believe this will fix the linked issues.

pkg/eventcache/eventcache.go

tpapagian · 2022-08-05T15:13:37Z

Nice catch on the timing issue. I believe this will fix the linked issues.

@kkourt It seems that it still have some flakes related to these issues. This makes things better but I believe that it does not solve this entirely. Working on that (this is the reason that this PR is in draft).

pkg/eventcache/eventcache.go

pkg/process/cache.go

pkg/eventcache/eventcache.go

Before this commit eventcache tries to purge any events every 10 seconds. This commit reduces this interval to 2 seconds. There is no need to wait for 10 seconds to get an event. We keep the same 30 seconds timeout for missing exec events by increasing the tries from 3 to 15. Signed-off-by: Anastasios Papagiannis <[email protected]>

We do this so that we can use the value in the json checker and in testing. See next two commit for details. Signed-off-by: Anastasios Papagiannis <[email protected]>

In the case where we have an OOO exec event, we have to wait for the eventcache which is triggered every eventcache.EventRetryTimer (2) seconds. This commit makes the delay between retries in jsonchecker to be related to the eventcache timeout (+1 second just to be sure). We also reduce the number of retries to keep the total time about the same as before. Partially FIXES: #285 FIXES: #247 Signed-off-by: Anastasios Papagiannis <[email protected]>

For missing events we used a fix amount of time to wait. This commit makes that proportional to eventcache.CacheStrikes. Signed-off-by: Anastasios Papagiannis <[email protected]>

When we run Go tests, we start a process that will run all tests. For each case we start a new observer that reads all existing processes from /proc. When a test case ends we start again a new observer that keeps the previous process cache and we add again all existing processes from /proc (wihout any cleanup). This causes ref counts to be 0 and thus these processes evicted from the cache. This leads to kprobe events that cannot find process info and for this reason stay for a large period of time in the event cache. To solve this issue, when we start a new observer, even in the case where the process cache already exist, we cleanup it and we initialize a new clean proc cache for each test case. This commit also files process.cacheGarbageCollector where it calls .Purge() recursively insead of calling .Purge() for pc.cache and pc.pidMap. Hopefully FIXES: #285 FIXES: #247 Signed-off-by: Anastasios Papagiannis <[email protected]>

tpapagian · 2022-08-06T11:31:36Z

Fixing dummy notifier should resolve issue and then we don't need patch 3.

Fixed dummy notifier and addressed the other comments. Ready for review now.

jrfastab · 2022-08-08T15:45:34Z

pkg/eventcache/eventcache.go

+	// garbage collection retries
+	cacheStrikes = 15
+	// garbage collection run interval
+	eventRetryTimer = time.Second * 2


The 10seconds was about balancing GC performance cost vs getting events out. We should add a configMap option for this I think.

Created #296 to keep track of that.

jrfastab · 2022-08-08T15:51:48Z

This all looks good to me. With one question about making the time to retry cache configurable. For testing and watching results over grpc or cmd line its nice to have the system be more reactive, but when writing to logs for offsystem analysis I'm not sure it matters much if its 2 seconds or 2minutes.

tpapagian requested a review from jrfastab August 3, 2022 12:43

tpapagian requested a review from a team as a code owner August 3, 2022 12:43

willfindlay approved these changes Aug 3, 2022

View reviewed changes

pkg/api/readyapi/readyapi.go Outdated Show resolved Hide resolved

tpapagian force-pushed the pr/apapag/cache_missing_exec branch from 51a05b4 to c7855b0 Compare August 3, 2022 15:30

jrfastab requested changes Aug 3, 2022

View reviewed changes

tpapagian marked this pull request as draft August 4, 2022 06:30

tpapagian force-pushed the pr/apapag/cache_missing_exec branch 2 times, most recently from ee4b2c4 to 6129ed4 Compare August 4, 2022 07:19

tpapagian marked this pull request as ready for review August 4, 2022 07:32

tpapagian force-pushed the pr/apapag/cache_missing_exec branch 3 times, most recently from 2ff5479 to 17d4d16 Compare August 5, 2022 11:46

tpapagian marked this pull request as draft August 5, 2022 12:00

tpapagian force-pushed the pr/apapag/cache_missing_exec branch 2 times, most recently from fc6acdb to e93f662 Compare August 5, 2022 12:36

tpapagian added 3 commits August 5, 2022 12:54

Change PIDs used in different tests in pkg/grpc/exec/exec_test.go

44b4173

As we have more caches in this path (i.e. procCache in pkg/process/process.go) let's use a different PID in all tests in order not to have any interference. Signed-off-by: Anastasios Papagiannis <[email protected]>

tpapagian force-pushed the pr/apapag/cache_missing_exec branch 5 times, most recently from 9a20c72 to f63e26f Compare August 5, 2022 14:45

kkourt approved these changes Aug 5, 2022

View reviewed changes

pkg/eventcache/eventcache.go Outdated Show resolved Hide resolved

pkg/eventcache/eventcache.go Show resolved Hide resolved

tpapagian force-pushed the pr/apapag/cache_missing_exec branch 2 times, most recently from ce9e722 to dd39a9c Compare August 5, 2022 16:53

willfindlay reviewed Aug 5, 2022

View reviewed changes

pkg/eventcache/eventcache.go Outdated Show resolved Hide resolved

pkg/process/cache.go Outdated Show resolved Hide resolved

pkg/process/cache.go Outdated Show resolved Hide resolved

pkg/eventcache/eventcache.go Outdated Show resolved Hide resolved

tpapagian force-pushed the pr/apapag/cache_missing_exec branch 6 times, most recently from c59b250 to 8acb4c9 Compare August 6, 2022 10:08

tpapagian added 2 commits August 6, 2022 10:15

Make eventRetryTimer and cacheStrikes from eventcache public

6b6c80c

We do this so that we can use the value in the json checker and in testing. See next two commit for details. Signed-off-by: Anastasios Papagiannis <[email protected]>

tpapagian force-pushed the pr/apapag/cache_missing_exec branch from 8acb4c9 to 4dfab55 Compare August 6, 2022 10:18

tpapagian added 3 commits August 6, 2022 10:24

Fix cache timeouts in grpc/exec/exec_test.go

091095b

For missing events we used a fix amount of time to wait. This commit makes that proportional to eventcache.CacheStrikes. Signed-off-by: Anastasios Papagiannis <[email protected]>

tpapagian force-pushed the pr/apapag/cache_missing_exec branch from 4dfab55 to 2557de2 Compare August 6, 2022 10:25

tpapagian marked this pull request as ready for review August 6, 2022 11:27

tpapagian mentioned this pull request Aug 8, 2022

eventcachemetrics: fix eventcache and add error metrics #291

Merged

jrfastab reviewed Aug 8, 2022

View reviewed changes

jrfastab self-requested a review August 8, 2022 15:51

jrfastab approved these changes Aug 8, 2022

View reviewed changes

jrfastab merged commit 3d6995b into main Aug 8, 2022

jrfastab deleted the pr/apapag/cache_missing_exec branch August 8, 2022 15:52

tpapagian mentioned this pull request Aug 8, 2022

Make GC run time interval configurable in eventcache #296

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a unit test for missing exec event in the eventcache #283

Add a unit test for missing exec event in the eventcache #283

tpapagian commented Aug 3, 2022 •

edited

Loading

willfindlay left a comment

jrfastab commented Aug 3, 2022

jrfastab commented Aug 3, 2022

jrfastab left a comment

kkourt left a comment

tpapagian commented Aug 5, 2022

tpapagian commented Aug 6, 2022

jrfastab Aug 8, 2022

tpapagian Aug 8, 2022

jrfastab commented Aug 8, 2022

Add a unit test for missing exec event in the eventcache #283

Add a unit test for missing exec event in the eventcache #283

Conversation

tpapagian commented Aug 3, 2022 • edited Loading

willfindlay left a comment

Choose a reason for hiding this comment

jrfastab commented Aug 3, 2022

jrfastab commented Aug 3, 2022

jrfastab left a comment

Choose a reason for hiding this comment

kkourt left a comment

Choose a reason for hiding this comment

tpapagian commented Aug 5, 2022

tpapagian commented Aug 6, 2022

jrfastab Aug 8, 2022

Choose a reason for hiding this comment

tpapagian Aug 8, 2022

Choose a reason for hiding this comment

jrfastab commented Aug 8, 2022

tpapagian commented Aug 3, 2022 •

edited

Loading