Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a unit test for missing exec event in the eventcache #283

Merged
merged 8 commits into from
Aug 8, 2022

Conversation

tpapagian
Copy link
Member

@tpapagian tpapagian commented Aug 3, 2022

This unit-test emulates the case where we get an exit event but we miss the exec event. In that case, we should wait for the cache iterations to end and then produce an exit event with partial process info. In that case we only have the process start time and the process pid.

This PR also fixes #285 and #247. OOO events was related to these issues and thus I don't use a separate PR for them. To test this I have run 8 successful consecutive runs in the CI while before a kprobe flake appears every ~3 runs.

Signed-off-by: Anastasios Papagiannis [email protected]

@tpapagian tpapagian requested a review from jrfastab August 3, 2022 12:43
@tpapagian tpapagian requested a review from a team as a code owner August 3, 2022 12:43
Copy link
Contributor

@willfindlay willfindlay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor comment from me.

pkg/api/readyapi/readyapi.go Outdated Show resolved Hide resolved
@tpapagian tpapagian force-pushed the pr/apapag/cache_missing_exec branch from 51a05b4 to c7855b0 Compare August 3, 2022 15:30
@jrfastab
Copy link
Contributor

jrfastab commented Aug 3, 2022

I'm somehow missing how a msg in the eventcache gets 'readded' to the cache. The eventcache kicks the event out to the NotifyListeners these can do an Add() in the eventcache again?

@jrfastab
Copy link
Contributor

jrfastab commented Aug 3, 2022

Aha so fairly sure the issue here is the dummyInterface is calling HandleMessage which does not reflect what the actual code does run from main does. The flow is HandleMessage -> Exporter with some constructors through processManager to add the exporter as a listener and then server.go used to construct the exporter correctly. To follow along the relevant files are:

pkg/exporter/exporter.go: Implements the Send() method and Start() methods to satisfy the interface
pkg/server/server.go: Embeds into exporter above and used by exporter to do aggregation and filtering
pkg/grpc//process_manager.go: Has the Add logic to add multiple listeners. This lets us run both gRPC listeners, dbg listeners and json writers.

Copy link
Contributor

@jrfastab jrfastab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing dummy notifier should resolve issue and then we don't need patch 3.

@tpapagian tpapagian marked this pull request as draft August 4, 2022 06:30
@tpapagian tpapagian force-pushed the pr/apapag/cache_missing_exec branch 2 times, most recently from ee4b2c4 to 6129ed4 Compare August 4, 2022 07:19
@tpapagian tpapagian marked this pull request as ready for review August 4, 2022 07:32
@tpapagian tpapagian force-pushed the pr/apapag/cache_missing_exec branch 3 times, most recently from 2ff5479 to 17d4d16 Compare August 5, 2022 11:46
@tpapagian tpapagian marked this pull request as draft August 5, 2022 12:00
@tpapagian tpapagian force-pushed the pr/apapag/cache_missing_exec branch 2 times, most recently from fc6acdb to e93f662 Compare August 5, 2022 12:36
As we have more caches in this path (i.e. procCache in pkg/process/process.go)
let's use a different PID in all tests in order not to have any
interference.

Signed-off-by: Anastasios Papagiannis <[email protected]>
This unit-test emulates the case where we get an exit events but we
miss the exec event. In that case, we should wait for the cache
iterations to end and then produce an exit event with partial process
info. In that case we only have the process start time and the process
pid.

Signed-off-by: Anastasios Papagiannis <[email protected]>
Before this commit the high level view of the path that we follow in
the unit-tests in pkg/grpc/exec/exec_test.go is:

HandleMessage() -> ExportEvent
    |
    v
  eventcache()
    |
    v
HandleMessage() -> ExportEvent

This is not the same as what the tetragon code does. The path in tetragon
is:

HandleMessage() -> ExportEvent
    |
    v
  eventcache() -> ExportEvent

Which also results in the failure of a unit-test introduced in the
previous commit (TestGrpcMissingExec).

This commit fixes NotifyListener() to reflect the tetragon codepath.

Signed-off-by: Anastasios Papagiannis <[email protected]>
@tpapagian tpapagian force-pushed the pr/apapag/cache_missing_exec branch 5 times, most recently from 9a20c72 to f63e26f Compare August 5, 2022 14:45
Copy link
Contributor

@kkourt kkourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Nice catch on the timing issue. I believe this will fix the linked issues.

pkg/eventcache/eventcache.go Outdated Show resolved Hide resolved
pkg/eventcache/eventcache.go Show resolved Hide resolved
@tpapagian
Copy link
Member Author

Nice catch on the timing issue. I believe this will fix the linked issues.

@kkourt It seems that it still have some flakes related to these issues. This makes things better but I believe that it does not solve this entirely. Working on that (this is the reason that this PR is in draft).

@tpapagian tpapagian force-pushed the pr/apapag/cache_missing_exec branch 2 times, most recently from ce9e722 to dd39a9c Compare August 5, 2022 16:53
pkg/eventcache/eventcache.go Outdated Show resolved Hide resolved
pkg/process/cache.go Outdated Show resolved Hide resolved
pkg/process/cache.go Outdated Show resolved Hide resolved
pkg/eventcache/eventcache.go Outdated Show resolved Hide resolved
@tpapagian tpapagian force-pushed the pr/apapag/cache_missing_exec branch 6 times, most recently from c59b250 to 8acb4c9 Compare August 6, 2022 10:08
Before this commit eventcache tries to purge any events every 10
seconds. This commit reduces this interval to 2 seconds. There is
no need to wait for 10 seconds to get an event.

We keep the same 30 seconds timeout for missing exec events by
increasing the tries from 3 to 15.

Signed-off-by: Anastasios Papagiannis <[email protected]>
We do this so that we can use the value in the json checker and in
testing. See next two commit for details.

Signed-off-by: Anastasios Papagiannis <[email protected]>
@tpapagian tpapagian force-pushed the pr/apapag/cache_missing_exec branch from 8acb4c9 to 4dfab55 Compare August 6, 2022 10:18
In the case where we have an OOO exec event, we have to wait for the eventcache
which is triggered every eventcache.EventRetryTimer (2) seconds.

This commit makes the delay between retries in jsonchecker to be related
to the eventcache timeout (+1 second just to be sure).

We also reduce the number of retries to keep the total time about the
same as before.

Partially
FIXES: #285
FIXES: #247

Signed-off-by: Anastasios Papagiannis <[email protected]>
For missing events we used a fix amount of time to wait. This commit
makes that proportional to eventcache.CacheStrikes.

Signed-off-by: Anastasios Papagiannis <[email protected]>
When we run Go tests, we start a process that will run all tests. For each
case we start a new observer that reads all existing processes from /proc.
When a test case ends we start again a new observer that keeps the previous
process cache and we add again all existing processes from /proc (wihout
any cleanup). This causes ref counts to be 0 and thus these processes evicted
from the cache.

This leads to kprobe events that cannot find process info and for this reason
stay for a large period of time in the event cache.

To solve this issue, when we start a new observer, even in the case where
the process cache already exist, we cleanup it and we initialize a new
clean proc cache for each test case.

This commit also files process.cacheGarbageCollector where it calls
.Purge() recursively insead of calling .Purge() for pc.cache and
pc.pidMap.

Hopefully
FIXES: #285
FIXES: #247

Signed-off-by: Anastasios Papagiannis <[email protected]>
@tpapagian tpapagian force-pushed the pr/apapag/cache_missing_exec branch from 4dfab55 to 2557de2 Compare August 6, 2022 10:25
@tpapagian tpapagian marked this pull request as ready for review August 6, 2022 11:27
@tpapagian
Copy link
Member Author

Fixing dummy notifier should resolve issue and then we don't need patch 3.

Fixed dummy notifier and addressed the other comments. Ready for review now.

// garbage collection retries
cacheStrikes = 15
// garbage collection run interval
eventRetryTimer = time.Second * 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 10seconds was about balancing GC performance cost vs getting events out. We should add a configMap option for this I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #296 to keep track of that.

@jrfastab
Copy link
Contributor

jrfastab commented Aug 8, 2022

This all looks good to me. With one question about making the time to retry cache configurable. For testing and watching results over grpc or cmd line its nice to have the system be more reactive, but when writing to logs for offsystem analysis I'm not sure it matters much if its 2 seconds or 2minutes.

@jrfastab jrfastab self-requested a review August 8, 2022 15:51
@jrfastab jrfastab merged commit 3d6995b into main Aug 8, 2022
@jrfastab jrfastab deleted the pr/apapag/cache_missing_exec branch August 8, 2022 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ci flake: TestKprobeObjectOpenMount occaisionally failing
4 participants