Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tetragon: Teach file writer to exit gracefully #185

Merged
merged 2 commits into from
Jun 28, 2022
Merged

Conversation

jrfastab
Copy link
Contributor

We observed that rarely if we SIGTERM Tetragon (e.g. what pod restart and destroy do before SIGKILL it) that a corrupted json file might get left behind. The theory is this is do to writers being killed in the middle of an operation.

To fix we do two things. First propagate a single context through the processManager and Server objects instead of just using context.Backgroun(). This allows the cancel() call from our SIGTERM handler to be handled by ctx.Done() in the respective go routines. Then we add a waitGroup to the signal handler to ensure it waits until the writer has exited.

I only fix the go routine responsible for writes here. We should audit the other routines as well.

And second I started thinking about how we nest the various objects, at the moment we pass a lot of input types through the New() constructors. But this looked bigger than a fix so held off for now on this.

jrfastab added 2 commits June 22, 2022 12:16
The context 'cancel()' done on closing does not propagate into the
ProcessManager and Server objects, which currently use the basic
context.Background(). The problem with this is on exit we hook
SIGTERM, etc. and call cancel() but, because this doesn't get pushed
back to the JSON writer object its possible to have partial writes
completed.

First to clean this lets ensure we use the ctx from main so cancel
will be pushed into these objects.

Signed-off-by: John Fastabend <[email protected]>
When we receive a signal -- sigterm for example -- we tear down the BPF
state and event handlers in go routines with cancel(). But, we do not
wait to ensure they have received the signal and done any cleanup needed.

This adds a waitgroup to use for this. So objects increment it on New()
create op and then call the Done() when they are complete. Ensuring we
don't exit in the middle of sensitive operations, such as writes to gRPC
streams and JSON files where we can generate corrupted events when we
are unlucky.

Signed-off-by: John Fastabend <[email protected]>
@jrfastab jrfastab requested a review from a team as a code owner June 22, 2022 23:54
@jrfastab jrfastab requested a review from tixxdz June 22, 2022 23:54
@jrfastab jrfastab force-pushed the pr/jrfastab/ctxprop branch from 7b8a132 to 7b4ea4d Compare June 23, 2022 00:17
@kkourt kkourt self-requested a review June 23, 2022 05:22
pkg/server/server.go Show resolved Hide resolved
pkg/server/server.go Outdated Show resolved Hide resolved
@jrfastab jrfastab force-pushed the pr/jrfastab/ctxprop branch from 7b4ea4d to 4b7c8d1 Compare June 24, 2022 05:41
@kkourt kkourt requested a review from willfindlay June 24, 2022 05:59
@jrfastab jrfastab merged commit d2de3fa into main Jun 28, 2022
@jrfastab jrfastab deleted the pr/jrfastab/ctxprop branch June 28, 2022 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants