Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job: Close the health reporter instead of stopping #20

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

joamaki
Copy link
Contributor

@joamaki joamaki commented Oct 10, 2024

When timer/one-shot/observer job finishes it should close the health reporting instead of stopping it in order not to accumulate endless number of health statuses for one-off etc. jobs.

When timer/one-shot/observer job finishes it should close the health
reporting instead of stopping it in order not to accumulate endless
number of health statuses for one-off etc. jobs.

Signed-off-by: Jussi Maki <[email protected]>
@joamaki joamaki requested a review from a team as a code owner October 10, 2024 15:49
@joamaki joamaki requested review from ovidiutirla and removed request for a team October 10, 2024 15:49
@joamaki
Copy link
Contributor Author

joamaki commented Oct 10, 2024

There's also another related issue on the cilium/cilium pkg/hive/health/provider.go side where the Stopped() method only marks StoppedAt time but doesn't set Level nor Message which can lead to confusion. Will do a separate PR for that.

(https://github.com/cilium/cilium/pull/35154/files#diff-f0f7633343b350adb76fe60850c162ed40b56279936699feffa2c0dce5587411)

@joamaki
Copy link
Contributor Author

joamaki commented Oct 10, 2024

I'm slightly unsure of this change. Do we want the jobs themselves to close their health scopes, or should we just have them call Stopped and we'd Close the whole tree when the job.Group is stopped? That might not ever happen though... it's more likely the group stays running and it's used to spawn a lot of one-shot jobs. I do see the benefit of seeing that a one-shot or a timer job was stopped... another way to solve this is to do garbage collection of stopped health entries after awhile...

My main concern is that right now it's way too easy to leave a lot of these around and if we ever start spawning bunch of these with generated names we'll end up with a huge health table, so I feel like erring on the side of cleaning up might be better. Thoughts?

@ovidiutirla
Copy link
Contributor

ovidiutirla commented Oct 13, 2024

I'm more into regular GC of the stopped health entries. Keeping their status stopped as long as the cilium-agent runs sound like it could easily become a memory leak.

JobGroups might never get stopped, I wouldn't rely on that. (agree on this)

Could we mark them as stopped, the metric should report that they were stopped, and immediately after closing them.
Or in between stop and close have a slight delay instead of having a periodic gc?

Closing them instead of reporting as stopped could produce some weird visuals on the metrics which are not that intuitive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants