Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crier failing to report OOMed pods #210

Open
Tracked by #295 ...
BenTheElder opened this issue Jul 25, 2024 · 14 comments
Open
Tracked by #295 ...

Crier failing to report OOMed pods #210

BenTheElder opened this issue Jul 25, 2024 · 14 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@BenTheElder
Copy link
Member

In prow.k8s.io we're seeing some pods fail to report status beyond "running" that have really been OOMed (if we manually inspect via the build cluster)

See: https://kubernetes.slack.com/archives/C7J9RP96G/p1721926824700569 and other #testing-ops therads

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 25, 2024
@jihoon-seo
Copy link
Contributor

I guess that you mean crier.. 😊
/retitle Crier failing to report OOMed pods

@dims
Copy link
Member

dims commented Jul 26, 2024

😢

@petr-muller
Copy link
Contributor

I don't think this is crier - that only comes later. I think this is podutils interacting with kubernetes/kubernetes#117793 .

NAME                                   READY   STATUS      RESTARTS   AGE
3b953040-572a-400f-8024-02310b8e54f8   1/2     OOMKilled   0          93m

The above (from the linked Slack thread) indicates one of the two containers in the Pod is still alive - sidecar. That container waits for entrypoint wrapper to write how the actual test process (entrypoints subprocess) ended, but with kubernetes/kubernetes#117793 all processes in the containers get OOMkilled, including entrypoint. In the past only the test process would be killed so entrypoint would be able to collect its exit code and exit normally.

Not sure about what would be the solution yet.

@BenTheElder
Copy link
Member Author

IMHO crier should still eventually report what happened here because it is observing the pod success / fail from the outside, but we see pods being perma reported as running (and if they really are running we should be enforcing a timeout via plank?)

We could either hand off more of this to crier / plank to observe the test container failure, or we could introduce additional heart beating / signaling between the entry point and sidecar?

@petr-muller
Copy link
Contributor

True; you are right that Plank should eventually reap a pod stuck like this - if that does not happen then it's perhaps a separate bug, not sure if plank just uses too soft signals (sidecar iirc ignores some signals in certain state when it thinks it still needs to finish uploading artifacts) or it somehow fails to enforce a timeout entirely.

@BenTheElder
Copy link
Member Author

Left a breadcrumb on kubernetes/kubernetes#117793 because I think it's worth noting when the project breaks itself with breaking changes :+)

This problem is probably going to get worse as cgroup v2 rolls out to the rest of our clusters and more widely in the ecosystem.

@BenTheElder
Copy link
Member Author

@BenTheElder
Copy link
Member Author

BenTheElder commented Aug 30, 2024

I'm seeing a lack of metadata (like the pod json) on pods that received SIGTERM and probably(?) weren't OOMed, at least there's no indication they would be, monitoring suggests they're actually not using much of the memory requested/limited.

https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-etcd-robustness-arm64

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-etcd-robustness-arm64/1829281539263827968

@BenTheElder
Copy link
Member Author

@oliver-goetz
Copy link
Contributor

True; you are right that Plank should eventually reap a pod stuck like this - if that does not happen then it's perhaps a separate bug, not sure if plank just uses too soft signals (sidecar iirc ignores some signals in certain state when it thinks it still needs to finish uploading artifacts) or it somehow fails to enforce a timeout entirely.

Afaik when a container is OOMKilled it exits with code 137. However, there is no signal sent to other containers of the same pod. Thus, we would need a kind of heart-beat between entrypoint and sidecar container if we would like to implement it there.

A potential plank implementation could check the Pod status for OOMKilled containers. It already checks for things like deleted pods, so it does not sound weird to implement the OOMKilled checker too. This could be done here and looks not that complicated. If it finds an OOMKilled container plank could set the prow job status to failed and delete the pod. The sidecar would have still time to collect and upload artifacts.

I prefer the second option. WDYT?

@BenTheElder
Copy link
Member Author

That sounds reasonable but I haven't been in the prow internals for some time now, @petr-muller @krzyzacy might have thoughts 🙏

@kannon92
Copy link

In 1.32, we have the oom behavior as configurable: kubernetes/kubernetes#126096.

@BenTheElder
Copy link
Member Author

In 1.32, we have the oom behavior as configurable: kubernetes/kubernetes#126096.

We use managed clusters for practical reasons, so we're generally getting the defaults (unless we convince EKS and GKE and someday AKS they should change the defaults).

@krzyzacy
Copy link
Contributor

/cc

will use this as a good ramp up issue and circle back in the next dev cycle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

8 participants