Skip to content

Conversation

@sprt
Copy link

@sprt sprt commented Feb 24, 2025

Merge Checklist
Summary

Backporting kata-containers#10911

Test Methodology

sprt added 3 commits February 24, 2025 10:31
In the CI, test containers intermittently fail to start after creation,
with an error like below (see kata-containers#10872 for more details):

  #     State:      Terminated
  #       Reason:   StartError
  #       Message:  failed to start containerd task "afd43e77fae0815afbc7205eac78f94859e247968a6a4e8bcbb987690fcf10a6": No such file or directory (os error 2)

I've observed this error to repro with the following containers, which
have in common that they're all *very short-lived* by design (more tests
might be affected):

 * k8s-job.bats
 * k8s-seccomp.bats
 * k8s-hostname.bats
 * k8s-policy-job.bats
 * k8s-policy-logs.bats

Furthermore, appending a `; sleep 1` to the command line for those
containers seemed to consistently get rid of the error.

Investigating further, I've uncovered a race between the end of the container
process and the setting up of the cgroup watchers (to report OOMs).

If the process terminates first, the agent will try to watch cgroup
paths that don't exist anymore, and it will fail to start the container.
The added error context in notifier.rs confirms that the error comes
from the missing cgroup:

  https://github.com/kata-containers/kata-containers/actions/runs/13450787436/job/37585901466#step:17:6536

The fix simply consists in creating the watchers *before* we start the
container but still *after* we create it -- this is non-blocking, and IIUC the
cgroup is guaranteed to already be present then.

Fixes: kata-containers#10872

Signed-off-by: Aurélien Bombo <[email protected]>
Doesn't seem like we're going to use this and it's confusing when inspecting
code.

Signed-off-by: Aurélien Bombo <[email protected]>
@sprt sprt added the upstream/merged PRs that have been merged upstream label Feb 24, 2025
@sprt sprt requested review from a team as code owners February 24, 2025 16:35
@sprt sprt merged commit f142629 into msft-main Feb 26, 2025
86 of 107 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

upstream/merged PRs that have been merged upstream

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants