-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubelet: Always restart container in unknown state. #22607
Kubelet: Always restart container in unknown state. #22607
Conversation
Labelling this PR as size/S |
GCE e2e build/test passed for commit 45064d7. |
cc @kubernetes/rh-cluster-infra @kubernetes/rh-platform-management @smarterclayton |
// Always restart container in unknown state now | ||
if status.State == ContainerStateUnknown { | ||
return true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this check be after restart policy checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. If you do it after the restart policy checks, then the restart policy is honored, which isn't what we want here.
Assuming it's possible to distinguish "created but failed to start" from "created and we're about to try to start", I would like to treat "created but failed to start" as something that is always retried. |
Trying to understand, how "created but failed to start" is different from dead? |
@aveshagarwal "created but failed to start" means the state is still created. In my test that I'm trying to reproduce, there is some issue in Docker where starting the container fails because it can't find a cgroup file. "dead" is created, started without issue, ran something, then exited. "Ran something" could return an error (executable not found, something else, whatever). But the container runtime successfully "started" the container prior to it exiting. |
It seems to me that what you are suggesting could be possible: At this point, a new state (createdbutfailedtostart) could be assigned instead of ContainerStateExited and then it could be retried. So that seems possible. |
I'm not sure we need a new state necessarily, although it probably would make things clearer. However, this PR as-is does fix the problem I ran into. |
We need to be able to distinguish between "created/failed to start" and "created/haven't started yet" so we don't inadvertently try to restart the latter. |
The following gist contains the e2e log and the portion of the kubelet log spanning the lifecycle of the pod in the e2e test. It did take just under 20 seconds for the replacement container to start for some reason. |
I just realized that the failure in my e2e test loop wasn't 100% identical to before. This time, it was a failure to start the infra container (System error: The minimum allowed cpu-shares is 1024). It looks like the actual value set was 2. I guess this means I need to keep running my test loop to make sure I get the error condition I was expecting, and that this PR + @pmorie's PR to fix the downward API pod IP flake together fix the flake. |
@ncdc, If I understand correctly, you are saying, "created/haven't started yet" should not be restarted? Right? But this seems to be doing exactly this: As per comment in the code in ./pkg/kubelet/dockertools/manager.go
"created/haven't started yet" is an ContainerStateUnknown and this PR is restarting in this scenario. |
@aveshagarwal I want it to restart |
I am also a little confused with this one. To clarify a bit, there are 2 kinds of
For 1), the restart policy is always honored, before and now. The behavior is never changed. @ncdc It looks like what you met should be 2), right? |
@Random-Liu I'm talking about 1. Here's what I saw in the past:
Here is sample In the past, when I encountered this sequence of events, the Kubelet created another container to replace the first container that was created but failed to start. I am 100% certain of this. |
@ncdc Thanks, I'll look into it. |
This makes sense and aligns with what I remember about the past behavior. @ncdc, has your issue resolved with this PR? |
@yujuhong I've yet to reproduce the exact same failure scenario as before. I will keep trying. Hopefully it will show up soon! |
kubelet doesn't handle paused and restarting containers since we don't restart containers. We can potentially add support to recognize these containers and delete/recreate them. However, it's a bit of stretch to assume kubelet can handle them correctly. AFAIK, we used to check if the container is running, and lump all non-running containers to one category and treat them as "exited" container.
Ideally we should be able to handle all states. In reality, our SyncPod does a million of changes in one iteration, and observing intermediate state changes caused by one iteration don't help (yet).
We can treat them as exited containers like before, with the caveat of compatibility with rkt (below).
I am a bit concerned that rkt may report a non-running intermediate state after syncing the pod for the first time. @yifan-gu, is that possible? |
@k8s-bot test this Tests are more than 48 hours old. Re-running tests. |
GCE e2e build/test passed for commit 45064d7. |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e build/test passed for commit 45064d7. |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e build/test passed for commit 45064d7. |
Automatic merge from submit-queue |
…iner Auto commit by PR queue bot
Now if container is in some state which kubelet doesn't handle, it will be set as
ContainerStateUnknown
.For docker, there are 6 states: "paused", "restarting", "running", "dead", "created", "exited". (See https://github.com/docker/docker/blob/master/container/state.go#L73)
In current implementation:
ContainerStateRunning
=> Kubelet never restarts it.ContainerStateExited
=> Kubelet restarts it with respect toRestartPolicy
ContainerStateUnknown
=> Undefined.Before #21349, container in unknown state:
RestartPolicy
if there is exited historical container.This behavior is a little wired and unclear.
After #21349, container in unknown state will always be restarted with respect to
RestartPolicy
. However, @ncdc reports that it will cause container created but failed to be started never be restated again.This PR makes sure that we always restart container in unknown state.
RFC:
@kubernetes/sig-node