-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert 20202. Use other measures to prevent race in test-cmd.sh #21175
Revert 20202. Use other measures to prevent race in test-cmd.sh #21175
Conversation
Labelling this PR as size/M |
GCE e2e test build/test passed for commit f757bf7eb39c575accdafc6a1581b7d2c3d20e6d. |
return | ||
} | ||
// This doesn't entirely avoid the race condition, pod can be deleted and recreated after this check | ||
// but before the deletion request sent to the server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. Is it worth it to add this? ( @smarterclayton, opinions?)
It'd be better to add an optional UID precondition to delete. Like, add UID to the DeleteOptions type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do that for all pod operations? In kubelet, we also have to check if the pod we're getting from the apiserver before a status update is actually the pod we mean to update. The same applies to deletion. Since we use the pod UID as the key, it seems to make sense to be able to send a request to the apiserver with UID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be better to add an optional UID precondition to delete. Like, add UID to the DeleteOptions type.
I can do that. I thought it's an post 1.2 goal. Shall we do it for 1.2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's too risky for 1.2 right now. It's not a regression, it's just undesirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I'll revert the change to node controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f757bf7
to
dae6091
Compare
@lavalamp I removed the node controller change. PTAL. Thanks. |
GCE e2e build/test failed for commit dae6091b565096511177abfbbc971bbcd807f3ec. |
dae6091
to
314a6ab
Compare
GCE e2e test build/test passed for commit 314a6ab. |
LGTM, thanks! |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e test build/test passed for commit 314a6ab. |
@caesarxuchao |
1 similar comment
@caesarxuchao |
GCE e2e test build/test passed for commit 314a6ab. |
Still the same issue: /registry/Error deleting container: Error response from daemon: Conflict, You cannot remove a running container. Stop the container before attempting removal or use -f |
1 similar comment
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e test build/test passed for commit 314a6ab. |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e test build/test passed for commit 314a6ab. |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e build/test failed for commit 314a6ab. Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake. |
GCE e2e build/test failed for commit 314a6ab. Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake. |
GCE e2e build/test passed for commit 314a6ab. |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e build/test failed for commit 314a6ab. Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake. |
GCE e2e build/test failed for commit 314a6ab. Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake. |
GCE e2e build/test passed for commit 314a6ab. |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e build/test passed for commit 314a6ab. |
Automatic merge from submit-queue |
Auto commit by PR queue bot
I'm going to have to revert this change. After this went in, kubernetes-go-test started failing: 19:23:25 �(B+++ [0226 19:23:25] Testing resource aliasing |
Reverting made Jenkins go green again. My apologies, but you'll have to redo this PR. |
@fabioy Thanks for taking care of the builds. I think Jenkins didn't re-run the unit tests before merge so the failure was not caught. I'll patch the PR and submit it again. |
#20202 skips the update when a pod is deleted with grace period = 0. That breaks the non-graceful deletion, pods will get a default deletion grace period even if the grace period is set to 0. This PR adds back the update. So the race between API server and nodecontroller/kubelet in a non-graceful deletion comes back. I haven't taken any measure to fix this race in this PR, because this race only occurs if API server is context-switched after it has done the "update" but before it does the "delete", i.e. here, which is unlikely to happen.
Reverting #20202 also lets another kind of race come back: kubelet and nodecontroller might send a deletion request that ends up deleting a deleted and then recreated pod. Kubelet has taken measures to prevent such a race,
I made the node controller to do the same in the second commit.[Update] this is too risky for 1.2, I removed this commit.cc @mikedanese.
I also revised hack/test-cmd.sh to switch namespace before recreating a deleted pod to completely avoid the latter race condition when running that script.
@smarterclayton @yujuhong @lavalamp