Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert 20202. Use other measures to prevent race in test-cmd.sh #21175

Merged
merged 2 commits into from
Feb 27, 2016

Conversation

caesarxuchao
Copy link
Member

#20202 skips the update when a pod is deleted with grace period = 0. That breaks the non-graceful deletion, pods will get a default deletion grace period even if the grace period is set to 0. This PR adds back the update. So the race between API server and nodecontroller/kubelet in a non-graceful deletion comes back. I haven't taken any measure to fix this race in this PR, because this race only occurs if API server is context-switched after it has done the "update" but before it does the "delete", i.e. here, which is unlikely to happen.

Reverting #20202 also lets another kind of race come back: kubelet and nodecontroller might send a deletion request that ends up deleting a deleted and then recreated pod. Kubelet has taken measures to prevent such a race, I made the node controller to do the same in the second commit. [Update] this is too risky for 1.2, I removed this commit.

cc @mikedanese.

I also revised hack/test-cmd.sh to switch namespace before recreating a deleted pod to completely avoid the latter race condition when running that script.

@smarterclayton @yujuhong @lavalamp

@k8s-github-robot k8s-github-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 12, 2016
@k8s-github-robot
Copy link

Labelling this PR as size/M

@k8s-bot
Copy link

k8s-bot commented Feb 12, 2016

GCE e2e test build/test passed for commit f757bf7eb39c575accdafc6a1581b7d2c3d20e6d.

return
}
// This doesn't entirely avoid the race condition, pod can be deleted and recreated after this check
// but before the deletion request sent to the server.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Is it worth it to add this? ( @smarterclayton, opinions?)

It'd be better to add an optional UID precondition to delete. Like, add UID to the DeleteOptions type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do that for all pod operations? In kubelet, we also have to check if the pod we're getting from the apiserver before a status update is actually the pod we mean to update. The same applies to deletion. Since we use the pod UID as the key, it seems to make sense to be able to send a request to the apiserver with UID.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be better to add an optional UID precondition to delete. Like, add UID to the DeleteOptions type.

I can do that. I thought it's an post 1.2 goal. Shall we do it for 1.2?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's too risky for 1.2 right now. It's not a regression, it's just undesirable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll revert the change to node controller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yujuhong I added your concern with "Get" operations to #20572.

@caesarxuchao
Copy link
Member Author

@lavalamp I removed the node controller change. PTAL. Thanks.

@k8s-bot
Copy link

k8s-bot commented Feb 19, 2016

GCE e2e build/test failed for commit dae6091b565096511177abfbbc971bbcd807f3ec.

@k8s-bot
Copy link

k8s-bot commented Feb 19, 2016

GCE e2e test build/test passed for commit 314a6ab.

@lavalamp lavalamp added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 19, 2016
@lavalamp
Copy link
Member

LGTM, thanks!

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented Feb 20, 2016

GCE e2e test build/test passed for commit 314a6ab.

@k8s-github-robot
Copy link

@caesarxuchao
You must link to the test flake issue which caused you to request this manual re-test.
Re-test requests should be in the form of: k8s-bot test this issue: #<number>
Here is the list of open test flakes.

1 similar comment
@k8s-github-robot
Copy link

@caesarxuchao
You must link to the test flake issue which caused you to request this manual re-test.
Re-test requests should be in the form of: k8s-bot test this issue: #<number>
Here is the list of open test flakes.

@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 22, 2016
@k8s-bot
Copy link

k8s-bot commented Feb 22, 2016

GCE e2e test build/test passed for commit 314a6ab.

@caesarxuchao
Copy link
Member Author

@k8s-bot unit test this issue: #21451

@caesarxuchao
Copy link
Member Author

Still the same issue: /registry/Error deleting container: Error response from daemon: Conflict, You cannot remove a running container. Stop the container before attempting removal or use -f
Unrecognized input header

@caesarxuchao
Copy link
Member Author

@k8s-bot unit test this issue: #21451

1 similar comment
@caesarxuchao
Copy link
Member Author

@k8s-bot unit test this issue: #21451

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented Feb 25, 2016

GCE e2e test build/test passed for commit 314a6ab.

@yujuhong
Copy link
Contributor

@k8s-bot unit test this issue: #21451

@yujuhong yujuhong added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 25, 2016
@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented Feb 25, 2016

GCE e2e test build/test passed for commit 314a6ab.

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented Feb 25, 2016

GCE e2e build/test failed for commit 314a6ab.

Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake.

@caesarxuchao
Copy link
Member Author

@k8s-bot test this please, issue: #20904, #21487

@k8s-bot
Copy link

k8s-bot commented Feb 25, 2016

GCE e2e build/test failed for commit 314a6ab.

Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake.

@caesarxuchao
Copy link
Member Author

@k8s-bot e2e test this, issue: #21753

@k8s-bot
Copy link

k8s-bot commented Feb 26, 2016

GCE e2e build/test passed for commit 314a6ab.

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented Feb 26, 2016

GCE e2e build/test failed for commit 314a6ab.

Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake.

@caesarxuchao
Copy link
Member Author

@k8s-bot e2e test this, issue: #21753

@k8s-bot
Copy link

k8s-bot commented Feb 27, 2016

GCE e2e build/test failed for commit 314a6ab.

Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake.

@caesarxuchao
Copy link
Member Author

@k8s-bot e2e test this, issue: #21484

@k8s-bot
Copy link

k8s-bot commented Feb 27, 2016

GCE e2e build/test passed for commit 314a6ab.

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented Feb 27, 2016

GCE e2e build/test passed for commit 314a6ab.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

k8s-github-robot pushed a commit that referenced this pull request Feb 27, 2016
@k8s-github-robot k8s-github-robot merged commit a5ceafc into kubernetes:master Feb 27, 2016
@fabioy
Copy link
Contributor

fabioy commented Feb 27, 2016

I'm going to have to revert this change. After this went in, kubernetes-go-test started failing:

19:23:25 �(B+++ [0226 19:23:25] Testing resource aliasing
19:23:25 replicationcontroller "cassandra" created
19:23:25 I0226 19:23:25.736468 7531 event.go:211] Event(api.ObjectReference{Kind:"ReplicationController", Namespace:"default", Name:"cassandra", UID:"767d73df-dd01-11e5-befa-0242ac110003", APIVersion:"v1", ResourceVersion:"1034", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: cassandra-2wxmq
19:23:25 I0226 19:23:25.736613 7531 event.go:211] Event(api.ObjectReference{Kind:"ReplicationController", Namespace:"default", Name:"cassandra", UID:"767d73df-dd01-11e5-befa-0242ac110003", APIVersion:"v1", ResourceVersion:"1034", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: cassandra-hsdrt
19:23:26 replicationcontroller "cassandra" scaled
19:23:26 service "cassandra" created
19:23:26
19:23:26 FAIL!
19:23:26 Get all -l'app=cassandra' {{range.items}}{{range .metadata.labels}}{{.}}:{{end}}{{end}}
19:23:26 Expected: cassandra:cassandra:cassandra:
19:23:26 Got: cassandra:cassandra:cassandra:cassandra:
19:23:26 �(B
19:23:26 1563 ./hack/test-cmd.sh
19:23:26 �(B
19:23:26 !!! Error in ./hack/test-cmd.sh:51
19:23:26 'return 1' exited with status 1
19:23:26 Call stack:
19:23:26 1: ./hack/test-cmd.sh:51 runTests(...)
19:23:26 2: ./hack/test-cmd.sh:1636 main(...)
19:23:26 Exiting with status 1

@fabioy
Copy link
Contributor

fabioy commented Feb 27, 2016

Reverting made Jenkins go green again. My apologies, but you'll have to redo this PR.

@caesarxuchao
Copy link
Member Author

@fabioy Thanks for taking care of the builds. I think Jenkins didn't re-run the unit tests before merge so the failure was not caught. I'll patch the PR and submit it again.

@caesarxuchao caesarxuchao mentioned this pull request Feb 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants