-
Notifications
You must be signed in to change notification settings - Fork 99
Bug 1884053: Configure CoreDNS to shut down gracefully #205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1884053: Configure CoreDNS to shut down gracefully #205
Conversation
This commit fixes bug 1884053. https://bugzilla.redhat.com/show_bug.cgi?id=1884053 * assets/dns/daemonset.yaml: Specify a termination grace period of 120 seconds. Change the readiness probe to use :8181/ready. * pkg/manifests/bindata.go: Regenerate. * pkg/operator/controller/controller_dns_configmap.go: (corefileTemplate): Configure CoreDNS's health plugin to sleep 60 seconds when CoreDNS is shut down. Enable CoreDNS's ready plugin in order to provide a readiness endpoint on :8181/ready, which stops responding when CoreDNS is shutting down. * pkg/operator/controller/controller_dns_configmap_test.go (TestDesiredDNSConfigmap): Adjust for changes to corefileTemplate. * pkg/operator/controller/controller_dns_daemonset.go (daemonsetConfigChanged): Check if the readiness probe or termination grace period changed, and update them if they did. * pkg/operator/controller/controller_dns_daemonset_test.go (TestDaemonsetConfigChanged): Add test cases to verify that deploymentConfigChanged detects updates to .spec.template.spec.containers[].readinessProbe.httpGet and .spec.template.spec.terminationGracePeriodSeconds.
|
@Miciah: This pull request references Bugzilla bug 1884053, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@Miciah: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Miciah, is coredns/coredns#4099 a concern for us? It was rectified in upstream CoreDNS via coredns/coredns#4167, but the fix is currently not available in the openshift fork of CoreDNS.
| errors | ||
| health | ||
| health { | ||
| lameduck 60s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not trying to suggest that 60s is an unacceptable lameduck period, but I'm curious to hear how you arrived at this value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The readiness probe has a period of 10 seconds and a failure threshold of 3, so I figure it might take 30 seconds for the kubelet to recognize that the pod is unresponsive, and I doubled that for good measure.
Yeah, I saw that, but according to the issue, the issue only applies if the connection is re-used. The issue report mentions that Azure re-uses the connection. I assume the kubelet does not re-use the connection or else someone would have said something. |
Alright, fair enough. Just wanted to double check and raise awareness. Thanks! |
|
/retest |
|
Looks good! |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Miciah, sgreene570 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
@Miciah: All pull requests linked via external trackers have merged: Bugzilla bug 1884053 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Likely fallout from this in https://bugzilla.redhat.com/show_bug.cgi?id=1893360 |
|
Revert inbound in #213 |
assets/dns/daemonset.yaml: Specify a termination grace period of 120 seconds. Change the readiness probe to use:8181/ready.pkg/manifests/bindata.go: Regenerate.pkg/operator/controller/controller_dns_configmap.go: (corefileTemplate): Configure CoreDNS's health plugin to sleep 60 seconds when CoreDNS is shut down. Enable CoreDNS's ready plugin in order to provide a readiness endpoint on:8181/ready, which stops responding when CoreDNS is shutting down.pkg/operator/controller/controller_dns_configmap_test.go(TestDesiredDNSConfigmap): Adjust for changes tocorefileTemplate.pkg/operator/controller/controller_dns_daemonset.go(daemonsetConfigChanged): Check if the readiness probe or termination grace period changed, and update them if they did.pkg/operator/controller/controller_dns_daemonset_test.go(TestDaemonsetConfigChanged): Add test cases to verify thatdeploymentConfigChangeddetects updates to.spec.template.spec.containers[].readinessProbe.httpGetand.spec.template.spec.terminationGracePeriodSeconds.These changes make CoreDNS behave more correctly but do not seem to resolve BZ#1884053 completely: In testing, I still see that the DNS pod becomes unresponsive before the endpoints controller removes the pod's address from the endpoints when I reboot a node. It is not clear whether this is a problem with CoreDNS, the kubelet, the endpoints controller, MCO, or something else.