Bug 1884053: Configure CoreDNS to shut down gracefully #205

Miciah · 2020-10-09T00:25:52Z

assets/dns/daemonset.yaml: Specify a termination grace period of 120 seconds. Change the readiness probe to use :8181/ready.
pkg/manifests/bindata.go: Regenerate.
pkg/operator/controller/controller_dns_configmap.go: (corefileTemplate): Configure CoreDNS's health plugin to sleep 60 seconds when CoreDNS is shut down. Enable CoreDNS's ready plugin in order to provide a readiness endpoint on :8181/ready, which stops responding when CoreDNS is shutting down.
pkg/operator/controller/controller_dns_configmap_test.go (TestDesiredDNSConfigmap): Adjust for changes to corefileTemplate.
pkg/operator/controller/controller_dns_daemonset.go (daemonsetConfigChanged): Check if the readiness probe or termination grace period changed, and update them if they did.
pkg/operator/controller/controller_dns_daemonset_test.go (TestDaemonsetConfigChanged): Add test cases to verify that deploymentConfigChanged detects updates to .spec.template.spec.containers[].readinessProbe.httpGet and .spec.template.spec.terminationGracePeriodSeconds.

These changes make CoreDNS behave more correctly but do not seem to resolve BZ#1884053 completely: In testing, I still see that the DNS pod becomes unresponsive before the endpoints controller removes the pod's address from the endpoints when I reboot a node. It is not clear whether this is a problem with CoreDNS, the kubelet, the endpoints controller, MCO, or something else.

This commit fixes bug 1884053. https://bugzilla.redhat.com/show_bug.cgi?id=1884053 * assets/dns/daemonset.yaml: Specify a termination grace period of 120 seconds. Change the readiness probe to use :8181/ready. * pkg/manifests/bindata.go: Regenerate. * pkg/operator/controller/controller_dns_configmap.go: (corefileTemplate): Configure CoreDNS's health plugin to sleep 60 seconds when CoreDNS is shut down. Enable CoreDNS's ready plugin in order to provide a readiness endpoint on :8181/ready, which stops responding when CoreDNS is shutting down. * pkg/operator/controller/controller_dns_configmap_test.go (TestDesiredDNSConfigmap): Adjust for changes to corefileTemplate. * pkg/operator/controller/controller_dns_daemonset.go (daemonsetConfigChanged): Check if the readiness probe or termination grace period changed, and update them if they did. * pkg/operator/controller/controller_dns_daemonset_test.go (TestDaemonsetConfigChanged): Add test cases to verify that deploymentConfigChanged detects updates to .spec.template.spec.containers[].readinessProbe.httpGet and .spec.template.spec.terminationGracePeriodSeconds.

openshift-ci-robot · 2020-10-09T00:26:00Z

@Miciah: This pull request references Bugzilla bug 1884053, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.7.0) matches configured target release for branch (4.7.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Details

In response to this:

Bug 1884053: Configure CoreDNS to shut down gracefully

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-10-09T03:36:36Z

@Miciah: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws	`f094ddf`	link	`/test e2e-aws`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

sgreene570

@Miciah, is coredns/coredns#4099 a concern for us? It was rectified in upstream CoreDNS via coredns/coredns#4167, but the fix is currently not available in the openshift fork of CoreDNS.

sgreene570 · 2020-10-09T15:08:29Z

pkg/operator/controller/controller_dns_configmap.go

    errors
-    health
+    health {
+        lameduck 60s


I'm not trying to suggest that 60s is an unacceptable lameduck period, but I'm curious to hear how you arrived at this value.

The readiness probe has a period of 10 seconds and a failure threshold of 3, so I figure it might take 30 seconds for the kubelet to recognize that the pod is unresponsive, and I doubled that for good measure.

Miciah · 2020-10-09T15:29:23Z

@Miciah, is coredns/coredns#4099 a concern for us? It was rectified in upstream CoreDNS via coredns/coredns#4167, but the fix is currently not available in the openshift fork of CoreDNS.

Yeah, I saw that, but according to the issue, the issue only applies if the connection is re-used. The issue report mentions that Azure re-uses the connection. I assume the kubelet does not re-use the connection or else someone would have said something.

sgreene570 · 2020-10-09T15:47:23Z

I assume the kubelet does not re-use the connection or else someone would have said something.

Alright, fair enough. Just wanted to double check and raise awareness. Thanks!

sgreene570 · 2020-10-14T18:54:34Z

/retest

sgreene570 · 2020-10-23T17:45:23Z

Looks good!
/lgtm

openshift-ci-robot · 2020-10-23T17:45:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah, sgreene570

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [Miciah,sgreene570]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2020-10-23T21:58:38Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-24T00:08:39Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2020-10-24T01:35:24Z

@Miciah: All pull requests linked via external trackers have merged:

openshift/cluster-dns-operator#205

Bugzilla bug 1884053 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1884053: Configure CoreDNS to shut down gracefully

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cgwalters · 2020-11-02T18:44:23Z

Likely fallout from this in https://bugzilla.redhat.com/show_bug.cgi?id=1893360

cgwalters · 2020-11-13T21:18:49Z

Revert inbound in #213

openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Oct 9, 2020

openshift-ci-robot requested review from danehans and sgreene570 October 9, 2020 00:26

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 9, 2020

sgreene570 reviewed Oct 9, 2020

View reviewed changes

openshift-ci-robot assigned sgreene570 Oct 23, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 23, 2020

openshift-merge-robot merged commit f2c1882 into openshift:master Oct 24, 2020

Miciah mentioned this pull request Nov 13, 2020

Revert "Configure CoreDNS to shut down gracefully" #213

Merged

This was referenced Feb 15, 2021

[release-4.6] Bug 1928773: Set CoreDNS readiness probe period and timeout each to 3 seconds #236

Merged

Bug 1884053: Configure CoreDNS to shut down gracefully #237

Merged

Bug 1884053: Configure CoreDNS to shut down gracefully #205

Bug 1884053: Configure CoreDNS to shut down gracefully #205

Uh oh!

Conversation

Miciah commented Oct 9, 2020

Uh oh!

openshift-ci-robot commented Oct 9, 2020

Uh oh!

openshift-ci-robot commented Oct 9, 2020

Uh oh!

sgreene570 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sgreene570 Oct 9, 2020

Choose a reason for hiding this comment

Uh oh!

Miciah Oct 9, 2020

Choose a reason for hiding this comment

Uh oh!

Miciah commented Oct 9, 2020

Uh oh!

sgreene570 commented Oct 9, 2020

Uh oh!

sgreene570 commented Oct 14, 2020

Uh oh!

sgreene570 commented Oct 23, 2020

Uh oh!

openshift-ci-robot commented Oct 23, 2020

Uh oh!

openshift-bot commented Oct 23, 2020

Uh oh!

openshift-bot commented Oct 24, 2020

Uh oh!

openshift-ci-robot commented Oct 24, 2020

Uh oh!

cgwalters commented Nov 2, 2020

Uh oh!

cgwalters commented Nov 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sgreene570 left a comment •

edited

Loading