Skip to content

Conversation

@Miciah
Copy link
Contributor

@Miciah Miciah commented Oct 9, 2020

  • assets/dns/daemonset.yaml: Specify a termination grace period of 120 seconds. Change the readiness probe to use :8181/ready.
  • pkg/manifests/bindata.go: Regenerate.
  • pkg/operator/controller/controller_dns_configmap.go: (corefileTemplate): Configure CoreDNS's health plugin to sleep 60 seconds when CoreDNS is shut down. Enable CoreDNS's ready plugin in order to provide a readiness endpoint on :8181/ready, which stops responding when CoreDNS is shutting down.
  • pkg/operator/controller/controller_dns_configmap_test.go (TestDesiredDNSConfigmap): Adjust for changes to corefileTemplate.
  • pkg/operator/controller/controller_dns_daemonset.go (daemonsetConfigChanged): Check if the readiness probe or termination grace period changed, and update them if they did.
  • pkg/operator/controller/controller_dns_daemonset_test.go (TestDaemonsetConfigChanged): Add test cases to verify that deploymentConfigChanged detects updates to .spec.template.spec.containers[].readinessProbe.httpGet and .spec.template.spec.terminationGracePeriodSeconds.

These changes make CoreDNS behave more correctly but do not seem to resolve BZ#1884053 completely: In testing, I still see that the DNS pod becomes unresponsive before the endpoints controller removes the pod's address from the endpoints when I reboot a node. It is not clear whether this is a problem with CoreDNS, the kubelet, the endpoints controller, MCO, or something else.

This commit fixes bug 1884053.

https://bugzilla.redhat.com/show_bug.cgi?id=1884053

* assets/dns/daemonset.yaml: Specify a termination grace period of 120
seconds.  Change the readiness probe to use :8181/ready.
* pkg/manifests/bindata.go: Regenerate.
* pkg/operator/controller/controller_dns_configmap.go: (corefileTemplate):
Configure CoreDNS's health plugin to sleep 60 seconds when CoreDNS is shut
down.  Enable CoreDNS's ready plugin in order to provide a readiness
endpoint on :8181/ready, which stops responding when CoreDNS is shutting
down.
* pkg/operator/controller/controller_dns_configmap_test.go
(TestDesiredDNSConfigmap): Adjust for changes to corefileTemplate.
* pkg/operator/controller/controller_dns_daemonset.go
(daemonsetConfigChanged): Check if the readiness probe or termination grace
period changed, and update them if they did.
* pkg/operator/controller/controller_dns_daemonset_test.go
(TestDaemonsetConfigChanged): Add test cases to verify that
deploymentConfigChanged detects updates to
.spec.template.spec.containers[].readinessProbe.httpGet and
.spec.template.spec.terminationGracePeriodSeconds.
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Oct 9, 2020
@openshift-ci-robot
Copy link
Contributor

@Miciah: This pull request references Bugzilla bug 1884053, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

Bug 1884053: Configure CoreDNS to shut down gracefully

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 9, 2020
@openshift-ci-robot
Copy link
Contributor

@Miciah: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws f094ddf link /test e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Contributor

@sgreene570 sgreene570 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Miciah, is coredns/coredns#4099 a concern for us? It was rectified in upstream CoreDNS via coredns/coredns#4167, but the fix is currently not available in the openshift fork of CoreDNS.

errors
health
health {
lameduck 60s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not trying to suggest that 60s is an unacceptable lameduck period, but I'm curious to hear how you arrived at this value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The readiness probe has a period of 10 seconds and a failure threshold of 3, so I figure it might take 30 seconds for the kubelet to recognize that the pod is unresponsive, and I doubled that for good measure.

@Miciah
Copy link
Contributor Author

Miciah commented Oct 9, 2020

@Miciah, is coredns/coredns#4099 a concern for us? It was rectified in upstream CoreDNS via coredns/coredns#4167, but the fix is currently not available in the openshift fork of CoreDNS.

Yeah, I saw that, but according to the issue, the issue only applies if the connection is re-used. The issue report mentions that Azure re-uses the connection. I assume the kubelet does not re-use the connection or else someone would have said something.

@sgreene570
Copy link
Contributor

I assume the kubelet does not re-use the connection or else someone would have said something.

Alright, fair enough. Just wanted to double check and raise awareness. Thanks!

@sgreene570
Copy link
Contributor

/retest

@sgreene570
Copy link
Contributor

Looks good!
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 23, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah, sgreene570

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit f2c1882 into openshift:master Oct 24, 2020
@openshift-ci-robot
Copy link
Contributor

@Miciah: All pull requests linked via external trackers have merged:

Bugzilla bug 1884053 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1884053: Configure CoreDNS to shut down gracefully

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member

Likely fallout from this in https://bugzilla.redhat.com/show_bug.cgi?id=1893360

@cgwalters
Copy link
Member

Revert inbound in #213

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants