Bug 1709958: Avoid dropping traffic during upgrade by Miciah · Pull Request #280 · openshift/cluster-ingress-operator

Miciah · 2019-08-05T17:45:24Z

`TestDeploymentConfigChanged`: tolerations ordering

pkg/operator/controller/ingress/deployment_test.go (TestDeploymentConfigChanged): Add test case that verifies that the ordering of tolerations is ignored.

Add `davecgh/go-spew` dependency.

Avoid dropping traffic during upgrade

Omit the affinity policy when using host network. The scheduler already prevents colocation of pods that use host networking and request the same ports, so the affinity policy is superfluous in this case.

Change the affinity policy and deployment strategy when using the load-balancer or private endpoint publishing strategy as follows. First, configure affinity for replicas of different generations of same controller. Second, configure anti-affinity for replicas of same generation of same controller. Finally, configure the deployment strategy to surge.

The intention of the changes for the load-balancer and private endpoint publishing strategies is to avoid dropping traffic during rolling updates of an ingress controller's deployment. If a node loses local endpoints for the deployment, then the service proxy will drop traffic to that node for that ingress controller, and it may take some time for the load balancer to stop sending traffic to that node. These changes, combined with a change to the ReplicaSet controller, will ensure that a node that has local endpoints for an ingress controller at the start of a rollout will continue to have local endpoints during and at completion of the rollout, thus preventing traffic from being dropped.

pkg/operator/controller/controller_router_deployment.go (desiredRouterDeployment): Set the deployment's pod selector to a copy of the pod template's labels, so that subsequently mutating labels does not mutate the selector. If the ingress controller uses the host network, do not set any affinity policy. If the ingress controller uses the load-balancer or private endpoint publishing strategy, add a "hash" label to identify the deployment's generation, configure affinity for replicas of different generations of the same ingress controller, configure anti-affinity for the same generation of the same ingress controller, and configure the deployment strategy to surge.
(deploymentHash): New function. Return a stringified hash value for the given deployment, using only the fields that, if changed, should trigger an update.
(hashableDeployment): New function. Return a copy of the given deployment with fields that should be used for computing its hash copied over, fields that are slices sorted, and fields that should be ignored zeroed.
(deepHashObject): New function. Hash the given object.
(deploymentConfigChanged): Compare hashes of the current and expected deployments instead of comparing fields directly. Set labels during an update.
(cmpEnvs, cmpVolumes, cmpSecretVolumeSource, cmpTolerations): Deleted.
pkg/operator/controller/controller_router_deployment_test.go (TestDesiredRouterDeployment): Verify that the hash label is set when it should be and that it has the right hash value.
(TestDeploymentConfigChanged): Add test cases to verify that the hash is ignored in the deployment labels and affinity. Add a test case to verify that deleting the affinity policy is not ignored. Add test cases to verify that ordering of the label selector expressions in affinity terms is ignored.
pkg/operator/controller/names.go (ControllerDeploymentHashLabel): New constant.

openshift-ci-robot · 2019-08-05T17:45:27Z

@Miciah: This pull request references a valid Bugzilla bug. The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Bug 1709958: Avoid dropping traffic during upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ironcladlou · 2019-08-05T20:31:59Z

pkg/operator/controller/names.go

+	// different generations of the same ingress controller, and for
+	// anti-affinity, to prevent colocation of replicas of the same
+	// generation of the same ingress controller.
+	ControllerDeploymentNonceLabel = "ingresscontroller.operator.openshift.io/nonce"


Instead of exposing the host network and generation details through labels, would it be possible for the operator to compute whether a given ingresscontroller is co-locatable and then generically label with something like ingresscontroller.operator.openshift.io/colocation-disabled?

openshift-ci-robot · 2019-08-07T19:29:39Z

@Miciah: This pull request references a valid Bugzilla bug.

Details

In response to this:

Bug 1709958: Avoid dropping traffic during upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2019-08-08T17:49:13Z

@Miciah: This pull request references a valid Bugzilla bug.

Details

In response to this:

Bug 1709958: Avoid dropping traffic during upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Miciah · 2019-08-08T22:51:56Z

I have not been able to diagnose or reproduce the test failures.
/test e2e-aws-operator

ironcladlou · 2019-08-12T14:37:34Z

pkg/operator/controller/ingress/deployment.go

-			MaxUnavailable: pointerTo(intstr.FromString("25%")),
-			MaxSurge:       pointerTo(intstr.FromInt(0)),
-		},
+		// Avoid going from one replica to zero replicas on any given node.


Would you mind also talking about the details of deployments that explain how these values accomplish the stated behavior?

I have updated the comments better to tell the tale of Ingress Operator and the Zero-Downtime Rolling Update.

Miciah · 2019-10-02T14:51:22Z

Rebased.

Miciah · 2019-10-02T15:42:32Z

Provisioning failure.

ironcladlou · 2019-10-02T15:45:10Z

/retest

* pkg/operator/controller/ingress/deployment_test.go (TestDeploymentConfigChanged): Add test case that verifies that the ordering of tolerations is ignored.

Miciah · 2019-10-17T20:02:48Z

Rebased.

Omit the affinity policy when using host network. The scheduler already prevents colocation of pods that use host networking and request the same ports, so the affinity policy is superfluous in this case. Change the affinity policy and deployment strategy when using the load-balancer or private endpoint publishing strategy as follows. First, configure affinity for replicas of different generations of same controller. Second, configure anti-affinity for replicas of same generation of same controller. Finally, configure the deployment strategy to surge. The intention of the changes for the load-balancer and private endpoint publishing strategies is to avoid dropping traffic during rolling updates of an ingress controller's deployment. If a node loses local endpoints for the deployment, then the service proxy will drop traffic to that node for that ingress controller, and it may take some time for the load balancer to stop sending traffic to that node. These changes, combined with a change to the ReplicaSet controller, will ensure that a node that has local endpoints for an ingress controller at the start of a rollout will continue to have local endpoints during and at completion of the rollout, thus preventing traffic from being dropped. This commit is related to bug 1709958. https://bugzilla.redhat.com/show_bug.cgi?id=1709958 * pkg/operator/controller/controller_router_deployment.go (desiredRouterDeployment): Set the deployment's pod selector to a copy of the pod template's labels, so that subsequently mutating labels does not mutate the selector. If the ingress controller uses the host network, do not set any affinity policy. If the ingress controller uses the load-balancer or private endpoint publishing strategy, add a "hash" label to identify the deployment's generation, configure affinity for replicas of different generations of the same ingress controller, configure anti-affinity for the same generation of the same ingress controller, and configure the deployment strategy to surge. (deploymentHash): New function. Return a stringified hash value for the given deployment, using only the fields that, if changed, should trigger an update. (hashableDeployment): New function. Return a copy of the given deployment with fields that should be used for computing its hash copied over, fields that are slices sorted, and fields that should be ignored zeroed. (deepHashObject): New function. Hash the given object. (deploymentConfigChanged): Compare hashes of the current and expected deployments instead of comparing fields directly. Set labels during an update. (cmpEnvs, cmpVolumes, cmpVolumeMounts, cmpConfigMapVolumeSource) (cmpSecretVolumeSource, cmpTolerations): Deleted. * pkg/operator/controller/controller_router_deployment_test.go (TestDesiredRouterDeployment): Verify that the hash label is set when it should be and that it has the right hash value. (TestDeploymentConfigChanged): Add test cases to verify that the hash is ignored in the deployment labels and affinity. Add a test case to verify that deleting the affinity policy is not ignored. Add test cases to verify that ordering of the label selector expressions in affinity terms is ignored. * pkg/operator/controller/names.go (ControllerDeploymentHashLabel): New constant.

Miciah · 2019-10-17T20:19:09Z

Fixed out-of-order imports and updated comments for the deployment strategy and affinity policy.

Miciah · 2019-10-17T21:23:45Z

/test e2e-aws-operator
Failed to bootstrap.

level=error msg="Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: the server could not find the requested resource (get clusteroperators.config.openshift.io)"

Must-gather got a bunch of empty files, so I'm not sure how to diagnose this one.

Miciah · 2019-10-17T21:28:55Z

Must-gather got a bunch of empty files, so I'm not sure how to diagnose this one.

The artifacts do have bootkube logs:

Oct 17 20:40:30 ip-10-0-2-243 bootkube.sh[6256]: Error: unable to pull registry.svc.ci.openshift.org/ci-op-plg9gk2j/stable@sha256:4a93380042c847179af13334187544960da26aded2d5474df9029dd301d1fb3f: unable to pull image: Error reading signatures: Get https://registry.svc.ci.openshift.org/extensions/v2/ci-op-plg9gk2j/stable/signatures/sha256:4a93380042c847179af13334187544960da26aded2d5474df9029dd301d1fb3f: dial tcp 35.196.103.194:443: i/o timeout

Looks like the registry may be down.

Miciah · 2019-10-18T19:21:06Z

/refresh

Miciah · 2019-10-18T19:21:34Z

/retest

knobunc

/lgtm

Wow... this is cunning. Nice work!

openshift-ci-robot · 2019-10-18T19:25:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: knobunc, Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [Miciah,knobunc]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ironcladlou · 2019-10-18T20:15:31Z

This is a big deal, thank you

openshift-ci-robot · 2019-10-18T22:20:54Z

@Miciah: Bugzilla bug 1709958 is in an unrecognized state (CLOSED (DEFERRED)) and will not be moved to the MODIFIED state.

Details

In response to this:

Bug 1709958: Avoid dropping traffic during upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 5, 2019

openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 5, 2019

openshift-ci-robot requested review from ironcladlou and knobunc August 5, 2019 17:45

ironcladlou reviewed Aug 5, 2019

View reviewed changes

Miciah force-pushed the BZ1709958-avoid-dropping-traffic-during-upgrade branch from 4c7d7eb to 250280a Compare August 7, 2019 19:26

Miciah force-pushed the BZ1709958-avoid-dropping-traffic-during-upgrade branch from 250280a to 0437c3f Compare August 7, 2019 20:35

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 8, 2019

Miciah force-pushed the BZ1709958-avoid-dropping-traffic-during-upgrade branch 2 times, most recently from 8c2ab4c to fba1678 Compare August 8, 2019 17:48

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 8, 2019

Miciah force-pushed the BZ1709958-avoid-dropping-traffic-during-upgrade branch from fba1678 to fc456f8 Compare August 9, 2019 19:49

ironcladlou reviewed Aug 12, 2019

View reviewed changes

Miciah force-pushed the BZ1709958-avoid-dropping-traffic-during-upgrade branch from fc456f8 to b5ff5e2 Compare October 2, 2019 14:51

openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 2, 2019

Miciah added 2 commits October 17, 2019 16:01

TestDeploymentConfigChanged: tolerations ordering

250ec88

* pkg/operator/controller/ingress/deployment_test.go (TestDeploymentConfigChanged): Add test case that verifies that the ordering of tolerations is ignored.

Add davecgh/go-spew dependency.

b4d8faf

Miciah force-pushed the BZ1709958-avoid-dropping-traffic-during-upgrade branch from b5ff5e2 to 8c18c7b Compare October 17, 2019 20:02

Miciah force-pushed the BZ1709958-avoid-dropping-traffic-during-upgrade branch from 8c18c7b to 0f6fd1c Compare October 17, 2019 20:18

knobunc approved these changes Oct 18, 2019

View reviewed changes

openshift-ci-robot assigned knobunc Oct 18, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 18, 2019

openshift-merge-robot merged commit a7f75ef into openshift:master Oct 18, 2019

This was referenced Oct 23, 2019

Bug 1709958: Use graceful shutdown for router #240

Closed

Add shell script for graceful shutdown openshift/router#28

Closed

Miciah mentioned this pull request Nov 4, 2019

Bug 1766851: Add migration for affinity and deployment strategy #321

Merged

Conversation

Miciah commented Aug 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TestDeploymentConfigChanged: tolerations ordering

Add davecgh/go-spew dependency.

Avoid dropping traffic during upgrade

Uh oh!

openshift-ci-robot commented Aug 5, 2019

Uh oh!

ironcladlou Aug 5, 2019

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Aug 7, 2019

Uh oh!

openshift-ci-robot commented Aug 8, 2019

Uh oh!

Miciah commented Aug 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ironcladlou Aug 12, 2019

Choose a reason for hiding this comment

Uh oh!

Miciah Oct 17, 2019

Choose a reason for hiding this comment

Uh oh!

Miciah commented Oct 2, 2019

Uh oh!

Miciah commented Oct 2, 2019

Uh oh!

ironcladlou commented Oct 2, 2019

Uh oh!

Miciah commented Oct 17, 2019

Uh oh!

Miciah commented Oct 17, 2019

Uh oh!

Miciah commented Oct 17, 2019

Uh oh!

Miciah commented Oct 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Miciah commented Oct 18, 2019

Uh oh!

Miciah commented Oct 18, 2019

Uh oh!

knobunc left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Oct 18, 2019

Uh oh!

ironcladlou commented Oct 18, 2019

Uh oh!

openshift-ci-robot commented Oct 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Miciah commented Aug 5, 2019 •

edited

Loading

`TestDeploymentConfigChanged`: tolerations ordering

Add `davecgh/go-spew` dependency.

Miciah commented Aug 8, 2019 •

edited

Loading

Miciah commented Oct 17, 2019 •

edited

Loading