Skip to content

Bug 1709958: Avoid dropping traffic during upgrade#280

Merged
openshift-merge-robot merged 3 commits intoopenshift:masterfrom
Miciah:BZ1709958-avoid-dropping-traffic-during-upgrade
Oct 18, 2019
Merged

Bug 1709958: Avoid dropping traffic during upgrade#280
openshift-merge-robot merged 3 commits intoopenshift:masterfrom
Miciah:BZ1709958-avoid-dropping-traffic-during-upgrade

Conversation

@Miciah
Copy link
Contributor

@Miciah Miciah commented Aug 5, 2019

TestDeploymentConfigChanged: tolerations ordering

  • pkg/operator/controller/ingress/deployment_test.go (TestDeploymentConfigChanged): Add test case that verifies that the ordering of tolerations is ignored.

Add davecgh/go-spew dependency.

Avoid dropping traffic during upgrade

Omit the affinity policy when using host network. The scheduler already prevents colocation of pods that use host networking and request the same ports, so the affinity policy is superfluous in this case.

Change the affinity policy and deployment strategy when using the load-balancer or private endpoint publishing strategy as follows. First, configure affinity for replicas of different generations of same controller. Second, configure anti-affinity for replicas of same generation of same controller. Finally, configure the deployment strategy to surge.

The intention of the changes for the load-balancer and private endpoint publishing strategies is to avoid dropping traffic during rolling updates of an ingress controller's deployment. If a node loses local endpoints for the deployment, then the service proxy will drop traffic to that node for that ingress controller, and it may take some time for the load balancer to stop sending traffic to that node. These changes, combined with a change to the ReplicaSet controller, will ensure that a node that has local endpoints for an ingress controller at the start of a rollout will continue to have local endpoints during and at completion of the rollout, thus preventing traffic from being dropped.

  • pkg/operator/controller/controller_router_deployment.go (desiredRouterDeployment): Set the deployment's pod selector to a copy of the pod template's labels, so that subsequently mutating labels does not mutate the selector. If the ingress controller uses the host network, do not set any affinity policy. If the ingress controller uses the load-balancer or private endpoint publishing strategy, add a "hash" label to identify the deployment's generation, configure affinity for replicas of different generations of the same ingress controller, configure anti-affinity for the same generation of the same ingress controller, and configure the deployment strategy to surge.
    (deploymentHash): New function. Return a stringified hash value for the given deployment, using only the fields that, if changed, should trigger an update.
    (hashableDeployment): New function. Return a copy of the given deployment with fields that should be used for computing its hash copied over, fields that are slices sorted, and fields that should be ignored zeroed.
    (deepHashObject): New function. Hash the given object.
    (deploymentConfigChanged): Compare hashes of the current and expected deployments instead of comparing fields directly. Set labels during an update.
    (cmpEnvs, cmpVolumes, cmpSecretVolumeSource, cmpTolerations): Deleted.
  • pkg/operator/controller/controller_router_deployment_test.go (TestDesiredRouterDeployment): Verify that the hash label is set when it should be and that it has the right hash value.
    (TestDeploymentConfigChanged): Add test cases to verify that the hash is ignored in the deployment labels and affinity. Add a test case to verify that deleting the affinity policy is not ignored. Add test cases to verify that ordering of the label selector expressions in affinity terms is ignored.
  • pkg/operator/controller/names.go (ControllerDeploymentHashLabel): New constant.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 5, 2019
@openshift-ci-robot
Copy link
Contributor

@Miciah: This pull request references a valid Bugzilla bug. The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Bug 1709958: Avoid dropping traffic during upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 5, 2019
// different generations of the same ingress controller, and for
// anti-affinity, to prevent colocation of replicas of the same
// generation of the same ingress controller.
ControllerDeploymentNonceLabel = "ingresscontroller.operator.openshift.io/nonce"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of exposing the host network and generation details through labels, would it be possible for the operator to compute whether a given ingresscontroller is co-locatable and then generically label with something like ingresscontroller.operator.openshift.io/colocation-disabled?

@Miciah Miciah force-pushed the BZ1709958-avoid-dropping-traffic-during-upgrade branch from 4c7d7eb to 250280a Compare August 7, 2019 19:26
@openshift-ci-robot
Copy link
Contributor

@Miciah: This pull request references a valid Bugzilla bug.

Details

In response to this:

Bug 1709958: Avoid dropping traffic during upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Miciah Miciah force-pushed the BZ1709958-avoid-dropping-traffic-during-upgrade branch from 250280a to 0437c3f Compare August 7, 2019 20:35
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 8, 2019
@Miciah Miciah force-pushed the BZ1709958-avoid-dropping-traffic-during-upgrade branch 2 times, most recently from 8c2ab4c to fba1678 Compare August 8, 2019 17:48
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 8, 2019
@openshift-ci-robot
Copy link
Contributor

@Miciah: This pull request references a valid Bugzilla bug.

Details

In response to this:

Bug 1709958: Avoid dropping traffic during upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Miciah
Copy link
Contributor Author

Miciah commented Aug 8, 2019

I have not been able to diagnose or reproduce the test failures.
/test e2e-aws-operator

@Miciah Miciah force-pushed the BZ1709958-avoid-dropping-traffic-during-upgrade branch from fba1678 to fc456f8 Compare August 9, 2019 19:49
MaxUnavailable: pointerTo(intstr.FromString("25%")),
MaxSurge: pointerTo(intstr.FromInt(0)),
},
// Avoid going from one replica to zero replicas on any given node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind also talking about the details of deployments that explain how these values accomplish the stated behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the comments better to tell the tale of Ingress Operator and the Zero-Downtime Rolling Update.

@Miciah Miciah force-pushed the BZ1709958-avoid-dropping-traffic-during-upgrade branch from fc456f8 to b5ff5e2 Compare October 2, 2019 14:51
@Miciah
Copy link
Contributor Author

Miciah commented Oct 2, 2019

Rebased.

@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 2, 2019
@Miciah
Copy link
Contributor Author

Miciah commented Oct 2, 2019

Provisioning failure.

@ironcladlou
Copy link
Contributor

/retest

* pkg/operator/controller/ingress/deployment_test.go
(TestDeploymentConfigChanged): Add test case that verifies that the
ordering of tolerations is ignored.
@Miciah Miciah force-pushed the BZ1709958-avoid-dropping-traffic-during-upgrade branch from b5ff5e2 to 8c18c7b Compare October 17, 2019 20:02
@Miciah
Copy link
Contributor Author

Miciah commented Oct 17, 2019

Rebased.

Omit the affinity policy when using host network.  The scheduler already
prevents colocation of pods that use host networking and request the same
ports, so the affinity policy is superfluous in this case.

Change the affinity policy and deployment strategy when using the
load-balancer or private endpoint publishing strategy as follows.  First,
configure affinity for replicas of different generations of same
controller.  Second, configure anti-affinity for replicas of same
generation of same controller.  Finally, configure the deployment strategy
to surge.

The intention of the changes for the load-balancer and private endpoint
publishing strategies is to avoid dropping traffic during rolling updates
of an ingress controller's deployment.  If a node loses local endpoints for
the deployment, then the service proxy will drop traffic to that node for
that ingress controller, and it may take some time for the load balancer to
stop sending traffic to that node.  These changes, combined with a change
to the ReplicaSet controller, will ensure that a node that has local
endpoints for an ingress controller at the start of a rollout will continue
to have local endpoints during and at completion of the rollout, thus
preventing traffic from being dropped.

This commit is related to bug 1709958.

https://bugzilla.redhat.com/show_bug.cgi?id=1709958

* pkg/operator/controller/controller_router_deployment.go
(desiredRouterDeployment): Set the deployment's pod selector to a copy of
the pod template's labels, so that subsequently mutating labels does not
mutate the selector.  If the ingress controller uses the host network, do
not set any affinity policy.  If the ingress controller uses the
load-balancer or private endpoint publishing strategy, add a "hash" label
to identify the deployment's generation, configure affinity for replicas of
different generations of the same ingress controller, configure
anti-affinity for the same generation of the same ingress controller, and
configure the deployment strategy to surge.
(deploymentHash): New function.  Return a stringified hash value for the
given deployment, using only the fields that, if changed, should trigger an
update.
(hashableDeployment): New function.  Return a copy of the given deployment
with fields that should be used for computing its hash copied over, fields
that are slices sorted, and fields that should be ignored zeroed.
(deepHashObject): New function.  Hash the given object.
(deploymentConfigChanged): Compare hashes of the current and expected
deployments instead of comparing fields directly.  Set labels during an
update.
(cmpEnvs, cmpVolumes, cmpVolumeMounts, cmpConfigMapVolumeSource)
(cmpSecretVolumeSource, cmpTolerations): Deleted.
* pkg/operator/controller/controller_router_deployment_test.go
(TestDesiredRouterDeployment): Verify that the hash label is set when it
should be and that it has the right hash value.
(TestDeploymentConfigChanged): Add test cases to verify that the hash is
ignored in the deployment labels and affinity.  Add a test case to verify
that deleting the affinity policy is not ignored.  Add test cases to verify
that ordering of the label selector expressions in affinity terms is
ignored.
* pkg/operator/controller/names.go (ControllerDeploymentHashLabel):
New constant.
@Miciah Miciah force-pushed the BZ1709958-avoid-dropping-traffic-during-upgrade branch from 8c18c7b to 0f6fd1c Compare October 17, 2019 20:18
@Miciah
Copy link
Contributor Author

Miciah commented Oct 17, 2019

Fixed out-of-order imports and updated comments for the deployment strategy and affinity policy.

@Miciah
Copy link
Contributor Author

Miciah commented Oct 17, 2019

/test e2e-aws-operator
Failed to bootstrap.

level=error msg="Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: the server could not find the requested resource (get clusteroperators.config.openshift.io)"

Must-gather got a bunch of empty files, so I'm not sure how to diagnose this one.

@Miciah
Copy link
Contributor Author

Miciah commented Oct 17, 2019

Must-gather got a bunch of empty files, so I'm not sure how to diagnose this one.

The artifacts do have bootkube logs:

Oct 17 20:40:30 ip-10-0-2-243 bootkube.sh[6256]: Error: unable to pull registry.svc.ci.openshift.org/ci-op-plg9gk2j/stable@sha256:4a93380042c847179af13334187544960da26aded2d5474df9029dd301d1fb3f: unable to pull image: Error reading signatures: Get https://registry.svc.ci.openshift.org/extensions/v2/ci-op-plg9gk2j/stable/signatures/sha256:4a93380042c847179af13334187544960da26aded2d5474df9029dd301d1fb3f: dial tcp 35.196.103.194:443: i/o timeout

Looks like the registry may be down.

@Miciah
Copy link
Contributor Author

Miciah commented Oct 18, 2019

/refresh

@Miciah
Copy link
Contributor Author

Miciah commented Oct 18, 2019

/retest

Copy link

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Wow... this is cunning. Nice work!

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 18, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: knobunc, Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ironcladlou
Copy link
Contributor

This is a big deal, thank you

@openshift-merge-robot openshift-merge-robot merged commit a7f75ef into openshift:master Oct 18, 2019
@openshift-ci-robot
Copy link
Contributor

@Miciah: Bugzilla bug 1709958 is in an unrecognized state (CLOSED (DEFERRED)) and will not be moved to the MODIFIED state.

Details

In response to this:

Bug 1709958: Avoid dropping traffic during upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants