Bug 1709958: Avoid dropping traffic during upgrade#280
Conversation
|
@Miciah: This pull request references a valid Bugzilla bug. The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
pkg/operator/controller/names.go
Outdated
| // different generations of the same ingress controller, and for | ||
| // anti-affinity, to prevent colocation of replicas of the same | ||
| // generation of the same ingress controller. | ||
| ControllerDeploymentNonceLabel = "ingresscontroller.operator.openshift.io/nonce" |
There was a problem hiding this comment.
Instead of exposing the host network and generation details through labels, would it be possible for the operator to compute whether a given ingresscontroller is co-locatable and then generically label with something like ingresscontroller.operator.openshift.io/colocation-disabled?
4c7d7eb to
250280a
Compare
|
@Miciah: This pull request references a valid Bugzilla bug. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
250280a to
0437c3f
Compare
8c2ab4c to
fba1678
Compare
|
@Miciah: This pull request references a valid Bugzilla bug. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
I have not been able to diagnose or reproduce the test failures. |
fba1678 to
fc456f8
Compare
| MaxUnavailable: pointerTo(intstr.FromString("25%")), | ||
| MaxSurge: pointerTo(intstr.FromInt(0)), | ||
| }, | ||
| // Avoid going from one replica to zero replicas on any given node. |
There was a problem hiding this comment.
Would you mind also talking about the details of deployments that explain how these values accomplish the stated behavior?
There was a problem hiding this comment.
I have updated the comments better to tell the tale of Ingress Operator and the Zero-Downtime Rolling Update.
fc456f8 to
b5ff5e2
Compare
|
Rebased. |
|
Provisioning failure. |
|
/retest |
* pkg/operator/controller/ingress/deployment_test.go (TestDeploymentConfigChanged): Add test case that verifies that the ordering of tolerations is ignored.
b5ff5e2 to
8c18c7b
Compare
|
Rebased. |
Omit the affinity policy when using host network. The scheduler already prevents colocation of pods that use host networking and request the same ports, so the affinity policy is superfluous in this case. Change the affinity policy and deployment strategy when using the load-balancer or private endpoint publishing strategy as follows. First, configure affinity for replicas of different generations of same controller. Second, configure anti-affinity for replicas of same generation of same controller. Finally, configure the deployment strategy to surge. The intention of the changes for the load-balancer and private endpoint publishing strategies is to avoid dropping traffic during rolling updates of an ingress controller's deployment. If a node loses local endpoints for the deployment, then the service proxy will drop traffic to that node for that ingress controller, and it may take some time for the load balancer to stop sending traffic to that node. These changes, combined with a change to the ReplicaSet controller, will ensure that a node that has local endpoints for an ingress controller at the start of a rollout will continue to have local endpoints during and at completion of the rollout, thus preventing traffic from being dropped. This commit is related to bug 1709958. https://bugzilla.redhat.com/show_bug.cgi?id=1709958 * pkg/operator/controller/controller_router_deployment.go (desiredRouterDeployment): Set the deployment's pod selector to a copy of the pod template's labels, so that subsequently mutating labels does not mutate the selector. If the ingress controller uses the host network, do not set any affinity policy. If the ingress controller uses the load-balancer or private endpoint publishing strategy, add a "hash" label to identify the deployment's generation, configure affinity for replicas of different generations of the same ingress controller, configure anti-affinity for the same generation of the same ingress controller, and configure the deployment strategy to surge. (deploymentHash): New function. Return a stringified hash value for the given deployment, using only the fields that, if changed, should trigger an update. (hashableDeployment): New function. Return a copy of the given deployment with fields that should be used for computing its hash copied over, fields that are slices sorted, and fields that should be ignored zeroed. (deepHashObject): New function. Hash the given object. (deploymentConfigChanged): Compare hashes of the current and expected deployments instead of comparing fields directly. Set labels during an update. (cmpEnvs, cmpVolumes, cmpVolumeMounts, cmpConfigMapVolumeSource) (cmpSecretVolumeSource, cmpTolerations): Deleted. * pkg/operator/controller/controller_router_deployment_test.go (TestDesiredRouterDeployment): Verify that the hash label is set when it should be and that it has the right hash value. (TestDeploymentConfigChanged): Add test cases to verify that the hash is ignored in the deployment labels and affinity. Add a test case to verify that deleting the affinity policy is not ignored. Add test cases to verify that ordering of the label selector expressions in affinity terms is ignored. * pkg/operator/controller/names.go (ControllerDeploymentHashLabel): New constant.
8c18c7b to
0f6fd1c
Compare
|
Fixed out-of-order imports and updated comments for the deployment strategy and affinity policy. |
|
/test e2e-aws-operator Must-gather got a bunch of empty files, so I'm not sure how to diagnose this one. |
The artifacts do have bootkube logs: Looks like the registry may be down. |
|
/refresh |
|
/retest |
knobunc
left a comment
There was a problem hiding this comment.
/lgtm
Wow... this is cunning. Nice work!
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: knobunc, Miciah The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
This is a big deal, thank you |
|
@Miciah: Bugzilla bug 1709958 is in an unrecognized state (CLOSED (DEFERRED)) and will not be moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
TestDeploymentConfigChanged: tolerations orderingpkg/operator/controller/ingress/deployment_test.go(TestDeploymentConfigChanged): Add test case that verifies that the ordering of tolerations is ignored.Add
davecgh/go-spewdependency.Avoid dropping traffic during upgrade
Omit the affinity policy when using host network. The scheduler already prevents colocation of pods that use host networking and request the same ports, so the affinity policy is superfluous in this case.
Change the affinity policy and deployment strategy when using the load-balancer or private endpoint publishing strategy as follows. First, configure affinity for replicas of different generations of same controller. Second, configure anti-affinity for replicas of same generation of same controller. Finally, configure the deployment strategy to surge.
The intention of the changes for the load-balancer and private endpoint publishing strategies is to avoid dropping traffic during rolling updates of an ingress controller's deployment. If a node loses local endpoints for the deployment, then the service proxy will drop traffic to that node for that ingress controller, and it may take some time for the load balancer to stop sending traffic to that node. These changes, combined with a change to the ReplicaSet controller, will ensure that a node that has local endpoints for an ingress controller at the start of a rollout will continue to have local endpoints during and at completion of the rollout, thus preventing traffic from being dropped.
pkg/operator/controller/controller_router_deployment.go(desiredRouterDeployment): Set the deployment's pod selector to a copy of the pod template's labels, so that subsequently mutating labels does not mutate the selector. If the ingress controller uses the host network, do not set any affinity policy. If the ingress controller uses the load-balancer or private endpoint publishing strategy, add a "hash" label to identify the deployment's generation, configure affinity for replicas of different generations of the same ingress controller, configure anti-affinity for the same generation of the same ingress controller, and configure the deployment strategy to surge.(
deploymentHash): New function. Return a stringified hash value for the given deployment, using only the fields that, if changed, should trigger an update.(
hashableDeployment): New function. Return a copy of the given deployment with fields that should be used for computing its hash copied over, fields that are slices sorted, and fields that should be ignored zeroed.(
deepHashObject): New function. Hash the given object.(
deploymentConfigChanged): Compare hashes of the current and expected deployments instead of comparing fields directly. Set labels during an update.(
cmpEnvs,cmpVolumes,cmpSecretVolumeSource,cmpTolerations): Deleted.pkg/operator/controller/controller_router_deployment_test.go(TestDesiredRouterDeployment): Verify that the hash label is set when it should be and that it has the right hash value.(
TestDeploymentConfigChanged): Add test cases to verify that the hash is ignored in the deployment labels and affinity. Add a test case to verify that deleting the affinity policy is not ignored. Add test cases to verify that ordering of the label selector expressions in affinity terms is ignored.pkg/operator/controller/names.go(ControllerDeploymentHashLabel): New constant.