OCPBUGS-9825: LoadBalancer Templates Merge Downstream: 25th May 2023#1683
Conversation
That's what we actually deploy nowadays. Signed-off-by: Dumitru Ceara <dceara@redhat.com>
These will be used for per node NB configurations (e.g., load balancer templates). Signed-off-by: Dumitru Ceara <dceara@redhat.com>
An upcoming commit will need to make some decisions based on the node IP address family. To avoid re-parsing nodeIPs, just store them as net.IP. Signed-off-by: Dumitru Ceara <dceara@redhat.com>
mergeLBs() was incorrectly only merging switches and routers of load balancers that can be merged. Until now that wasn't a problem because load balancers that used groups were never real candidates for merging (they were cluster-IP services). An upcoming commit will take advantage of this behavior and use LB groups for node port services too. Fixes: 18d11d4 ("Use Load_Balancer_Groups when supported.") Signed-off-by: Dumitru Ceara <dceara@redhat.com>
This better reflects what it's used for. The version of the packages to be used at runtime is actually specified in the corresponding Dockerfiles in dist/images. Signed-off-by: Dumitru Ceara <dceara@redhat.com>
This picks up the new Chassis_Template_Var support in OVN. Signed-off-by: Dumitru Ceara <dceara@redhat.com>
This picks up the following fixes (only mentioning relevant ones):
2bab96e899b5 ("northd: prevents sending packet to conntrack for router ports")
d01fdfdb2c97 ("lb: Allow IPv6 template LBs to use explicit backends.")
6a16c741e5a1 ("controller: lflow: do not use tcp as default IP protocol for ct_snat_to_vip action")
77384b7fe3f7 ("northd: Drop packets for LBs with no backends")
81eaa98bbb60 ("northd: Use generic ct.est flows for LR LBs")
0af110c400cc ("northd: drop ct.inv packets in post snat and lb_aff_learn stages")
89fc85fa7f2b ("controller: Add config option per LB to enable/disable CT flush")
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Insert OVSDB mutations for columns of type "map" are performed only if the value didn't already exist in the database (RFC 7047, section 5.1). That means that mutations that update existing values must first delete those. Signed-off-by: Dumitru Ceara <dceara@redhat.com>
For now we only use templates for NodePort services that don't have ETP=local set or affinity timeout. The latter is not supported in core OVN but the former can probably be implemented as a follow up change. With load balancer templates we're able to reduce the size of the NB and amount of work ovn-northd has to perform by a factor of N (where N is the number of nodes). In particular, for a cluster with N nodes and S NodePort services: - before this change we would configure O(N * S) load balancers (two per node for every service) - with this change this gets reduced to O(S) load balancers (four per service). We do have to configure N template variables per service but the impact of those on the amount of work ovn-northd has to perform is minimal as it will only have to propagate those records to the Southbound. Instead of generating a load balancer per service per node we now generate at most four (template) load balancers (IPv4/IPv6 applicable to gateway routers or applicable to node switches). Each template load balancer uses as VIP a template variable denoting the node IP, e.g., NODEIP_IPv4 and as backend another template variable, e.g., SVC_node_port_svc__UDP__30957__node__router__template__IPv4. These template variables have potentially different values on different chassis. For example, NODEIP_IPv4 will expand to chassis-X's IP on chassis-X and to chassis-Y's IP on chassis-Y. Similarly, SVC_node_port_svc__UDP__30957__node__router__template__IPv4 will expand on chassis-X to the set of backends that are required by the NodePort service on chassis-X and that might be different than the set of backends required on chassis-Y. The node IP template variables are managed by the services controller. Together with the nodeInfo structs provided by the node tracker they represent the node-specific information. A RWMutex, nodeInfoRWLock, is used to protect access to this node specific information. When writing this data (node_tracker) the mutex is taken for writing and when using it (service processing) the mutex is taken for reading (to reduce contention between service controller threads). The alreadyAppliedLock is repurposed into a RWMutex too in order to further reduce contention between service controller threads. More information about the way OVN component templates work and synthetic benchmark results can be found at: https://mail.openvswitch.org/pipermail/ovs-dev/2022-November/399522.html and https://www.openvswitch.org/support/ovscon2022/slides/Horizontal-scaling.pdf Signed-off-by: Dumitru Ceara <dceara@redhat.com>
In general, template OVN load balancers used for implementing NodePort services might have completely different sets of backends on different nodes. In particular however, there are quite a few cases in which node port services end up using the same backends on all nodes, e.g., for services that don't have ETP/ITP=local and that select only OVN networked pods as backends. Detect such situations and avoid using templated backends. Use instead a shared list of backend addresses. This reduces NB DB size of the Chassis_Template_Var table, from O(N * S) to O(N), where N is the number of nodes and S is the number of node port services that match the criteria defined above. Signed-off-by: Dumitru Ceara <dceara@redhat.com>
…lb-templates-merge-downstream Conflicts: go-controller/pkg/ovn/controller/services/lb_config.go because we have to take care of ETP=local OCPHACK: openshift#1622 go-controller/pkg/ovn/controller/services/loadbalancer.go because openshift#1652 went in
|
@tssurya: This pull request references Jira Issue OCPBUGS-9825, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/approve |
|
/hold |
|
/retest-required |
1 similar comment
|
/retest-required |
|
@tssurya: This pull request references Jira Issue OCPBUGS-9825, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/test images |
1 similar comment
|
/test images |
|
All right, my next concern is... unit tests succeed after 2 minutes only: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1683/pull-ci-openshift-ovn-kubernetes-master-unit/1662017861620600832 |
|
This openshift-ovn-kubernetes-master-unit CI job has actually been giving this behaviour since mid march. o_O The last job I see that actually ran all unit tests as expected is from March 21st: In particular, see the logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1596/pull-ci-openshift-ovn-kubernetes-master-unit/1638213514390671360/artifacts/test/build-log.txt |
|
/test images |
|
/retest |
ROFL! This is cracking me up so bad.. :) Well glad we caught this snafu and this explains why the OCPHACK tests "passed" for me on the PR and failed for you locally :D |
|
ok let's see how upgrades behaves in this run |
|
/retest-required |
|
/test e2e-aws-ovn-upgrade-local-gateway |
|
@tssurya: This pull request references Jira Issue OCPBUGS-9825, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/test e2e-aws-ovn-upgrade-local-gateway |
3 similar comments
|
/test e2e-aws-ovn-upgrade-local-gateway |
|
/test e2e-aws-ovn-upgrade-local-gateway |
|
/test e2e-aws-ovn-upgrade-local-gateway |
|
@tssurya: This pull request references Jira Issue OCPBUGS-9825, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@tssurya: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
need an override for lgw gateway upgrade. See https://issues.redhat.com/browse/OCPBUGS-14247 |
|
/hold cancel |
|
/test e2e-aws-ovn-upgrade-local-gateway |
|
/hold |
|
I think we found the root cause of lgw upgrade failure: openshift/cluster-node-tuning-operator#668 lets not block this from merging any longer /override ci/prow/e2e-aws-ovn-upgrade-local-gateway |
|
@trozet: Overrode contexts on behalf of trozet: ci/prow/e2e-aws-ovn-upgrade-local-gateway DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/hold cancel |
|
@tssurya: Jira Issue OCPBUGS-9825: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-9825 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Requirements for Merge:
@trozet or @jcaamano or @dcbw for approve.
We also needed an OVN fix for LB templates with V6 family:
I am bumping OVN rpms to
ovn23.03-23.03.0-49.el9fdp. See https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2512984 for a CHANGE log.