Skip to content

OCPBUGS-9825: LoadBalancer Templates Merge Downstream: 25th May 2023#1683

Merged
openshift-merge-robot merged 13 commits intoopenshift:masterfrom
tssurya:lb-templates-merge-downstream
May 30, 2023
Merged

OCPBUGS-9825: LoadBalancer Templates Merge Downstream: 25th May 2023#1683
openshift-merge-robot merged 13 commits intoopenshift:masterfrom
tssurya:lb-templates-merge-downstream

Conversation

@tssurya
Copy link
Contributor

@tssurya tssurya commented May 25, 2023

Requirements for Merge:

@trozet or @jcaamano or @dcbw for approve.

We also needed an OVN fix for LB templates with V6 family:

* Tue Mar 28 2023 Dumitru Ceara <dceara@redhat.com> - 23.03.0-14
- lb: Allow IPv6 template LBs to use explicit backends.
[Upstream: d01fdfdb2c97222cf326c8ab5579f670ded6e3cb]

I am bumping OVN rpms to ovn23.03-23.03.0-49.el9fdp. See https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2512984 for a CHANGE log.

brew list-tagged  rhaos-4.14-rhel-9-candidate | grep ovn
ovn23.03-23.03.0-49.el9fdp                rhaos-4.14-rhel-9-candidate  mmichels
ovn23.03-23.03.0-7.el9fdp                 rhaos-4.14-rhel-9-candidate  dceara

dceara and others added 12 commits April 5, 2023 09:38
That's what we actually deploy nowadays.

Signed-off-by: Dumitru Ceara <dceara@redhat.com>
These will be used for per node NB configurations (e.g., load balancer
templates).

Signed-off-by: Dumitru Ceara <dceara@redhat.com>
An upcoming commit will need to make some decisions based on the node
IP address family.  To avoid re-parsing nodeIPs, just store them as
net.IP.

Signed-off-by: Dumitru Ceara <dceara@redhat.com>
mergeLBs() was incorrectly only merging switches and routers of load
balancers that can be merged.  Until now that wasn't a problem because
load balancers that used groups were never real candidates for merging
(they were cluster-IP services).  An upcoming commit will take advantage
of this behavior and use LB groups for node port services too.

Fixes: 18d11d4 ("Use Load_Balancer_Groups when supported.")
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
This better reflects what it's used for.  The version of the packages to
be used at runtime is actually specified in the corresponding
Dockerfiles in dist/images.

Signed-off-by: Dumitru Ceara <dceara@redhat.com>
This picks up the new Chassis_Template_Var support in OVN.

Signed-off-by: Dumitru Ceara <dceara@redhat.com>
This picks up the following fixes (only mentioning relevant ones):
  2bab96e899b5 ("northd: prevents sending packet to conntrack for router ports")
  d01fdfdb2c97 ("lb: Allow IPv6 template LBs to use explicit backends.")
  6a16c741e5a1 ("controller: lflow: do not use tcp as default IP protocol for ct_snat_to_vip action")
  77384b7fe3f7 ("northd: Drop packets for LBs with no backends")
  81eaa98bbb60 ("northd: Use generic ct.est flows for LR LBs")
  0af110c400cc ("northd: drop ct.inv packets in post snat and lb_aff_learn stages")
  89fc85fa7f2b ("controller: Add config option per LB to enable/disable CT flush")

Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Insert OVSDB mutations for columns of type "map" are performed only if
the value didn't already exist in the database (RFC 7047, section 5.1).
That means that mutations that update existing values must first delete
those.

Signed-off-by: Dumitru Ceara <dceara@redhat.com>
For now we only use templates for NodePort services that don't have
ETP=local set or affinity timeout.  The latter is not supported in
core OVN but the former can probably be implemented as a follow up
change.

With load balancer templates we're able to reduce the size of the NB and
amount of work ovn-northd has to perform by a factor of N (where N is
the number of nodes).  In particular, for a cluster with N nodes and S
NodePort services:
- before this change we would configure O(N * S) load balancers (two per
  node for every service)
- with this change this gets reduced to O(S) load balancers (four per
  service).  We do have to configure N template variables per service
  but the impact of those on the amount of work ovn-northd has to
  perform is minimal as it will only have to propagate those records
  to the Southbound.

Instead of generating a load balancer per service per node we now
generate at most four (template) load balancers (IPv4/IPv6 applicable to
gateway routers or applicable to node switches).

Each template load balancer uses as VIP a template variable denoting
the node IP, e.g., NODEIP_IPv4 and as backend another template variable,
e.g., SVC_node_port_svc__UDP__30957__node__router__template__IPv4.

These template variables have potentially different values on different
chassis. For example, NODEIP_IPv4 will expand to chassis-X's IP on
chassis-X and to chassis-Y's IP on chassis-Y.

Similarly, SVC_node_port_svc__UDP__30957__node__router__template__IPv4
will expand on chassis-X to the set of backends that are required by
the NodePort service on chassis-X and that might be different than the
set of backends required on chassis-Y.

The node IP template variables are managed by the services controller.
Together with the nodeInfo structs provided by the node tracker they
represent the node-specific information.  A RWMutex, nodeInfoRWLock,
is used to protect access to this node specific information.  When
writing this data (node_tracker) the mutex is taken for writing and
when using it (service processing) the mutex is taken for reading
(to reduce contention between service controller threads).

The alreadyAppliedLock is repurposed into a RWMutex too in order to
further reduce contention between service controller threads.

More information about the way OVN component templates work and
synthetic benchmark results can be found at:
https://mail.openvswitch.org/pipermail/ovs-dev/2022-November/399522.html
and
https://www.openvswitch.org/support/ovscon2022/slides/Horizontal-scaling.pdf

Signed-off-by: Dumitru Ceara <dceara@redhat.com>
In general, template OVN load balancers used for implementing NodePort
services might have completely different sets of backends on different
nodes.

In particular however, there are quite a few cases in which node port
services end up using the same backends on all nodes, e.g., for services
that don't have ETP/ITP=local and that select only OVN networked pods as
backends.

Detect such situations and avoid using templated backends.  Use instead a
shared list of backend addresses.  This reduces NB DB size of the
Chassis_Template_Var table, from O(N * S) to O(N), where N is the
number of nodes and S is the number of node port services that match the
criteria defined above.

Signed-off-by: Dumitru Ceara <dceara@redhat.com>
…lb-templates-merge-downstream

Conflicts:
	go-controller/pkg/ovn/controller/services/lb_config.go
because we have to take care of ETP=local OCPHACK:
openshift#1622
	go-controller/pkg/ovn/controller/services/loadbalancer.go
because openshift#1652 went in
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 25, 2023
@openshift-ci-robot
Copy link
Contributor

@tssurya: This pull request references Jira Issue OCPBUGS-9825, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @anuragthehatter

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Requirements for Merge:

@trozet or @jcaamano for approve.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dcbw
Copy link
Contributor

dcbw commented May 25, 2023

/approve

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 25, 2023
@dcbw dcbw removed the lgtm Indicates that a PR is ready to be merged. label May 25, 2023
@tssurya
Copy link
Contributor Author

tssurya commented May 25, 2023

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 25, 2023
@tssurya
Copy link
Contributor Author

tssurya commented May 25, 2023

/retest-required

1 similar comment
@tssurya
Copy link
Contributor Author

tssurya commented May 26, 2023

/retest-required

@npinaeva
Copy link
Contributor

/lgtm for https://github.com/openshift/ovn-kubernetes/pull/1652/files conflicts

@openshift-ci-robot
Copy link
Contributor

@tssurya: This pull request references Jira Issue OCPBUGS-9825, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @anuragthehatter

Details

In response to this:

Requirements for Merge:

@trozet or @jcaamano for approve.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tssurya
Copy link
Contributor Author

tssurya commented May 26, 2023

/test images

1 similar comment
@dcbw
Copy link
Contributor

dcbw commented May 26, 2023

/test images

@ricky-rav
Copy link
Contributor

@ricky-rav
Copy link
Contributor

This openshift-ovn-kubernetes-master-unit CI job has actually been giving this behaviour since mid march. o_O
https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-ovn-kubernetes-master-unit?buildId=1639190040091299840
It shows it's been successful, but it hasn't run anything.

The last job I see that actually ran all unit tests as expected is from March 21st:

1638213514390671360 | #1596[WIP] [DNM] dummy commit to test aws upgrade (365673e) by trozet | Mar 21 17:18:38Mar 21 2023, 17:18:38 UTC+0100 | 14m37s | SUCCESS

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1596/pull-ci-openshift-ovn-kubernetes-master-unit/1638213514390671360

In particular, see the logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1596/pull-ci-openshift-ovn-kubernetes-master-unit/1638213514390671360/artifacts/test/build-log.txt

@tssurya
Copy link
Contributor Author

tssurya commented May 26, 2023

/test images

@tssurya
Copy link
Contributor Author

tssurya commented May 26, 2023

/retest

@tssurya
Copy link
Contributor Author

tssurya commented May 26, 2023

This openshift-ovn-kubernetes-master-unit CI job has actually been giving this behaviour since mid march. o_O https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-ovn-kubernetes-master-unit?buildId=1639190040091299840 It shows it's been successful, but it hasn't run anything.

The last job I see that actually ran all unit tests as expected is from March 21st:

1638213514390671360 | #1596[WIP] [DNM] dummy commit to test aws upgrade (365673e) by trozet | Mar 21 17:18:38Mar 21 2023, 17:18:38 UTC+0100 | 14m37s | SUCCESS

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1596/pull-ci-openshift-ovn-kubernetes-master-unit/1638213514390671360

In particular, see the logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1596/pull-ci-openshift-ovn-kubernetes-master-unit/1638213514390671360/artifacts/test/build-log.txt

ROFL! This is cracking me up so bad.. :) Well glad we caught this snafu and this explains why the OCPHACK tests "passed" for me on the PR and failed for you locally :D

@tssurya
Copy link
Contributor Author

tssurya commented May 26, 2023

ok let's see how upgrades behaves in this run
TODO: Figure out why unit tests are not running and add ricky-rav@d556bde as well to the PR

@dcbw
Copy link
Contributor

dcbw commented May 27, 2023

/retest-required

@tssurya
Copy link
Contributor Author

tssurya commented May 27, 2023

/test e2e-aws-ovn-upgrade-local-gateway

@openshift-ci-robot
Copy link
Contributor

@tssurya: This pull request references Jira Issue OCPBUGS-9825, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @anuragthehatter

Details

In response to this:

Requirements for Merge:

@trozet or @jcaamano or @dcbw for approve.

We also needed an OVN fix for LB templates with V6 family:

* Tue Mar 28 2023 Dumitru Ceara <dceara@redhat.com> - 23.03.0-14
- lb: Allow IPv6 template LBs to use explicit backends.
[Upstream: d01fdfdb2c97222cf326c8ab5579f670ded6e3cb]

I am bumping OVN rpms to ovn23.03-23.03.0-49.el9fdp. See https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2512984 for a CHANGE log.

brew list-tagged  rhaos-4.14-rhel-9-candidate | grep ovn
ovn23.03-23.03.0-49.el9fdp                rhaos-4.14-rhel-9-candidate  mmichels
ovn23.03-23.03.0-7.el9fdp                 rhaos-4.14-rhel-9-candidate  dceara

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tssurya
Copy link
Contributor Author

tssurya commented May 27, 2023

/test e2e-aws-ovn-upgrade-local-gateway

3 similar comments
@tssurya
Copy link
Contributor Author

tssurya commented May 28, 2023

/test e2e-aws-ovn-upgrade-local-gateway

@tssurya
Copy link
Contributor Author

tssurya commented May 29, 2023

/test e2e-aws-ovn-upgrade-local-gateway

@tssurya
Copy link
Contributor Author

tssurya commented May 30, 2023

/test e2e-aws-ovn-upgrade-local-gateway

@openshift-ci-robot
Copy link
Contributor

@tssurya: This pull request references Jira Issue OCPBUGS-9825, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @anuragthehatter

Details

In response to this:

Requirements for Merge:

@trozet or @jcaamano or @dcbw for approve.

We also needed an OVN fix for LB templates with V6 family:

* Tue Mar 28 2023 Dumitru Ceara <dceara@redhat.com> - 23.03.0-14
- lb: Allow IPv6 template LBs to use explicit backends.
[Upstream: d01fdfdb2c97222cf326c8ab5579f670ded6e3cb]

I am bumping OVN rpms to ovn23.03-23.03.0-49.el9fdp. See https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2512984 for a CHANGE log.

brew list-tagged  rhaos-4.14-rhel-9-candidate | grep ovn
ovn23.03-23.03.0-49.el9fdp                rhaos-4.14-rhel-9-candidate  mmichels
ovn23.03-23.03.0-7.el9fdp                 rhaos-4.14-rhel-9-candidate  dceara

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 30, 2023

@tssurya: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-openstack-ovn 34cc232 link false /test e2e-openstack-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@tssurya
Copy link
Contributor Author

tssurya commented May 30, 2023

need an override for lgw gateway upgrade. See https://issues.redhat.com/browse/OCPBUGS-14247
@dcbw @trozet

@tssurya
Copy link
Contributor Author

tssurya commented May 30, 2023

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 30, 2023
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 0f2fbd9 and 2 for PR HEAD 34cc232 in total

@tssurya
Copy link
Contributor Author

tssurya commented May 30, 2023

/test e2e-aws-ovn-upgrade-local-gateway

@tssurya
Copy link
Contributor Author

tssurya commented May 30, 2023

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 30, 2023
@trozet
Copy link
Contributor

trozet commented May 30, 2023

I think we found the root cause of lgw upgrade failure: openshift/cluster-node-tuning-operator#668

lets not block this from merging any longer

/override ci/prow/e2e-aws-ovn-upgrade-local-gateway

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 30, 2023

@trozet: Overrode contexts on behalf of trozet: ci/prow/e2e-aws-ovn-upgrade-local-gateway

Details

In response to this:

I think we found the root cause of lgw upgrade failure: openshift/cluster-node-tuning-operator#668

lets not block this from merging any longer

/override ci/prow/e2e-aws-ovn-upgrade-local-gateway

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tssurya
Copy link
Contributor Author

tssurya commented May 30, 2023

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 30, 2023
@openshift-merge-robot openshift-merge-robot merged commit 8adae67 into openshift:master May 30, 2023
@openshift-ci-robot
Copy link
Contributor

@tssurya: Jira Issue OCPBUGS-9825: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-9825 has been moved to the MODIFIED state.

Details

In response to this:

Requirements for Merge:

@trozet or @jcaamano or @dcbw for approve.

We also needed an OVN fix for LB templates with V6 family:

* Tue Mar 28 2023 Dumitru Ceara <dceara@redhat.com> - 23.03.0-14
- lb: Allow IPv6 template LBs to use explicit backends.
[Upstream: d01fdfdb2c97222cf326c8ab5579f670ded6e3cb]

I am bumping OVN rpms to ovn23.03-23.03.0-49.el9fdp. See https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2512984 for a CHANGE log.

brew list-tagged  rhaos-4.14-rhel-9-candidate | grep ovn
ovn23.03-23.03.0-49.el9fdp                rhaos-4.14-rhel-9-candidate  mmichels
ovn23.03-23.03.0-7.el9fdp                 rhaos-4.14-rhel-9-candidate  dceara

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants