Skip to content

CORENET-5625, OCPBUGS-54245, SDN-5772: Downstream merge 2025-03-21#2501

Merged
openshift-merge-bot[bot] merged 60 commits into
openshift:masterfrom
jcaamano:dmerge-20250321
Mar 26, 2025
Merged

CORENET-5625, OCPBUGS-54245, SDN-5772: Downstream merge 2025-03-21#2501
openshift-merge-bot[bot] merged 60 commits into
openshift:masterfrom
jcaamano:dmerge-20250321

Conversation

@jcaamano

Copy link
Copy Markdown
Contributor

hareeshpc and others added 30 commits February 28, 2025 18:46
For host networking, external bridge acts as the input/output port
with Node IP configured on the bridge itself as a local port.

When hardware acceleration capable devices, like ConnectX or
Bluefield2 cards are used, pods can use hardware accelerated
Virtual Functions (VFs) or SubFunctions(SFs) as interfaces,
and fully offload all kubernetes traffic flows.
But for host networking pods or when the host itself is the traffic
endpoint, not all kubernetes flows are accelerated since current
CT infrastructure cannot offload CT flows where external bridge
is the in/out port.

To allow accelerated traffic flows for host networking, this patch
allows specifying a gateway accelerated interface via the
`--gateway-accelerated-interface` flag. This can either be a
switchdev VF or SF, connected to the external bridge and holding
the Node IP.
                        ┌──────────┐
                        │  br-ext  │
                  ┌─────┴──┐       │    ┌──────────┐
                  │  eth0  │       │    │  br-int  │
                  └─────┬──┘       │    │          │
                        │          X────X          │
   ┌────────┐     ┌─────┴──┐       │    │          │
   │ eth0v0 ├─────┤ eth0_0 │       │    │          │
   └────────┘     └─────┬──┘       │    └──────────┘
     NODE_IP            │          │
                        └──────────┘

where, eth0v0 and eth0_0 are, for ex., VF and VF representor of eth0 uplink.
Note that used netdevice must be excluded from device plugin pools,
so it won't be used for workload pods.

This flag should be used mutually exclusive to the existing
gateway option `--gateway-interface` flag.

Signed-off-by: Hareesh Puthalath <hareeshp@nvidia.com>
Use accelerated device as Gateway interface
If MultiProtocol is enabled (default) then a BGP session
carries prefixes of both IPv4 and IPv6 families. Our problem is
that with an IPv4 session, FRR can incorrectly pick the
masquerade IPv6 address (instead of the real address) as next hop
for IPv6 prefixes and that won't work. Note that with a dedicated
IPv6 session that can't happen since FRR will use the same
address that was used to stablish the session. Let's
enforce the use of DisableMP for now.

Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com>
On every node update we were syncing the node in cluster manager. While
there were checks in place to limit updating the node annotation, there
were not checks in place to limit the other functionality (like marking
subnets allocated). This code would execute everytime, which would spam
the logs with messages like:

2025-02-14T01:25:53.598240753Z I0214 01:25:53.598230       1 node_allocator.go:510] Allowed existing subnets [10.132.5.0/24] on node ip-10-0-58-12.us-west-2.compute.internal
2025-02-14T01:25:53.598305025Z I0214 01:25:53.598279       1 node_allocator.go:510] Allowed existing subnets [10.132.8.0/24] on node ip-10-0-114-225.us-west-2.compute.internal
2025-02-14T01:25:53.598305025Z I0214 01:25:53.594125       1 node_allocator.go:488] Valid subnet 10.132.21.0/24 allocated on node ip-10-0-58-12.us-west-2.compute.internal
2025-02-14T01:25:53.598305025Z I0214 01:25:53.594137       1 node_allocator.go:488] Valid subnet 10.132.28.0/24 allocated on node ip-10-0-58-12.us-west-2.compute.internal
2025-02-14T01:25:53.598305025Z I0214 01:25:53.594143       1 node_allocator.go:488] Valid subnet 10.132.4.0/24 allocated on node ip-10-0-58-12.us-west-2.compute.internal
2025-02-14T01:25:53.598305025Z I0214 01:25:53.594148       1 node_allocator.go:488] Valid subnet 10.132.6.0/24 allocated on node ip-10-0-58-12.us-west-2.compute.internal

This floods the log. The "Valid subnet" just happens when the subnet is
marked as allocated. It doesn't mean anything new was allocated. Removed
this log. Allowed existing subnets message jsut means the existing
subnets on the node were already allocated. These log messages also dont
reference network name, so they are pretty useless. Logs remain which
indicate if new subnets were allocated and for what network.

Additionally we dont need to run the update logic if the node was
already sync'ed on node add. Once the node is allocated, nothing changes
on the node that would force us to need to allocate again (other than a
node going from hybrid overlay -> ovn). Added a sync map to track if a
node needs to be updated again.

Finally, simplified some of the logic in the sync node network
annotations. No need to annotate the network id on the node unless it
already existed and is somehow incorrect. Also only release the tunnel
ID if it was allocated and failed to be annotated.

Signed-off-by: Tim Rozet <trozet@redhat.com>
On every node event, ZCC will call kube patch. Reduce it to a single
time. Before patch:

trozet@fedora:~/go/src/github.com/ovn-org/ovn-kubernetes/go-controller$ go test -mod=vendor -v ./pkg/clustermanager -ginkgo.v -ginkgo.focus=".*Node subnet allocations.*Linux nodes$" | grep -i "setting annotations"
I0218 11:18:14.168191  310203 kube.go:130] Setting annotations map[k8s.ovn.org/node-id:3] on node node1
I0218 11:18:14.168187  310203 kube.go:130] Setting annotations map[k8s.ovn.org/node-id:2] on node node2
I0218 11:18:14.168200  310203 kube.go:130] Setting annotations map[k8s.ovn.org/node-id:4] on node node3
I0218 11:18:14.168964  310203 kube.go:130] Setting annotations map[k8s.ovn.org/node-id:3] on node node1
I0218 11:18:14.169120  310203 kube.go:130] Setting annotations map[k8s.ovn.org/node-id:4] on node node3
I0218 11:18:14.169152  310203 kube.go:130] Setting annotations map[k8s.ovn.org/node-id:2] on node node2
I0218 11:18:14.169395  310203 kube.go:130] Setting annotations map[k8s.ovn.org/node-id:3] on node node1
I0218 11:18:14.169430  310203 kube.go:130] Setting annotations map[k8s.ovn.org/node-id:2] on node node2
I0218 11:18:14.169492  310203 kube.go:130] Setting annotations map[k8s.ovn.org/node-id:4] on node node3

After patch:
trozet@fedora:~/go/src/github.com/ovn-org/ovn-kubernetes/go-controller$ go test -mod=vendor -v ./pkg/clustermanager -ginkgo.v -ginkgo.focus=".*Node subnet allocations.*Linux nodes$" | grep -i "setting annotations"
I0218 11:28:16.991114  338949 kube.go:130] Setting annotations map[k8s.ovn.org/node-id:2] on node node2
I0218 11:28:16.991133  338949 kube.go:130] Setting annotations map[k8s.ovn.org/node-id:4] on node node3
I0218 11:28:16.991130  338949 kube.go:130] Setting annotations map[k8s.ovn.org/node-id:3] on node node1

Signed-off-by: Tim Rozet <trozet@redhat.com>
Out of an abundance of caution, check that a node has annotations before
skipping it during the update event. The only reasons I can think of
this being necessary is if:

1. we missed an add event (kapi informer bug)
2. someone deleted the annotation on the node

For context:
Our other handlers on the ovnkube-controller side I don't think handle
the above scenarios correctly. For example, we check if the gatewayInit
failed only in the sync map for the node event handler, and if it did
not we ignore the update. We would never have processed it if we missed
the add event and therefore could result in perma fail.

Signed-off-by: Tim Rozet <trozet@redhat.com>
Function was updated for node network controllers but was not for zone
network controllers. It needs to try to find the networkID from the NAD
first instead of the nodes.

Signed-off-by: Tim Rozet <trozet@redhat.com>
RouteAdvertisements: fail if DisableMP is unset
The networkID is stored in the NAD itself, and the network manager code
in OVNK will refuse the start the network controller if it does not have
the networkID. For backwards compatibility, when the NAD syncAll happens
it checks for the networkID on a node and then copies it as well to the
NAD in case it was missing previously.

There were stale functions in these network controllers
that were relying on setting a cached struct value of networkID, derived
from either the NAD or from the annotation on nodes at runtime. This is
duplicate information as the controllers all hold a reference to the NAD
itself, which is updated through network controller reconicliation.

This commit removes controller struct variables that store networkID,
and instead rely on the embedded NAD to get it. Also, removes network
controllers looking up networkID from nodes. The controllers should all
have the networkID on start up derived from the associated NAD.

Signed-off-by: Tim Rozet <trozet@redhat.com>
Just make them consistent.

Signed-off-by: Tim Rozet <trozet@redhat.com>
InvalidID was being used for both networkID and tunnelID. noID was
previously used for just tunnelID and I overloaded it to be used for
networkID as well. This was not a great choice as it causes even more
confusion because noID (value 0) has the same value as DefaultNetworkID.

This commit refactors the variables and moves them into our global
constants file. It changes noID to be noTunnelID and declares
DefaultNetworkID there in a single place. It also creates a noNetworkID
with a value that doesn't collide with DefaultNetworkID. Now logically
the code should be much easier to read.

Also removes a function and unit test that are no longer needed.

Signed-off-by: Tim Rozet <trozet@redhat.com>
Limit cluster manager node allocator updates/logs
We add the current host as a printerColumn to have a nicer way to understand
which node is hosting the service:
```
$ kubectl get egressservice
NAME              ASSIGNED HOST
example-service   ovn-worker
```

Signed-off-by: Ori Braunshtein <obraunsh@redhat.com>
EgressService: add additionalPrinterColumn for .status.host
GetActiveNetworkForNamespaceFast returns the primary network for the
namespace if any or the default network otherwise. It is faster than
GetActiveNetworkForNamespace because it does not copy the network and it
does not verify against UDNs. To be used by controllers capable of
reconciling primary network changes.

Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com>
Add support to advertise EIPs for UDNs in cluster manager
RouteAdvertisements controller.

Selected Egress IPs are those that
- are served on the same namespaces as where the selected
  networks are serving, and
- are assigned to a selected node
- are on the default network subnet for that node

Egress IPs, just as with Pod IPs, will be advertised on routers
on the target VRF on the selected nodes.

`auto` is not supported as target VRF for Egress IPs.

Better support for Egress IPs on subnets other that the default network
node subnet, including any support for VRF-Lite interface subnets, is
left for a future exercise. We would need cluster manager to be able
to:
- map non VRF-Lite interface subnets to the proper BGP sessions
- tell apart VRF-Lite interface subnets from other secondary interface subnets

Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com>
Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
In 1.0.1 `endPort` support was added.

Signed-off-by: Nadia Pinaeva <npinaeva@redhat.com>
For example, to focus on a given test here is what I write (exact string):
Multi Homing multiple pods connected to the same OVN-K secondary network
multi-network policies multi-network policies configure traffic allow
lists for a pure L2 overlay when the multi-net policy describes the
allow-list using pod selectors

Now it will be:
Multi Homing multiple pods connected to the same OVN-K secondary network
with multi-network policies that configure traffic allow lists using
pod selectors for a pure L2 overlay

Signed-off-by: Nadia Pinaeva <npinaeva@redhat.com>
Add support to advertise EIPs for UDNs
Update community meeting timing and platform details
Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Split the NAD Spec generation from the NAD generation.
This will be useful in future commits when only the NAD.spec will need
to be patched.

Signed-off-by: Ram Lavi <ralavi@redhat.com>
Add test to check MTU on pod is updated both before and after NAD
reconcile.

Signed-off-by: Ram Lavi <ralavi@redhat.com>
Add test that changes the available IP allocation to a specific range,
then make sure a new pod follows this new restrictions.

Signed-off-by: Ram Lavi <ralavi@redhat.com>
Add tests that make sure that:
- the N/S connectivity is broken after NAD updating the VLAN-ID.
- the N/S connectivity is restored after the server networking is
reconfigured to the new VLAN-ID.

Signed-off-by: Ram Lavi <ralavi@redhat.com>
The KubeVirt version v1.5.0 is breaking tcp connections at live
migration during our e2e tests, this change ping kubevirt to last known
good version v1.4.0

https://github.com/kubevirt/kubevirt/releases/tag/v1.5.0

Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Signed-off-by: Ram Lavi <ralavi@redhat.com>
When there are no available IP addresses in the IP pool, there is no
indication sent to the pod, and it ends up hanging with the generic
warning event: failed to get pod annotation.

Adding an event indicating the lack of available IP in the pool as the
cause for the failure.

Signed-off-by: Ram Lavi <ralavi@redhat.com>
@jluhrsen

Copy link
Copy Markdown
Contributor

/retest

@openshift-ci-robot

Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 12b33c1 and 2 for PR HEAD b8ca158 in total

@maiqueb

maiqueb commented Mar 26, 2025

Copy link
Copy Markdown
Contributor

/retitle OCPBUGS-54245, SDN-5772: Downstream merge 2025-03-21

@openshift-ci openshift-ci Bot changed the title SDN-5772: Downstream merge 2025-03-21 OCPBUGS-54245, SDN-5772: Downstream merge 2025-03-21 Mar 26, 2025
@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Mar 26, 2025
@openshift-ci-robot

openshift-ci-robot commented Mar 26, 2025

Copy link
Copy Markdown
Contributor

@jcaamano: This pull request references Jira Issue OCPBUGS-54245, which is invalid:

  • expected the bug to target the "4.19.0" version, but no target version was set
  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is MODIFIED instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

This pull request references SDN-5772 which is a valid jira issue.

Details

In response to this:

cc @trozet @tssurya @hareeshpc @oribon @npinaeva @RamLavi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jcaamano

Copy link
Copy Markdown
Contributor Author

/retest

@maiqueb

maiqueb commented Mar 26, 2025

Copy link
Copy Markdown
Contributor

/jira refresh

@openshift-ci-robot

openshift-ci-robot commented Mar 26, 2025

Copy link
Copy Markdown
Contributor

@maiqueb: This pull request references Jira Issue OCPBUGS-54245, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is MODIFIED instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

This pull request references SDN-5772 which is a valid jira issue.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@maiqueb

maiqueb commented Mar 26, 2025

Copy link
Copy Markdown
Contributor

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 26, 2025
@openshift-ci-robot

openshift-ci-robot commented Mar 26, 2025

Copy link
Copy Markdown
Contributor

@maiqueb: This pull request references Jira Issue OCPBUGS-54245, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (ysegev@redhat.com), skipping review request.

This pull request references SDN-5772 which is a valid jira issue.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented Mar 26, 2025

Copy link
Copy Markdown
Contributor

@jcaamano: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn b8ca158 link false /test okd-scos-e2e-aws-ovn
ci/prow/security b8ca158 link false /test security

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 0f6638a into openshift:master Mar 26, 2025
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@jcaamano: Jira Issue OCPBUGS-54245: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-54245 has been moved to the MODIFIED state.

Details

In response to this:

cc @trozet @tssurya @hareeshpc @oribon @npinaeva @RamLavi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot

Copy link
Copy Markdown
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ovn-kubernetes-base
This PR has been included in build ose-ovn-kubernetes-base-container-v4.20.0-202503262140.p0.g0f6638a.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot

Copy link
Copy Markdown
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ovn-kubernetes-microshift
This PR has been included in build ovn-kubernetes-microshift-container-v4.20.0-202503262140.p0.g0f6638a.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot

Copy link
Copy Markdown
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-ovn-kubernetes
This PR has been included in build ose-ovn-kubernetes-container-v4.20.0-202503262140.p0.g0f6638a.assembly.stream.el9.
All builds following this will include this PR.

@openshift-merge-robot

Copy link
Copy Markdown
Contributor

Fix included in accepted release 4.19.0-0.nightly-2025-04-02-065200

@openshift-merge-robot

Copy link
Copy Markdown
Contributor

Fix included in accepted release 4.19.0-0.nightly-2025-04-02-170034

@openshift-merge-robot

Copy link
Copy Markdown
Contributor

Fix included in accepted release 4.19.0-0.nightly-2025-04-04-023411

@openshift-merge-robot

Copy link
Copy Markdown
Contributor

Fix included in accepted release 4.19.0-0.nightly-2025-04-04-170728

trozet added a commit to trozet/ovn-kubernetes-1 that referenced this pull request May 30, 2025
Before openshift#2501

Signed-off-by: Tim Rozet <trozet@redhat.com>
@maiqueb

maiqueb commented Mar 19, 2026

Copy link
Copy Markdown
Contributor

/retitle CORENET-5625, OCPBUGS-54245, SDN-5772: Downstream merge 2025-03-21

@openshift-ci openshift-ci Bot changed the title OCPBUGS-54245, SDN-5772: Downstream merge 2025-03-21 CORENET-5625, OCPBUGS-54245, SDN-5772: Downstream merge 2025-03-21 Mar 19, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@jcaamano: Jira Issue OCPBUGS-54245 is in an unrecognized state (Closed) and will not be moved to the MODIFIED state.

Details

In response to this:

cc @trozet @tssurya @hareeshpc @oribon @npinaeva @RamLavi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@maiqueb

maiqueb commented Mar 19, 2026

Copy link
Copy Markdown
Contributor

/jira refresh

@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@maiqueb: Jira Issue OCPBUGS-54245 is in an unrecognized state (Closed) and will not be moved to the MODIFIED state.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.