OCPBUGS-49368, OCPBUGS-62895, OCPBUGS-70130, OCPBUGS-74164, OCPBUGS-66267, OCPBUGS-65114: DownStream Merge [02-12-2026]#2980
Conversation
Remove the temporary migration code that was added in 2023 to support the transition to OVN Interconnect (IC) architecture. This HACK code tracked whether remote zone nodes had completed migration using the "k8s.ovn.org/remote-zone-migrated" annotation. This code is no longer needed. Changes: - Remove OvnNodeMigratedZoneName constant and helper functions (SetNodeZoneMigrated, HasNodeMigratedZone, NodeMigratedZoneAnnotationChanged) - Remove migrated field from nodeInfo struct in node_tracker.go - Simplify isLocalZoneNode() in base_network_controller.go and egressip.go - Remove HACK helper functions (checkOVNSBNodeLRSR, fetchLBNames, lbExists, portExists) and migration sync flow from default_node_network_controller.go - Remove remote-zone-migrated annotation from webhook allowed annotations - Update tests to remove references to the migration annotation Assisted by Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
Signed-off-by: Soli0222 <github@str08.net>
Signed-off-by: Soli0222 <github@str08.net>
Signed-off-by: Soli0222 <github@str08.net>
Signed-off-by: Jean Chen <jechen@redhat.com>
Add PodSecurity compliance to util.go
Remove IC zone migration HACK code
The layer2 UDN cleanup tests for IC clusters were failing because of a
zone mismatch between the controller and the test node:
- Controller zone: read from NBGlobal.Name ("global")
- Node zone: set via annotation ("test" when IC enabled)
This mismatch was previously masked in two spots:
1. The HACK in isLocalZoneNode() (removed by commit 7d408c1):
When the controller's zone was "global" (the default), the HACK
bypassed the zone comparison entirely and instead checked whether
the node had a migration annotation. Since the test node had no
migration annotation, it was treated as local despite the zone
mismatch.
2. Unconditional gateway cleanup in deleteNodeEvent (changed by
commit 8725a93 to only cleanup nodes tracked in localZoneNodes)
With both items above removed/changed, the test correctly fails because
the node is treated as remote (zones don't match), so it's not added to
localZoneNodes, and cleanup is skipped.
Fix the test by:
- using setupConfig() to set config.Default.Zone to testICZone when IC
is enabled
- setting NBGlobal.Name to config.Default.Zone (which setupConfig()
already configured correctly)
This ensures the controller and node are in the same zone, so the node
is correctly treated as local and its gateway entities are cleaned up.
🤖 Assisted by [Claude Code](https://claude.com/claude-code)
Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
Fix IC cluster cleanup tests zone configuration
When an EndpointSlice for a UDP NodePort or loadbalancer type of service is updated, stale conntrack entries for removed endpoints must be flushed. The existing logic failed to do this correctly if the backend pod was on a different node. This patch fixes the issue by flushing conntrack entries by filtering the nodePort when the node is not hosting the backend pod. In case that the backend pod was on the same node as the service, this issue won't happen. Since all old pod entries are removed from the node by the function deletePodConntrack when the pod is deleted. Signed-off-by: Peng Liu <pliu@redhat.com>
It should be able to preserve UDP traffic when server pod cycles for a NodePort service via a different node. Signed-off-by: Peng Liu <pliu@redhat.com>
Signed-off-by: Tim Rozet <trozet@nvidia.com>
Even though kind-helm.sh was building ovn-kubernetes, it was pointing to an upstream image and never using the built image without override. Align with kind.sh where by default it uses image built. Move image functions from kind.sh to kind-common to have single functions used by both. Signed-off-by: Tim Rozet <trozet@nvidia.com>
Signed-off-by: Tim Rozet <trozet@nvidia.com>
Would have modified an existing lane, but kind-helm doesn't support IPv6 yet. Will consolidate later. Signed-off-by: Tim Rozet <trozet@nvidia.com>
Fix spelling error in function name. Signed-off-by: Tim Rozet <trozet@nvidia.com>
Signed-off-by: Tim Rozet <trozet@nvidia.com>
There was duplication of a lot of variables. Move them to kind-common.sh. Signed-off-by: Tim Rozet <trozet@nvidia.com>
get_image/tag methods take an argument, but never actually pass an argument in thier usage. They are only used in one place and it is basically a single operation, so just remove these useless methods. Signed-off-by: Tim Rozet <trozet@nvidia.com>
Signed-off-by: Tim Rozet <trozet@nvidia.com>
Cluster manager RBAC was missing this permission to Get FRR Configurations. Signed-off-by: Tim Rozet <trozet@nvidia.com>
kind-helm had its own version, lets just use he one from kind.sh. Signed-off-by: Tim Rozet <trozet@nvidia.com>
Multihoming was already being totally skipped for ipv6. Skip only the ipv6 and dual stack tests for ipv4. Signed-off-by: Tim Rozet <trozet@nvidia.com>
Signed-off-by: Tim Rozet <trozet@nvidia.com>
Move the labeling and taint removal into a common function used by kind and kind-helm. Ensure the HA labeling is only done when OVN_HA is true. Check whether or not to do taint removal for scheduling regardless. Signed-off-by: Tim Rozet <trozet@nvidia.com>
Extend the SpecGetter interface with GetTransport() and GetEVPNConfiguration() methods to access EVPN fields from ClusterUserDefinedNetwork specs. Add renderEVPNConfig() to translate EVPN configuration from the CUDN API to the CNI NetConf format. Signed-off-by: Matteo Dallaglio <mdallagl@redhat.com>
Implement VID (VLAN ID) allocation for EVPN networks to enable the
Linux bridge to map traffic to the correct VNI on the Single VXLAN
Device (SVD) architecture.
Changes:
- Add vidAllocator to UDN controller for cluster-wide VID allocation
- Allocate one VID per VRF (MAC-VRF and IP-VRF can have different VIDs)
- Add VID field to VRFConfig in CNI types
- Implement VID recovery on controller restart from existing NADs
- Release VIDs when CUDN is deleted
- Expose VID via NetInfo interface (EVPNMACVRFVID, EVPNIPVRFVID)
VIDs are allocated in range 1-4094 and stored in the NAD config.
========================================================================
EVPN VID (VLAN ID) Lifecycle
========================================================================
VIDs are cluster-wide unique identifiers allocated to EVPN networks for
use as VXLAN Network Identifiers in the data plane. Each VRF (MAC-VRF or
IP-VRF) in an EVPN network requires its own VID, so a symmetric IRB
network uses 2 VIDs.
Allocation Keys:
- MAC-VRF: "{networkName}/macvrf"
- IP-VRF: "{networkName}/ipvrf"
Lifecycle:
1. ALLOCATION: When a CUDN/UDN with EVPN transport is reconciled,
VIDs are allocated via allocateEVPNVIDsIfNeeded() and stored in
the NAD's JSON config. The id.Allocator is idempotent - calling
AllocateID with the same key returns the previously allocated VID.
2. PERSISTENCE: VIDs are persisted in the NAD spec.config JSON field.
The in-memory allocator is not persistent across controller restarts.
3. RECOVERY: On controller startup, recoverEVPNVIDs() re-reserves VIDs
in the allocator using NetworkManager's cached NetInfo (which has
already parsed all NADs). This ensures VID consistency after restarts.
4. RELEASE: When a CUDN/UDN is deleted, releaseVIDForNetwork() frees
both the MAC-VRF and IP-VRF VIDs (if allocated) back to the pool.
Design Decision - Why VID persisted in NAD spec.config over annotations/labels:
- Annotations were considered for faster recovery but rejected:
1. CNI plugin on nodes needs VID in spec.config anyway
2. Two copies (annotation + NetConf) creates sync/drift risk
3. Recovery uses NetworkManager's cache (already parsed), so no
startup parsing overhead to optimize away
- CUDN status was rejected: users have copied objects with status
populated causing conflicts; VID isn't user-facing info
Recovery Failure Handling:
- If VID recovery fails for a CUDN (e.g., NAD not in NetworkManager
cache, VID conflict), the error is logged and the CUDN is enqueued
for reconciliation - startup does NOT fail.
- This prevents DoS: a malicious/corrupted NAD cannot crash the
entire cluster-manager.
- During reconciliation, if the NAD exists with a valid VID, the
allocator's idempotency ensures the same VID is re-allocated.
Thread Safety:
- The id.Allocator uses per-key locking, making concurrent
allocations safe.
- Controllers use Threadiness:1, so reconciliations for the same
resource are serialized.
========================================================================
Signed-off-by: Matteo Dallaglio <mdallagl@redhat.com>
This commit adds a VTEPNotifier and VTEP validation logic for EVPN CUDNs: - Add VTEPNotifier to watch VTEP CRs and trigger CUDN reconciliation - Add validateEVPNVTEP to check VTEP existence during CUDN sync - Report VTEPNotFound status when referenced VTEP doesn't exist - Add ReconcileVTEP to handle VTEP create/delete events - Add RBAC permissions for vteps resource - Add vtepInformer creation in factory when Network Segmentation is enabled Signed-off-by: Matteo Dallaglio <mdallagl@redhat.com>
Signed-off-by: Matteo Dallaglio <mdallagl@redhat.com>
CUDN controller: EVPN configuration translation and VID allocation
|
@jluhrsen: This PR was included in a payload test run from openshift/origin#30560
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f1ba2940-11b8-11f1-9f34-ac35586a2332-0 |
|
/retest-required |
1 similar comment
|
/retest-required |
|
@arkadeepsen we need to check why ovnkubenode pods are getting stuck on this job: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_ovn-kubernetes/2980/pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack-bgp/2026476307969216512 |
This seems to be unrelated as for the earlier runs (https://prow.ci.openshift.org/pr-history/?org=openshift&repo=ovn-kubernetes&pr=2980) the cluster deployment went through fine. There were test failures one of which is across all the failed runs is |
|
/override ci/prow/e2e-aws-ovn-serial I will override the serial job: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_ovn-kubernetes/2980/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-serial/2026476307918884864 its failing on unrelated things: https://issues.redhat.com/browse/OCPBUGS-61674 seems like the GCP counterpart for the API LBs issue no known AWS issue so far, I've tagged @dgoodwin for this one, I have to override this since its not related to ovnk and its holding up the merge and this exact job passed just fine on previous attempts in past for this PR: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_ovn-kubernetes/2980/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-serial/2021995320514187264 as for the sig-intrumentation I can't find any known bugs but its all over : https://search.dptools.openshift.org/?search=when+installed+on+the+cluster+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job the search previous runs: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_ovn-kubernetes/2980/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-serial/2026304361428160512 and https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_ovn-kubernetes/2980/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-serial/2026215750376624128 failed due to known bug: https://issues.redhat.com/browse/OCPBUGS-77019 |
|
@tssurya: Overrode contexts on behalf of tssurya: ci/prow/e2e-aws-ovn-serial DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/tide refresh |
|
|
|
For both
|
|
@openshift-pr-manager[bot]: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@tssurya the failing jobs have been failing across more than just this PR. Can we override them and merge this? |
|
/override ci/prow/e2e-aws-ovn-serial https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_ovn-kubernetes/2980/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-serial/2026553859781955584 |
|
/tide refresh |
|
/override ci/prow/e2e-metal-ipi-ovn-dualstack-bgp https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_ovn-kubernetes/2980/pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack-bgp/2026553859815510016 also failing because of See https://redhat-internal.slack.com/archives/C01CQA76KMX/p1772017294188469 |
|
/tide refresh |
|
@tssurya: Overrode contexts on behalf of tssurya: ci/prow/e2e-aws-ovn-serial DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/override ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw See https://redhat-internal.slack.com/archives/C01CQA76KMX/p1772017294188469 |
|
/tide refresh |
|
@tssurya: Overrode contexts on behalf of tssurya: ci/prow/e2e-metal-ipi-ovn-dualstack-bgp DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@tssurya: Overrode contexts on behalf of tssurya: ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/override ci/prow/e2e-metal-ipi-ovn-ipv6 |
|
@tssurya: Overrode contexts on behalf of tssurya: ci/prow/e2e-metal-ipi-ovn-ipv6 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/tide refresh |
|
@openshift-pr-manager[bot]: Jira Issue Verification Checks: Jira Issue OCPBUGS-49368 Jira Issue OCPBUGS-49368 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 Jira Issue Verification Checks: Jira Issue OCPBUGS-62895 Jira Issue OCPBUGS-62895 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 Jira Issue Verification Checks: Jira Issue OCPBUGS-70130 Jira Issue OCPBUGS-70130 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 Jira Issue Verification Checks: Jira Issue OCPBUGS-74164 Jira Issue OCPBUGS-74164 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 Jira Issue Verification Checks: Jira Issue OCPBUGS-66267 Jira Issue OCPBUGS-66267 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 Jira Issue Verification Checks: Jira Issue OCPBUGS-65114 Jira Issue OCPBUGS-65114 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Fix included in accepted release 4.22.0-0.nightly-2026-03-05-070435 |
|
Fix included in accepted release 4.22.0-0.nightly-2026-03-15-203841 |
|
Fix included in accepted release 4.22.0-0.nightly-2026-03-17-033403 |
Automated merge of upstream/master → master.