NO-JIRA: DownStream Merge [02-27-2026] by openshift-pr-manager[bot] · Pull Request #3010 · openshift/ovn-kubernetes

openshift-pr-manager · 2026-02-27T05:02:57Z

Automated merge of upstream/master → master.

Note: This PR includes an automated sync of test annotations with upstream test changes (go mod vendor + update-tests-annotation.sh).

Signed-off-by: Dan Winship <danwinship@redhat.com>

Ignore whitespace differences. Sort the output back into the "correct" order. Signed-off-by: Dan Winship <danwinship@redhat.com>

Replace the custom HTTP server in StartMetricsServer with MetricServer. Signed-off-by: Lei Huang <leih@nvidia.com>

A DPU firmware settings change can cause the same physical port to be re-enumerated under a different PCI address after a host reboot. Previously, Init() only handled missing device IDs (legacy annotations). Now it also detects when the annotated device ID is no longer present in the allocator and falls back to matching by PfId and FuncId. Signed-off-by: Yury Kulazhenkov <ykulazhenkov@nvidia.com>

Signed-off-by: Tim Rozet <trozet@nvidia.com>

Adds DPU blog

`kind setup` starts to fail with error: ``` ERROR: failed to load image: command "docker exec --privileged -i ovn-worker ctr --namespace=k8s.io images import --all-platforms --digests --snapshotter=overlayfs -" failed with error: exit status 1 Command Output: ctr: content digest sha256:9c04829e9...: not found ``` Related kind issue is kubernetes-sigs/kind#3795. This change uses the workaround mentioned in the kind issue. Signed-off-by: Lei Huang <leih@nvidia.com>

fix kind load docker-image content digest not found

Signed-off-by: fangyuchen86 <fangyuchen86@gmail.com>

Signed-off-by: Patryk Diak <pdiak@redhat.com>

The node gateway logic was not taking into account dynamic UDN. Therefore if a UDN was created with a service, but our node was not active, then at start up during syncServices we would fail due to GetActiveNetworkForNamespace failing. After 60 seconds of syncServices failling, it would lead to OVN-Kube node crashing. This commit introduces a common helper function to network manager api, ResolveActiveNetworkForNamespaceOnNode, which will allow legacy controllers that are not per-UDN or default controller to find the primary network serving a namespace for their node. The node/gateway is updated to use this function and during sync and allows us to ignore objects for which the network is not on our node with Dynamic UDN. Additionally it does not fail syncServices when a network is not found. During NAD controller start up, all networks will have been processed. If by the time gateway starts up and the network is missing, that means it is a new event which this node has never seen before. Therefore it is safe to skip it during syncServices and allow initial add handling to take care of it later. Signed-off-by: Tim Rozet <trozet@nvidia.com>

Network Policy add was not taking into account dynamic UDN. This was not a problem for the layer2/layer3 UDN controller side, because if the node was inactive, then the controllers wouldn't exist. However, it was a problem for the default network controller, because if the DNC could not get the active network, it would error and retry to add the KNP over and over again for other UDNs. This fixes it by checking the nad controller cache instead, which will always have the full info to determine if the KNP belongs to CDN. Furthermore, the delete KNP path was incorrect. It would try to get the active network which could be gone during deletion. This was unnecessary as the deleteNetworkPolicy code will check to see if it actually configured it in the first place, making it a noop to always call delete. Signed-off-by: Tim Rozet <trozet@nvidia.com>

Needed to be updated for the same reasons as network policy. Services controller is per UDN, and with an inactive node this is not a problem for UDN controllers as they will not exist. However, for DNC it would continue failing to get active network here. Use the nad controller cache and shortcut the checks for default network controller. Signed-off-by: Tim Rozet <trozet@nvidia.com>

Should always just return default network in that case. Signed-off-by: Tim Rozet <trozet@nvidia.com>

EF calls GetActiveNetworkForNamespace in an initialSync migration function. This function moves from cluster port group to namespace pgs. It is old and could be argued to just remove the code, but for now move to use nad controller cache. Also, do not cause OVNK to exit if we cannot get the network name, and just skip that entity. Signed-off-by: Tim Rozet <trozet@nvidia.com>

Egress IP controller runs as part of DNC, is event driven, and retries on failures. It is also not dynamic UDN aware. This commit aims to fix this by: - Change EgressIP to check with nad controller for network presence - If network is not processed/invalid skip retrying in egress IP controller - Register NAD Reconciler for Egress IP, so that when network becomes active Egress IP handles reconciliation. - If dynamic UDN is enabled, filter out EgressIP operations for inactive nodes. Overall this should be a quality of life improvement to EgressIP and reduce unnecessary reconcilation with UDN. Future steps will be to break Egress IP into its own level driven controller. Signed-off-by: Tim Rozet <trozet@nvidia.com>

Adds a test that creates a primary + secondary UDN, pod, egress IP, KNP, MNP objects in those UDNs. Then restarts every ovnkube-pod, and ensures it comes back up in ready state. This is useful in general to make sure we survive restarts correclty, but especially useful for Dynamic UDN where a network may not be active on a node and we want to ensure start up syncing is not failing because of that. Signed-off-by: Tim Rozet <trozet@nvidia.com>

When a pod is recreated with the same name, the egressIP cache could already contain a “served” {EgressIP,Node} status and skip programming as a no-op. Since statusMap keys do not include pod IP, LRP/NAT state could remain stale and traffic would miss egressIP SNAT. Fix by detecting pod IP drift from podAssignment.podIPs and forcing a delete+add reprogram for already-applied statuses: - compare cached pod IPs to current pod IPs - queue existing statuses for reprogram on IP change - delete old assignment state (without standby promotion) and re-add it - then update cached pod IPs Signed-off-by: Tim Rozet <trozet@nvidia.com>

EgressIP pod handling assumes pod networking setup has already populated logicalPortCache before egressIP reconciliation runs. That ordering holds within one controller queue, but breaks for primary UDNs where pod setup runs in UDN controllers while egressIP pod reconcile runs in the default controller. In that cross-controller race, egressIP reconcile can run first, fail to get pod IPs (stale/missing LSP), and wait for normal retry cadence even after UDN later updates port cache. Fix by wiring an immediate egressIP pod retry on logicalPortCache add: - add a base controller callback hook for logicalPortCache add events - invoke it from default/UDN pod logical port add paths - hook it for primary UDN controllers to enqueue no-backoff egressIP pod retry - centralize retry logic in eIPController.addEgressIPPodRetry() (including PodNeedsSNAT filtering) This preserves existing behavior while removing the UDN/DNC ordering race window for egressIP pod programming. Signed-off-by: Tim Rozet <trozet@nvidia.com>

Removes UnprocessedActiveNetwork Error, and moves to just using a single error, InvalidPrimaryNetworkError for everything. Modifies GetActiveNetworkForNamespace to return nil when there is no active network due to namespace being removed, or Dynamic UDN filtering. Callers can then rely on this function to determine whether or not a network is active versus the network should exist but doesn't (an error). Walked through all callers of GetActiveNetworkForNamespace and GetPrimaryNADForNamespace and tried to simplify number of calls and logic. Signed-off-by: Tim Rozet <trozet@nvidia.com>

- Removes a second call to GetActiveNetworkForNamespace during egress firewall add. We can just use the cache object that already exists. - Restructure the cache object to be a slice of subnets, rather than a string key. - Fix util function CopyIPNets, which was not doing a deep copy of the underlying IP/Mask slices. Signed-off-by: Tim Rozet <trozet@nvidia.com>

Code was modifying the annotations of the informer cache node object. If this was happening while another goroutine was reading the annotation map, it would trigger ovnkube to crash! Fixes: #5950 Signed-off-by: Tim Rozet <trozet@nvidia.com>

Signed-off-by: Tim Rozet <trozet@nvidia.com>

Gateway egress IP adds IPs to an annotation on the node. The code was assuming the informer object should have the latest data, then overwriting the IPs using that information. That isn't reliable as the informer could have stale data compared to recent kubeclient updates. This would trigger egress IP logic to corrupt the IPs in the node annotation, and cause further drift/corruption in subsequent updates. This fixes it by creating a local cache of IPs for the controller, and using that as the source of truth, initialized on start up from the node object. Then updates are driven by what is in the cache, versus what is in the informer. Also fixes places where tests should have been using Eventually. Signed-off-by: Tim Rozet <trozet@nvidia.com>

Signed-off-by: fangyuchen86 <fangyuchen86@gmail.com>

EgressIP: Fix crash from mutating node informer object

Fixes missing Dynamic UDN integration, Incorrect logic with GetActiveNetworkForNamespace, adds EgressIP NAD Reconciler

In Egress IP tracker when GetPrimaryNADForNamespace returns an InvalidPrimaryNetworkError we return nil during the sync, as we expect the NAD controller to deliver the event later when the NAD is processed. However, in this UT there is no full NAD controller and it relies on the lister. Therefore the UT may run before the informer cache is populated and never get notified from the "NAD Controller". To fix it, wait until the informer cache is populated and then simulate the NAD Controller behavior by Reconciling the NAD key. Fixes: #5953 Signed-off-by: Tim Rozet <trozet@nvidia.com>

Skip namespaces with deletionTimestamp set when selecting target namespaces, triggering NAD deletion for terminating namespaces. Signed-off-by: Patryk Diak <pdiak@redhat.com>

In UDNs, goroutines are started for some controllers like NetworkQoS where a waitgroup is used to add a reference to the goroutine, and then stopChan is passed as a mechanism to shutdown the NetworkQoS controller. If the UDN controller starts, and then shutsdown very quickly, the stopChan is closed and reset to nil. It is set to nil as a pattern we use to guard multiple Stop calls to the UDN controller (Stop may be called multiple times). However, if the NetworkQoS goroutine does not finish starting before the stopChan is closed and reset to nil, then by the time NetworkQos gets to read stopChan, it will hang forever, causing the UDN controller waitgroup to wait forever. This will deadlock the entire network manager from being able to start/stop anymore UDN controllers! We can see this behavior in CI here: I0223 04:37:24.677192 77 network_controller.go:415] [zone-nad-controller network controller]: sync network wpnhc_tenant-blue I0223 04:37:24.677203 77 localnet_user_defined_network_controller.go:311] Stoping controller for UDN wpnhc_tenant-blue I0223 04:37:24.677209 77 base_secondary_layer2_network_controller.go:39] Stop secondary localnet network controller of network wpnhc_tenant-blue I0223 04:37:24.677241 77 obj_retry.go:473] Stop channel got triggered: will stop retrying failed objects of type *v1.Namespace I0223 04:37:24.677250 77 network_qos_controller.go:215] Starting controller wpnhc_tenant-blue-network-controller I0223 04:37:24.677256 77 network_qos_controller.go:218] Waiting for informer caches (networkqos,namespace,pod,node) to sync I0223 04:37:24.677263 77 obj_retry.go:473] Stop channel got triggered: will stop retrying failed objects of type *v1beta2.MultiNetworkPolicy I0223 04:37:24.677270 77 shared_informer.go:349] "Waiting for caches to sync" controller="wpnhc_tenant-blue-network-controller" I0223 04:37:24.677339 77 shared_informer.go:356] "Caches are synced" controller="wpnhc_tenant-blue-network-controller" There is never a "finished syncing network wpnhc_tenant-blue" log again after this for zone-nad-controller, nor any other networks for that matter after this point in the log. However, there are logs for node-nad-controller as it did not hit this race. To fix this, pass a copy of the oc.stopChan to the goroutines. Channels are copied as a reference so closing the oc.stopChan still closes the copy, and we can still allow oc.stopChan to be set to nil as a Stop guard. Signed-off-by: Tim Rozet <trozet@nvidia.com>

Fixes: - #6014 - ovn-kubernetes/ovn-kubernetes#6014 Signed-off-by: Andrés Hernández <tonejito@comunidad.unam.mx>

Fix UDN network controller deadlock due to stopChan nil race

Signed-off-by: Nadia Pinaeva <n.m.pinaeva@gmail.com>

This reverts commit 3c67139.

Signed-off-by: Nadia Pinaeva <n.m.pinaeva@gmail.com>

Documentation: UserDefinedNetwork Markdown does not render properly

Allow ICMP and ICMPv6 regardless of network policy

From OKEP #5674 Signed-off-by: Tim Rozet <trozet@nvidia.com>

Handle stale PCI address in mgmt port Init for DPU Host case

(B)ANP conformance: update framework to use retries

GC CMD is just a noop handler, like CHECK. Signed-off-by: Tim Rozet <trozet@nvidia.com>

Implements DPU Health Check + adds CNI version 1.1.0 support

- go mod vendor - ./openshift/hack/update-tests-annotation.sh Automated sync after downstream merge to keep test annotations in sync with upstream test modifications and rules.go changes.

openshift-pr-manager · 2026-02-27T05:02:58Z

/ok-to-test
/payload 4.22 ci blocking
/payload 4.22 nightly blocking

openshift-ci-robot · 2026-02-27T05:03:01Z

@openshift-pr-manager[bot]: This pull request explicitly references no jira issue.

Details

In response to this:

Automated merge of upstream/master → master.

Note: This PR includes an automated sync of test annotations with upstream test changes (go mod vendor + update-tests-annotation.sh).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-02-27T05:03:03Z

@openshift-pr-manager[bot]: trigger 5 job(s) of type blocking for the ci release of OCP 4.22

periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade
periodic-ci-openshift-release-main-ci-4.22-e2e-gcp-ovn-upgrade
periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aks
periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/963de9c0-1399-11f1-92c4-4ba34912ba57-0

trigger 14 job(s) of type blocking for the nightly release of OCP 4.22

periodic-ci-openshift-release-main-ci-4.22-e2e-aws-upgrade-ovn-single-node
periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips
periodic-ci-openshift-release-main-ci-4.22-e2e-azure-ovn-upgrade
periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-rt-upgrade
periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn-conformance
periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-serial-1of2
periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-serial-2of2
periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview
periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-1of3
periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-2of3
periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-3of3
periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips-no-nat-instance
periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv4
periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/963de9c0-1399-11f1-92c4-4ba34912ba57-1

openshift-ci · 2026-02-27T05:03:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: openshift-pr-manager[bot]
Once this PR has been reviewed and has the lgtm label, please assign tssurya for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jluhrsen · 2026-02-27T05:11:08Z

/test lint

jluhrsen · 2026-02-27T05:21:44Z

/close
in favor of #3011

openshift-ci · 2026-02-27T05:22:35Z

@jluhrsen: Closed this PR.

Details

In response to this:

/close
in favor of #3011

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

danwinship and others added 30 commits January 19, 2026 15:18

Use nft destroy to simplify the UDN cleanup code

809ad11

Signed-off-by: Dan Winship <danwinship@redhat.com>

Improve nodenft.MatchNFTRules behavior

1d5e616

Ignore whitespace differences. Sort the output back into the "correct" order. Signed-off-by: Dan Winship <danwinship@redhat.com>

Unify the metrics servers used by ovnkube-node and OVS/OVN metrics

0974f2e

Replace the custom HTTP server in StartMetricsServer with MetricServer. Signed-off-by: Lei Huang <leih@nvidia.com>

Adds DPU blog

e527aa8

Signed-off-by: Tim Rozet <trozet@nvidia.com>

Merge pull request #5956 from trozet/dpu-blog

0b5af87

Adds DPU blog

Merge pull request #5968 from l8huang/kind-vteps-ci

8dffca2

fix kind load docker-image content digest not found

Add SAIC Motor to ADOPTERS

31ee5d7

Signed-off-by: fangyuchen86 <fangyuchen86@gmail.com>

Handle Docker 29+ "invalid IP" response for IP addresses

6f78d1d

Signed-off-by: Patryk Diak <pdiak@redhat.com>

Guard GetPrimaryNADForNamespace when netseg not enabled

3ae25d2

Should always just return default network in that case. Signed-off-by: Tim Rozet <trozet@nvidia.com>

Fix hybrid overlay mutating informer pod object

3cfbaff

Signed-off-by: Tim Rozet <trozet@nvidia.com>

fix issues for adding SAIC Motor to Adopters

16480b9

Signed-off-by: fangyuchen86 <fangyuchen86@gmail.com>

Merge branch 'master' into add-saic-motor-adopter

3fe75a1

Merge pull request #5951 from trozet/coredump_egressip

3444172

EgressIP: Fix crash from mutating node informer object

Merge pull request #5943 from trozet/fix-dudn-startup

1a5c4b6

Fixes missing Dynamic UDN integration, Incorrect logic with GetActiveNetworkForNamespace, adds EgressIP NAD Reconciler

CUDN: cleanup NADs in terminating namespaces without pods

7606fd8

Skip namespaces with deletionTimestamp set when selecting target namespaces, triggering NAD deletion for terminating namespaces. Signed-off-by: Patryk Diak <pdiak@redhat.com>

trozet and others added 17 commits February 24, 2026 15:16

docs: user-defined-networks: Fix markdown syntax

9d7b70f

Fixes: - #6014 - ovn-kubernetes/ovn-kubernetes#6014 Signed-off-by: Andrés Hernández <tonejito@comunidad.unam.mx>

docs: user-defined-network: Fix 'l2-UDN' image link

0b82e64

Fixes: - #6014 - ovn-kubernetes/ovn-kubernetes#6014 Signed-off-by: Andrés Hernández <tonejito@comunidad.unam.mx>

Merge branch 'master' into pr-drop-stale-mgmt-port

6bb7849

Merge pull request #6005 from trozet/fix_stop_channel_race

8c2c758

Fix UDN network controller deadlock due to stopChan nil race

(B)ANP conformance: update framework to use retries

3c67139

Signed-off-by: Nadia Pinaeva <n.m.pinaeva@gmail.com>

Revert "(B)ANP conformance: update framework to use retries"

15d73b3

This reverts commit 3c67139.

(B)ANP conformance: update framework to use retries

721400a

Signed-off-by: Nadia Pinaeva <n.m.pinaeva@gmail.com>

Merge pull request #6015 from tonejito/tonejito-6014-udn-markdown

b254782

Documentation: UserDefinedNetwork Markdown does not render properly

Merge pull request #5247 from trozet/allow_global_icmp_netpol

7e7538d

Allow ICMP and ICMPv6 regardless of network policy

Implements DPU Health Check

99107d3

From OKEP #5674 Signed-off-by: Tim Rozet <trozet@nvidia.com>

Merge pull request #5957 from ykulazhenkov/pr-drop-stale-mgmt-port

51c6172

Handle stale PCI address in mgmt port Init for DPU Host case

Merge pull request #6018 from npinaeva/conformance-update

b1e2485

(B)ANP conformance: update framework to use retries

Moves OVN-K to CNI 1.1.0

845456d

GC CMD is just a noop handler, like CHECK. Signed-off-by: Tim Rozet <trozet@nvidia.com>

Merge pull request #5777 from trozet/implement_dpu_healthcheck

5576a02

Implements DPU Health Check + adds CNI version 1.1.0 support

Merge remote-tracking branch 'upstream/master' into d/s-merge-02-27-2026

314a3ed

sync test annotations with upstream changes

d92494b

- go mod vendor - ./openshift/hack/update-tests-annotation.sh Automated sync after downstream merge to keep test annotations in sync with upstream test modifications and rules.go changes.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 27, 2026

openshift-ci Bot requested review from jcaamano and tssurya February 27, 2026 05:03

openshift-ci Bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Feb 27, 2026

openshift-ci Bot closed this Feb 27, 2026

jluhrsen mentioned this pull request Mar 6, 2026

NO-JIRA: DownStream Merge [02-27-2026] #3011

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NO-JIRA: DownStream Merge [02-27-2026]#3010

NO-JIRA: DownStream Merge [02-27-2026]#3010
openshift-pr-manager[bot] wants to merge 90 commits intomasterfrom
d/s-merge-02-27-2026

openshift-pr-manager Bot commented Feb 27, 2026

Uh oh!

openshift-pr-manager Bot commented Feb 27, 2026

Uh oh!

openshift-ci-robot commented Feb 27, 2026

Uh oh!

openshift-ci Bot commented Feb 27, 2026

Uh oh!

openshift-ci Bot commented Feb 27, 2026

Uh oh!

jluhrsen commented Feb 27, 2026

Uh oh!

jluhrsen commented Feb 27, 2026

Uh oh!

openshift-ci Bot commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

Conversation

openshift-pr-manager Bot commented Feb 27, 2026

Uh oh!

openshift-pr-manager Bot commented Feb 27, 2026

Uh oh!

openshift-ci-robot commented Feb 27, 2026

Uh oh!

openshift-ci Bot commented Feb 27, 2026

Uh oh!

openshift-ci Bot commented Feb 27, 2026

Uh oh!

jluhrsen commented Feb 27, 2026

Uh oh!

jluhrsen commented Feb 27, 2026

Uh oh!

openshift-ci Bot commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants