OCPBUGS-58501,OCPBUGS-59657,OCPBUGS-42303,OCPBUGS-61566: DownStream Merge [09-09-2025] by martinkennelly · Pull Request #2750 · openshift/ovn-kubernetes

martinkennelly · 2025-09-09T11:03:02Z

No description provided.

sometimes a tmp pv image file used for setting up the runner gets leftover in /mnt. Add the removal of that in the "Free up disk space" step. Also move that step to a composite action to deduplicate it for easier maintenance Also added the Runner Diagnostics step as the first step that runs for every job Signed-off-by: Jamo Luhrsen <jluhrsen@gmail.com>

When multiple networks support was first added, all controllers that were added used the label "Secondary" to indicate they were not "Default". When UDN was added, it allowed "Secondary" networks to function as the primary network for a pod, creating terminology confusion. We now treat non-default networks all as "User-Defined Networks". This commit changes all naming to conform to the latter. The only places secondary is used now is for distinguishing whether or not a UDN is acting as a primary or secondary network for a pod (it's role). The only exception to this is udn-isolation. I did not touch this because it relies on dbIDs, which would impact functionality for upgrade. There is no functional change in this commit. Signed-off-by: Tim Rozet <trozet@nvidia.com>

Rabbit found these during code review. Signed-off-by: Tim Rozet <trozet@nvidia.com>

fix intermittent disk space issue

Moves ACL DBIDs from "Secondary" to "PrimaryUDN". Signed-off-by: Tim Rozet <trozet@nvidia.com>

Signed-off-by: Tim Rozet <trozet@nvidia.com>

Currently, we are force exiting with the trap before the background processes can end, container is removed and the orphaned processes end early causing our config to go into an unknown state because we dont end in an orderly manner. Wait until the pid file for ovnkube controller with node is removed which shows the process has completed. Signed-off-by: Martin Kennelly <mkennell@redhat.com>

Prevent ovn-controller from sending stale GARP by adding drop flows on external bridge patch ports until ovnkube-controller synchronizes the southbound database - henceforth known as "drop flows". This addresses race conditions where ovn-controller processes outdated SB DB state before ovnkube-controller updates it, particularly affecting EIP SNAT configurations attached to logical router ports. Fixes: https://issues.redhat.com/browse/FDP-1537 ovnkube-controller controls the lifecycle of the drop flows. ovs / ovn-controller running is required to configure external bridge. Downstream, the external bridge maybe precreated and ovn-controller will use this. This fix considers three primary scenarios: node, container and pod restart. On Node restart means the ovs flows installed priotior to reboot on the node are cleared but the external bridge exists. Add the flows before ovnkube controller with node starts. The reason to add it here is that our gateway code depends on ovn-controller started and running... There is now a race here between ovn-controller starting (and garping) before we set this flow but I think the risk is low however it needs serious testing. The reason I did not naturally at the drop flows before ovn-controller started is because I have no way to detect if its a node reboot or pod reboot and i dont want to inject drop flows for simple ovn-controller container restart which could disrupt traffic. ovnkube-controller starts, we create a new gateway and apply flows the same flows in-order to ensure we always drop GARP when ovnkube controller hasn't sync. Remove the flows when ovnkube-controller has syncd. There is also a race here between ovnkube-controller removing the flows and ovn-controller GARPing with stale SB DB info. There is no easy way to detect what SB DB data ovn-controller has consumed. On Pod restart, we add the drop flows before exit. ovnkube-controller-with-node will also add it before it starts the go code. Container restart: - ovnkube-controller: adds flows upon start and exit - ovn-controller: no changes While the drop flows are set, OVN may not be able to resolve IPs it doesn't know about in its Logical Router pipelines generation. Following removal of the drop flows, OVN may resolve the IPs using GARP requests. OVN-Controller always sends out GARPs with op code 1 on startup. Signed-off-by: Martin Kennelly <mkennell@redhat.com>

Fix naming of "Secondary" to be "User-Defined"

sriovnet lib contains fixes to handle local and remote VF representors co-existing. this fix is relevant for case where ovn-kubernetes runs in DPU mode and there are VFs for both the external host and local VFs (which are used on the DPU itself) Signed-off-by: adrianc <adrianc@nvidia.com>

add new netlink Route fields: MTULock, RtoMinLock Signed-off-by: adrianc <adrianc@nvidia.com>

run go mod tidy in e2e test folder Signed-off-by: adrianc <adrianc@nvidia.com>

We configure the interface in the pods via CNI, thus it doesn't make sense to configure OVN IPAM, and use it to allocate and configure the pod IP address. Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>

chore: bump sriovnet lib

multi-homing, tests: do not use OVN provided IPAM in L3 nets

The node-encap-ips annotation should not be set when running in DPU host mode since the encap IP is not available on the DPU host. This prevents errors during node initialization when the encap IP cannot be determined. - Add conditional check for NodeModeDPUHost before setting encap IP annotation - Maintain existing behavior for non-DPU host modes - Fixes initialization errors in DPU host mode deployments Signed-off-by: Alin Serdean <aserdean@nvidia.com>

Fixes regression from 1448d5a The previous commit dropped matching on in_port so that localnet ports would also use table 1. This allows reply packets from a localnet pod towards the shared OVN/LOCAL IP to be sent to the correct port. However, a regression was introduced where traffic coming from these localnet ports to any destination would be sent to table 1. Egress traffic from the localnet ports is not committed to conntrack, so by sending to table=1 via CT we were getting a miss. This is especially bad for hardware offload where a localnet port is being used as the Geneve encap port. In this case all geneve traffic misses in CT lookup and is not offloaded. Table 1 is intended to be for handling IP traffic destined to the shared Gateway IP/MAC that both the Host and OVN use. It is also used to handle reply traffic for Egress IP. To fix this problem, we can add dl_dst match criteria to this flow, ensuring that only traffic destined to the Host/OVN goes to table 1. Furthermore, after fixing this problem there still exists the issue that localnet -> host/OVN egress traffic will still enter table 1 and CT miss. Potentially this can be fixed with always committing egress traffic, but it might have performance penalty, so deferring that fix to a later date. Signed-off-by: Tim Rozet <trozet@nvidia.com>

We did this for IPv4 in 1448d5a, but forgot about IPv6. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

Add dl_dst=$breth0 to table=0, prio=50 for IPv6 We want to match in table=1 only conntrack'ed reply traffic whose next hop is either OVN or the host. As a consequence, localnet traffic whose next hop is an external router (and that might or might not be destined to OVN/host) should bypass table=1 and just hit the NORMAL flow in table=0. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

We already tested localnet -> host, let's also cover connections initiated from the host. The localnet uses IPs in the same subnet as the host network. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

We have two non-InterConnect CI lanes for multihoming, while only one with IC enabled (and local gw). We need coverage with IC enabled for both gateway modes, so let's make an existing non-IC lane IC enabled, set it as dualstack and gateway=shared to have better coverage. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

This is needed because we will need to generate IPs from different subnets than just the host subnet. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

The localnet is on a subnet different than the host subnet, the corresponding NAD is configured with a VLAN ID, the localnet pod uses an external router to communicate to cluster pods. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

In testing we saw how an invalid conntrack state would drop all echo requests after the first one. Let's send three pings in each test then. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

martinkennelly · 2025-09-22T15:16:35Z

4.20-upgrade-from-stable-4.19-e2e-gcp-ovn-rt-upgrade is permafailing since aug 27th. Its reaching its timeout limit consistently.

martinkennelly · 2025-09-22T15:29:07Z

/test 4.20-upgrade-from-stable-4.19-e2e-gcp-ovn-rt-upgrade

..rt-upgrade job previous failures are presistent that the job is failing to deprovision the cloud resources under 1 hour: {"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:169","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 1h0m0s timeout","severity":"error","time":"2025-09-11T07:31:59Z"}

There was also what i believe is a flake for test Pre-provisioned PV (default fs)] subPath should support existing directories when readOnly specified in the volumeSource where the test failed to send data. It seems data transmission was interrupted when talking from pod (cluster network) to kapi (host network). Unknown reason.

martinkennelly · 2025-09-23T08:15:03Z

/override ci/prow/lint

https://issues.redhat.com/browse/CORENET-6207

openshift-ci · 2025-09-23T08:15:39Z

@martinkennelly: Overrode contexts on behalf of martinkennelly: ci/prow/lint

Details

In response to this:

/override ci/prow/lint

https://issues.redhat.com/browse/CORENET-6207

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2025-09-23T08:15:42Z

@martinkennelly: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-fdp-qe	`95d74c1`	link	false	`/test e2e-aws-ovn-fdp-qe`
ci/prow/qe-perfscale-aws-ovn-small-udn-density-churn-l3	`7d0868e`	link	false	`/test qe-perfscale-aws-ovn-small-udn-density-churn-l3`
ci/prow/security	`7d0868e`	link	false	`/test security`
ci/prow/e2e-aws-ovn-hypershift-conformance-techpreview	`7d0868e`	link	false	`/test e2e-aws-ovn-hypershift-conformance-techpreview`
ci/prow/e2e-aws-ovn-hypershift-kubevirt	`7d0868e`	link	false	`/test e2e-aws-ovn-hypershift-kubevirt`
ci/prow/qe-perfscale-aws-ovn-small-udn-density-l3	`7d0868e`	link	false	`/test qe-perfscale-aws-ovn-small-udn-density-l3`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

martinkennelly · 2025-09-23T08:17:04Z

/override ci/prow/4.20-upgrade-from-stable-4.19-e2e-gcp-ovn-rt-upgrade

job 4.20-upgrade-from-stable-4.19-e2e-gcp-ovn-rt-upgrade tests passed but deprovisioning timed out. Ill create a bug for it.

openshift-ci · 2025-09-23T08:17:54Z

@martinkennelly: Overrode contexts on behalf of martinkennelly: ci/prow/4.20-upgrade-from-stable-4.19-e2e-gcp-ovn-rt-upgrade

Details

In response to this:

/override ci/prow/4.20-upgrade-from-stable-4.19-e2e-gcp-ovn-rt-upgrade

job 4.20-upgrade-from-stable-4.19-e2e-gcp-ovn-rt-upgrade tests passed but deprovisioning timed out. Ill create a bug for it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

martinkennelly · 2025-09-23T08:30:19Z

Bug for RT upgrade job: https://issues.redhat.com/browse/OCPBUGS-62093

I see ovn in the job title so i am assigning it against ovn-k team.

tssurya · 2025-09-23T08:47:58Z

Bug for RT upgrade job: https://issues.redhat.com/browse/OCPBUGS-62093

I see ovn in the job title so i am assigning it against ovn-k team.

all jobs have ovn in job title, is this really a networking issue? if there is a CI issue around provisioning resources not sure what we can do, unless I amisunderstood the bug title. Better path is to search for OCP CI search sippy for failures around this? (wearing dispatch hat, not enough info to suggest any evidence its us, I'll need to close it :P unless more triage was done)

martinkennelly · 2025-09-23T16:05:09Z

Waiting for other eng to add lgtm and qe to add the necessary label

jluhrsen · 2025-09-23T16:05:50Z

Bug for RT upgrade job: https://issues.redhat.com/browse/OCPBUGS-62093
I see ovn in the job title so i am assigning it against ovn-k team.

all jobs have ovn in job title, is this really a networking issue? if there is a CI issue around provisioning resources not sure what we can do, unless I amisunderstood the bug title. Better path is to search for OCP CI search sippy for failures around this? (wearing dispatch hat, not enough info to suggest any evidence its us, I'll need to close it :P unless more triage was done)

not our problem. it's happening in the periodics as well.

martinkennelly · 2025-09-23T16:07:37Z

@jluhrsen do you know who i can reassign the bug to?

martinkennelly · 2025-09-23T16:28:45Z

/lgtm

openshift-ci · 2025-09-23T16:28:47Z

@martinkennelly: you cannot LGTM your own PR.

Details

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

martinkennelly · 2025-09-23T16:36:33Z

/unhold

tssurya · 2025-09-23T16:42:30Z

/lgtm

on behalf of @martinkennelly

openshift-ci · 2025-09-23T16:43:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: martinkennelly, tssurya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [martinkennelly,tssurya]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jechen0648 · 2025-09-23T17:11:54Z

/verified by jechen

openshift-ci-robot · 2025-09-23T17:12:07Z

@jechen0648: This PR has been marked as verified by jechen.

Details

In response to this:

/verified by jechen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-09-23T17:16:48Z

@martinkennelly: Jira Issue Verification Checks: Jira Issue OCPBUGS-58501
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-58501 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Jira Issue Verification Checks: Jira Issue OCPBUGS-59657
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-59657 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Jira Issue OCPBUGS-42303: Some pull requests linked via external trackers have merged:

The following pull request, linked via external tracker, has not merged:

openshift/machine-config-operator#5123 is open

All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-42303 has not been moved to the MODIFIED state.

This PR is marked as verified. If the remaining PRs listed above are marked as verified before merging, the issue will automatically be moved to VERIFIED after all of the changes from the PRs are available in an accepted nightly payload.

Jira Issue OCPBUGS-61566: Some pull requests linked via external trackers have merged:

openshift/ovn-kubernetes#2750

The following pull request, linked via external tracker, has not merged:

openshift/ovn-kubernetes#2759 is open

All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-61566 has not been moved to the MODIFIED state.

This PR is marked as verified. If the remaining PRs listed above are marked as verified before merging, the issue will automatically be moved to VERIFIED after all of the changes from the PRs are available in an accepted nightly payload.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jluhrsen · 2025-09-23T21:28:29Z

@jluhrsen do you know who i can reassign the bug to?

no, sorry. someone would have to look deeper in to the logs to figure out what is actually causing the slow down in the jobs and assign to whatever component they figure out.

openshift-merge-robot · 2025-09-25T14:39:38Z

Fix included in accepted release 4.21.0-0.nightly-2025-09-25-082813

jluhrsen and others added 30 commits August 25, 2025 14:11

A couple of minor fixes

a393d95

Rabbit found these during code review. Signed-off-by: Tim Rozet <trozet@nvidia.com>

Merge pull request #5392 from jluhrsen/need-more-diags

db5f3e4

fix intermittent disk space issue

Updates UDN Isolation DBIDs

a44f6c1

Moves ACL DBIDs from "Secondary" to "PrimaryUDN". Signed-off-by: Tim Rozet <trozet@nvidia.com>

Update docs for UDN

46fa330

Signed-off-by: Tim Rozet <trozet@nvidia.com>

Merge pull request #5512 from trozet/stop_secondary_madness

56d14a3

Fix naming of "Secondary" to be "User-Defined"

fix: routemanager unit tests

320b2fa

add new netlink Route fields: MTULock, RtoMinLock Signed-off-by: adrianc <adrianc@nvidia.com>

fix: run go mod tidy e2e tests

9633bdf

run go mod tidy in e2e test folder Signed-off-by: adrianc <adrianc@nvidia.com>

multi-homing, tests: do not use OVN provided IPAM in L3 nets

c0c1b26

We configure the interface in the pods via CNI, thus it doesn't make sense to configure OVN IPAM, and use it to allocate and configure the pod IP address. Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>

Merge pull request #5508 from adrianchiris/bump-sriovnet

6d12ab9

chore: bump sriovnet lib

Merge pull request #4885 from maiqueb/remove-secondary-l3-ovn-ipam

380c234

multi-homing, tests: do not use OVN provided IPAM in L3 nets

Openflow: drop in_port from IPv6 dispatch OF rule at prio=50

66d8f14

We did this for IPv4 in 1448d5a, but forgot about IPv6. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

E2E localnet: remove double import of ginkgo

4ce92a9

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

E2E: add test host -> localnet with IP in host subnet

a5029f8

We already tested localnet -> host, let's also cover connections initiated from the host. The localnet uses IPs in the same subnet as the host network. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

E2E localnet: remove references to downstream bugs and stories

6de44ef

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

E2E localnet: specify that the localnet uses IPs from host subnet

c4cc25a

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

E2E localnet: make IP request for localnet pod extensible

eb5f3c1

This is needed because we will need to generate IPs from different subnets than just the host subnet. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

E2E localnet: Fix requirement on number of schedulable nodes

f82e101

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

E2E localnet: host network -> localnet on VLAN with external router

51eae7a

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

E2E localnet: localnet -> host network on VLAN with external router

dea42b4

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

E2E localnet: send three pings instead of just one

b004ed0

In testing we saw how an invalid conntrack state would drop all echo requests after the first one. Let's send three pings in each test then. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 23, 2025

openshift-ci bot assigned tssurya Sep 23, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 23, 2025

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Sep 23, 2025

openshift-merge-bot bot merged commit e712193 into openshift:master Sep 23, 2025
44 of 49 checks passed

tsorya mentioned this pull request Sep 24, 2025

[release-4.20] OCPBUGS-61239: Skip node-encap-ips annotation in DPU host mode #2763

Merged

5 tasks

This was referenced Oct 2, 2025

OCPBUGS-61455: [4.18] allow default network -> localnet on the same node for any localnet subnet #2773

Closed

OCPBUGS-61455,OCPBUGS-59744: [4.18] allow default network -> localnet on the same node for any localnet subnet #2785

Merged

Comments

Conversation

martinkennelly commented Sep 9, 2025

Uh oh!

martinkennelly commented Sep 22, 2025

Uh oh!

martinkennelly commented Sep 22, 2025

Uh oh!

martinkennelly commented Sep 23, 2025

Uh oh!

openshift-ci bot commented Sep 23, 2025

Uh oh!

openshift-ci bot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martinkennelly commented Sep 23, 2025

Uh oh!

openshift-ci bot commented Sep 23, 2025

Uh oh!

martinkennelly commented Sep 23, 2025

Uh oh!

tssurya commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martinkennelly commented Sep 23, 2025

Uh oh!

jluhrsen commented Sep 23, 2025

Uh oh!

martinkennelly commented Sep 23, 2025

Uh oh!

martinkennelly commented Sep 23, 2025

Uh oh!

openshift-ci bot commented Sep 23, 2025

Uh oh!

martinkennelly commented Sep 23, 2025

Uh oh!

tssurya commented Sep 23, 2025

Uh oh!

openshift-ci bot commented Sep 23, 2025

Uh oh!

jechen0648 commented Sep 23, 2025

Uh oh!

openshift-ci-robot commented Sep 23, 2025

Uh oh!

Uh oh!

openshift-ci-robot commented Sep 23, 2025

Uh oh!

jluhrsen commented Sep 23, 2025

Uh oh!

openshift-merge-robot commented Sep 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

openshift-ci bot commented Sep 23, 2025 •

edited

Loading

tssurya commented Sep 23, 2025 •

edited

Loading