-
Notifications
You must be signed in to change notification settings - Fork 173
OCPBUGS-66428, CORENET-6055, OCPBUGS-66360, OCPBUGS-48710, OCPBUGS-66381: Branch Sync release-4.19 to release-4.18 [12-03-2025] #2895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
OVN-Kubernetes is always lagging behind on the version of OVN it pins. This is causing a lot of trouble with keeping up with bug fixes and especially CVE fixes on older branches, resulting in scanners flagging this image with poor security grades and much longer time for bug fixes to be delivered to customers as the PR backporting process can take weeks or even months. Removing the pin, so every time the new build is released in FDP, it automatically gets into versions of OpneShift that use it. There is a pre-release testing process in place between FDP and OCP QE that ensures the required test coverage before the new build is released through FDP. Keeping OKD versions separate since sometimes new major versions are not released at the same time in FDP/RHEL and CentOS, so we may need them different at some point in time. Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When multiple networks support was first added, all controllers that were added used the label "Secondary" to indicate they were not "Default". When UDN was added, it allowed "Secondary" networks to function as the primary network for a pod, creating terminology confusion. We now treat non-default networks all as "User-Defined Networks". This commit changes all naming to conform to the latter. The only places secondary is used now is for distinguishing whether or not a UDN is acting as a primary or secondary network for a pod (it's role). The only exception to this is udn-isolation. I did not touch this because it relies on dbIDs, which would impact functionality for upgrade. There is no functional change in this commit. Signed-off-by: Tim Rozet <trozet@nvidia.com> (cherry picked from commit bbca874)
The k8s e2e utility functions AddOrUpdateLabelOnNode/RemoveLabelOffNode don't work for labels without a value. The incorrect handling of these labels caused an incorrect sequence of nodes whem migrating different than what the tests intended to test. Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com> (cherry picked from commit 434b48f)
There's two circumstances when IPs were being released incorrectly: * when a live migratable pod completed with no migration ongoing it was not being released due to IsMigratedSourcePodStale outright assuming a completed pod was stale. * when a live migratable pod completed on a different node than the VM's original as part of a migration it was being released when it shouldn't, we were simply not checking if it was a migration. It also improves the tests to check for IP release. Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com> (cherry picked from commit 4c34982)
Don't attempt to release IPs that are not managed by the local zone which can happen with live migratable pods, otherwise we would get distracting error logs on release. Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com> (cherry picked from commit 7a155cc)
ConditionalIPRelease would always return false when checking IPs not tracked in the local zone so in that case we were not correctly checking for colliding pods. This was hidden by the fact that IsMigratedSourcePodStale was used just before instead of AllVMPodsAreCompleted until a very recent fix and that would always return false for a completed live migratable pod. Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com> (cherry picked from commit 0dc8f27)
Or completion of a failed target pod Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com> (cherry picked from commit c1b02b5)
As it is the most complex scenario and a superset of testing without it Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com> (cherry picked from commit ef92f78)
I accidentally removed the check in recent PR [1] which could have performance consequences as checking agains other pods has a cost. Reintroduce the check with a hopefully useful comment to prevent it form happening again. [1] ovn-kubernetes/ovn-kubernetes#5626 Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com> (cherry picked from commit 76f6439)
When processing pods during an EgressIP status update, the controller used to stop iterating as soon as it encountered a pod in Pending state (in my case, pod IPs are not found when pod is in pending state with container creating status). This caused any subsequent Running pods to be skipped, leaving their SNAT entries unprogrammed on the egress node. With this change, only Pending pods are skipped, while iteration continues for the rest. This ensures that Running pods are properly processed and their SNAT entries are programmed. This change also skips pods that are unscheduled or use host networking. Signed-off-by: Periyasamy Palanisamy <pepalani@redhat.com> (cherry picked from commit 2afbaf6)
…d_4.20 [release-4.20] OCPBUGS-63631: Skip Pending pods in EgressIP status updates
When processing pods during an EgressIP status update, the controller used to stop iterating as soon as it encountered a pod in Pending state (in my case, pod IPs are not found when pod is in pending state with container creating status). This caused any subsequent Running pods to be skipped, leaving their SNAT entries unprogrammed on the egress node. With this change, only Pending pods are skipped, while iteration continues for the rest. This ensures that Running pods are properly processed and their SNAT entries are programmed. This change also skips pods that are unscheduled or use host networking. Signed-off-by: Periyasamy Palanisamy <pepalani@redhat.com> (cherry picked from commit 2afbaf6)
…rry-pick-2721-to-release-4.20 OCPBUGS-63577: [release-4.20] CORENET-6055: Dockerfile: Unpin OVN and consume the latest from FDP.
…rry-pick-2831-to-release-4.19 [release-4.19] OCPBUGS-63660: Skip Pending pods in EgressIP status updates
configure IP/VRF rules only on full/dpu-host mode, and confiugre openflow rules only on full/dpu mode. Signed-off-by: Yun Zhou <yunz@nvidia.com> (cherry picked from commit a996442)
Signed-off-by: Yun Zhou <yunz@nvidia.com> (cherry picked from commit 60404e5)
OCPBUGS-63007: kubevirt: fix bad release of IPs of live migratable pods
When multiple networks support was first added, all controllers that were added used the label "Secondary" to indicate they were not "Default". When UDN was added, it allowed "Secondary" networks to function as the primary network for a pod, creating terminology confusion. We now treat non-default networks all as "User-Defined Networks". This commit changes all naming to conform to the latter. The only places secondary is used now is for distinguishing whether or not a UDN is acting as a primary or secondary network for a pod (it's role). The only exception to this is udn-isolation. I did not touch this because it relies on dbIDs, which would impact functionality for upgrade. There is no functional change in this commit. Signed-off-by: Tim Rozet <trozet@nvidia.com> (cherry picked from commit bbca874)
The k8s e2e utility functions AddOrUpdateLabelOnNode/RemoveLabelOffNode don't work for labels without a value. The incorrect handling of these labels caused an incorrect sequence of nodes whem migrating different than what the tests intended to test. Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com> (cherry picked from commit 434b48f)
There's two circumstances when IPs were being released incorrectly: * when a live migratable pod completed with no migration ongoing it was not being released due to IsMigratedSourcePodStale outright assuming a completed pod was stale. * when a live migratable pod completed on a different node than the VM's original as part of a migration it was being released when it shouldn't, we were simply not checking if it was a migration. It also improves the tests to check for IP release. Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com> (cherry picked from commit 4c34982)
Don't attempt to release IPs that are not managed by the local zone which can happen with live migratable pods, otherwise we would get distracting error logs on release. Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com> (cherry picked from commit 7a155cc)
ConditionalIPRelease would always return false when checking IPs not tracked in the local zone so in that case we were not correctly checking for colliding pods. This was hidden by the fact that IsMigratedSourcePodStale was used just before instead of AllVMPodsAreCompleted until a very recent fix and that would always return false for a completed live migratable pod. Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com> (cherry picked from commit 0dc8f27)
Or completion of a failed target pod Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com> (cherry picked from commit c1b02b5)
As it is the most complex scenario and a superset of testing without it Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com> (cherry picked from commit ef92f78)
I accidentally removed the check in recent PR [1] which could have performance consequences as checking agains other pods has a cost. Reintroduce the check with a hopefully useful comment to prevent it form happening again. [1] ovn-kubernetes/ovn-kubernetes#5626 Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com> (cherry picked from commit 76f6439)
…n are updated Signed-off-by: arkadeepsen <arsen@redhat.com>
…tations are updated Signed-off-by: arkadeepsen <arsen@redhat.com>
Addresses incorrect DNAT rules with <proto>/0 target port when using services with externalTrafficPolicy: Local and named ports. The issue occurred when allocateLoadBalancerNodePorts was false and services referenced pod named ports. The previous implementation used svcPort.TargetPort.IntValue() which returns 0 for named ports, causing invalid DNAT rules. This refactoring introduces/uses structured endpoint types that properly handle port mapping from endpoint slices, ensuring the actual pod port numbers are used instead of attempting to convert named ports to integers. This change unifies endpoint processing logic by having both the services controller and nodePortWatcher use the same GetEndpointsForService function. This ensures consistent endpoint resolution and port mapping behavior across all service-related components, preventing divergence in logic and similar unnoticed port handling issues in the future. Signed-off-by: Andreas Karis <ak.karis@gmail.com> (cherry picked from commit 0651593)
Adds tests for loadBalancer services with named ports and AllocateLoadBalancerNodePorts=False. Add new test cases in Test_getEndpointsForService. Signed-off-by: Andreas Karis <ak.karis@gmail.com> (cherry picked from commit 282b01e)
Signed-off-by: Andreas Karis <ak.karis@gmail.com> (cherry picked from commit 651759c)
|
@ricky-rav: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@ricky-rav @jluhrsen isn't the windows job supposed to be fixed by now? |
@kyrtapz , I don't really know any more. the bug we filed is still assigned to the windows team bot and no update to it. I'll comment on it to see if someone from that team can take a look. I would override it for now. looks like the master job is passing a little bit now. but 4.17, 4.18 and 4.19 have not seen a single pass since mid-November 4.16 passes sometimes though. |
|
/retest-required |
@ricky-rav , the e2e-aws-ovn-windows job is not going to pass. I got a reply on the bug here and looks like we need some back-ports to land before we can expect this job to pass. |
|
/override e2e-aws-ovn-windows |
|
@ricky-rav: /override requires failed status contexts, check run or a prowjob name to operate on.
Only the following failed contexts/checkruns were expected:
If you are trying to override a checkrun that has a space in it, you must put a double quote on the context. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@kyrtapz could you add the magic label? :) Thanks |
|
/override ci/prow/e2e-aws-ovn-windows |
|
@ricky-rav: Overrode contexts on behalf of ricky-rav: ci/prow/e2e-aws-ovn-windows DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
just need BPRA from @ricky-rav , but for kicks I see the 4.18 backport to make the windows job pass has merged. let's see /test e2e-aws-ovn-windows |
|
/label backport-risk-assessed |
|
/retest-required |
1 similar comment
|
/retest-required |
|
@jluhrsen: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
8036759
into
openshift:release-4.18
|
@jluhrsen: Jira Issue Verification Checks: Jira Issue OCPBUGS-66428 Jira Issue OCPBUGS-66428 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 Jira Issue Verification Checks: Jira Issue OCPBUGS-66360 Jira Issue OCPBUGS-66360 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 Jira Issue OCPBUGS-48710: All pull requests linked via external trackers have merged:
Jira Issue OCPBUGS-48710 has been moved to the MODIFIED state. Jira Issue Verification Checks: Jira Issue OCPBUGS-66381 Jira Issue OCPBUGS-66381 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Fix included in accepted release 4.18.0-0.nightly-2025-12-24-222251 |
This is identical to the automated branch sync PR here, but two one line changes to fix the lint job. See that commit message for more details.
This PR is actually enabling linting so we never noticed the issue before. Although the issue was harmless no-op.
with this PR we are in sync with 4.19 like we want to be, with the exception of having to re-pin libreswan because of OCPBUGS-55453