Skip to content

Comments

OCPBUGS-62670: [release-4.19] Networking: reset ovn-remote config and allow ovnkube controller to set it#5324

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:release-4.19from
martinkennelly:garp-419
Oct 15, 2025
Merged

OCPBUGS-62670: [release-4.19] Networking: reset ovn-remote config and allow ovnkube controller to set it#5324
openshift-merge-bot[bot] merged 1 commit intoopenshift:release-4.19from
martinkennelly:garp-419

Conversation

@martinkennelly
Copy link
Contributor

…et it

This fixes the issue where ovn-remote is set
prior to reboot and when boot occurs, ovn-controller syncs quickly with a stale SB DB.

This PR is part of the EIP GARP issue fix.
Its required because when ovnkube-controller and
ovn-controller container start on boot, there
is no order to which container will start first,
and we dont want ovn-controller to connect to SB DB before ovnkube controller has added the drop flows.

Ideally, we would only allow ovn-controller to sync with SB DB when ovnkube controller has concluded
syncing and the changes are available in SB DB.
That maybe future work.

(cherry picked from commit 567a191)

/hold

Depends on #5317

…et it

This fixes the issue where ovn-remote is set
prior to reboot and when boot occurs, ovn-controller
syncs quickly with a stale SB DB.

This PR is part of the EIP GARP issue fix.
Its required because when ovnkube-controller and
ovn-controller container start on boot, there
is no order to which container will start first,
and we dont want ovn-controller to connect to SB DB
before ovnkube controller has added the drop flows.

Ideally, we would only allow ovn-controller to sync
with SB DB when ovnkube controller has concluded
syncing and the changes are available in SB DB.
That maybe future work.

Signed-off-by: Martin Kennelly <mkennell@redhat.com>
(cherry picked from commit 567a191)
@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 2, 2025
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Oct 2, 2025
@openshift-ci-robot
Copy link
Contributor

@martinkennelly: This pull request references Jira Issue OCPBUGS-62670, which is invalid:

  • expected dependent Jira Issue OCPBUGS-62671 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but it is New instead
  • expected dependent Jira Issue OCPBUGS-62671 to target a version in 4.20.0, but it targets "4.18.z" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

…et it

This fixes the issue where ovn-remote is set
prior to reboot and when boot occurs, ovn-controller syncs quickly with a stale SB DB.

This PR is part of the EIP GARP issue fix.
Its required because when ovnkube-controller and
ovn-controller container start on boot, there
is no order to which container will start first,
and we dont want ovn-controller to connect to SB DB before ovnkube controller has added the drop flows.

Ideally, we would only allow ovn-controller to sync with SB DB when ovnkube controller has concluded
syncing and the changes are available in SB DB.
That maybe future work.

(cherry picked from commit 567a191)

/hold

Depends on #5317

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Oct 2, 2025
@martinkennelly
Copy link
Contributor Author

/payload-with-prs 4.19 nightly blocking openshift/cluster-network-operator#2809 openshift/ovn-kubernetes#2774

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 2, 2025

@martinkennelly: trigger 11 job(s) of type blocking for the nightly release of OCP 4.19

  • periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-azure-aks-ovn-conformance
  • periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-upgrade-ovn-single-node
  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-techpreview
  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-techpreview-serial
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-upgrade-fips
  • periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-bm
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a89d3170-9f8b-11f0-93fd-ee70f1d60e20-0

@martinkennelly
Copy link
Contributor Author

martinkennelly commented Oct 7, 2025

@yuqi-zhang Thank you for reviewing the 4.20 PR - its not merged but we want the approvers lined up and labels added. Its a critical bug and we have the fastfix label applied. We will only merge when QE has verified. Its a clean cherry-pick.

@yuqi-zhang
Copy link
Contributor

/approve
/label backport-risk-assessed

The only thing I'd like to add is that currently the manual bugs for 4.19 and 4.18 has weird cloning, and I think prow expects a clone of the previous version (so in the clone links, the depends on should be the 4.20 bug and not the 4.18 bug)

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Oct 7, 2025
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 7, 2025
@martinkennelly
Copy link
Contributor Author

/payload-with-prs 4.19 nightly blocking openshift/cluster-network-operator#2809 openshift/ovn-kubernetes#2774

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 8, 2025

@martinkennelly: trigger 11 job(s) of type blocking for the nightly release of OCP 4.19

  • periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-azure-aks-ovn-conformance
  • periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-upgrade-ovn-single-node
  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-techpreview
  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-techpreview-serial
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-upgrade-fips
  • periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-bm
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c3464cd0-a43d-11f0-8c73-68fbceb6751c-0

@martinkennelly
Copy link
Contributor Author

/test e2e-aws-mco-disruptive

Unrelated:

 [sig-arch] events should not repeat pathologically for ns/openshift-dns expand_less	0s
{  2 events happened too frequently

event happened 24 times, something is wrong: namespace/openshift-dns hmsg/78f5fd4cde service/dns-default - reason/TopologyAwareHintsEnabled Topology Aware Hints has been enabled, addressType: IPv4 (15:44:17Z) result=reject 
event happened 42 times, something is wrong: namespace/openshift-dns hmsg/1de144762d service/dns-default - reason/TopologyAwareHintsDisabled Unable to allocate minimum required endpoints to each zone without exceeding overload threshold (5 endpoints, 2 zones), addressType: IPv4 (15:43:02Z) result=reject }


: [sig-arch] events should not repeat pathologically for ns/openshift-cluster-storage-operator expand_less	0s
{  2 events happened too frequently

event happened 27 times, something is wrong: namespace/openshift-cluster-storage-operator deployment/cluster-storage-operator hmsg/166f46a2bb - reason/OperatorStatusChanged Status for clusteroperator/storage changed: 
Progressing changed from True to False ("AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: DaemonSet is not progressing\nAWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverControllerServiceControllerProgressing: Deployment is not progressing") (15:51:29Z) result=reject 
event happened 26 times, something is wrong: namespace/openshift-cluster-storage-operator deployment/cluster-storage-operator hmsg/8d50b7801f - reason/OperatorStatusChanged Status for clusteroperator/storage changed: Progressing changed from False to True ("AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods") (15:50:37Z) result=reject }

@zshi-redhat
Copy link

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 11, 2025
@openshift-ci-robot
Copy link
Contributor

@zshi-redhat: This pull request references Jira Issue OCPBUGS-62670, which is valid.

7 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.z) matches configured target version for branch (4.19.z)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
  • release note text is set and does not match the template
  • dependent bug Jira Issue OCPBUGS-62273 is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
  • dependent Jira Issue OCPBUGS-62273 targets the "4.20.0" version, which is one of the valid target versions: 4.20.0
  • bug has dependents

No GitHub users were found matching the public email listed for the QA contact in Jira (jechen@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@martinkennelly
Copy link
Contributor Author

@yuqi-zhang Hey Jerry, can you take a look? Its missing a label. You looked at higher version of this. Thank you for your support throughout this.

@martinkennelly
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 13, 2025
@martinkennelly
Copy link
Contributor Author

Nighly blocking is good :)

@yuqi-zhang
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 14, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 14, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: martinkennelly, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@martinkennelly
Copy link
Contributor Author

/tide refresh

@martinkennelly
Copy link
Contributor Author

Tide is giving me conflicting info, it says Waiting for status to be reported — Not mergeable. Job ci/prow/e2e-gcp-op has not succeeded. via the gh UI but when i click into tide it says it meets the label requirements. Anyway, maybe theres just a lag.

@martinkennelly
Copy link
Contributor Author

/tide refresh

@martinkennelly
Copy link
Contributor Author

/test e2e-gcp-op

@jechen0648
Copy link

/tide refresh

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 21f507d and 2 for PR HEAD 98291f7 in total

@martinkennelly
Copy link
Contributor Author

martinkennelly commented Oct 15, 2025

Job e2e-gcp-op has failed 3 times now on this PR and basically the same error.

The test is called TestMCNScopeSadPath and TestMCNScopeHappyPath - indeed it is sad for me :). Tests flake but same error signature.

func TestMCNScopeSadPath(t *testing.T) {                                                                                      
                                                                                                                              
        cs := framework.NewClientSet("")                                                                                      
                                                                                                                              
        // Grab two random nodes from different pools, so we don't end up testing and targetting the same node.               
        nodeUnderTest := helpers.GetRandomNode(t, cs, "worker")                                                               
        targetNode := helpers.GetRandomNode(t, cs, "master")                                                                  
                                                                                                                              
        // Attempt to patch the MCN owned by targetNode from nodeUnderTest's MCD. This should fail.                           
        // This oc command effectively use the service account of the nodeUnderTest's MCD pod, which should only be able to edit nodeUnderTest's MCN.                                                                                                       
        cmdOutput, err := helpers.ExecCmdOnNodeWithError(cs, nodeUnderTest, "chroot", "/rootfs", "oc", "patch", "machineconfignodes", targetNode.Name, "--type=merge", "-p", "{\"spec\":{\"configVersion\":{\"desired\":\"rendered-worker-test\"}}}")               require.Error(t, err, "No errors found during failure path :%v", err)                                                 
        require.Contains(t, cmdOutput, "updates to MCN "+targetNode.Name+" can only be done from the MCN's owner node")       
}
=== RUN   TestMCNScopeSadPath
    mcn_test.go:28: 
        	Error Trace:	/go/src/github.com/openshift/machine-config-operator/test/e2e/mcn_test.go:28
        	Error:      	"Error from server: error dialing backend: dial tcp 10.0.128.3:10250: connect: connection refused\n" does not contain "updates to MCN ci-op-xhrct6yq-5691c-qbbwm-master-1 can only be done from the MCN's owner node"
        	Test:       	TestMCNScopeSadPath

Connection refused is coming from the oc patch ... command. It means either the target or something along the way rejected the connection. Port 10250 is a well known kubelet port. Its unrelated to this PR as I see this test fail also with in other PRs.

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/5300/pull-ci-openshift-machine-config-operator-release-4.19-e2e-gcp-op/1970493026836942848

Requesting over ride.

@martinkennelly
Copy link
Contributor Author

/test e2e-gcp-op

Incase mco team dont act on my request in time.

@martinkennelly
Copy link
Contributor Author

/test e2e-gcp-op

Failed to build image, unrelated:

Adding transient rw bind mount for /run/secrets/rhsm
STEP 1/4: FROM quay.io/zzlotnik/zacks-openshift-helpers:latest
Trying to pull quay.io/zzlotnik/zacks-openshift-helpers:latest...
error: build error: creating build container: initializing image from source docker://quay.io/zzlotnik/zacks-openshift-helpers:latest: unexpected end of JSON input

@martinkennelly
Copy link
Contributor Author

Twice in a row CI is borked for job e2e-gcp-op for building:

Writing manifest to image destination
Adding transient rw bind mount for /run/secrets/rhsm
STEP 1/4: FROM quay.io/zzlotnik/zacks-openshift-helpers:latest
Trying to pull quay.io/zzlotnik/zacks-openshift-helpers:latest...
error: build error: creating build container: initializing...acks-openshift-helpers:latest: unexpected end of JSON input 

I think it should overrided anyway based on history.

@martinkennelly
Copy link
Contributor Author

martinkennelly commented Oct 15, 2025

/test e2e-gcp-op

See nothing on test platforum regarding any error.

@martinkennelly
Copy link
Contributor Author

/test e2e-gcp-op

See previous comment - CI was borked for building image. Still requesting override based on job history.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 15, 2025

@martinkennelly: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure-ovn-upgrade-out-of-change 98291f7 link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/e2e-gcp-mco-disruptive 98291f7 link false /test e2e-gcp-mco-disruptive
ci/prow/e2e-aws-mco-disruptive 98291f7 link false /test e2e-aws-mco-disruptive

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@martinkennelly
Copy link
Contributor Author

/test e2e-gcp-op

@martinkennelly
Copy link
Contributor Author

CI is borked :/

@yuqi-zhang
Copy link
Contributor

/override ci/prow/e2e-gcp-op

This shouldn't affect gcp-op

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 15, 2025

@yuqi-zhang: Overrode contexts on behalf of yuqi-zhang: ci/prow/e2e-gcp-op

Details

In response to this:

/override ci/prow/e2e-gcp-op

This shouldn't affect gcp-op

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot bot merged commit 49dbecf into openshift:release-4.19 Oct 15, 2025
19 of 22 checks passed
@openshift-ci-robot
Copy link
Contributor

@martinkennelly: Jira Issue Verification Checks: Jira Issue OCPBUGS-62670
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-62670 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

…et it

This fixes the issue where ovn-remote is set
prior to reboot and when boot occurs, ovn-controller syncs quickly with a stale SB DB.

This PR is part of the EIP GARP issue fix.
Its required because when ovnkube-controller and
ovn-controller container start on boot, there
is no order to which container will start first,
and we dont want ovn-controller to connect to SB DB before ovnkube controller has added the drop flows.

Ideally, we would only allow ovn-controller to sync with SB DB when ovnkube controller has concluded
syncing and the changes are available in SB DB.
That maybe future work.

(cherry picked from commit 567a191)

/hold

Depends on #5317

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.