Skip to content

Conversation

@damdo
Copy link
Member

@damdo damdo commented Jun 18, 2025

Originally reverted by #1383

--

Subsequently un-reverted by #1386

--

Currently if the .spec.authoritativeAPI is set to ClusterAPI on Machine/MachineSet creation, there is a chance for a race where:

  • first .status.authoritativeAPI is empty at creation
  • the resource gets reconciled by the MAPI controller and successfully skips the if m.Status.AuthoritativeAPI != "" && m.Status.AuthoritativeAPI != machinev1.MachineAuthorityMachineAPI as it is empty
  • then the paused condition is set to Paused: False but the update function not only patches but also then fetches a fresh copy of the MAPI resource
  • meanwhile the machine/set migration controller has propagated the .spec.authoritativeAPI: ClusterAPI to the .status
  • at this point the MAPI resource has got .status.authoritativeAPI: ClusterAPI but carries on getting reconciled by the MAPI controllers, which can cause a machine to be created by the MAPI MachineSet controller even if the CAPI MachineSet controller is authoritative (sometimes both engines create one, sometimes only the CAPI one is created, sometimes only the MAPI one)

This PR puts an end to that by explicitly checking for all the possible .Status.AuthoritativeAPI states and acting accordingly.

I didn't simply remove the topmost if m.Status.AuthoritativeAPI != "" from if m.Status.AuthoritativeAPI != "" && m.Status.AuthoritativeAPI != machinev1.MachineAuthorityMachineAPI because that would cause the controller to Pause: True the MAPI MachineSet even if it was only a matter of waiting the propagation of the authority from spec to .status.authoritativeAPI.

Also I didn't want to change the logic of updateStatus as that's used in various places and is quite legacy code at this point so don't think that's very wise.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 18, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 18, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@damdo damdo force-pushed the fix-controllers-guard-on-empty-status-authoritativeAPI branch 2 times, most recently from 0e7711d to 9ddbe20 Compare June 19, 2025 16:49
@damdo damdo force-pushed the fix-controllers-guard-on-empty-status-authoritativeAPI branch 2 times, most recently from 257af99 to 384ed7b Compare June 20, 2025 16:15
@damdo damdo changed the title fix: controllers: guard on empty .status.authoritativeAPI OCPCLOUD-2986: fix: controllers: guard on empty .status.authoritativeAPI Jun 20, 2025
@damdo damdo marked this pull request as ready for review June 20, 2025 16:17
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 20, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jun 20, 2025

@damdo: This pull request references OCPCLOUD-2986 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

Details

In response to this:

xRef: openshift/machine-api-provider-aws#134

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 20, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jun 20, 2025

@damdo: This pull request references OCPCLOUD-2986 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

Details

In response to this:

Currently if the .spec.authoritativeAPI is set to ClusterAPI on Machine/MachineSet creation, there is a chance for a race where:

  • first .status.authoritativeAPI is empty at creation
  • the resource gets reconciled by the MAPI controller and successfully skips the if m.Status.AuthoritativeAPI != "" && m.Status.AuthoritativeAPI != machinev1.MachineAuthorityMachineAPI as it is empty
  • then the paused condition is set to Paused: False but the update function not only patches but also then fetches a fresh copy of the MAPI resource
  • meanwhile the machine/set migration controller has propagated the .spec.authoritativeAPI: ClusterAPI to the .status
  • at this point the MAPI resource has got .status.authoritativeAPI: ClusterAPI but carries on getting reconciled by the MAPI controllers, which can cause a machine to be created by the MAPI MachineSet controller even if the CAPI MachineSet controller is authoritative (sometimes both engines create one, sometimes only the CAPI one is created, sometimes only the MAPI one)

This PR puts an end to that by checking for .Status.AuthoritativeAPI != MachineAPI right before branching off to reconciling the MAPI resource by the MAPI logic.

I didn't simply remove the topmost if m.Status.AuthoritativeAPI != "" from if m.Status.AuthoritativeAPI != "" && m.Status.AuthoritativeAPI != machinev1.MachineAuthorityMachineAPI because that would cause the controller to Pause: True the MAPI MachineSet even if it was only a matter of waiting the propagation of the authority from spec to .status.authoritativeAPI.

Also I didn't want to change the logic of updateStatus as that's used in various places and is quite legacy code at this point so don't think that's very wise.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@damdo
Copy link
Member Author

damdo commented Jun 20, 2025

/assign @JoelSpeed

@damdo damdo force-pushed the fix-controllers-guard-on-empty-status-authoritativeAPI branch from 384ed7b to 760eca9 Compare June 20, 2025 16:42
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/assign @mdbooth

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 23, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 23, 2025
@damdo
Copy link
Member Author

damdo commented Jun 23, 2025

/jira refresh

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jun 23, 2025

@damdo: This pull request references OCPCLOUD-2986 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@damdo
Copy link
Member Author

damdo commented Jun 23, 2025

/jira refresh

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jun 23, 2025

@damdo: This pull request references OCPCLOUD-2986 which is a valid jira issue.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@damdo damdo force-pushed the fix-controllers-guard-on-empty-status-authoritativeAPI branch from 760eca9 to 705e2c7 Compare June 25, 2025 09:55
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 26, 2025
@damdo
Copy link
Member Author

damdo commented Jun 26, 2025

/retest-required

@damdo damdo changed the title OCPCLOUD-2986: fix: controllers: guard on empty .status.authoritativeAPI OCPCLOUD-2986,OCPBUGS-56849: fix: controllers: guard on empty .status.authoritativeAPI Jun 26, 2025
@openshift-ci-robot openshift-ci-robot added jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 26, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jun 26, 2025

@damdo: This pull request references OCPCLOUD-2986 which is a valid jira issue.

This pull request references Jira Issue OCPBUGS-56849, which is invalid:

  • expected the bug to target the "4.20.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Currently if the .spec.authoritativeAPI is set to ClusterAPI on Machine/MachineSet creation, there is a chance for a race where:

  • first .status.authoritativeAPI is empty at creation
  • the resource gets reconciled by the MAPI controller and successfully skips the if m.Status.AuthoritativeAPI != "" && m.Status.AuthoritativeAPI != machinev1.MachineAuthorityMachineAPI as it is empty
  • then the paused condition is set to Paused: False but the update function not only patches but also then fetches a fresh copy of the MAPI resource
  • meanwhile the machine/set migration controller has propagated the .spec.authoritativeAPI: ClusterAPI to the .status
  • at this point the MAPI resource has got .status.authoritativeAPI: ClusterAPI but carries on getting reconciled by the MAPI controllers, which can cause a machine to be created by the MAPI MachineSet controller even if the CAPI MachineSet controller is authoritative (sometimes both engines create one, sometimes only the CAPI one is created, sometimes only the MAPI one)

This PR puts an end to that by explicitly checking for all the possible .Status.AuthoritativeAPI states and acting accordingly.

I didn't simply remove the topmost if m.Status.AuthoritativeAPI != "" from if m.Status.AuthoritativeAPI != "" && m.Status.AuthoritativeAPI != machinev1.MachineAuthorityMachineAPI because that would cause the controller to Pause: True the MAPI MachineSet even if it was only a matter of waiting the propagation of the authority from spec to .status.authoritativeAPI.

Also I didn't want to change the logic of updateStatus as that's used in various places and is quite legacy code at this point so don't think that's very wise.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@damdo
Copy link
Member Author

damdo commented Jun 26, 2025

/jira refresh

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jun 26, 2025

@damdo: This pull request references OCPCLOUD-2986 which is a valid jira issue.

This pull request references Jira Issue OCPBUGS-56849, which is invalid:

  • expected the bug to target the "4.20.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@damdo
Copy link
Member Author

damdo commented Jun 26, 2025

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 26, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jun 26, 2025

@damdo: This pull request references OCPCLOUD-2986 which is a valid jira issue.

This pull request references Jira Issue OCPBUGS-56849, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from sunzhaohua2 June 26, 2025 14:09
@damdo
Copy link
Member Author

damdo commented Jun 26, 2025

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 26, 2025
@damdo
Copy link
Member Author

damdo commented Jun 26, 2025

/test e2e-aws-operator-techpreview

1 similar comment
@damdo
Copy link
Member Author

damdo commented Jun 26, 2025

/test e2e-aws-operator-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 27, 2025

@damdo: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-ovn-techpreview 12253b5 link false /test e2e-vsphere-ovn-techpreview
ci/prow/e2e-openstack 12253b5 link false /test e2e-openstack
ci/prow/e2e-vsphere-ovn-techpreview-serial 12253b5 link false /test e2e-vsphere-ovn-techpreview-serial

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@damdo
Copy link
Member Author

damdo commented Jun 27, 2025

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 27, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 917e2fa into openshift:main Jun 27, 2025
31 of 34 checks passed
@openshift-ci-robot
Copy link
Contributor

@damdo: Jira Issue OCPBUGS-56849: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-56849 has been moved to the MODIFIED state.

Details

In response to this:

Currently if the .spec.authoritativeAPI is set to ClusterAPI on Machine/MachineSet creation, there is a chance for a race where:

  • first .status.authoritativeAPI is empty at creation
  • the resource gets reconciled by the MAPI controller and successfully skips the if m.Status.AuthoritativeAPI != "" && m.Status.AuthoritativeAPI != machinev1.MachineAuthorityMachineAPI as it is empty
  • then the paused condition is set to Paused: False but the update function not only patches but also then fetches a fresh copy of the MAPI resource
  • meanwhile the machine/set migration controller has propagated the .spec.authoritativeAPI: ClusterAPI to the .status
  • at this point the MAPI resource has got .status.authoritativeAPI: ClusterAPI but carries on getting reconciled by the MAPI controllers, which can cause a machine to be created by the MAPI MachineSet controller even if the CAPI MachineSet controller is authoritative (sometimes both engines create one, sometimes only the CAPI one is created, sometimes only the MAPI one)

This PR puts an end to that by explicitly checking for all the possible .Status.AuthoritativeAPI states and acting accordingly.

I didn't simply remove the topmost if m.Status.AuthoritativeAPI != "" from if m.Status.AuthoritativeAPI != "" && m.Status.AuthoritativeAPI != machinev1.MachineAuthorityMachineAPI because that would cause the controller to Pause: True the MAPI MachineSet even if it was only a matter of waiting the propagation of the authority from spec to .status.authoritativeAPI.

Also I didn't want to change the logic of updateStatus as that's used in various places and is quite legacy code at this point so don't think that's very wise.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-machine-api-operator
This PR has been included in build ose-machine-api-operator-container-v4.20.0-202506270743.p0.g917e2fa.assembly.stream.el9.
All builds following this will include this PR.

@openshift-ci-robot
Copy link
Contributor

@damdo: Jira Issue OCPBUGS-56849: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-56849 has not been moved to the MODIFIED state.

Details

In response to this:

Originally reverted by #1383

--

Subsequently un-reverted by #1386

--

Currently if the .spec.authoritativeAPI is set to ClusterAPI on Machine/MachineSet creation, there is a chance for a race where:

  • first .status.authoritativeAPI is empty at creation
  • the resource gets reconciled by the MAPI controller and successfully skips the if m.Status.AuthoritativeAPI != "" && m.Status.AuthoritativeAPI != machinev1.MachineAuthorityMachineAPI as it is empty
  • then the paused condition is set to Paused: False but the update function not only patches but also then fetches a fresh copy of the MAPI resource
  • meanwhile the machine/set migration controller has propagated the .spec.authoritativeAPI: ClusterAPI to the .status
  • at this point the MAPI resource has got .status.authoritativeAPI: ClusterAPI but carries on getting reconciled by the MAPI controllers, which can cause a machine to be created by the MAPI MachineSet controller even if the CAPI MachineSet controller is authoritative (sometimes both engines create one, sometimes only the CAPI one is created, sometimes only the MAPI one)

This PR puts an end to that by explicitly checking for all the possible .Status.AuthoritativeAPI states and acting accordingly.

I didn't simply remove the topmost if m.Status.AuthoritativeAPI != "" from if m.Status.AuthoritativeAPI != "" && m.Status.AuthoritativeAPI != machinev1.MachineAuthorityMachineAPI because that would cause the controller to Pause: True the MAPI MachineSet even if it was only a matter of waiting the propagation of the authority from spec to .status.authoritativeAPI.

Also I didn't want to change the logic of updateStatus as that's used in various places and is quite legacy code at this point so don't think that's very wise.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants