✨ Cancel instance refresh on any relevant change to ASG instead of blocking until previous one is finished (which may have led to failing nodes due to outdated join token) #5543

AndiDog · 2025-06-11T14:18:44Z

(reopened from #5173 because I didn't have enough permissions to force-push)

What type of PR is this?

What this PR does / why we need it:

Changing any relevant spec.* for an AWSMachinePool triggers rolling of nodes via ASG instance refresh. If another change happens shortly afterwards, it has to wait until the first rollout is done, and will then trigger another instance refresh. But it is neither necessary nor desired to roll all worker nodes twice in such a case, and it's much slower. Instead, cancel the first pending instance refresh, wait until another one can be started, and apply the latest change as soon as possible with the second instance refresh.

This change has been running fine in Giant Swarm's CAPA fork for almost a year at the time of opening this PR.

Checklist:

Release note:

Cancel instance refresh on any relevant change to ASG instead of blocking until previous one is finished

k8s-ci-robot · 2025-06-11T14:18:51Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign dlipovetsky for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

AndiDog · 2025-06-12T08:35:19Z

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

AndiDog · 2025-07-16T15:15:27Z

@richardcase I think you have some machine pool experience and could maybe review this?

…king until previous one is finished (which may have led to failing nodes due to outdated join token)

AndiDog · 2025-08-28T15:21:07Z

Rebased onto AWS SDK Go v2 changes

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

AndiDog · 2025-08-28T22:08:57Z

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

AndiDog · 2025-08-30T11:27:40Z

Rate limit errors in the test, trying again

/test pull-cluster-api-provider-aws-e2e

fiunchinho

/lgtm

k8s-ci-robot · 2025-09-18T11:46:53Z

LGTM label has been added.

Git tree hash: 524f880dcd8a5eff0e7f608006e4e86f3fc3786c

richardcase · 2025-09-18T13:42:45Z

@AndiDog - do you think there is any benefit in adding an e2e that covers this scenario?

AndiDog · 2025-10-01T06:56:07Z

@AndiDog - do you think there is any benefit in adding an e2e that covers this scenario?

@richardcase I'd say the feature is not that critical to require an E2E test. The behavior is well-covered in exp/controllers/awsmachinepool_controller_test.go (and I'm saying that as a huge fan of those quickly-running, mock-based unit tests because they're much easier to write).

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 11, 2025

k8s-ci-robot requested review from damdo and richardcase June 11, 2025 14:18

k8s-ci-robot added needs-priority size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 11, 2025

AndiDog force-pushed the cancel-instance-refresh branch from 56f83a4 to a99b8ca Compare June 11, 2025 15:23

AndiDog force-pushed the cancel-instance-refresh branch from a99b8ca to af59d86 Compare June 24, 2025 07:02

AndiDog force-pushed the cancel-instance-refresh branch 2 times, most recently from 615c57c to d4d3662 Compare July 2, 2025 14:41

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 29, 2025

Cancel instance refresh on any relevant change to ASG instead of bloc…

baf3527

…king until previous one is finished (which may have led to failing nodes due to outdated join token)

AndiDog force-pushed the cancel-instance-refresh branch from d4d3662 to baf3527 Compare August 28, 2025 14:15

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 28, 2025

AndiDog added the area/machinepool label Sep 16, 2025

fiunchinho approved these changes Sep 18, 2025

View reviewed changes

k8s-ci-robot assigned fiunchinho Sep 18, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

✨ Cancel instance refresh on any relevant change to ASG instead of blocking until previous one is finished (which may have led to failing nodes due to outdated join token) #5543

✨ Cancel instance refresh on any relevant change to ASG instead of blocking until previous one is finished (which may have led to failing nodes due to outdated join token) #5543

AndiDog commented Jun 11, 2025

Uh oh!

k8s-ci-robot commented Jun 11, 2025

Uh oh!

AndiDog commented Jun 12, 2025

Uh oh!

AndiDog commented Jul 16, 2025

Uh oh!

AndiDog commented Aug 28, 2025

Uh oh!

AndiDog commented Aug 28, 2025

Uh oh!

AndiDog commented Aug 30, 2025

Uh oh!

fiunchinho left a comment

Uh oh!

k8s-ci-robot commented Sep 18, 2025

Uh oh!

richardcase commented Sep 18, 2025

Uh oh!

AndiDog commented Oct 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

✨ Cancel instance refresh on any relevant change to ASG instead of blocking until previous one is finished (which may have led to failing nodes due to outdated join token) #5543

Are you sure you want to change the base?

✨ Cancel instance refresh on any relevant change to ASG instead of blocking until previous one is finished (which may have led to failing nodes due to outdated join token) #5543

Conversation

AndiDog commented Jun 11, 2025

Uh oh!

k8s-ci-robot commented Jun 11, 2025

Uh oh!

AndiDog commented Jun 12, 2025

Uh oh!

AndiDog commented Jul 16, 2025

Uh oh!

AndiDog commented Aug 28, 2025

Uh oh!

AndiDog commented Aug 28, 2025

Uh oh!

AndiDog commented Aug 30, 2025

Uh oh!

fiunchinho left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Sep 18, 2025

Uh oh!

richardcase commented Sep 18, 2025

Uh oh!

AndiDog commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

AndiDog commented Oct 1, 2025 •

edited

Loading