✨ Trigger machine pool instance refresh (node rollout) if bootstrap config reference changes #4619

AndiDog · 2023-11-06T20:30:33Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

We discussed this and alternative solutions at length in kubernetes-sigs/cluster-api#8858 and CAPI/CAPA office hours. Eventually, we turned to a solution that is not surprising for users: if the bootstrap config reference (e.g. KubeadmConfig) changes within a MachinePool/AWSMachinePool combo, i.e. its name changes, new nodes should be rolled out. With this, it is not necessary to parse the bootstrap data (= user data) and detect changes within those scripts (cloud-init, ignition, maybe other formats in the future). Instead, we only need to store the previously-used bootstrap data secret key (namespace+name) in the launch template, and compare on every reconciliation whether that reference changed. For example, if an operating user wants to roll out a change to KubeadmConfig.spec.files, i.e. add new files on the nodes, it's now sufficient to create a newly-named KubeadmConfig. For my specific usage, this means naming the object KubeadmConfig/my-config-<hash of its spec> and the Helm chart upgrade will take care to deploy this new object and delete the old one (which I tested and works perfectly fine).

Testing of this change showed that CAPI+CAPA correctly, and in the correct order without possible race conditions (🤞), render the bootstrap data secret, mark the bootstrap config ready, create a new launch template version and trigger instance refresh (which creates new EC2 instances). The only minor glitch I found is #4618, but that can be solved separately.

Since reconciliation of launch templates wasn't covered in tests at all, I changed the mock interfaces a little so it could be tested, and added some test cases.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Towards kubernetes-sigs/cluster-api#8858 and #4071

For the feature to work, users require a CAPI version that includes kubernetes-sigs/cluster-api#8667. Without that, CAPA won't see an updated MachinePool.Spec.Template.Spec.Bootstrap.DataSecretName to reconcile.

Checklist:

squashed commits
includes documentation
adds unit tests
adds or updates e2e tests

Release note:

Trigger machine pool instance refresh (node rollout) if bootstrap config reference changes

fiunchinho · 2023-11-07T11:39:38Z

exp/controllers/awsmachinepool_controller.go

@@ -225,6 +234,7 @@ func (r *AWSMachinePoolReconciler) reconcileNormal(ctx context.Context, machineP

 	ec2Svc := r.getEC2Service(ec2Scope)
 	asgsvc := r.getASGService(clusterScope)
+	reconSvc := r.getReconcileService(ec2Scope)


I'm not familiar with the code base, so this may be a basic question but, what's the difference between ec2Svc and reconSvc? They both seem to be calling ec2.NewService().

This is because I split interfaces such that I can call the actual reconciliation code in tests while still mocking EC2 API calls. I don't know why reconciliation was part of a mockable interface in the first place, but that's where I started from with my change.

fiunchinho · 2023-11-07T12:03:34Z

pkg/cloud/services/ec2/launchtemplate.go

-	if len(tags) > 0 {
-		// tag instances
+	// tag instances
+	{


Do we use blocks like { } on other parts of the code? I'm wondering if we want to use it, or if it improves things. If we don't use it anywhere else, I wouldn't use it.

Reading this, the purpose of the if len(tags) > 0 test seems to be to avoid creating and appending a spec element to the tagSpecifications with zero tags in spec.Tags list.

Does the AWS sdk reject a LaunchTemplateTagSpecificationRequest with a list of zero tags? If AWS can support that, we can remove this block and the if len(tags) > 0 test from below, it just makes this code harder to read. Otherwise it seems fine.

If there are zero tags per category (here: ec2.ResourceTypeInstance and ec2.ResourceTypeVolume), we just shouldn't add the tag specification. That's why the guard is there. And I think having a block, whether it's just { ... } or if ... { ... } around each category, together with a comment above each block, separates the logic somewhat cleanly.

cnmcavoy

Take a look at the linter errors failing the tests. Overall this looks solid.

cnmcavoy · 2023-11-09T21:40:53Z

pkg/cloud/services/ec2/launchtemplate.go

-	if len(tags) > 0 {
-		// tag instances
+	// tag instances
+	{


Reading this, the purpose of the if len(tags) > 0 test seems to be to avoid creating and appending a spec element to the tagSpecifications with zero tags in spec.Tags list.

Does the AWS sdk reject a LaunchTemplateTagSpecificationRequest with a list of zero tags? If AWS can support that, we can remove this block and the if len(tags) > 0 test from below, it just makes this code harder to read. Otherwise it seems fine.

cnmcavoy · 2023-11-09T21:41:06Z

pkg/cloud/services/ec2/launchtemplate.go

-		// tag EBS volumes
-		spec = &ec2.LaunchTemplateTagSpecificationRequest{ResourceType: aws.String(ec2.ResourceTypeVolume)}
+	// tag EBS volumes
+	if len(tags) > 0 {


See comment above.

Does the above answer solve your question? Here, tags could be empty and we want an if, while for instance tags, we now have at least one (the newly added tag used for the feature) and need no if.

cnmcavoy · 2023-11-09T21:52:20Z

pkg/cloud/services/interfaces.go

+// from EC2Interface so that we can test the behavior of our non-mock implementations. For example, by not mocking
+// the ReconcileLaunchTemplate function, but mocking EC2Interface, we can test which EC2 API operations would have
+// been called.
+type ReconcileInterface interface {


Is there a better interface name we can choose? I find that I am confused when I would want to add method here or to the EC2Interface, given that both interfaces contain methods names with "Reconcile" . It may be confusing for future contributors when deciding where an additional method should go and how to decide.

I'm not sure what a better name would be, but something closer to the purpose of this?

EC2Interface mocks the AWS EC2 API, so ReconcileLaunchTemplate (which I moved out) and ReconcileTags (out of scope for my PR) should not be in there. If I move the ReconcileTags function as well into the new interface, it should become clearer for someone working on the code, particularly if we state that it's for the functionality of the controller as opposed to raw AWS requests. Given that change, what about MachinePoolReconcileInterface since those functions are exclusively used in awsmachinepool_controller.go and awsmanagedmachinepool_controller.go?

I think that would clear up my confusion, sgtm.

Done. Also fixed lint errors.

fiunchinho · 2023-11-13T08:52:08Z

/lgtm

cnmcavoy · 2023-11-20T18:54:41Z

pkg/cloud/services/ec2/launchtemplate.go

+					launchTemplateUserDataSecretKey = &apimachinerytypes.NamespacedName{
+						Namespace: parts[0],
+						Name:      parts[1],
+					}


Can we return here?

cnmcavoy · 2023-11-20T18:55:20Z

pkg/cloud/services/ec2/launchtemplate.go

 	}

-	return i, userdata.ComputeHash(decodedUserData), nil
+	return i, userdata.ComputeHash(decodedUserData), launchTemplateUserDataSecretKey, nil


is it okay for launchTemplateUserDataSecretKey to be potentially nil here, if the above loop fails? Or should we return an error here?

We must be backward-compatible and support cases where the tag is missing by mistake/bug/intervention. So nil is fine. I changed this line to explicit nil here, and added the return for the found case into the loop as you commented above.

cnmcavoy

/lgtm

AndiDog · 2023-12-15T07:14:34Z

Rebased and only one (obvious to me) test line was conflicted. @cnmcavoy @fiunchinho could you take another look? @richardcase we'd need approval here as well. I'm not sure if this should go into v2.3.x?

damdo · 2024-01-18T11:07:02Z

@AndiDog for your e2e failure, see this

AndiDog · 2024-01-18T15:37:01Z

Rebased because that fixed E2E tests in another PR.

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

fiunchinho

/lgtm

AndiDog · 2024-01-18T23:17:42Z

/retest

richardcase · 2024-01-19T08:34:28Z

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

richardcase · 2024-01-23T14:47:33Z

/milestone v2.4.0

AndiDog · 2024-01-23T20:22:19Z

/retest

Ankitasw · 2024-01-25T10:09:53Z

@AndiDog could you rebase the PR and retry E2E tests?

…fig reference changes

AndiDog · 2024-01-25T10:16:41Z

/retest

damdo · 2024-01-25T11:13:06Z

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

AndiDog · 2024-01-25T14:11:56Z

Still one E2E failure related to LBs. The test function createLBService look strange to me. It expects some status field to be filled immediately or otherwise silently continues.

Ankitasw · 2024-01-31T10:17:05Z

/test pull-cluster-api-provider-aws-e2e

Ankitasw · 2024-01-31T15:03:26Z

/approve

cc @fiunchinho @cnmcavoy @AndiDog for final review

richardcase

Looks good to me as well

/approve

For the final lgtm:

/assign @cnmcavoy @fiunchinho

k8s-ci-robot · 2024-02-04T12:11:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Ankitasw, richardcase

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Ankitasw,richardcase]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cnmcavoy · 2024-02-05T17:53:08Z

/lgtm

k8s-ci-robot requested review from AverageMarcus and cnmcavoy November 6, 2023 20:30

AndiDog force-pushed the machine-pool-bootstrap-config-ref-rotation branch from 3480765 to 1653865 Compare November 6, 2023 20:33

AndiDog mentioned this pull request Nov 7, 2023

Machine pool nodes are not rolled during update giantswarm/roadmap#2217

Closed

fiunchinho reviewed Nov 7, 2023

View reviewed changes

cnmcavoy reviewed Nov 9, 2023

View reviewed changes

k8s-ci-robot assigned fiunchinho Nov 13, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 13, 2023

AndiDog force-pushed the machine-pool-bootstrap-config-ref-rotation branch from 1653865 to 2f77269 Compare November 16, 2023 10:54

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 16, 2023

cnmcavoy reviewed Nov 20, 2023

View reviewed changes

AndiDog force-pushed the machine-pool-bootstrap-config-ref-rotation branch from 2f77269 to 77bb30d Compare November 22, 2023 21:41

AndiDog changed the title ~~Trigger machine pool instance refresh (node rollout) if bootstrap config reference changes~~ ✨ Trigger machine pool instance refresh (node rollout) if bootstrap config reference changes Nov 22, 2023

cnmcavoy approved these changes Nov 22, 2023

View reviewed changes

k8s-ci-robot assigned cnmcavoy Nov 22, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 22, 2023

This was referenced Dec 14, 2023

Cherry-pick bugfixes into v2.3.x release branch giantswarm/cluster-api-provider-aws#576

Merged

Backport changes for bootstrap config reference rotation into v1.4.x branch giantswarm/cluster-api#21

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 15, 2023

AndiDog force-pushed the machine-pool-bootstrap-config-ref-rotation branch from 77bb30d to 6f6df3a Compare December 15, 2023 07:14

k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Dec 15, 2023

AndiDog force-pushed the machine-pool-bootstrap-config-ref-rotation branch from 6f6df3a to 169894f Compare January 18, 2024 15:36

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 18, 2024

fiunchinho approved these changes Jan 18, 2024

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 18, 2024

AndiDog force-pushed the machine-pool-bootstrap-config-ref-rotation branch from 169894f to f0033b1 Compare January 18, 2024 23:16

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 18, 2024

AndiDog force-pushed the machine-pool-bootstrap-config-ref-rotation branch from f0033b1 to 2cddfed Compare January 19, 2024 07:50

k8s-ci-robot added this to the v2.4.0 milestone Jan 23, 2024

AndiDog mentioned this pull request Jan 23, 2024

KubeadmConfig changes should be reconciled for machine pools, triggering instance recreation kubernetes-sigs/cluster-api#8858

Closed

Trigger machine pool instance refresh (node rollout) if bootstrap con…

7d2df5d

…fig reference changes

AndiDog force-pushed the machine-pool-bootstrap-config-ref-rotation branch from 2cddfed to 7d2df5d Compare January 25, 2024 10:16

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 31, 2024

richardcase reviewed Feb 4, 2024

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 5, 2024

k8s-ci-robot merged commit dba86fc into kubernetes-sigs:main Feb 5, 2024
19 checks passed

AndiDog mentioned this pull request Sep 4, 2024

Updating MachinePool, AWSMachinePool, and KubeadmConfig resources does not trigger an ASG instanceRefresh #4071

Closed

✨ Trigger machine pool instance refresh (node rollout) if bootstrap config reference changes #4619

✨ Trigger machine pool instance refresh (node rollout) if bootstrap config reference changes #4619

Conversation

AndiDog commented Nov 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cnmcavoy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiunchinho commented Nov 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cnmcavoy left a comment

Choose a reason for hiding this comment

AndiDog commented Dec 15, 2023

damdo commented Jan 18, 2024 • edited Loading

AndiDog commented Jan 18, 2024

fiunchinho left a comment

Choose a reason for hiding this comment

AndiDog commented Jan 18, 2024

richardcase commented Jan 19, 2024

richardcase commented Jan 23, 2024

AndiDog commented Jan 23, 2024

Ankitasw commented Jan 25, 2024

AndiDog commented Jan 25, 2024

damdo commented Jan 25, 2024

AndiDog commented Jan 25, 2024

Ankitasw commented Jan 31, 2024

Ankitasw commented Jan 31, 2024

richardcase left a comment • edited Loading

Choose a reason for hiding this comment

k8s-ci-robot commented Feb 4, 2024

cnmcavoy commented Feb 5, 2024

damdo commented Jan 18, 2024 •

edited

Loading

richardcase left a comment •

edited

Loading