Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Trigger machine pool instance refresh (node rollout) if bootstrap config reference changes #4619

Conversation

AndiDog
Copy link
Contributor

@AndiDog AndiDog commented Nov 6, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

We discussed this and alternative solutions at length in kubernetes-sigs/cluster-api#8858 and CAPI/CAPA office hours. Eventually, we turned to a solution that is not surprising for users: if the bootstrap config reference (e.g. KubeadmConfig) changes within a MachinePool/AWSMachinePool combo, i.e. its name changes, new nodes should be rolled out. With this, it is not necessary to parse the bootstrap data (= user data) and detect changes within those scripts (cloud-init, ignition, maybe other formats in the future). Instead, we only need to store the previously-used bootstrap data secret key (namespace+name) in the launch template, and compare on every reconciliation whether that reference changed. For example, if an operating user wants to roll out a change to KubeadmConfig.spec.files, i.e. add new files on the nodes, it's now sufficient to create a newly-named KubeadmConfig. For my specific usage, this means naming the object KubeadmConfig/my-config-<hash of its spec> and the Helm chart upgrade will take care to deploy this new object and delete the old one (which I tested and works perfectly fine).

Testing of this change showed that CAPI+CAPA correctly, and in the correct order without possible race conditions (🤞), render the bootstrap data secret, mark the bootstrap config ready, create a new launch template version and trigger instance refresh (which creates new EC2 instances). The only minor glitch I found is #4618, but that can be solved separately.

Since reconciliation of launch templates wasn't covered in tests at all, I changed the mock interfaces a little so it could be tested, and added some test cases.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Towards kubernetes-sigs/cluster-api#8858 and #4071

For the feature to work, users require a CAPI version that includes kubernetes-sigs/cluster-api#8667. Without that, CAPA won't see an updated MachinePool.Spec.Template.Spec.Bootstrap.DataSecretName to reconcile.

Checklist:

  • squashed commits
  • includes documentation
  • adds unit tests
  • adds or updates e2e tests

Release note:

Trigger machine pool instance refresh (node rollout) if bootstrap config reference changes

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 6, 2023
@AndiDog AndiDog force-pushed the machine-pool-bootstrap-config-ref-rotation branch from 3480765 to 1653865 Compare November 6, 2023 20:33
@@ -225,6 +234,7 @@ func (r *AWSMachinePoolReconciler) reconcileNormal(ctx context.Context, machineP

ec2Svc := r.getEC2Service(ec2Scope)
asgsvc := r.getASGService(clusterScope)
reconSvc := r.getReconcileService(ec2Scope)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the code base, so this may be a basic question but, what's the difference between ec2Svc and reconSvc? They both seem to be calling ec2.NewService().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because I split interfaces such that I can call the actual reconciliation code in tests while still mocking EC2 API calls. I don't know why reconciliation was part of a mockable interface in the first place, but that's where I started from with my change.

if len(tags) > 0 {
// tag instances
// tag instances
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use blocks like { } on other parts of the code? I'm wondering if we want to use it, or if it improves things. If we don't use it anywhere else, I wouldn't use it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this, the purpose of the if len(tags) > 0 test seems to be to avoid creating and appending a spec element to the tagSpecifications with zero tags in spec.Tags list.

Does the AWS sdk reject a LaunchTemplateTagSpecificationRequest with a list of zero tags? If AWS can support that, we can remove this block and the if len(tags) > 0 test from below, it just makes this code harder to read. Otherwise it seems fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are zero tags per category (here: ec2.ResourceTypeInstance and ec2.ResourceTypeVolume), we just shouldn't add the tag specification. That's why the guard is there. And I think having a block, whether it's just { ... } or if ... { ... } around each category, together with a comment above each block, separates the logic somewhat cleanly.

Copy link
Contributor

@cnmcavoy cnmcavoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at the linter errors failing the tests. Overall this looks solid.

if len(tags) > 0 {
// tag instances
// tag instances
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this, the purpose of the if len(tags) > 0 test seems to be to avoid creating and appending a spec element to the tagSpecifications with zero tags in spec.Tags list.

Does the AWS sdk reject a LaunchTemplateTagSpecificationRequest with a list of zero tags? If AWS can support that, we can remove this block and the if len(tags) > 0 test from below, it just makes this code harder to read. Otherwise it seems fine.

// tag EBS volumes
spec = &ec2.LaunchTemplateTagSpecificationRequest{ResourceType: aws.String(ec2.ResourceTypeVolume)}
// tag EBS volumes
if len(tags) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the above answer solve your question? Here, tags could be empty and we want an if, while for instance tags, we now have at least one (the newly added tag used for the feature) and need no if.

// from EC2Interface so that we can test the behavior of our non-mock implementations. For example, by not mocking
// the ReconcileLaunchTemplate function, but mocking EC2Interface, we can test which EC2 API operations would have
// been called.
type ReconcileInterface interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better interface name we can choose? I find that I am confused when I would want to add method here or to the EC2Interface, given that both interfaces contain methods names with "Reconcile" . It may be confusing for future contributors when deciding where an additional method should go and how to decide.

I'm not sure what a better name would be, but something closer to the purpose of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EC2Interface mocks the AWS EC2 API, so ReconcileLaunchTemplate (which I moved out) and ReconcileTags (out of scope for my PR) should not be in there. If I move the ReconcileTags function as well into the new interface, it should become clearer for someone working on the code, particularly if we state that it's for the functionality of the controller as opposed to raw AWS requests. Given that change, what about MachinePoolReconcileInterface since those functions are exclusively used in awsmachinepool_controller.go and awsmanagedmachinepool_controller.go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would clear up my confusion, sgtm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also fixed lint errors.

@fiunchinho
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 13, 2023
@AndiDog AndiDog force-pushed the machine-pool-bootstrap-config-ref-rotation branch from 1653865 to 2f77269 Compare November 16, 2023 10:54
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 16, 2023
launchTemplateUserDataSecretKey = &apimachinerytypes.NamespacedName{
Namespace: parts[0],
Name: parts[1],
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we return here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

return i, userdata.ComputeHash(decodedUserData), nil
return i, userdata.ComputeHash(decodedUserData), launchTemplateUserDataSecretKey, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it okay for launchTemplateUserDataSecretKey to be potentially nil here, if the above loop fails? Or should we return an error here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must be backward-compatible and support cases where the tag is missing by mistake/bug/intervention. So nil is fine. I changed this line to explicit nil here, and added the return for the found case into the loop as you commented above.

@AndiDog AndiDog force-pushed the machine-pool-bootstrap-config-ref-rotation branch from 2f77269 to 77bb30d Compare November 22, 2023 21:41
@AndiDog AndiDog changed the title Trigger machine pool instance refresh (node rollout) if bootstrap config reference changes ✨ Trigger machine pool instance refresh (node rollout) if bootstrap config reference changes Nov 22, 2023
Copy link
Contributor

@cnmcavoy cnmcavoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 22, 2023
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 15, 2023
@AndiDog AndiDog force-pushed the machine-pool-bootstrap-config-ref-rotation branch from 77bb30d to 6f6df3a Compare December 15, 2023 07:14
@AndiDog
Copy link
Contributor Author

AndiDog commented Dec 15, 2023

Rebased and only one (obvious to me) test line was conflicted. @cnmcavoy @fiunchinho could you take another look? @richardcase we'd need approval here as well. I'm not sure if this should go into v2.3.x?

@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Dec 15, 2023
@damdo
Copy link
Member

damdo commented Jan 18, 2024

@AndiDog for your e2e failure, see this

@AndiDog AndiDog force-pushed the machine-pool-bootstrap-config-ref-rotation branch from 6f6df3a to 169894f Compare January 18, 2024 15:36
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 18, 2024
@AndiDog
Copy link
Contributor Author

AndiDog commented Jan 18, 2024

Rebased because that fixed E2E tests in another PR.

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

Copy link
Contributor

@fiunchinho fiunchinho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 18, 2024
@AndiDog AndiDog force-pushed the machine-pool-bootstrap-config-ref-rotation branch from 169894f to f0033b1 Compare January 18, 2024 23:16
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 18, 2024
@AndiDog
Copy link
Contributor Author

AndiDog commented Jan 18, 2024

/retest

@AndiDog AndiDog force-pushed the machine-pool-bootstrap-config-ref-rotation branch from f0033b1 to 2cddfed Compare January 19, 2024 07:50
@richardcase
Copy link
Member

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

@richardcase
Copy link
Member

/milestone v2.4.0

@k8s-ci-robot k8s-ci-robot added this to the v2.4.0 milestone Jan 23, 2024
@AndiDog
Copy link
Contributor Author

AndiDog commented Jan 23, 2024

/retest

@Ankitasw
Copy link
Member

@AndiDog could you rebase the PR and retry E2E tests?

@AndiDog AndiDog force-pushed the machine-pool-bootstrap-config-ref-rotation branch from 2cddfed to 7d2df5d Compare January 25, 2024 10:16
@AndiDog
Copy link
Contributor Author

AndiDog commented Jan 25, 2024

/retest

@damdo
Copy link
Member

damdo commented Jan 25, 2024

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

@AndiDog
Copy link
Contributor Author

AndiDog commented Jan 25, 2024

Still one E2E failure related to LBs. The test function createLBService look strange to me. It expects some status field to be filled immediately or otherwise silently continues.

@Ankitasw
Copy link
Member

/test pull-cluster-api-provider-aws-e2e

@Ankitasw
Copy link
Member

/approve

cc @fiunchinho @cnmcavoy @AndiDog for final review

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 31, 2024
Copy link
Member

@richardcase richardcase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me as well

/approve

For the final lgtm:

/assign @cnmcavoy @fiunchinho

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Ankitasw, richardcase

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [Ankitasw,richardcase]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cnmcavoy
Copy link
Contributor

cnmcavoy commented Feb 5, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 5, 2024
@k8s-ci-robot k8s-ci-robot merged commit dba86fc into kubernetes-sigs:main Feb 5, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants