Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine pool nodes are not rolled during update #2217

Closed
Tracked by #2737
calvix opened this issue Jan 18, 2023 · 18 comments
Closed
Tracked by #2737

Machine pool nodes are not rolled during update #2217

calvix opened this issue Jan 18, 2023 · 18 comments
Assignees
Labels
area/kaas Mission: Cloud Native Platform - Self-driving Kubernetes as a Service goal/capa-internal-ga kind/bug provider/cluster-api-aws Cluster API based running on AWS team/phoenix Team Phoenix topic/capi

Comments

@calvix
Copy link

calvix commented Jan 18, 2023

Issue

The machine pool nodes are not rolled if the KubeadmConfig is changed and the nodes needs to be manually rolled.

Details

I noticed during the thunder update, as there were many changes to both cluster-aws and other stuff, it looked like it rolled nodes but the changes for proxy were not rolled as well, once I had deleted the nodes manually they came up with proper config.

We should first confirm if the scenario is reproducible, and if it is we need to find a solution or be aware of the issue.

How to try to reproduce:

  • trigger a cluster update that includes a roll of worker nodes
  • once the update is triggered make a change to a configuration that would trigger yet another change causing and roll of worker nodes
  • observe if the second change is applied to workers or not
@fiunchinho
Copy link
Member

I think it's this bug kubernetes-sigs/cluster-api-provider-aws#4071

@alex-dabija alex-dabija changed the title pontentional unconfirmed bug - if the kubeadmconfig of machinepool changes during ongoing update it wont roll nodes again and you might need to manually roll nodes later if the kubeadmconfig of machinepool changes during ongoing update it wont roll nodes again and you might need to manually roll nodes later Mar 7, 2023
@alex-dabija alex-dabija changed the title if the kubeadmconfig of machinepool changes during ongoing update it wont roll nodes again and you might need to manually roll nodes later Machine pool nodes not rolled during update Mar 27, 2023
@alex-dabija alex-dabija changed the title Machine pool nodes not rolled during update Machine pool nodes are not rolled during update Mar 27, 2023
@alex-dabija
Copy link

I've encountered a similar situation when other machine pool settings were updated. I ended up manually rolling the nodes from the AWS console.

@alex-dabija alex-dabija transferred this issue from another repository Mar 27, 2023
@alex-dabija alex-dabija transferred this issue from another repository Mar 27, 2023
@alex-dabija alex-dabija added area/kaas Mission: Cloud Native Platform - Self-driving Kubernetes as a Service team/hydra topic/capi provider/cluster-api-aws Cluster API based running on AWS kind/bug labels Mar 27, 2023
@primeroz
Copy link

FYI CAPz has the a very similar issues with MachinePool Roll updates , https://github.com/giantswarm/giantswarm/issues/25188 , reason why we reverted to MachineDeployments

@AndiDog AndiDog self-assigned this Apr 13, 2023
@AndiDog
Copy link

AndiDog commented Apr 13, 2023

Here's why this happens: upstream comment. I'll try to figure out implementation ideas for the fix.

@AndiDog
Copy link

AndiDog commented May 2, 2023

In kubernetes-sigs/cluster-api-provider-aws#4071 (comment), I figured that cluster-api changes are also needed to speed up time until nodes get refreshed.

Therefore I started working on making our fork https://github.com/giantswarm/cluster-api ready to work with. First change to be upstream: kubernetes-sigs/cluster-api#8586 so we can build only the production components controller images of CAPI, but avoid building not required components (clusterctl, test stuff). Once merged, our fork should not differ from upstream apart from our CircleCI config.

@AndiDog
Copy link

AndiDog commented May 4, 2023

First part of the fix in CAPA: kubernetes-sigs/cluster-api-provider-aws#4245

@alex-dabija alex-dabija added team/phoenix Team Phoenix and removed team/hydra labels May 15, 2023
@T-Kukawka
Copy link
Contributor

@AndiDog do u know if there was any decision made?

@AndiDog
Copy link

AndiDog commented Jun 14, 2023

Next step is to discuss the issue in the CAPI office hours, since in the CAPA meeting we noted that it's generic across providers and CAPI needs to be adapted at best.

@AndiDog
Copy link

AndiDog commented Jun 15, 2023

Continuing the discussion with CAPI maintainers in the new issue kubernetes-sigs/cluster-api#8858 and Slack thread.

@AndiDog
Copy link

AndiDog commented Aug 10, 2023

I'm writing down a proposed solution which requires some contract from CAPI to infra providers (e.g. CAPA) so they can get a checksum of the user data without the bootstrap token – or in other terms, a checksum of KubeadmConfig.spec since we want to roll node on changes to that. That should then trigger some further discussion with upstream maintainers, since it's more actionable. The proposal will go into kubernetes-sigs/cluster-api#8858 very soon, so I won't repeat it here.

In the meantime, I want to try a workaround: CAPA always rolls out the new launch template if AWSMachinePool.spec.additionalTags changes (see if needsUpdate || tagsChanged || ... in code). If we put a checksum of KubeadmConfig.spec in such a tag, we would therefore enforce rolling nodes. The timing issue in CAPI still remains, meaning that the user data (alias bootstrap data secret) only gets updated every 5-15 minutes because CAPI only updates when the bootstrap token TTL is about to expire. That's however much easier to fix and does not need any discussion, since we've already discussed this as clear bug in the upstream meetings. I have a half-ready change that was already working for me.

@AndiDog
Copy link

AndiDog commented Aug 14, 2023

The proposed workaround, i.e. adding a hash of the machine pool's KubeadmConfig.spec as tag value into AWSMachinePool.spec.additionalTags, forces node rollout just fine. However, due to CAPI not immediately updating the bootstrap secret (kubernetes-sigs/cluster-api#8858), the new nodes start with an old version of the launch template. Only the next rollout, if 10 minutes later, would use the new version, since CAPI by then has updated the secret. So my workaround won't work until that is fixed, and I'll concentrate on that first. But even then, I have the feeling that a race condition could occur – what if CAPA's AWSMachinePool controller first triggers a node rollout and CAPI takes a few seconds longer to update the bootstrap secret? Well, we'll give it a try.

@AndiDog
Copy link

AndiDog commented Aug 21, 2023

Proposed solution in the CAPI issue: kubernetes-sigs/cluster-api#8858 (comment)

@AndiDog
Copy link

AndiDog commented Sep 27, 2023

My first proposal would lead to quite some contract changes between CAPI and bootstrap providers, so not sure if it's feasible or desired.

Therefore, to get a quick turnaround, I tried the "swap the MachinePool -> KubeadmConfig object reference" workaround. Unfortunately, it doesn't work and creates miserable problems. But it's a use case that seemingly can be fixed more easily, so there's a chance that I can go forward with maintainers to patch CAPI and CAPA. Details in kubernetes-sigs/cluster-api#8858 (comment).

@JosephSalisbury
Copy link
Contributor

from wg capa migration sync: critical for adidas poc (as we need to do manual work to upgrade the cluster)

@AndiDog
Copy link

AndiDog commented Oct 11, 2023

I have a meeting with upstream this Friday to see how we can go on. None of the workarounds worked until now because of CAPI/CAPA bugs or shortcomings.

@AndiDog
Copy link

AndiDog commented Nov 7, 2023

Working and tested solution in kubernetes-sigs/cluster-api-provider-aws#4619 (combined with newer CAPI version), so moving this to blocked until we progress on the upstream PRs

@AndiDog
Copy link

AndiDog commented Jan 2, 2024

We're almost done here. CAPI/CAPA forks and cluster-aws have the feature. Waiting for the upstream PR (targeting CAPA v2.4.0 because it's a new feature) kubernetes-sigs/cluster-api-provider-aws#4619 to be merged before closing this issue. There's also still a small docs PR open: giantswarm/docs#2028.

@AndiDog
Copy link

AndiDog commented Feb 5, 2024

Upstream PR is merged now

@AndiDog AndiDog closed this as completed Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kaas Mission: Cloud Native Platform - Self-driving Kubernetes as a Service goal/capa-internal-ga kind/bug provider/cluster-api-aws Cluster API based running on AWS team/phoenix Team Phoenix topic/capi
Projects
None yet
Development

No branches or pull requests

7 participants