-
Notifications
You must be signed in to change notification settings - Fork 598
ELB health check fails with Kubernetes >=v1.30.x #5139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
/triage accepted |
I'm running into this as well. |
/priority critical-urgent |
This was discussed at the office hours 14th October 2024. The summary is that:
|
/help |
@richardcase: GuidelinesPlease ensure that the issue body includes answers to the following questions:
For more details on the requirements of such an issue, please see here and ensure that they are met. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/milestone v2.8.0 |
/milestone v2.7.2 |
I tried setting the tls cipher suites but that didn't work: https://gist.github.com/richardcase/47118a404bc832904c399ba1360462f2 |
@richardcase I wasn't able to just apply your spec directly because of some IAM issues, but was able to create by explicitly setting this public AMI:
AWSCluster
and KCP:
|
I got it working with this additional argument in the template (I'm using
Do we really want to hardcode these less secure settings in the template? This makes it very likely for users to blindly take it over. I'm rather thinking of other options:
|
We talked in the office hours to consider switching the default type to NLB. Next steps:
|
This issue is labeled with You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
The reconciler fails afterwards. Looks like a change in the loadbalancer naming scheme has happened as well (
Even if it did work, it would break existing clients as there is no way to convert ELB, just create a new one and then manually (or with the help of DNS, think Route 53) migrate the existing clients.
This looks interesting as the enabler of the proper way forward in my opinion (read on). Since the working (NLB) alternative uses the TCP healthcheck by default - why not make the default classic ELB use TCP healthchecks by default as a current workaround? I can agree that the verification of SSL/TLS is good beyond just TCP but still not as comprehensive as proper HTTPS healthcheck and so this default change does not really dumb down the defaults much in my opinion. Note there are no raw SSL/TLS checks in non-classic ELBs. I see the way forward would be to migrate to NLB with HTTPS checks as they should be the most comprehensive (and working), i.e.:
See also notes below. Other notes:
|
/milestone v2.8 |
Talking with @damdo some, would it work to update the cipher suites for v2.8, then change the default in v2.9? This gives users some time to prepare for the change in default. Another option could be to use TCP health checks instead of updating the cipher suites, but I think either way, the default will need to change by v2.9 and we'll need to provide either automatic support for it, or well-tested documentation on how to change over. |
#5338 updated a test case that was breaking. Talking to Richard and Damiano, we think this is how we're going to approach the problem: Starting in v2.8.0 for new installations, we'll update all the templates to use Network Load Balancers. We will also be marking the classic load balancer default as deprecated. This will start a 9 month/3 version timer until the default is updated within the API to be a Network Load Balancer. For existing clusters that are using the defaults, the following logic will apply:
For existing clusters that are not using the defaults, error and inform the user that they'll need to migrate. The webhook will need to be update to allow the user to do the change themselves. |
I just tested automatically changing the health check protocol and this works nicely with minimal code changes. I will create a PR with the changes after a bit more testing. |
/assign |
I put up:
|
…to 2.8.1 in /hack/third-party/capa (#1089) Bumps [sigs.k8s.io/cluster-api-provider-aws/v2](https://github.com/kubernetes-sigs/cluster-api-provider-aws) from 2.7.1 to 2.8.1. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/kubernetes-sigs/cluster-api-provider-aws/releases">sigs.k8s.io/cluster-api-provider-aws/v2's releases</a>.</em></p> <blockquote> <h2>v2.8.1</h2> <h1>Release notes for Cluster API Provider AWS (CAPA) v2.8.1</h1> <p><a href="https://cluster-api-aws.sigs.k8s.io/">Documentation</a></p> <blockquote> <p>There is no v2.8.0 GitHub release due to issues during the release process.</p> </blockquote> <h1>Changelog since v2.7.1</h1> <h2>Urgent Upgrade Notes</h2> <h3>(No, really, you MUST read this before you upgrade)</h3> <ul> <li><em><strong>Action required</strong></em> Bump CAPI to v1.9.z !! ACTION REQUIRED BEFORE UPGRADING !! If you are using the AWSManagedControlPlane to provision EKS clusters and you do not have a spec.Version specified in such resource (meaning you are relying on the default that AWS provides), you will need to either: a) explicitly set such <code>spec.Version</code> field before upgrading CAPA or b) disable the MachineSetPreflightChecks in your cluster either: b1) by setting this core CAPI feature gate to <code>false</code> b2) or by disabling it via the relevant annotation on all the machineSets belonging to said cluster (follow this guide on how to do this: <a href="https://cluster-api.sigs.k8s.io/tasks/experimental-features/machineset-preflight-checks">https://cluster-api.sigs.k8s.io/tasks/experimental-features/machineset-preflight-checks</a>). This is necessary as core CAPI 1.9 introduces a feature gate change, setting MachineSetPreflightChecks=true, which in turn relies on the presence of spec.Version and status.Version on the AWSManagedControlPlane object. We are planning a future refactor of these API fields in v1beta3 (<a href="https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/3853">kubernetes-sigs/cluster-api-provider-aws#3853</a>). Other places where you can find details on this are: <ul> <li><a href="https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5225">kubernetes-sigs/cluster-api-provider-aws#5225</a></li> <li><a href="https://github.com/kubernetes-sigs/cluster-api/issues/11117">kubernetes-sigs/cluster-api#11117</a></li> <li><a href="https://kubernetes.slack.com/archives/CD6U2V71N/p1739783013734149">https://kubernetes.slack.com/archives/CD6U2V71N/p1739783013734149</a> (<a href="https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5209">#5209</a>, <a href="https://github.com/damdo"><code>@damdo</code></a>)</li> </ul> </li> <li><em><strong>Action required</strong></em> From this release onwards we recommend not creating clusters using the classic ELB (which is the default for the API). Classic ELB support is deprected and support will be removed in a future version. For new & existing clusters that use a classic elb AND do not specify the health check protocol then the protocol will be changed/set to TCP instead of SSL. If you want to use a classic elb with an SSL healthcheck then you will need to specify the cipher suites to use in the KubeadmControlPlane:</li> </ul> <pre lang="yaml"><code>apiVersion: controlplane.cluster.x-k8s.io/v1beta1 kind: KubeadmControlPlane metadata: name: "${CLUSTER_NAME}-control-plane" spec: kubeadmConfigSpec: clusterConfiguration: apiServer: extraArgs: cloud-provider: external <pre><code> # This is needed for Kubernetes v1.30+ since else it uses the Go defaults which don't # work with AWS classic load balancers, see # kubernetes-sigs/cluster-api-provider-aws#5139. If you use # another load balancer type such as NLB, this is not needed. # # The list consists of the secure ciphers from Go 1.23.3, plus some less secure </code></pre> <p></tr></table><br /> </code></pre></p> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/kubernetes-sigs/cluster-api-provider-aws/commit/49c75acbea08e311e48c1089c78d65ac5aabe42c"><code>49c75ac</code></a> Merge pull request <a href="https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5428">#5428</a> from damdo/bump-cloudbuild-gcb-image-go-1.23</li> <li><a href="https://github.com/kubernetes-sigs/cluster-api-provider-aws/commit/468aa37703893de7659cb3cd7effb7a2f4c73dd7"><code>468aa37</code></a> cloudbuild: bump gcb image to get go 1.23</li> <li><a href="https://github.com/kubernetes-sigs/cluster-api-provider-aws/commit/29ef6c7f7dff5cf4c87904a9ea0b465d02092cb0"><code>29ef6c7</code></a> Merge pull request <a href="https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5426">#5426</a> from mzazrivec/fix_doc_links</li> <li><a href="https://github.com/kubernetes-sigs/cluster-api-provider-aws/commit/dea4012dff11314637c1db03a0ecc7b4ae56bab4"><code>dea4012</code></a> Merge pull request <a href="https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5422">#5422</a> from nrb/retry-boskos-checkouts</li> <li><a href="https://github.com/kubernetes-sigs/cluster-api-provider-aws/commit/6c938899a0265656e5744eb6e908304856d8ada3"><code>6c93889</code></a> Merge pull request <a href="https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5418">#5418</a> from richardcase/fix_efs_e2e_test</li> <li><a href="https://github.com/kubernetes-sigs/cluster-api-provider-aws/commit/1e93dc77b237cab3f2fe18a48ef0aaee07212d90"><code>1e93dc7</code></a> Retry boskos account checkouts</li> <li><a href="https://github.com/kubernetes-sigs/cluster-api-provider-aws/commit/eb3ff6ead2d8d60ef26c6aa6642d0f702b98ec28"><code>eb3ff6e</code></a> refactor: remove old setup logic for AWSCluster</li> <li><a href="https://github.com/kubernetes-sigs/cluster-api-provider-aws/commit/b7fcdffcd122ed72ae0b177af943b90f603e31fa"><code>b7fcdff</code></a> fix: efs e2e test breaking</li> <li><a href="https://github.com/kubernetes-sigs/cluster-api-provider-aws/commit/5b02ba0098a31df716b45404d27276693d60ca12"><code>5b02ba0</code></a> Merge pull request <a href="https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5425">#5425</a> from richardcase/5424_ensurepaused_fixed</li> <li><a href="https://github.com/kubernetes-sigs/cluster-api-provider-aws/commit/afcd1184505d6a82c468326972ea930797b9a277"><code>afcd118</code></a> Fix links to superseded document</li> <li>Additional commits viewable in <a href="https://github.com/kubernetes-sigs/cluster-api-provider-aws/compare/v2.7.1...v2.8.1">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jimmi Dyson <[email protected]>
/kind bug
What steps did you take and what happened:
Follow the quickstart documentation with Kubernetes v1.30.5 and a custom built AMI (the public AMIs are missing for that version and the default v1.31.0 version).
The ELB Health Check fails and the cluster is stuck after creating the first control-plane instance. The AWS console shows that 0 of 1 instanced are in service.
What did you expect to happen:
The defaults should result in a working cluster.
Anything else you would like to add:
Changing the health check to TCP in the AWS console did fix the check, but this update is not allowed by a webhook here and even after removing the webhook, the new value from AWSCluster never got updated.
Setting this on the apiserver and other control-plane components allowed the ELB health check to pass
Some discussion about this in the Kuberentes slack https://kubernetes.slack.com/archives/C3QUFP0QM/p1726622974749509
Environment:
kubectl version
): v1.30.5/etc/os-release
):The text was updated successfully, but these errors were encountered: