-
Notifications
You must be signed in to change notification settings - Fork 519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aws-k8s: pass --hostname-override
to kubelet, set it to inst. PrivateDnsName
for k8s-1.26
#3033
Conversation
PrivateDnsName
for k8s-1.26--hostname-override
to kubelet, set it to inst. PrivateDnsName
for k8s-1.26
Push above adds a README entry for the new |
Push above fixes some typos and word changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few nits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
sources/api/migration/migrations/v1.13.4/add-hostname-override-metadata/src/main.rs
Outdated
Show resolved
Hide resolved
sources/models/shared-defaults/kubernetes-aws-external-cloud-provider.toml
Outdated
Show resolved
Hide resolved
// Limit the timeout for the EC2 describe-instances API call to 5 minutes | ||
const EC2_DESCRIBE_INSTANCES_TIMEOUT: Duration = Duration::from_secs(300); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we align this with the timeout that the in-tree aws cloud provider uses for the same API call?
I am wary about creating a launch time regression for anyone using a private VPC with no access to the EC2 API. If this call timed out previously and the cloud provider just used the system hostname, then it may not have been noticed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The in-tree aws cloud provider relies on the default retries/timeout in the aws-sdk-go EC2 client: https://github.com/kubernetes/legacy-cloud-providers/blob/74c983775777a7ddd7ebcf5973420adb405cefd4/aws/instances.go#L136-L147
https://aws.github.io/aws-sdk-go-v2/docs/configuring-sdk/retries-timeouts/#standard-retryer
They don't modify the default client:
https://github.com/kubernetes/legacy-cloud-providers/blob/74c983775777a7ddd7ebcf5973420adb405cefd4/aws/aws.go#L824-L834
A default value of 2 for maximum retry attempts, making a total of 3 call attempts. This value can be overwritten through the max_attempts configuration parameter.
Any retry attempt will include an exponential backoff by a base factor of 2 for a maximum backoff time of 20 seconds.
Do we want to align on that?
EKS Optimized AMI is doing 10 retries instead:
https://github.com/awslabs/amazon-eks-ami/pull/1264/files#diff-049390d14bc3ea2d7882ff0f108e2802ad9b043336c5fa637e93581d9a7fdfc2R488-R490
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should put a time bound on the API call still. If we dont, we'll run into #2370 again where pluto
can potentially block boot for 30 mins (default standard retry configuration).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wary about creating a launch time regression for anyone using a private VPC with no access to the EC2 API. If this call timed out previously and the cloud provider just used the system hostname, then it may not have been noticed.
From looking at the legacy cloud provider code, I believe the call would have hanged for longer than 5 mins while blocking kubelet
from progressing. Every AWS SDK implementation has a different default request timeout value it seems... but most are between 100 seconds to 240 seconds. Which across 3 attempts would put it over the 5 min timeout we're enforcing.
In the EKS optimized AMI, I believe the bootstrapping script is probably blocking for longer than 5 mins as well with the 10 retries (each retry is 1 mins according to aws-cli
docs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty happy with this given that we address @bcressey's comment about timeouts.
) -> Result<impl Into<aws_smithy_client::http_connector::HttpConnector>> { | ||
// Determines whether a request of a given scheme, host and port should be proxied | ||
// according to `https_proxy` and `no_proxy`. | ||
let intercept = move |scheme: Option<&str>, host: Option<&str>, _port| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit gnarly and probably deserves tests, but I respect that we are just doing a move refactor here.
Push above addresses @bcressey 's comments (other than #3033 (comment)) |
README.md
Outdated
|
||
**Important note for all Kubernetes variants:** Changing this setting at runtime (not via user-data) can cause issues with kubelet registration, as hostname is closely tied to the identity of the system for both registration and certificates/authorization purposes. | ||
|
||
Most users don't need to change this setting. If left unset, the system hostname will be used instead. The `settings.network.hostname` setting can be used to specify the value for both `kubelet` and the host. Only set this override if you intend for the `kubelet` to register with a different name than the host. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: shouldn't each sentence be its own line?
This adds the 'kubernetes.hostname-override' setting for kubelet's '--hostname-override' option.
Adds a new command for retrieving the instance's PrivateDnsName. Refactors out code for setting up proxy for the AWS API clients so it can be shared between the EKS module and EC2 module.
…s-1.26 This separates the model for aws-k8s-1.26 so that we generate the 'hostname-override' setting and set the cloud-provider to 'external' for aws-k8s-1.26 variants only.
This lets the 'settings.kubernetes.cloud-provider' setting control what gets passed to '--cloud-provider' in kubelet options
This adds a multicall binary to be used as the common entry point for all Kubernetes CIS benchmark checks. Signed-off-by: Sean McGinnis <[email protected]>
@stmcginnis kubernetes-checks 4.2.7 also needs update? |
I responded to that comment, but I'm not sure I'm understanding what you are asking. Can you elaborate? |
@stmcginnis thanks for response and confirmation! I was perhaps confused by the wording and status of the test. I realize that Manual tests are skipped by automation, but they aren't skipped by humans, and it wasn't clear to me that it was the test that was invalid, or that this invalidation was specific to bottlerocket. It looked to me like this flag was added to EKS Optimized AMI as well. https://aws.amazon.com/blogs/containers/amazon-eks-now-supports-kubernetes-version-1-26/
|
Very fair. The There's some talk of adding more documentation around these benchmarks on the Bottlerocket website. I will make sure these inconsistencies in the K8s benchmark are called out to make sure that "not valid" statement is understood. |
Issue number:
Resolves #3028
Description of changes:
Testing done:
On
aws-k8s-1.26
:Launched
aws-k8s-1.26
node into 1.26 cluster in us-west-2 and the node registers itself fine:Checking on the host, the kubelet command line arguments has
--hostname-override
as expected:Running
pluto
separately:On
aws-k8s-1.24
:The node joins successfully and the
hostname-override
setting is not generated on boot as expected:kubelet command line args don't have
--hostname-override
aws-k8s-1.26
in cluster with custom domain names for VPC:VPC has DHCP options set to:
Launched nodes with changes and the nodes join fine whereas they could not before:
On
i-0a163988779026c87
, the hostname is underetung.test
as expected:In my kubelet logs:
Migration testing
Started on the aws-k8s-1.26 x86 1.13.3 release and updated to image built with changes:
After reboot, the
hostname-override
setting exists and is set:Checked datastore:
Then I downgrade back to 1.13.3 via
signpost rollback-to-inactive
The host boots fine back into 1.13.3. The
hostname-override
setting and associated metadata are all gone as expected.Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.