Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Discover Instance Type Memory Capacity #7004

Merged
merged 35 commits into from
Oct 21, 2024

Conversation

jukie
Copy link
Contributor

@jukie jukie commented Sep 12, 2024

Fixes #5161

Description
This adds a new controller for managing a cache on the instancetype provider that stores the memory capacity overhead for each instance type by comparing the actual value after a Kubernetes Node gets registered with the cluster.
The cache is then implemented by NewInstanceType() when calculating memory capacity. If a cached value doesn't exist it will fall back to the existing logic of vmMemoryOverheadPercent.

The cache is keyed by a combination of instance type name and hash of nodeClass.Status.AMIs so this should always be accurate and when an AMI update is triggered will use the existing logic of calculating against vmMemoryOverheadPercent to ensure safe instance creation every time.

How was this change tested?
Suite tests and live environment

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened:
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jukie jukie requested a review from a team as a code owner September 12, 2024 22:06
@jukie jukie changed the title WIP: Add memory overhead tracking per nodeclass and ami family Draft: Add memory overhead tracking per nodeclass and ami family Sep 12, 2024
Copy link

netlify bot commented Sep 12, 2024

Deploy Preview for karpenter-docs-prod ready!

Name Link
🔨 Latest commit 90d9c55
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/67107692a7a1e80008f75dd1
😎 Deploy Preview https://deploy-preview-7004--karpenter-docs-prod.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@jukie jukie changed the title Draft: Add memory overhead tracking per nodeclass and ami family Draft: feat: Discover Instance Type Capacity Memory Overhead Sep 12, 2024
@jukie jukie changed the title Draft: feat: Discover Instance Type Capacity Memory Overhead [DRAFT]: feat: Discover Instance Type Capacity Memory Overhead Sep 12, 2024
@jukie jukie marked this pull request as draft September 12, 2024 23:53
@jukie jukie changed the title [DRAFT]: feat: Discover Instance Type Capacity Memory Overhead [DRAFT] feat: Discover Instance Type Capacity Memory Overhead Sep 12, 2024
@jukie jukie changed the title [DRAFT] feat: Discover Instance Type Capacity Memory Overhead feat: Discover Instance Type Capacity Memory Overhead Sep 14, 2024
@jukie jukie marked this pull request as ready for review September 14, 2024 08:30
Copy link
Contributor

@njtran njtran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks really good!

pkg/providers/instancetype/instancetype.go Outdated Show resolved Hide resolved
pkg/providers/instancetype/instancetype.go Outdated Show resolved Hide resolved
pkg/providers/instancetype/instancetype.go Outdated Show resolved Hide resolved
@jukie
Copy link
Contributor Author

jukie commented Sep 26, 2024

CC @jmdeal (I think you were the one on the call today?) There's a negligible amount of variance when measuring overhead in Ki but I was unable to find any variance between the same instance type when measuring in Mi which my PR is doing.

Here's a simple script that can be used to verify: https://gist.github.com/jukie/df045af8fec68941f5d119044bf04aee

Instance Type: m6i.24xlarge
Variance detected in memory capacities:
 - 389990024
 - 389990016
 - 389990004
 - 389989996
 - 389990000
 - 389990004
 - 389989988
 - 389990000
 - 389990012
 - 389990012
 - 389989988
 - 389989980
 - 389990020

@jukie jukie requested a review from njtran September 26, 2024 22:11
Signed-off-by: jukie <[email protected]>
pkg/cache/cache.go Outdated Show resolved Hide resolved
pkg/providers/instancetype/instancetype.go Outdated Show resolved Hide resolved
pkg/providers/instancetype/instancetype.go Outdated Show resolved Hide resolved
@jmdeal jmdeal self-assigned this Oct 10, 2024
Copy link
Contributor

@jmdeal jmdeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/karpenter snapshot

Copy link
Contributor

Snapshot successfully published to oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter:0-7194ce607171bb52280313c434a574b616877c85.
To install you must login to the ECR repo with an AWS account:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-7194ce607171bb52280313c434a574b616877c85" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

@jukie jukie requested a review from jmdeal October 13, 2024 20:22
@jukie
Copy link
Contributor Author

jukie commented Oct 14, 2024

@jmdeal could you take another look please

@coveralls
Copy link

coveralls commented Oct 15, 2024

Pull Request Test Coverage Report for Build 11377237253

Details

  • 58 of 83 (69.88%) changed or added relevant lines in 6 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.2%) to 82.883%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/controllers.go 0 1 0.0%
pkg/operator/operator.go 0 1 0.0%
pkg/providers/instancetype/instancetype.go 43 49 87.76%
pkg/controllers/providers/instancetype/capacity/controller.go 10 27 37.04%
Totals Coverage Status
Change from base Build 11372026312: -0.2%
Covered Lines: 5641
Relevant Lines: 6806

💛 - Coveralls

@jukie
Copy link
Contributor Author

jukie commented Oct 15, 2024

Fixed the conflicts

@jukie
Copy link
Contributor Author

jukie commented Oct 16, 2024

Any other changes needed here @jmdeal?

Copy link
Contributor

@jmdeal jmdeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit, and I'm going to let E2Es run. Otherwise LGTM.

/karpenter snapshot

Copy link
Contributor

@jmdeal jmdeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/karpenter snapshot

Copy link
Contributor

Snapshot successfully published to oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter:0-7548492412f9ec41de90fd048119ccc0ba7d0141.
To install you must login to the ECR repo with an AWS account:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-7548492412f9ec41de90fd048119ccc0ba7d0141" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

@jukie
Copy link
Contributor Author

jukie commented Oct 17, 2024

I think the e2e's need a retry after committing your suggestion but the prior run appeared to pass.

Copy link
Contributor

@jmdeal jmdeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to rerun, the only changes since they last ran was the comment nit and some docs changes you merged in. This LGTM 🚀.

@jmdeal jmdeal merged commit f9b3292 into aws:main Oct 21, 2024
17 checks passed
@jukie jukie deleted the memory-overhead-controller branch October 21, 2024 17:24
@jukie
Copy link
Contributor Author

jukie commented Nov 1, 2024

Hi @jmdeal I saw 1.0.7 was released but this wasn't included. Will this make it into the next one?

@jmdeal
Copy link
Contributor

jmdeal commented Nov 1, 2024

We weren't planning on including this in a patch release, that's typically reserved for critical bug and security fixes. The plan would be to include this in the next minor version release, v1.1.0.

@jukie
Copy link
Contributor Author

jukie commented Nov 2, 2024

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Discover Instance Type Capacity Memory Overhead Instead of vmMemoryOverheadPercent
4 participants