Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Nvidia Settings API changes #4125

Closed
wants to merge 1 commit into from

Conversation

monirul
Copy link
Contributor

@monirul monirul commented Aug 3, 2024

Issue number:

Closes #

Description of changes:
This PR introduces new settings API for Nvidia GPUs for Kubernetes Nvidia variants.

New settings are

Bottlerocket Settings Impact Value
settings.nvidia-container-runtime.visible-devices-as-volume-mounts allows to change the  accept-nvidia-visible-devices-as-volume-mounts value for k8s container-toolkit true | false default: true
settings.nvidia-container-runtime.visible-devices-envvar-when-unprivileged allows to set value of  accept-nvidia-visible-devices-envvar-when-unprivileged settings of nvidia container runtime for k8s varient true | false default: false
settings.kubernetes.device-plugins.nvidia.pass-device-specs sets the value of the pass-device-specs settings of the device plugin that pass the list of DeviceSpecs to the kubelet on Allocate true | false default: true
settings.kubernetes.device-plugins.nvidia.device-id-strategy sets the value of the device-id-strategy settings of the device plugin which specifies how GPUs are identified and selected for workloads running in a Kubernetes cluster uuid | index Default: index
settings.kubernetes.device-plugins.nvidia.device-list-strategy sets the value of  device-list-strategy setting in NVIDIA Kubernetes device plugins. It is used to configure how GPUs are listed and allocated to pods in a Kubernetes cluster envvar | volume-mounts default: volume-mounts

Testing done:
Yes.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

Release.toml Outdated
@@ -1,4 +1,4 @@
version = "1.21.0"
version = "1.21.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically we would not add new settings to the API in a point release. These changes should target 1.22. But don't do the release version bump in a feature PR, it's not really related to your feature and creates some churn.

Twoliter.toml Outdated Show resolved Hide resolved
sources/shared-defaults/nvidia-k8s-device-plugin.toml Outdated Show resolved Hide resolved
sources/settings-plugins/aws-k8s-nvidia/Cargo.toml Outdated Show resolved Hide resolved
@monirul monirul force-pushed the nvidia-api-kit branch 5 times, most recently from 4b97b65 to 20f5ffc Compare August 8, 2024 20:23
Copy link
Contributor

@bcressey bcressey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good apart from the parts that need to be reverted or cleaned up.

It'd be good to test a non-nvidia aws-k8s variant to confirm that the device plugin settings aren't recognized, which would indicate that the feature flag wasn't used at build time.

Release.toml Outdated Show resolved Hide resolved
-p settings-plugin-aws-k8s-nvidia \
%{nil}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove one of the two newlines:

Suggested change


%description aws-k8s-nvidia
%{summary}.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid adding unnecessary whitespace:

Suggested change

@cbgbt
Copy link
Contributor

cbgbt commented Aug 9, 2024

I spoke with @monirul yesterday about an idea to programmatically verify that feature unification has not taken place. Since we need to do a settings-sdk release for the new models anyways, I think it would be a good idea to make the requisite changes their too.

The basic idea is:

  • Add conditionally-compiled const booleans for the enabled feature
  • Statically assert in the settings models that those flags are as-expected.

@arnaldo2792
Copy link
Contributor

This was superseded by #4182.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants