Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3063: dynamic resource allocation #3064

Merged
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 43 additions & 4 deletions keps/sig-node/3063-dynamic-resource-allocation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ SIG Architecture for cross-cutting KEPs).
- [Risks and Mitigations](#risks-and-mitigations)
- [Feature not used](#feature-not-used)
- [Compromised node](#compromised-node)
- [Compromised resource driver plugin](#compromised-resource-driver-plugin)
- [User permissions and quotas](#user-permissions-and-quotas)
- [Usability](#usability)
- [Design Details](#design-details)
Expand Down Expand Up @@ -250,10 +251,23 @@ limitations of the current approach for the following use cases:
containers should be able to use other free resources on the same
device.

*Limitation*: Current implementation of the device plugin doesn’t
allow one to allocate part of the device because parameters are too limited
and Kubernetes doesn't have enough information about the extended
resources on a node to decide whether they can be shared.
*Limitation*: For example, newer generations of NVIDIA GPUs have a mode of
operation called MIG, that allow them to be sub-divided into a set of
mini-GPUs (called MIG devices) with varying amounts of memory and compute
resources provided by each. From a hardware-standpoint, configuring a GPU
into a set of MIG devices is highly-dynamic and creating a MIG device
tailored to the resource needs of a particular application is well
supported. However, with the current device plugin API, the only way to make
use of this feature is to pre-partition a GPU into a set of MIG devices and
advertise them to the kubelet in the same way a full / static GPU is
advertised. The user must then pick from this set of pre-partitioned MIG
devices instead of having one created for them on the fly based on their
particular resource constraints. Without the ability to create MIG devices
dynamically (i.e. at the time they are requested) the set of pre-defined MIG
devices must be carefully tuned to ensure that GPU resources do not go unused
because some of the pre-partioned devices are in low-demand. It also puts
the burden on the user to pick a particular MIG device type, rather than
declaring the resource constraints more abstractly.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Picking a MIG device type (a name for a pre-configured set of resources) would be the more-abstract option, I think what we want (and get) from this KEP is being able to declare our resource constraints more "explicitly".

Copy link
Contributor

@klueska klueska Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends on how you look at it. By “more abstractly” I meant you might only care about making sure you get at least 5GB of memory on your GPU and then the plugin can either create a MIG device that satisfies that for you, or give you a full GPU if that’s all that’s available and it’s OK with that. With pre-defined MIG devices a user has to explicitly request that device type rather than providing a more abstract set of resource constraints that the plugin can work with to find the best GPU for the job.

Copy link

@TBBle TBBle Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I read the case we're trying to enable not as "choosing the best GPU", but subdividing a GPU to match your explicit resource constraints, e.g., "at least 5GB", on-demand, vs have to specify only a pre-defined GPU division (I thought that's what "MIG device type" meant here) which probably matches the container's requirements, but is abstracted behind a name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Looking at just the "resource sharing" / subdivison aspect what you say makes sense. I ended up conflating 2 different advantages of the new proposal into 1, i.e. better support for "resource sharing" and better support for matching workloads to the "right sized" resources based on an abstract set of constraints.


- *Optional allocation*: When deploying a workload I’d like to specify
soft(optional) device requirements. If a device exists and it’s
Expand Down Expand Up @@ -563,6 +577,31 @@ driver vendor. Solutions like Akri which establish their own control plane and
then communicate with Kubernetes through the device plugin API already need to
address this.

#### Compromised resource driver plugin

This is the result of an attack against the resource driver, either from a
container which uses a resource exposed by the driver, a compromised kubelet
which interacts with the plugin, or through a successful attack against the
node which led to root access.

The resource driver plugin only needs read access to objects described in this
KEP, so compromising it does not interfere with dynamic resource allocation for
other drivers. It may need write access for [CRDs that communicate or
coordinate resource
availability](#implementing-a-plugin-for-node-resources). This could be used to
attack scheduling involving the driver as outlined in the previous section.

A resource driver may need root access on the node to manage
hardware. Attacking the driver therefore may lead to root privilege
escalation. Ideally, driver authors should try to avoid depending on root
permissions and instead use capabilities or special permissions for the kernel
APIs that they depend on.

A resource driver may also need privileged access to remote services to manage
network-attached devices. Resource driver vendors and cluster administrators
have to consider what the effect of a compromise could be for that and how such
privileges could get revoked.

#### User permissions and quotas

Similar to generic ephemeral inline volumes, the [ephemeral resource use
Expand Down
7 changes: 4 additions & 3 deletions keps/sig-node/3063-dynamic-resource-allocation/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@ authors:
owning-sig: sig-node
participating-sigs:
- sig-scheduling
status: provisional
- sig-autoscaling
status: implementable
creation-date: 2021-05-17
reviewers:
- "@ahg-g"
ahg-g marked this conversation as resolved.
Show resolved Hide resolved
Expand All @@ -28,8 +29,8 @@ latest-milestone: "v1.25"
# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.25"
beta: "v1.27"
stable: "v1.29"
beta: "v1.28"
stable: "v1.30"

feature-gates:
- name: DynamicResourceAllocation
Expand Down