[feature] Dynamic MIG partitioner #361

elgalu · 2022-06-24T15:05:39Z

Would it be possible that the gpu operator makes the MIG setup transparent such that the end user can directly request per-pod GPU memory requirements on-demand while, under the hood, MIG configuration is dynamically re-partitioned? i.e. without any intervention of a sysadmin / devops team.

# requesting eight 40Gi MIG slices
resources:
    limits:
        nvidia.com/gpu: 8
        nvidia.com/gpu_memory: "40Gi"

https://www.nvidia.com/en-us/technologies/multi-instance-gpu/

MIG instances can also be dynamically reconfigured

klueska · 2022-06-24T15:20:33Z

Please see this document on why this is not feasible under the current Kubernetes resource model:
Challenges Supporting Multi-Instance GPUs (MIG) in Kubernetes

Once the following newly accepted Kubernetes Enhancement proposal gets implemented, we will be able to build a device plugin that properly supports what you suggest:
kubernetes/enhancements#3064

elgalu · 2022-08-02T15:45:58Z

I updated the example to emphasise the capability to perform MIG-sliced multi-gpu training, e.g. by requesting eight 40Gi MIG slices (from different GPU cards on the same DGX). This is currently not possible AFAIK not even with a static MIG layout.

klueska · 2022-08-03T12:08:05Z

Once we have Dynamic Resource Allocation all of what you propose will be possible. We do not plan to "hack" this support onto the existing plugin and instead will be putting all efforts to support an API like this into the new plugin for DRA.

omer-dayan · 2022-11-18T21:17:22Z

I agree with @klueska about how DRA is the right way.
However, @elgalu, I do not agree its not feasible under the current circumstances.
You welcome to watch the following video (https://www.youtube.com/watch?v=zk7g3FbW7go) that show it had been achieved in Kubernetes

sshukun · 2023-01-22T14:55:10Z

Hi @klueska, I cannot wait to try this new DRA feature but after read the KEP, I have some concerns about how the resource driver will be implemented.

What I want is not only dynamic MIG configuration but also dynamically allocating network-attached GPUs.

In my understanding, a Resource Driver needs to define its own ResourceClaimParameter CRD, allocate and configure the devices, and interact with kubelet to prepare devices for containers. Most of these work are device specific and should be handled by the device vendor I believe, but allocation seems different and complicate when the devices are dynamically attached from network.
How could the resource driver determine which device it should attach and how to interact with the infrastructure to attach the device? My infrastructure is built with Liqid fabric switches connected with NVIDIA gpus and bare metal servers. A machine can be created and reprogrammed using Liqid management software. In this case, do I need to write some component to receive request from the Resource Driver and interact with Liqid by myself?

Could you tell me the NVIDIA's thought about how to implement a Resource Driver and how to support dynamically attaching devices?

zeryx · 2024-05-06T16:39:06Z

Please see this document on why this is not feasible under the current Kubernetes resource model: Challenges Supporting Multi-Instance GPUs (MIG) in Kubernetes

Once the following newly accepted Kubernetes Enhancement proposal gets implemented, we will be able to build a device plugin that properly supports what you suggest: kubernetes/enhancements#3064

Looks like kubernetes/enhancements#3064 has merged! Any thoughts on this ask?

likku123 · 2024-05-14T03:49:31Z

Right now I am using https://github.com/nebuly-ai/nos for dynamic GPU partitioning. It's solving the purpose for now but facing issue when using with Karpenter. These days group is not active enough to contribute for the solution.
Hoping NVIDIA will come up with a plugin to address this requirement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] Dynamic MIG partitioner #361

[feature] Dynamic MIG partitioner #361

elgalu commented Jun 24, 2022 •

edited

Loading

klueska commented Jun 24, 2022

elgalu commented Aug 2, 2022 •

edited

Loading

klueska commented Aug 3, 2022 •

edited

Loading

omer-dayan commented Nov 18, 2022

sshukun commented Jan 22, 2023 •

edited

Loading

zeryx commented May 6, 2024

likku123 commented May 14, 2024

[feature] Dynamic MIG partitioner #361

[feature] Dynamic MIG partitioner #361

Comments

elgalu commented Jun 24, 2022 • edited Loading

klueska commented Jun 24, 2022

elgalu commented Aug 2, 2022 • edited Loading

klueska commented Aug 3, 2022 • edited Loading

omer-dayan commented Nov 18, 2022

sshukun commented Jan 22, 2023 • edited Loading

zeryx commented May 6, 2024

likku123 commented May 14, 2024

elgalu commented Jun 24, 2022 •

edited

Loading

elgalu commented Aug 2, 2022 •

edited

Loading

klueska commented Aug 3, 2022 •

edited

Loading

sshukun commented Jan 22, 2023 •

edited

Loading