-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support in EKS [Help] #145
Comments
The GPU Share Scheduler Extender needs to change the scheduler configuration and the pod "gpushare-installer*" is used to change the scheduler configuration, the schedulers are hosted on master nodes usually. The reason of the pods are Pending is that not found master node in this cluster, you can use following command to make sure:
As you can see, the cluster has nodes whose role is "master". If you not found the master nodes, maybe the master nodes of the cluster are hosted on the another cluster, we call this cluster whose master nodes are hosted on another cluster as Managed Kubernetes Cluster in Alibaba Cloud. You can get helps from the EKS and ask them how to enable a scheduler extender configuration for the scheduler. |
@M-A-N-I-S-H-K, did you manage to solve it? I have the same issue with AKS. |
Does the pods really need access to master nodes ? It looks like a scheduler-extender should work on EKS without access to a master node, this project does it: https://github.com/marccampbell/graviton-scheduler-extender |
The scheduler can be deployed as a separate scheduler instead of modifying the default scheduler as done in https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/config/kube-scheduler.yaml#L18. Instead of adding the config file to the master node, specify the scheduler configuration using a config map.
Mount the config map using volumes and deploy the new scheduler
Finally, specify the new scheduler in the pod manifest
|
Hi @animesh-agarwal , thank you very much for your reply and suggestion, it helps a lot. |
@2811299
Please note that the scheduler will be created inside the Please follow this to understand how to use the newly deployed scheduler in your pods |
@fernandocamargoai Hi, does this method works for you for AKS? |
I'm not actively working on that project anymore, but I sent them the link to this issue for them to try it in the future. When they try it and let me know, I'll comment here. |
Is there anyone be able to verify animesh's method works for AKS? |
Confirmed this works in EKS. |
hello @mm-e1 I tried using the piece of yaml you mentionned on EKS with the default plugin deployed without any success: I tried port Using the custom scheduler withing my pods I would end up in Pending state for ever. Could you give me a bit more detail on how you proceeded with the installation? Thanks |
@mariusehr1 The scheduler extender worked for me. Did you prep your nodes correctly? By labeling them with gpushare=true? |
What is the name of the scheduler created @animesh-agarwal ? is there anywhere else except the pod manifest where I need to mention it? My pods show pending state and do not come up on mentioning schedulerName: gpushare-schd-extender |
@animesh-agarwal since kubernetes v1.24, there has been a removel of scheduling policies are no longer supported instead scheduler configurations should be used. Hence the configuration you provided is not working, can you please help me out in setting this up in Kubernetes v1.23+? have tried using the new KubeSchedulerConfiguration by editing the configmap. The image has changed as well, and the pods do not come up. Any help would be appreciated |
Hi! I have successfully deployed gpushare-scheduler-extender on KubernetesV1.23 or above in EKS. I have published the detailed steps here, hoping it will behelpful to you! 1. Deploy GPU share scheduler extenderkubectl create -f https://gist.githubusercontent.com/YuuinIH/71b025b7e63291e6a7d5f3cc43e76805/raw/a1e530e03cc985891a33e8fc2ed2f26307061b0b/gpushare-schd-extender.yaml 2.Deploy GPU share schedulerkubectl create -f https://gist.githubusercontent.com/YuuinIH/71b025b7e63291e6a7d5f3cc43e76805/raw/2c5d874b6061e0497274779ab59ac2c240c4817a/gpushare-scheduler.yaml 3.Update the system:kube-scheduler cluster role to Enable scheduler leader electionkubectl edit clusterrole system:kube-scheduler
4.Deploy Device pluginsHere is the same as the official guide. kubectl delete ds -n kube-system nvidia-device-plugin-daemonset
kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-rbac.yaml
kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-ds.yaml 5.After that, run pod with specify schedulers.apiVersion: batch/v1
kind: Job
metadata:
name: gpu-share-sample
spec:
parallelism: 1
template:
metadata:
labels:
app: gpu-share-sample
spec:
schedulerName: gpushare-scheduler #important!!!!!
containers:
- name: gpu-share-sample
image: registry.cn-hangzhou.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5
command:
- python
- tensorflow-sample-code/tfjob/docker/mnist/main.py
- --max_steps=100000
- --data_dir=tensorflow-sample-code/data
resources:
limits:
aliyun.com/gpu-mem: 3
workingDir: /root
restartPolicy: Never Then, run inspector to show the GPU memory ❯ kubectl inspect cgpu
NAME IPADDRESS GPU0(Allocated/Total) GPU Memory(GiB)
ip-192-168-80-151.cn-northwest-1.compute.internal 192.168.80.151 0/15 0/15
ip-192-168-87-86.cn-northwest-1.compute.internal 192.168.87.86 3/15 3/15
-----------------------------------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
3/30 (10%) ❯ kubectl logs gpu-share-sample-vrpsj --tail 1
2023-03-23 09:51:02.301985: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0) |
Hi,
I tried to set this up in my
EKS
cluster, but I am observing the pods are in pending state and not running as expectedDescribing the pods
gpushare-installer-5s56q
Describing the pods
gpushare-schd-extender-846977f446-s9bxh
As per the documentation and going through the file
./templates/schd-config-job.yaml
, and./templates/gpushare-extender-deployment.yaml
I need to setup a label asnode-role.kubernetes.io/master: ""
fornode selector
.Also, this step by step guide https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/docs/install.md me to update the kubeschedular configuration.
On
EKS
, I am not sure where/how I can configure on which nodes should I update this configuration??Guidance will be much appericiated.
The text was updated successfully, but these errors were encountered: