-
Notifications
You must be signed in to change notification settings - Fork 626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cannot unmarshal string into Go struct field PluginCommandLineFlags.flags.plugin.deviceListStrategy of type []string #410
Comments
and, the another related issue is in the init container of gpu-feature-discovery-init, it requires the field of deviceListStrategy is a string, not a array. unable to load config: unable to finalize config: unable to parse config file: error parsing config file: unmarshal error: error unmarshaling JSON: while decoding JSON: json: cannot unmarshal array into Go struct field PluginCommandLineFlags.flags.plugin.deviceListStrategy of type string |
Thanks @xuzimianxzm the It does also seem as if we didn't implement a custom unmarshaller for the cc @cdesiniotis Update: I have reproduced the failure in a unit tests here https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/294 and we will work on getting a fix released. |
As a workaround, could you specify the |
@elezar what do you mean? I am facing same issue. Deploying it as a daemon set with Flux, not using helm. Should I create DEVICE_LIST_STRATEGY env variable for container, set its value to envvar and exclude deviceListStrategy: "envvar" from config map |
@ndacic this is how I solved it for me:
|
This issue should be addressed in the @ndacic please let me know if bumping the version does not address your issue so that I can better document the workaround. |
@elezar I need to set replicas to 1 so that I can have full resource access of the GPU node. My config looks like this
So my g4dn.2xlarge instance gives 40 SM, but the pods are to have 20 SM with replicas set to 2. When I do install version Could you please suggest to me how and where I can configure this replica count as 1 so that I do not get the error |
It's not clear what you hope to accomplish by enabling MPS but setting its replicas to 1. If we allowed you to set replicas to 1, then you would get an MPS server started for the GPU, but only be able to connect 1 workload/pod to it (i.e. no sharing would be possible). Can you please elaborate on exactly what your expectations are for using MPS? It sounds like maybe time-slicing is more what you are looking for. Either that, or (as I suggested before), maybe you want a way to limit the memory of each workload, but allow them all to share the same compute. Please clarify what your expectations are. Just saying you want a way to "set replicas to 1" doesn't tell us anything, because that is a disallowed configuration for the reason mentioned above. |
I provisioned an Optimised EKS GPU node g4dn.2xlarge with 1 GPU, configuration as follows In order to have my workloads/pods get scheduled over it, I created the daemonset via helm Output: Logs: My Config file is like this and I have updated this in the values.yaml file in order to get MPS sharing so that multiple workloads can be scheduled on the GPU node
=========================================================================== Issue:When I set In the above output, the multiprocessor count is 20, however, I need multiprocessor count as 40 so that the workloads can perform efficiently, otherwise with 20 it gets slow. My expectation:If I can set the I followed this doc and came to this expectation: Ref: https://github.com/NVIDIA/k8s-device-plugin/tree/release-0.15 |
If you set |
@elezar I am stuck with another issue where I am not able to get the GPU metrics.
|
@PrakChandra looking at your issues here, they were not related to the original post. Could you please open new issues instead of extending this thread. |
Sure. Thanks @elezar |
I think the follow configuration has a issue: the field deviceListStrategy is a array,but you provide a string. so this will be cause a issue when the init container of nvidia-device-plugin-ctr startting.
The text was updated successfully, but these errors were encountered: