cannot unmarshal string into Go struct field PluginCommandLineFlags.flags.plugin.deviceListStrategy of type []string #410

xuzimianxzm · 2023-06-02T02:15:25Z

I think the follow configuration has a issue: the field deviceListStrategy is a array,but you provide a string. so this will be cause a issue when the init container of nvidia-device-plugin-ctr startting.

cat << EOF > /tmp/dp-example-config0.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid
EOF

xuzimianxzm · 2023-06-02T02:45:51Z

and, the another related issue is in the init container of gpu-feature-discovery-init, it requires the field of deviceListStrategy is a string, not a array.

unable to load config: unable to finalize config: unable to parse config file: error parsing config file: unmarshal error: error unmarshaling JSON: while decoding JSON: json: cannot unmarshal array into Go struct field PluginCommandLineFlags.flags.plugin.deviceListStrategy of type string

elezar · 2023-06-02T11:49:24Z

Thanks @xuzimianxzm the deviceListStrategy config option was updated to be a string late in the Device Plugin's v0.14.0 release cycle and it seems the changes were never propagated to gpu-feature-discovery. This explains the error you're seeing in your second comment.

It does also seem as if we didn't implement a custom unmarshaller for the deviceListStrategy when extending this in the device plugin.

cc @cdesiniotis

Update: I have reproduced the failure in a unit tests here https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/294 and we will work on getting a fix released.

elezar · 2023-06-02T11:51:06Z

As a workaround, could you specify the deviceListStrategy using the DEVICE_LIST_STRATEGY envvar instead?

ndacic · 2023-06-29T11:22:15Z

@elezar what do you mean? I am facing same issue. Deploying it as a daemon set with Flux, not using helm. Should I create DEVICE_LIST_STRATEGY env variable for container, set its value to envvar and exclude deviceListStrategy: "envvar" from config map

alekc · 2023-07-31T15:15:08Z

@ndacic this is how I solved it for me:

nodeSelector:
  nvidia.com/gpu.present: "true"
config:
  map:
    default: |-
      version: v1
      flags:
        migStrategy: "none"
        failOnInitError: true
        nvidiaDriverRoot: "/"
        plugin:
          passDeviceSpecs: false
          deviceListStrategy:
            - envvar
          deviceIDStrategy: uuid
      sharing:
        timeSlicing:
          renameByDefault: false
          resources:
          - name: nvidia.com/gpu
            replicas: 10

elezar · 2023-08-01T10:46:24Z

This issue should be addressed in the v0.14.1 release.

@ndacic please let me know if bumping the version does not address your issue so that I can better document the workaround.

erikschul · 2023-12-02T05:28:15Z

@elezar This is still a problem with version 0.14.3.

It fails with the official example:

version: v1
sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 10

but it works with the example given above. Thanks @alekc

PrakChandra · 2024-05-21T14:21:11Z

@elezar
I am using version 0.15.0

I need to set replicas to 1 so that I can have full resource access of the GPU node.

My config looks like this

      version: v1
      flags:
        migStrategy: none
      sharing:
        mps:
          default_active_thread_percentage: 10
          resources:
            - name: nvidia.com/gpu
              replicas: 2

So my g4dn.2xlarge instance gives 40 SM, but the pods are to have 20 SM with replicas set to 2.

When I do install version 0.15.0 I get the following error

Could you please suggest to me how and where I can configure this replica count as 1 so that I do not get the error

klueska · 2024-05-21T14:59:27Z

It's not clear what you hope to accomplish by enabling MPS but setting its replicas to 1. If we allowed you to set replicas to 1, then you would get an MPS server started for the GPU, but only be able to connect 1 workload/pod to it (i.e. no sharing would be possible).

Can you please elaborate on exactly what your expectations are for using MPS? It sounds like maybe time-slicing is more what you are looking for. Either that, or (as I suggested before), maybe you want a way to limit the memory of each workload, but allow them all to share the same compute.

Please clarify what your expectations are. Just saying you want a way to "set replicas to 1" doesn't tell us anything, because that is a disallowed configuration for the reason mentioned above.

PrakChandra · 2024-05-21T16:13:33Z

@klueska

I provisioned an Optimised EKS GPU node g4dn.2xlarge with 1 GPU, configuration as follows

In order to have my workloads/pods get scheduled over it, I created the daemonset via helm
helm upgrade -i nvdp nvdp/nvidia-device-plugin --version=0.15.0 --namespace kube-system -f values.yaml

Output:

Logs:

My Config file is like this and I have updated this in the values.yaml file in order to get MPS sharing so that multiple workloads can be scheduled on the GPU node

  map: 
    default: |-
      version: v1
      flags:
        migStrategy: none
      sharing:
        mps:
          resources:
            - name: nvidia.com/gpu
              replicas: 2

===========================================================================

Issue:

When I set replicas: 2 I get the following output. This output is from one of the pods which is getting scheduled on the GPU node

In the above output, the multiprocessor count is 20, however, I need multiprocessor count as 40 so that the workloads can perform efficiently, otherwise with 20 it gets slow.

My expectation:

If I can set the replicas: 1 , then the multiprocessor count would become 40 and the workloads can do the processing efficiently.

I followed this doc and came to this expectation:

Ref: https://github.com/NVIDIA/k8s-device-plugin/tree/release-0.15

elezar · 2024-05-21T16:53:38Z

If you set replicas = 1 this is the same as no sharing since you will only expose a single slice that is the same as the entire GPU.

PrakChandra · 2024-05-22T05:05:14Z

@elezar @klueska
Although thing didn't work from Helm configuration I was able to figure out the solution.
I tweaked the value for CUDA_MPS_ACTIVE_THREAD_PERCENTAGE to 100 so that my full GPU is accessible to all the pods. And it is working as expected>

Thanks

PrakChandra · 2024-05-22T05:22:25Z

@elezar I am stuck with another issue where I am not able to get the GPU metrics.

time="2024-05-22T05:10:27Z" level=info msg="Starting dcgm-exporter"
time="2024-05-22T05:10:27Z" level=info msg="DCGM successfully initialized!"
time="2024-05-22T05:10:27Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: Profiling is not supported for this group of GPUs or GPU"
time="2024-05-22T05:10:27Z" level=info msg="Pipeline starting"
time="2024-05-22T05:10:27Z" level=info msg="Starting webserver"```

Am I missing something?

```apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
    spec:
      nodeSelector:
        nvidia.com/accelerator: gpu
      #   nvidia.com/lama: gpu
      # nodeSelector:
      #   nvidia.com/accelerator: gpu
      #   nvidia.com/lama: gpu
      # affinity:
      #   nodeAffinity:
      #     requiredDuringSchedulingIgnoredDuringExecution:
      #       nodeSelectorTerms:
      #       - matchExpressions:
      #         # On discrete-GPU based systems NFD adds the following label where 10de is the NVIDIA PCI vendor ID
      #         - key: nvidia.com/accelerator
      #           operator: In
      #           values:
      #           - "gpu"
      #       - matchExpressions:
      #         # On some Tegra-based systems NFD detects the CPU vendor ID as NVIDIA
      #         - key: app
      #           operator: In
      #           values:
      #           - "AI-GPU"
      #           - "AI-GPU-LAMA"
      #       - matchExpressions:
      #         # We allow a GPU deployment to be forced by setting the following label to "true"
      #         - key: nvidia.com/lama
      #           operator: In
      #           values:
      #           - "gpu"        
      tolerations:
        - key: app
          value: AI-GPU
          effect: NoSchedule
          operator: Equal
        - key: nvidia/gpu
          operator: Exists
          effect: NoSchedule
        # - key: app
        #   value: AI-GPU-LAMA
        #   effect: NoSchedule
        #   operator: Equal          
          
        ## this matter nodeSelector to check your gpu node
      containers:
      - name: dcgm-exporter
        image: nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu18.04
        ports:
        - containerPort: 9400
        securityContext:
          capabilities:
            add:
              - SYS_ADMIN```

elezar · 2024-05-22T09:01:58Z

@PrakChandra looking at your issues here, they were not related to the original post. Could you please open new issues instead of extending this thread.

PrakChandra · 2024-05-22T10:12:10Z

Sure. Thanks @elezar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cannot unmarshal string into Go struct field PluginCommandLineFlags.flags.plugin.deviceListStrategy of type []string #410

cannot unmarshal string into Go struct field PluginCommandLineFlags.flags.plugin.deviceListStrategy of type []string #410

xuzimianxzm commented Jun 2, 2023

xuzimianxzm commented Jun 2, 2023

elezar commented Jun 2, 2023 •

edited

Loading

elezar commented Jun 2, 2023

ndacic commented Jun 29, 2023

alekc commented Jul 31, 2023

elezar commented Aug 1, 2023

erikschul commented Dec 2, 2023

PrakChandra commented May 21, 2024

klueska commented May 21, 2024

PrakChandra commented May 21, 2024 •

edited

Loading

elezar commented May 21, 2024

PrakChandra commented May 22, 2024

PrakChandra commented May 22, 2024

elezar commented May 22, 2024

PrakChandra commented May 22, 2024

cannot unmarshal string into Go struct field PluginCommandLineFlags.flags.plugin.deviceListStrategy of type []string #410

cannot unmarshal string into Go struct field PluginCommandLineFlags.flags.plugin.deviceListStrategy of type []string #410

Comments

xuzimianxzm commented Jun 2, 2023

xuzimianxzm commented Jun 2, 2023

elezar commented Jun 2, 2023 • edited Loading

elezar commented Jun 2, 2023

ndacic commented Jun 29, 2023

alekc commented Jul 31, 2023

elezar commented Aug 1, 2023

erikschul commented Dec 2, 2023

PrakChandra commented May 21, 2024

klueska commented May 21, 2024

PrakChandra commented May 21, 2024 • edited Loading

Issue:

My expectation:

elezar commented May 21, 2024

PrakChandra commented May 22, 2024

PrakChandra commented May 22, 2024

elezar commented May 22, 2024

PrakChandra commented May 22, 2024

elezar commented Jun 2, 2023 •

edited

Loading

PrakChandra commented May 21, 2024 •

edited

Loading