Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This plugin not working when used IB NIC the LINK_TYPE_P1=ETH! #98

Open
sober-wang opened this issue Jan 25, 2024 · 12 comments
Open

This plugin not working when used IB NIC the LINK_TYPE_P1=ETH! #98

sober-wang opened this issue Jan 25, 2024 · 12 comments

Comments

@sober-wang
Copy link

This plugin not working when used IB NIC the LINK_TYPE_P1=ETH!

mlxconfig -d <DIVICE_INFO> query | grep LINK_TYPE_P1
mlxconfig -d <DIVICE_INFO> set LINK_TYPE_P1=1
reboot 

The IB NIC change ETH model. run the rdma-shared-dev-plugin in k8s cluster.
The node Capacity and Allocatable resourceName values is 0.

NIC version: Mellanox ConnectX 6.

But Mellanox ConnectX 6 Dx can share rdma resources in k8s cluster.

@adrianchiris
Copy link
Collaborator

im not sure i understand the issue,
can you share the config map of device plugin ?

are you changing the link type ? of both ports ? or of a single port of the NIC ?
when changing link type the netdevice name changes

@sober-wang
Copy link
Author

My Mellanox NIC config

root@gpu-11:~$ mst status -v 

MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded
PCI devices:
------------
DEVICE_TYPE             MST                           PCI       RDMA            NET                       NUMA  
ConnectX6(rev:0)        /dev/mst/mt4123_pciconf3      ###       mlx5_9          net-ens31np0              1     

ConnectX6(rev:0)        /dev/mst/mt4123_pciconf2      ###       mlx5_8          net-ens30np0              1     

ConnectX6(rev:0)        /dev/mst/mt4123_pciconf1      ###       mlx5_3          net-ens25np0              0     

ConnectX6(rev:0)        /dev/mst/mt4123_pciconf0      ###       mlx5_2          net-ens24np0              0 



root@gpu-11:~# mlxconfig -d /dev/mst/mt4123_pciconf3 query | grep LINK_TYPE_P1
         LINK_TYPE_P1                                ETH(2)

k8s clsuter apply rdma-share

rdma-device

Name:         rdma-devices
Namespace:    kube-system
Labels:       <none>
Annotations:  <none>

Data
====
config.json:
----
{
    "periodicUpdateInterval": 300,
    "configList": [{
         "resourceName": "hca_shared_devices_a",
         "rdmaHcaMax": 1000,
         "devices": ["ens24np0"]
       },
       {
         "resourceName": "hca_shared_devices_b",
         "rdmaHcaMax": 1000,
         "devices": ["ens25np0"] 
       },
       {
         "resourceName": "hca_shared_devices_c",
         "rdmaHcaMax": 1000,
         "devices": ["ens30np0"] 
       },
       {
         "resourceName": "hca_shared_devices_d",
         "rdmaHcaMax": 1000,
         "devices": ["ens31np0"] 
       }
    ]
}


BinaryData
====

rdma-shared-dp-ds Daemonset

Name:           rdma-shared-dp-ds
Selector:       name=rdma-shared-dp-ds
Node-Selector:  <none>
Labels:         <none>
Annotations:    deprecated.daemonset.template.generation: 4
Desired Number of Nodes Scheduled: 70
Current Number of Nodes Scheduled: 70
Number of Nodes Scheduled with Up-to-date Pods: 70
Number of Nodes Scheduled with Available Pods: 70
Number of Nodes Misscheduled: 0
Pods Status:  70 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:       name=rdma-shared-dp-ds
  Annotations:  kubectl.kubernetes.io/restartedAt: 2024-01-24T12:41:15+08:00
  Containers:
   k8s-rdma-shared-dp-ds:
    Image:        ghcr.io/mellanox/k8s-rdma-shared-dev-plugin:1.4.0
    Port:         <none>
    Host Port:    <none>
    Environment:  <none>
    Mounts:
      /dev/ from devs (rw)
      /k8s-rdma-shared-dev-plugin from config (rw)
      /var/lib/kubelet/ from device-plugin (rw)
  Volumes:
   device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/
    HostPathType:  
   config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      rdma-devices
    Optional:  false
   devs:
    Type:               HostPath (bare host directory volume)
    Path:               /dev/
    HostPathType:       
  Priority Class Name:  system-node-critical
Events:                 <none>

Result

[root@master1 ~]# kubectl describe node gpu-11
Name:               gpu-11
Roles:              <none>
Labels:             gpu=A100
                    kubernetes.io/arch=amd64
.....

Capacity:
.....
  nvidia.com/gpu:             8
  pods:                       110
  rdma/hca_shared_devices_a:  0
  rdma/hca_shared_devices_b:  0
  rdma/hca_shared_devices_c:  0
  rdma/hca_shared_devices_d:  0
Allocatable:
.....
  nvidia.com/gpu:             8
  pods:                       110
  rdma/hca_shared_devices_a:  0
  rdma/hca_shared_devices_b:  0
  rdma/hca_shared_devices_c:  0
  rdma/hca_shared_devices_d:  0

@adrianchiris
Copy link
Collaborator

can you provide device plugin logs and the content of /dev/infiniband folder of the node ?

@sober-wang
Copy link
Author

sober-wang commented Jan 29, 2024

/dev/infiniband

root@gpu-11:/dev/infiniband# ls -alh
total 0
drwxr-xr-x  2 root root      660 Dec 28 11:15 .
drwxr-xr-x 24 root root     5.6K Jan 18 17:26 ..
crw-------  1 root root 231,  64 Dec 28 14:10 issm0
crw-------  1 root root 231,  65 Dec 28 14:10 issm1
crw-------  1 root root 231,  66 Dec 28 14:10 issm2
crw-------  1 root root 231,  67 Dec 28 14:10 issm3
crw-------  1 root root 231,  68 Dec 28 14:10 issm4
crw-------  1 root root 231,  69 Dec 28 14:10 issm5
crw-------  1 root root 231,  70 Dec 28 14:10 issm6
crw-------  1 root root 231,  71 Dec 28 14:10 issm7
crw-------  1 root root 231,  72 Dec 28 14:10 issm8
crw-------  1 root root 231,  73 Dec 28 14:10 issm9
crw-rw-rw-  1 root root  10,  56 Dec 28 14:10 rdma_cm
crw-------  1 root root 231,   0 Dec 28 14:10 umad0
crw-------  1 root root 231,   1 Dec 28 14:10 umad1
crw-------  1 root root 231,   2 Dec 28 14:10 umad2
crw-------  1 root root 231,   3 Dec 28 14:10 umad3
crw-------  1 root root 231,   4 Dec 28 14:10 umad4
crw-------  1 root root 231,   5 Dec 28 14:10 umad5
crw-------  1 root root 231,   6 Dec 28 14:10 umad6
crw-------  1 root root 231,   7 Dec 28 14:10 umad7
crw-------  1 root root 231,   8 Dec 28 14:10 umad8
crw-------  1 root root 231,   9 Dec 28 14:10 umad9
crw-rw-rw-  1 root root 231, 192 Dec 28 14:10 uverbs0
crw-rw-rw-  1 root root 231, 193 Dec 28 14:10 uverbs1
crw-rw-rw-  1 root root 231, 194 Dec 28 14:10 uverbs2
crw-rw-rw-  1 root root 231, 195 Dec 28 14:10 uverbs3
crw-rw-rw-  1 root root 231, 196 Dec 28 14:10 uverbs4
crw-rw-rw-  1 root root 231, 197 Dec 28 14:10 uverbs5
crw-rw-rw-  1 root root 231, 198 Dec 28 14:10 uverbs6
crw-rw-rw-  1 root root 231, 199 Dec 28 14:10 uverbs7
crw-rw-rw-  1 root root 231, 200 Dec 28 14:10 uverbs8
crw-rw-rw-  1 root root 231, 201 Dec 28 14:10 uverbs9

devive plugin log

``
[root@master1 k8s]# kubectl -n kube-system logs rdma-shared-dp-ds-6jknp
2024/01/29 02:36:25 Starting K8s RDMA Shared Device Plugin version= master
2024/01/29 02:36:25 resource manager reading configs
Using Kubelet Plugin Registry Mode
2024/01/29 02:36:25 Reading /k8s-rdma-shared-dev-plugin/config.json
2024/01/29 02:36:25 loaded config: [{ResourceName:hca_shared_devices_a ResourcePrefix: RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens24np0] LinkTypes:[]}} {ResourceName:hca_shared_devices_b ResourcePrefix: RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens25np0] LinkTypes:[]}} {ResourceName:hca_shared_devices_c ResourcePrefix: RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens30np0] LinkTypes:[]}} {ResourceName:hca_shared_devices_d ResourcePrefix: RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens31np0] LinkTypes:[]}}]
2024/01/29 02:36:25 periodic update interval: +300
2024/01/29 02:36:25 Discovering host devices
2024/01/29 02:36:25 discovering host network devices
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.2 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.3 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:36:25 Initializing resource servers
2024/01/29 02:36:25 Resource: &{ResourceName:hca_shared_devices_a ResourcePrefix:rdma RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens24np0] LinkTypes:[]}}
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found"
2024/01/29 02:36:25 Resource: &{ResourceName:hca_shared_devices_b ResourcePrefix:rdma RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens25np0] LinkTypes:[]}}
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found"
2024/01/29 02:36:25 Resource: &{ResourceName:hca_shared_devices_c ResourcePrefix:rdma RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens30np0] LinkTypes:[]}}
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found"
2024/01/29 02:36:25 Resource: &{ResourceName:hca_shared_devices_d ResourcePrefix:rdma RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens31np0] LinkTypes:[]}}
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found"
2024/01/29 02:36:25 Starting all servers...
2024/01/29 02:36:25 starting rdma/hca_shared_devices_a device plugin endpoint at: hca_shared_devices_a.sock
2024/01/29 02:36:25 rdma/hca_shared_devices_a device plugin endpoint started serving
2024/01/29 02:36:25 starting rdma/hca_shared_devices_b device plugin endpoint at: hca_shared_devices_b.sock
2024/01/29 02:36:25 rdma/hca_shared_devices_b device plugin endpoint started serving
2024/01/29 02:36:25 starting rdma/hca_shared_devices_c device plugin endpoint at: hca_shared_devices_c.sock
2024/01/29 02:36:25 rdma/hca_shared_devices_c device plugin endpoint started serving
2024/01/29 02:36:25 starting rdma/hca_shared_devices_d device plugin endpoint at: hca_shared_devices_d.sock
2024/01/29 02:36:25 rdma/hca_shared_devices_d device plugin endpoint started serving
2024/01/29 02:36:25 All servers started.
2024/01/29 02:36:25 Listening for term signals
2024/01/29 02:36:25 Starting OS watcher.
2024/01/29 02:41:25 discovering host network devices
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.2 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.3 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found"
2024/01/29 02:41:25 no changes to devices for "rdma/hca_shared_devices_a"
2024/01/29 02:41:25 exposing "1000" devices
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found"
2024/01/29 02:41:25 no changes to devices for "rdma/hca_shared_devices_b"
2024/01/29 02:41:25 exposing "1000" devices
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found"
2024/01/29 02:41:25 no changes to devices for "rdma/hca_shared_devices_c"
2024/01/29 02:41:25 exposing "1000" devices
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found"
2024/01/29 02:41:25 no changes to devices for "rdma/hca_shared_devices_d"
2024/01/29 02:41:25 exposing "1000" devices
2024/01/29 02:46:25 discovering host network devices

@sober-wang
Copy link
Author

can you provide device plugin logs and the content of /dev/infiniband folder of the node ?

I want fix the problem . Can you show me how to modify this plugin?

@adrianchiris
Copy link
Collaborator

adrianchiris commented Jan 30, 2024

from the logs, device plugin behaves as expected.

i see that device plugin discovered resources properly.
kubelet is not calling ListAndWatch [1] else we would have seen a log msg (which will then report resources on node obj)

can you provide the output of:
/var/lib/kubelet
/var/lib/kubelet/plugins_registry
/var/lib/kubelet/plugins
/var/lib/kubelet/device_plugins

can you add the yaml used for deployment of device plugin daemonset ? is it what we have in master branch ?

[1]

log.Printf("ListAndWatch called by kubelet for: %s", rs.resourceName)

@sober-wang
Copy link
Author

sober-wang commented Feb 2, 2024

from the logs, device plugin behaves as expected.

i see that device plugin discovered resources properly. kubelet is not calling ListAndWatch [1] else we would have seen a log msg (which will then report resources on node obj)

can you provide the output of: /var/lib/kubelet /var/lib/kubelet/plugins_registry /var/lib/kubelet/plugins /var/lib/kubelet/device_plugins

can you add the yaml used for deployment of device plugin daemonset ? is it what we have in master branch ?

[1]

log.Printf("ListAndWatch called by kubelet for: %s", rs.resourceName)

the kubelet --root-dir is /data/kubelet . others , such as the NVIDIA plugin and csi-nfs are working.

root@gpu-11:/var/lib/kubelet# tree 
.
├── config.yaml
├── device-plugins
│   ├── DEPRECATION
│   ├── device-plugins
│   ├── kubelet_internal_checkpoint
│   ├── kubelet.sock
│   ├── nvidia.sock
│   └── plugins_registry
├── kubeadm-flags.env
├── pki
│   ├── kubelet-client-2024-01-08-11-59-36.pem
│   ├── kubelet-client-current.pem -> /var/lib/kubelet/pki/kubelet-client-2024-01-08-11-59-36.pem
│   ├── kubelet.crt
│   └── kubelet.key
├── plugins_registry
│   ├── hca_shared_devices_a.sock
│   ├── hca_shared_devices_b.sock
│   ├── hca_shared_devices_c.sock
│   └── hca_shared_devices_d.sock
└── pod-resources

6 directories, 14 files

@adrianchiris
Copy link
Collaborator

ok,
so is it /data/kubelet or /var/lib/kubelet ?
your tree output is of the latter but you say kubelet root is the former.

did you deploy rdma-shared-device-plugin with the modified mounts as suggested in #96 ?
the layout looks OK, if both kubelet and device plugin have the same paths it should work.

please provide some additional information on how to reproduce this (k8s version, OS, NIC hardware and its config).

@sober-wang
Copy link
Author

ok, so is it /data/kubelet or /var/lib/kubelet ? your tree output is of the latter but you say kubelet root is the former.

did you deploy rdma-shared-device-plugin with the modified mounts as suggested in #96 ? the layout looks OK, if both kubelet and device plugin have the same paths it should work.

please provide some additional information on how to reproduce this (k8s version, OS, NIC hardware and its config).

I'm showing /var/lib/kubelet directory tree. the plugin daemonset configure in history chat , can you find.

my environment:
os version: Ubuntu 20.04 kernel 5.4.0-100-generic
kubernetes version: v1.23.0
lspci | grep Mell: Mellanox Technologies MT28908 Family [ConnectX-6]
Mellanox version: MLNX_OFED_LINUX-5.8-3.0.7.0-ubuntu20.04-x86_64.tgz

@adrianchiris
Copy link
Collaborator

if your kubelet root dir is configured as /data/kubelet then you need device plugin to mount the same directory IMO.

can you try it ?

that is: mount /data/kubelet to /var/lib/kubelet in the device plugin daemonset.

apart of that, everything looks ok to me

@sober-wang
Copy link
Author

activeSockDir = "/var/lib/kubelet/plugins_registry"

Why must use the default kubelet --root-dir ?

when I used the default --root-dir, it was okay. but then the not running or there were no allocatable devices, As a result, I decided to change the directory.

@sober-wang
Copy link
Author

sober-wang commented Mar 22, 2024

new log show

2024/03/22 05:58:38 Starting OS watcher.
2024/03/22 05:58:49 hca_3.sock failed to be registered at Kubelet: RegisterPlugin error -- plugin registration failed with err: failed to dial device plugin with socketPath /var/lib/kubelet/plugins_registry/hca_3.sock: failed to dial device plugin: context deadline exceeded; restarting.

Daemonset

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: rdma-shared-dp-ds
  namespace: cni-plugin
spec:
  selector:
    matchLabels:
      name: rdma-shared-dp-ds
  template:
    metadata:
      labels:
        name: rdma-shared-dp-ds
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: rdma
                    operator: In
                    values:
                      - sugon
      hostNetwork: true
      priorityClassName: system-node-critical
      containers:
      - image: ghcr.io/mellanox/k8s-rdma-shared-dev-plugin
        name: k8s-rdma-shared-dp-ds
        imagePullPolicy: IfNotPresent
        #securityContext:
        #  privileged: true
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
          - name: plugins-registry
            mountPath: /var/lib/kubelet/plugins_registry
          - name: config
            mountPath: /k8s-rdma-shared-dev-plugin
          - name: devs
            mountPath: /dev/
      volumes:
        - name: device-plugin
          hostPath:
            path: /data/kubelet/device-plugins
        - name: plugins-registry
          hostPath:
            path: /data/kubelet/plugins_registry
        - name: config
          configMap:
            name: rdma-devices
            items:
            - key: config.json
              path: config.json
        - name: devs
          hostPath:
            path: /dev/
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: rdma-devices
  namespace: cni-plugin
data:
  config.json: |
    {
        "periodicUpdateInterval": 300,
        "configList": [{
             "resourceName": "hca_3",
             "rdmaHcaMax": 1000,
             "selectors": {
                "ifNames": ["ens24np0"]
             }
           }
        ]
    }

start kubelet argument

/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --hostname-override=gpu-186 --network-plugin=cni --pod-infra-container-image=registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.6 --root-dir=/data/kubelet --node-ip=192.168.1.9 --max-pods=20 -v=4

kubelet version v1.23.0

kubelet log

Mar 22 13:58:38 GPU-186 kubelet[2736666]: I0322 13:58:38.523704 2736666 plugin_watcher.go:203] "Adding socket path or updating timestamp to desired state cache" path="/data/kubelet/plugins_registry/hca_3.sock"
Mar 22 13:58:39 GPU-186 kubelet[2736666]: I0322 13:58:39.470223 2736666 reconciler.go:160] "OperationExecutor.RegisterPlugin started" plugin={SocketPath:/data/kubelet/plugins_registry/hca_3.sock Timestamp:2024-03-22 13:58:38.523730272 +0800 CST m=+75.689085324 Handler:<nil> Name:}
Mar 22 13:58:39 GPU-186 kubelet[2736666]: I0322 13:58:39.471951 2736666 manager.go:308] "Got Plugin at endpoint with versions" plugin="rdma/hca_3" endpoint="/var/lib/kubelet/plugins_registry/hca_3.sock" versions=[v1alpha1 v1beta1]
Mar 22 13:58:39 GPU-186 kubelet[2736666]: I0322 13:58:39.472004 2736666 manager.go:325] "Registering plugin at endpoint" plugin="rdma/hca_3" endpoint="/var/lib/kubelet/plugins_registry/hca_3.sock"
Mar 22 13:58:39 GPU-186 kubelet[2736666]: W0322 13:58:39.472247 2736666 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/hca_3.sock /var/lib/kubelet/plugins_registry/hca_3.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/hca_3.sock: connect: no such file or directory". Reconnecting...
Mar 22 13:58:40 GPU-186 kubelet[2736666]: W0322 13:58:40.473408 2736666 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/hca_3.sock /var/lib/kubelet/plugins_registry/hca_3.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/hca_3.sock: connect: no such file or directory". Reconnecting...
Mar 22 13:58:42 GPU-186 kubelet[2736666]: W0322 13:58:42.099541 2736666 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/hca_3.sock /var/lib/kubelet/plugins_registry/hca_3.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/hca_3.sock: connect: no such file or directory". Reconnecting...
Mar 22 13:58:44 GPU-186 kubelet[2736666]: W0322 13:58:44.243570 2736666 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/hca_3.sock /var/lib/kubelet/plugins_registry/hca_3.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/hca_3.sock: connect: no such file or directory". Reconnecting...
Mar 22 13:58:47 GPU-186 kubelet[2736666]: W0322 13:58:47.653129 2736666 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/hca_3.sock /var/lib/kubelet/plugins_registry/hca_3.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/hca_3.sock: connect: no such file or directory". Reconnecting...
Mar 22 13:58:49 GPU-186 kubelet[2736666]: E0322 13:58:49.472945 2736666 endpoint.go:63] "Can't create new endpoint with socket path" err="failed to dial device plugin: context deadline exceeded" path="/var/lib/kubelet/plugins_registry/hca_3.sock"
Mar 22 13:58:49 GPU-186 kubelet[2736666]: I0322 13:58:49.473931 2736666 plugin_watcher.go:215] "Removing socket path from desired state cache" path="/data/kubelet/plugins_registry/hca_3.sock"
Mar 22 13:58:49 GPU-186 kubelet[2736666]: E0322 13:58:49.474152 2736666 goroutinemap.go:150] Operation for "/data/kubelet/plugins_registry/hca_3.sock" failed. No retries permitted until 2024-03-22 13:58:49.974116125 +0800 CST m=+87.139471161 (durationBeforeRetry 500ms). Error: RegisterPlugin error -- plugin registration failed with err: failed to dial device plugin with socketPath /var/lib/kubelet/plugins_registry/hca_3.sock: failed to dial device plugin: context deadline exceeded: rpc error: code = Unavailable desc = error reading from server: EOF
Mar 22 13:58:49 GPU-186 kubelet[2736666]: W0322 13:58:49.474224 2736666 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/data/kubelet/plugins_registry/hca_3.sock /data/kubelet/plugins_registry/hca_3.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /data/kubelet/plugins_registry/hca_3.sock: connect: no such file or directory". Reconnecting...
Mar 22 13:58:50 GPU-186 kubelet[2736666]: I0322 13:58:50.476259 2736666 reconciler.go:143] "OperationExecutor.UnregisterPlugin started" plugin={SocketPath:/data/kubelet/plugins_registry/hca_3.sock Timestamp:2024-03-22 13:58:38.523730272 +0800 CST m=+75.689085324 Handler:0xc000630000 Name:rdma/hca_3}

show the kubelet --root-dir=/data/kubelet/plugins_registry

root@GPU-186:/data/kubelet# tree /data/kubelet/plugins_registry/
/data/kubelet/plugins_registry/
└── nfs.csi.k8s.io-reg.sock

the rdma-shared-dev-plugin not create the socket file at /data/kubelet/plugins_registry diretory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants