[Documentation]: clarification on kernel/driver installation #57

rptaylor · 2024-04-04T22:21:46Z

Description of errors

https://github.com/ROCm/k8s-device-plugin says "ROCm kernel (Installation guide) or latest AMD GPU Linux driver (Installation guide)"

These two links seem to be different ways of doing more or less the same thing, except one only uses the amdgpu-install script while the other documents the script as well as a package-manager-only approach. Are these two sets of documentation redundant, or if not what is the difference?

Also the text or latest AMD GPU Linux driver makes me think that any Linux system should work out of the box without requiring any extra software installation, because the amdgpu driver is included in the kernel:

$ sudo lspci -v | grep -i amd
00:06.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Aldebaran/MI200 [Instinct MI210] (rev 02)
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0c34
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu

$ ls -la /sys/class/kfd
total 0
drwxr-xr-x.  2 root root 0 Apr  4 00:01 .
drwxr-xr-x. 58 root root 0 Apr  4 00:00 ..
lrwxrwxrwx.  1 root root 0 Apr  4 00:03 kfd -> ../../devices/virtual/kfd/kfd

Is that not the case?

Also since the two linked documents cover the general case and are not k8s-specific I am not sure which parts are applicable to using a GPU in k8s pods as opposed to general bare metal. For example wouldn't parts of the ROCm software be installed in an application pod rather than on the node?

Attach any links, screenshots, or additional evidence you think will be helpful.

https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html

https://amdgpu-install.readthedocs.io/en/latest/

The text was updated successfully, but these errors were encountered:

y2kenny · 2024-04-05T20:31:59Z

Unfortunately, this is a consequence of the complexity of AMD GPU product stack. Due to the divergence of graphic and compute products (CDNA and RDNA in marketing terms), there are different ways of doing things and the two links represent the divergence. The choice between the two depends on the AMD GPU products you have in your servers. In theory, you can have a cluster of consumer grade AMD GPUs and this k8s device plugin will work.

Also the text or latest AMD GPU Linux driver makes me think that any Linux system should work out of the box without requiring any extra software installation, because the amdgpu driver is included in the kernel

This is correct IF the Linux distribution you choose for your cluster already has a Linux kernel that supports the AMD GPUs you have. I am not sure which kernel version you have in your example but looks like the MI210 support has already been up-streamed and included. In these cases, you should be able to simply install Linux, install k8s and deploy the device plugin and have things working.

However, there are situations where AMD just launched some new hardware and the distribution you picked is not fast moving enough to have the hardware support. In such case, you will have to follow one of the install links (depending on the SKU.)

Also since the two linked documents cover the general case and are not k8s-specific I am not sure which parts are applicable to using a GPU in k8s pods as opposed to general bare metal. For example wouldn't parts of the ROCm software be installed in an application pod rather than on the node?

Only the kernel driver installation is needed. Any user space software should be part of the container images.

rptaylor mentioned this issue Apr 5, 2024

Remove reference to deprecated allow-privileged flag for kubelet #55

Merged

y2kenny-amd closed this as completed Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Documentation]: clarification on kernel/driver installation #57

[Documentation]: clarification on kernel/driver installation #57

rptaylor commented Apr 4, 2024

y2kenny commented Apr 5, 2024

[Documentation]: clarification on kernel/driver installation #57

[Documentation]: clarification on kernel/driver installation #57

Comments

rptaylor commented Apr 4, 2024

Description of errors

Attach any links, screenshots, or additional evidence you think will be helpful.

y2kenny commented Apr 5, 2024