Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Documentation]: clarification on kernel/driver installation #57

Closed
rptaylor opened this issue Apr 4, 2024 · 1 comment
Closed

[Documentation]: clarification on kernel/driver installation #57

rptaylor opened this issue Apr 4, 2024 · 1 comment

Comments

@rptaylor
Copy link
Contributor

rptaylor commented Apr 4, 2024

Description of errors

https://github.com/ROCm/k8s-device-plugin says "ROCm kernel (Installation guide) or latest AMD GPU Linux driver (Installation guide)"

These two links seem to be different ways of doing more or less the same thing, except one only uses the amdgpu-install script while the other documents the script as well as a package-manager-only approach. Are these two sets of documentation redundant, or if not what is the difference?

Also the text or latest AMD GPU Linux driver makes me think that any Linux system should work out of the box without requiring any extra software installation, because the amdgpu driver is included in the kernel:

$ sudo lspci -v | grep -i amd
00:06.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Aldebaran/MI200 [Instinct MI210] (rev 02)
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0c34
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu

$ ls -la /sys/class/kfd
total 0
drwxr-xr-x.  2 root root 0 Apr  4 00:01 .
drwxr-xr-x. 58 root root 0 Apr  4 00:00 ..
lrwxrwxrwx.  1 root root 0 Apr  4 00:03 kfd -> ../../devices/virtual/kfd/kfd

Is that not the case?

Also since the two linked documents cover the general case and are not k8s-specific I am not sure which parts are applicable to using a GPU in k8s pods as opposed to general bare metal. For example wouldn't parts of the ROCm software be installed in an application pod rather than on the node?

Attach any links, screenshots, or additional evidence you think will be helpful.

https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html

https://amdgpu-install.readthedocs.io/en/latest/

@y2kenny
Copy link
Contributor

y2kenny commented Apr 5, 2024

Unfortunately, this is a consequence of the complexity of AMD GPU product stack. Due to the divergence of graphic and compute products (CDNA and RDNA in marketing terms), there are different ways of doing things and the two links represent the divergence. The choice between the two depends on the AMD GPU products you have in your servers. In theory, you can have a cluster of consumer grade AMD GPUs and this k8s device plugin will work.

Also the text or latest AMD GPU Linux driver makes me think that any Linux system should work out of the box without requiring any extra software installation, because the amdgpu driver is included in the kernel

This is correct IF the Linux distribution you choose for your cluster already has a Linux kernel that supports the AMD GPUs you have. I am not sure which kernel version you have in your example but looks like the MI210 support has already been up-streamed and included. In these cases, you should be able to simply install Linux, install k8s and deploy the device plugin and have things working.

However, there are situations where AMD just launched some new hardware and the distribution you picked is not fast moving enough to have the hardware support. In such case, you will have to follow one of the install links (depending on the SKU.)

Also since the two linked documents cover the general case and are not k8s-specific I am not sure which parts are applicable to using a GPU in k8s pods as opposed to general bare metal. For example wouldn't parts of the ROCm software be installed in an application pod rather than on the node?

Only the kernel driver installation is needed. Any user space software should be part of the container images.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants