Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for modules loading completion before switching to userspace #120

Closed
hxss opened this issue Dec 18, 2021 · 12 comments
Closed

Wait for modules loading completion before switching to userspace #120

hxss opened this issue Dec 18, 2021 · 12 comments

Comments

@hxss
Copy link

hxss commented Dec 18, 2021

Boot fails when I add amdgpu module in booster.yaml

modules:
  -*,amdgpu,nvme,ext4,serio_raw,atkbd,xhci_pci,i8042,wmi,video,crc32c_generic,crc32c_intel

The last message I see on boot screen is Switching to the userspace now., wherein keyboard doesn't work.
Journalctl shows this errors:

$ jrc -kb -1 -o short-precise -p warning
Dec 18 14:09:22.016768 hp kernel: amdgpu 0000:04:00.0: loading /lib/firmware/amdgpu/green_sardine_sdma.bin failed with error -4
Dec 18 14:09:22.016838 hp kernel: amdgpu 0000:04:00.0: Direct firmware load for amdgpu/green_sardine_sdma.bin failed with error -4
Dec 18 14:09:22.016848 hp kernel: [drm:sdma_v4_0_early_init [amdgpu]] *ERROR* sdma_v4_0: Failed to load firmware "amdgpu/green_sardine_sdma.bin"
Dec 18 14:09:22.016856 hp kernel: [drm:sdma_v4_0_early_init [amdgpu]] *ERROR* Failed to load sdma firmware!
Dec 18 14:09:22.016864 hp kernel: [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* early_init of IP block <sdma_v4_0> failed -4
Dec 18 14:09:22.016918 hp kernel: amdgpu 0000:04:00.0: amdgpu: Fatal error during GPU init
Dec 18 14:09:22.016974 hp kernel: amdgpu: probe of 0000:04:00.0 failed with error -4

The file /lib/firmware/amdgpu/green_sardine_sdma.bin exists on disk, but I don't see any mention of it in logs when loading into mkinitcpio with amdgpu.

Kernel: x86_64 Linux 5.15.6-zen2-1-zen
CPU: AMD Ryzen 7 5800U with Radeon Graphics @ 16x 1.9GHz
GPU: [integrated] [AMD/ATI] Cezanne (rev c1)
@anatol
Copy link
Owner

anatol commented Dec 18, 2021

This amdgpu issue suppose to be fixed by #104

What is the content of mkinitpcio generated image? Does it even try to load the green_sardine_sdma.bin firmware? Could you please post the full listing of mkinitpcio image that works for you?

@hxss
Copy link
Author

hxss commented Dec 18, 2021

mkinitcpio.conf:

MODULES=(amdgpu nvme ext4)

HOOKS=(systemd autodetect)

Mkinitcpio content. I dont see any green_sardine_sdma.bin in logs when loading with mkinitcpio.

@anatol
Copy link
Owner

anatol commented Dec 18, 2021

Thanks. So the bin file is in mkinitcpio version as well. It probably successfully and silently loaded and not printed to the logs.

$ jrc -kb -1 -o short-precise -p warning
Dec 18 14:09:22.016768 hp kernel: amdgpu 0000:04:00.0: loading /lib/firmware/amdgpu/green_sardine_sdma.bin failed with error -4
Dec 18 14:09:22.016838 hp kernel: amdgpu 0000:04:00.0: Direct firmware load for amdgpu/green_sardine_sdma.bin failed with error -4
Dec 18 14:09:22.016848 hp kernel: [drm:sdma_v4_0_early_init [amdgpu]] *ERROR* sdma_v4_0: Failed to load firmware "amdgpu/green_sardine_sdma.bin"
Dec 18 14:09:22.016856 hp kernel: [drm:sdma_v4_0_early_init [amdgpu]] *ERROR* Failed to load sdma firmware!
Dec 18 14:09:22.016864 hp kernel: [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* early_init of IP block <sdma_v4_0> failed -4
Dec 18 14:09:22.016918 hp kernel: amdgpu 0000:04:00.0: amdgpu: Fatal error during GPU init
Dec 18 14:09:22.016974 hp kernel: amdgpu: probe of 0000:04:00.0 failed with error -4

@hxss do you see any interesting debug messages before or inbetween these ^^ logs? Anything related to loading files/firmware/booster/...

@anatol
Copy link
Owner

anatol commented Dec 18, 2021

And the error itself comes from kernel_read_file_from_path_initns() here https://github.com/torvalds/linux/blob/abfecb39092029c42c79bacac3d1c96a133ff231/drivers/base/firmware_loader/main.c#L476

It is not clear why this function returns EINTR error.

@hxss
Copy link
Author

hxss commented Dec 18, 2021

I added full logs for booster and mkinitcpio in gist. Only info messages exist before the sardine error.

@anatol
Copy link
Owner

anatol commented Dec 18, 2021

@ishitatsuyuki san maybe you have any ideas what is going on here?

@ishitatsuyuki
Copy link

I'm running booster-git 21bafff (one commit behind master, from chaotic-aur) and it has been working fine.

@hxss Are you running the latest version? Try the git version if you have not already. If you have already updated, then try the following config:

modules_force_load: amdgpu

(This is the only line I have in config btw)

@hxss
Copy link
Author

hxss commented Dec 19, 2021

Yeah, it works with modules_force_load, thanks!
So the amdgpu fails only when I list it in modules.
I tested 21bafff and 3d08a89.

@hxss hxss closed this as completed Dec 19, 2021
@anatol
Copy link
Owner

anatol commented Dec 19, 2021

Thanks for the help @ishitatsuyuki and @hxss

The difference between modules and modules_force_load is:

  1. loading modules from the modules list is triggered by udev events. Once some hardware (e.g. AMD GPU) is detected then booster starts loading the corresponding module. modules_force_load starts loading at booster start independently of hardware and udev.
  2. modules_force_load waits till its modules are fully loaded before switching to userspace. modules does not, any module from this list starts loading asynchronously and booster does not wait for its completion.

Item 2 is what triggers this problem. amdgpu is a heavy driver with firmware blobs. If it starts as part of modules it takes time to complete. In case of non-LUKS root partition its detection and mount happens quickly before amdgpu is fully loaded. It actually explains why loading firmware returns EINTR error. The load process is interrupted because booster process is killed while moving to userspace init. And in this case amd driver is half-loaded and it causes GPU issues.

I think that item 2 is too confusing for users. It should wait for modules loading completion from modules as well.

@anatol anatol reopened this Dec 19, 2021
@anatol anatol changed the title boot fails with amdgpu kernel module Wait for modules loading completion before switching to userspace Dec 19, 2021
anatol added a commit that referenced this issue Dec 19, 2021
…to userspace

Currently there are 2 ways to load modules
 - using `modules` and then let udev events match the module and start
   loading it
 - using `modules_force_load` that starts loading the modules at the
   beginning of booster execution.

`modules_force_load` waits for the modules loading completion while
`modules` codepath does not. It causes problems when heavy modules (like
AMD GPU) is being added to `modules` rather than to `modules_force_load`.

Add a counter that tracks whether modules are loaded via any of these
codepaths. Wait for the counter to get down to zero before switching to
userspace.

Closes #120
@anatol
Copy link
Owner

anatol commented Dec 19, 2021

I added a possible fix for the root of this problem. With the commit above AMD GPU drivers should be able to boot even if it added to modules instead of modules_force_load.

@hxss please try wip branch, amd driver should be able to load successfully with your previous config (modules: amdgpu).

@hxss
Copy link
Author

hxss commented Dec 20, 2021

@anatol works great 👍

@anatol
Copy link
Owner

anatol commented Dec 21, 2021

Thank you for confirmation @hxss.

Note that modules_force_load is still the right way to pre-load GPU drivers. If amdgpu added to modules then there is race condition. Theoretically booster might mount root partition and switch to it before PCI enumerates GPU devices.

@anatol anatol closed this as completed in d3652f3 Dec 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants