Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU mapping issues resizing Steam window #362

Open
asahilina opened this issue Jan 10, 2025 · 1 comment
Open

GPU mapping issues resizing Steam window #362

asahilina opened this issue Jan 10, 2025 · 1 comment

Comments

@asahilina
Copy link
Member

asahilina commented Jan 10, 2025

Reported here: https://www.reddit.com/r/AsahiLinux/comments/1hy9kym/gpu_timeout_when_resizing_steam_window_macbook/

This one is more interesting than a userspace driver bug. The faults repro easily, but then after resizing for a bit I managed to get this:

[ 7979.479260] ------------[ cut here ]------------
[ 7979.479264] WARNING: CPU: 6 PID: 12736 at drivers/iommu/io-pgtable-arm.c:727 __arm_lpae_unmap+0x36c/0x600
[ 7979.479272] Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer snd_seq nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 wireguard libcurve25519_generic ip6_udp_tunnel udp_tunnel ip_set nf_tables qrtr snd_usb_audio snd_usbmidi_lib snd_hwdep snd_ump snd_rawmidi snd_seq_device uhid bnep sunrpc cdc_acm cdc_mbim cdc_wdm cdc_ncm cdc_ether usbnet mii brcmfmac_wcc binfmt_misc brcmfmac uas brcmutil usb_storage onboard_usb_dev cfg80211 hci_bcm4377 sdhci_pci cqhci bluetooth sdhci mmc_core rfkill apple_isp snd_soc_macaudio videobuf2_dma_sg videobuf2_memops ofpart videobuf2_v4l2 spi_nor videodev exfat snd_soc_apple_mca mtd snd_soc_cs42l84 snd_soc_tas2764 snd_soc_core videobuf2_common snd_compress ac97_bus mc macsmc_hid leds_pwm apple_soc_cpufreq joydev loop dm_multipath nfnetlink zram hid_apple tps6598x nvmem_spmi_mfd macsmc_hwmon rtc_macsmc
[ 7979.479317]  macsmc_power gpio_macsmc macsmc_reboot dockchannel_hid simple_mfd_spmi appledrm crct10dif_ce polyval_ce polyval_generic asahi ghash_ce apple_dcp sha3_ce dwc3 sha512_ce sha512_arm64 i2c_pasemi_platform ulpi drm_dma_helper apple_sio udc_core spi_apple apple_admac snd_pcm_dmaengine pwm_apple i2c_pasemi_core macsmc_rtkit macsmc apple_dockchannel spmi_apple_controller apple_rtkit_helper snd_pcm nvmem_apple_efuses apple_wdt pinctrl_apple_gpio phy_apple_atc snd_timer snd typec mux_apple_display_crossbar apple_dart clk_apple_nco soundcore mux_core xhci_plat_hcd vfat fat nvme_apple apple_sart nvme_core nvme_auth scsi_dh_rdac scsi_dh_emc scsi_dh_alua fuse i2c_dev
[ 7979.479343] CPU: 6 UID: 2001 PID: 12736 Comm: gpu worker Tainted: G S                 6.12.4-400.asahi.fc41.aarch64+16k #1
[ 7979.479345] Tainted: [S]=CPU_OUT_OF_SPEC
[ 7979.479346] Hardware name: Apple MacBook Pro (16-inch, M2 Max, 2023) (DT)
[ 7979.479347] pstate: 21400009 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[ 7979.479349] pc : __arm_lpae_unmap+0x36c/0x600
[ 7979.479351] lr : __arm_lpae_unmap+0x130/0x600
[ 7979.479352] sp : ffff8000b78d7170
[ 7979.479352] x29: ffff8000b78d7170 x28: ffff207bb05c7f00 x27: ffff207bb05c6c30
[ 7979.479354] x26: 00000077f9f80000 x25: 000000000000025a x24: 0000000000000000
[ 7979.479355] x23: 00000077f9618000 x22: 0000000000004000 x21: 0000000000000000
[ 7979.479356] x20: 000000000000025a x19: ffff207d0a506640 x18: 0000000000000000
[ 7979.479358] x17: 0000000000000000 x16: ffffc679da1724b0 x15: 0000ffff60c5ba00
[ 7979.479359] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 7979.479360] x11: 0000000000000000 x10: 0000008000000000 x9 : ffffc679da115318
[ 7979.479361] x8 : 0000000000000586 x7 : ffffe08640000000 x6 : 000000000000027a
[ 7979.479363] x5 : 0000000000000003 x4 : 00000000000002ab x3 : 0000000000004000
[ 7979.479364] x2 : 00000077f9618000 x1 : 000000000000000a x0 : 0000000000000003
[ 7979.479365] Call trace:
[ 7979.479366]  __arm_lpae_unmap+0x36c/0x600
[ 7979.479367]  __arm_lpae_unmap+0x130/0x600
[ 7979.479369]  __arm_lpae_unmap+0x130/0x600
[ 7979.479370]  arm_lpae_unmap_pages+0xb8/0x108
[ 7979.479371]  _RNvMs2_NtCseJjz0SMgPjl_5asahi3mmuNtB5_7VmInner11unmap_pages+0xc8/0x20c [asahi]
[ 7979.479383]  _RNvXs1_NtCseJjz0SMgPjl_5asahi3mmuNtB5_7VmInnerNtNtNtCsyksr4wmXDQ_6kernel3drm5gpuvm11DriverGpuVm10step_unmap+0xdc/0x434 [asahi]
[ 7979.479389]  _RINvNtNtCsyksr4wmXDQ_6kernel3drm5gpuvm19step_unmap_callbackNtNtCseJjz0SMgPjl_5asahi3mmu7VmInnerEBZ_+0x28/0x3c [asahi]
[ 7979.479396]  drm_gpuvm_bo_unmap+0xa8/0x198
[ 7979.479398]  _RNvMsh_NtCseJjz0SMgPjl_5asahi3mmuNtB5_2Vm13drop_mappings+0xac/0x1f4 [asahi]
[ 7979.479404]  _RNvMs1_NtCseJjz0SMgPjl_5asahi4fileNtB5_4File17unbind_gem_object+0xdc/0x1fc [asahi]
[ 7979.479410]  _RINvNtNtCsyksr4wmXDQ_6kernel3drm3gem14close_callbackNtNtCseJjz0SMgPjl_5asahi3gem12DriverObjectINtNtB2_5shmem6ObjectBO_EEBS_+0x80/0xd0 [asahi]
[ 7979.479417]  drm_gem_handle_delete+0x7c/0xf0
[ 7979.479418]  drm_gem_close_ioctl+0x3c/0x58
[ 7979.479419]  drm_ioctl_kernel+0xc8/0x138
[ 7979.479421]  drm_ioctl+0x230/0x4c0
[ 7979.479422]  __arm64_sys_ioctl+0xb4/0x100
[ 7979.479424]  invoke_syscall+0x6c/0x100
[ 7979.479427]  el0_svc_common.constprop.0+0x48/0xf0
[ 7979.479429]  do_el0_svc+0x24/0x38
[ 7979.479430]  el0_svc+0x38/0x148
[ 7979.479432]  el0t_64_sync_handler+0x120/0x138
[ 7979.479434]  el0t_64_sync+0x194/0x198
[ 7979.479435] ---[ end trace 0000000000000000 ]---
[ 7979.479446] ------------[ cut here ]------------
[...]
[ 8153.834856] asahi 406400000.gpu: unmap_pages 0x77f33c0000:0x2e returned 0
(repeat for 0x2e pages since the asahi driver retries for each subsequent page)

So there's a kernel driver bug involved. It kind of sounds like a double unmap, or an unmap after a failed map, or something like that. The addresses printed by the kernel match at least one of the faults.

@asahilina
Copy link
Member Author

asahilina commented Jan 11, 2025

Fun...

[68152.647640] asahi 406400000.gpu: [File 16]: IOCTL: gem_create size=0xdb4000
[68152.647643] asahi: DriverObject::new id=200471
[68152.647656] asahi: DriverObject new user object: id=200471
[68152.647658] asahi 406400000.gpu: [File 16]: IOCTL: gem_create size=0xdb4000 handle=0x55
[68152.647661] asahi 406400000.gpu: [File 16 VM 1]: IOCTL: gem_bind op=0 handle=0x55 flags=0x3 0x0:0xdb4000 -> 0x77fa8f0000

[68152.659672] asahi 406400000.gpu: [File 302 VM 1]: IOCTL: gem_bind op=0 handle=0x120 flags=0x3 0x0:0x12d0000 -> 0x77e8018000
[68152.659684] asahi 406400000.gpu: [File 302 VM 1]: IOCTL: gem_bind op=0 handle=0x120 flags=0x3 0x0:0xdb4000 -> 0x77f7180000

[...]
[68152.688039] asahi: DriverObject::close id=200471
[...]
[68152.696518] asahi: DriverObject::close id=200471
[...]
[68152.696806] asahi 406400000.gpu: unmap_pages 0x77e8dcc000:0x147 returned 0

An object gets created, exported/imported into another process. The other process binds it twice, once with a size greater than the object.

The kernel bug is we do not fail bind requests with size > object size, and consider the tail range mapped (even though the underlying PT mutation operation stops after reaching the end of the object). So on unmap, everything after the object end WARNs since those PTEs aren't actually populated. This is harmless in principle (no kernel state is dangerously wrong), but noisy.

But there's definitely a userspace bug to go along with this. Mapping the same object twice in a row, once with an excessive size, is definitely wrong. Virtio related perhaps? 0x12d0000 is the size of other GEM objects File 302 is dealing with...

I guess step 1 here is to fix the kernel to fail the bad request, then see where it comes from in userspace. And I really need to get muvm/libkrun to verbosely log virglrenderer errors somehow...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant