Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan: Fix vkCmdDrawIndexed crash on Adreno 5XX devices #92611

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Alex2782
Copy link
Contributor

@Alex2782 Alex2782 commented Jun 1, 2024

Fixes #94741 only (Editor)


for #79760, #82602, #85097, #86037

previous solution

Workaround for the Adreno 5XX family of devices.

An issue occurs when the vkCmdDrawIndexed function is called multiple times between vkCmdBeginRenderPass
and vkCmdEndRenderPass, using a shader that includes a uniform variable (#define MATERIAL_UNIFORMS_USED enabled). The crash is caused if the subsequent render operations (Nodes) within the same render pass do not use material shaders with a uniform variable.

In other words, if a shader with uniform variables is used within a render pass, all following render operations up to the end of the render pass must also use shaders with uniform variables. Otherwise, this can lead to a crash. This workaround ensures that this does not happen by making sure that shaders with uniform variables are consistently used between vkCmdBeginRenderPass and vkCmdEndRenderPass.

--> https://github.com/Alex2782/godot/tree/fix_uniform_crash_adreno_5xx_back

Test project: ShaderTest.zip

Tests on Firebase Test Lab
  • Mi A2 Lite - Adreno (TM) 506 (reproduced)
  • vivo 1906 - Adreno (TM) 505 (reproduced)
  • F-01L (Fujitsu) - Adreno (TM) 506 (not affected)

crash if:r_device.workarounds.force_material_uniform_set = false;

F libc    : Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x0 in tid 19822 (VkThread), pid 19502 (lex.shader_test)
F DEBUG   : backtrace:
F DEBUG   :       #00 pc 00000000000f5004  /vendor/lib64/hw/vulkan.msm8953.so (BuildId: 4059f276877a7a61cc16b085624608be)
F DEBUG   :       #01 pc 00000000000a0468  /vendor/lib64/hw/vulkan.msm8953.so (qglinternal::vkCmdDrawIndexed(VkCommandBuffer_T*, unsigned int, unsigned int, unsigned int, int, unsigned int)+232) (BuildId: 4059f276877a7a61cc16b085624608be)

no crash if: r_device.workarounds.force_material_uniform_set = true;

no_crash_web-build_20240601_3ppp_daisy_sprout-29-en_US-landscape_video.mp4

@Alex2782 Alex2782 requested a review from a team as a code owner June 1, 2024 00:13
@Alex2782 Alex2782 force-pushed the fix_uniform_crash_adreno_5xx branch 3 times, most recently from fe8aa34 to 19639e4 Compare June 1, 2024 00:38
@AThousandShips AThousandShips added this to the 4.3 milestone Jun 1, 2024
@Alex2782 Alex2782 force-pushed the fix_uniform_crash_adreno_5xx branch from 19639e4 to 3c49670 Compare June 2, 2024 01:41
@clayjohn clayjohn modified the milestones: 4.3, 4.4 Jul 24, 2024
@Alex2782 Alex2782 marked this pull request as draft July 29, 2024 20:44
@Alex2782 Alex2782 changed the title Vulkan: Fix uniform crash on Adreno 5XX devices Vulkan: Fix vkCmdDrawIndexed crash on Adreno 5XX devices Jul 29, 2024
@Alex2782
Copy link
Contributor Author

There are more vkCmdDrawIndexed crashes, RENDER_GRAPH_REORDER = 0 fixes the issues.
#94741 (comment)

I do not own an Adreno 5xx. I can only check on Firebase Test Lab if the devices crash and analyze the logs, which takes a lot of time. Adreno 5xx (Android 7 to 10) devices are no longer very widespread, or there are not so many devices in active use. I think we should just disable RENDER_GRAPH_REORDER for all Adreno 5xx devices for now.

In the next few days I will first check what effects RENDER_GRAPH_REORDER = 0 has on performance.

@clayjohn
Copy link
Member

In the next few days I will first check what effects RENDER_GRAPH_REORDER = 0 has on performance.

In theory it should have a big impact on performance. However since Adreno 5XX isn't a very common device anymore, it is probably better to force disable RENDER_GRAPH_REORDER so that the workaround only has a small impact on the complexity of the code.

That being said, some of the original reports were from before the acyclic graph was added, which means that setting RENDER_GRAPH_REORDER=0 may not be enough (for example: #79760)

CC @DarioSamo You may find this interesting. We discussed this workaround last month, now we have a bit more information

@DarioSamo
Copy link
Contributor

In theory it should have a big impact on performance. However since Adreno 5XX isn't a very common device anymore, it is probably better to force disable RENDER_GRAPH_REORDER so that the workaround only has a small impact on the complexity of the code.

The good news is the macro doesn't really control much actual compilation and it just controls a boolean that gets passed here. Making a driver workaround might be as simple as just adding a && to that first argument (e.g. a workaround boolean called avoid_graph_reorder).

draw_graph.end(RENDER_GRAPH_REORDER, RENDER_GRAPH_FULL_BARRIERS, command_buffer, frames[frame].command_buffer_pool);

I'd keep this as an option if you consider there's absolutely no other solution and there's not enough time to figure out every annoying driver bug in time for 4.3. However, it is very likely the issue will pop up in the future in the mobile renderer just because someone adds an operation that causes it and we'll pretty much be back at the same point. Perhaps by that point, with some luck, we won't care that these devices are broken and don't need to investigate it further.

@Alex2782

This comment was marked as outdated.

@Alex2782 Alex2782 force-pushed the fix_uniform_crash_adreno_5xx branch from 3c49670 to 10c4a32 Compare July 29, 2024 23:24
@Alex2782
Copy link
Contributor Author

no crash on Fujitsu F-01L - Adreno (TM) 506, Vulkan 1.0.61, driverVersion = 54185879

Is there a function to format the driverVersion as a string? xxx.xxx.x
I want to output in logcat first, on 1 device out of 6 the 2D UniformShader does not crash.


just adding a && to that first argument (e.g. a workaround boolean called avoid_graph_reorder).

That sounds better, maybe I'll change it later. (Wow ... this sentence rhymes in English) 🎤 drop

I decided on a new private variable that is initialized in the constructor, but I can also change that later.
I wasn't sure whether it would have a negative impact on performance.
Every frame: RENDER_GRAPH_REORDER && get_device_workarounds().disable_render_graph_reorder

@clayjohn
Copy link
Member

Is there a function to format the driverVersion as a string? xxx.xxx.x

Yep, they're defined in vulkan_core.h

#define VK_API_VERSION_VARIANT(version) ((uint32_t)(version) >> 29U)
#define VK_API_VERSION_MAJOR(version) (((uint32_t)(version) >> 22U) & 0x7FU)
#define VK_API_VERSION_MINOR(version) (((uint32_t)(version) >> 12U) & 0x3FFU)
#define VK_API_VERSION_PATCH(version) ((uint32_t)(version) & 0xFFFU)

You have to assemble the string yourself using itos()

itos(VK_API_VERSION_VARIANT(driverVersion)) + "." + itos(VK_API_VERSION_MAJOR(driverVersion)) + "." + itos(VK_API_VERSION_MINOR(driverVersion)) + "." + itos(VK_API_VERSION_PATCH(driverVersion))

@Alex2782 Alex2782 force-pushed the fix_uniform_crash_adreno_5xx branch from 10c4a32 to b87c545 Compare July 31, 2024 00:20
@Alex2782

This comment was marked as outdated.

@Alex2782 Alex2782 force-pushed the fix_uniform_crash_adreno_5xx branch 2 times, most recently from 9f47b1e to e0f18b1 Compare July 31, 2024 18:34
@clayjohn
Copy link
Member

Just a note about performance testing. RENDER_GRAPH_REORDER won't have a big impact on simple 2D scenes. It will have the biggest impact on:

  1. Scenes with many particles (2D or 3D)
  2. Scenes with many skeletons (2D or 3D)
  3. Scenes with many 3D features (opaque + transparent + sky + glow + SSR + SSAO etc.)

@Alex2782
Copy link
Contributor Author

Alex2782 commented Jul 31, 2024

should only fix #94741 --> comment


ShaderTest.zip -> #79760, #82602, #85097, #86037

RENDER_GRAPH_REORDER = 0 does not prevent the uniform crash.

On Firebase Test Lab, unfortunately 3 Adreno 5xx devices have been removed in the last 2 months. There are only 4 devices left and my ShaderTest still crashes there, except on F-01L - Fujitsu. And soon at least 1 more device will be removed.

Samsung Tab S3 (crash)
08-06 07:20:29.335 31659 32118 I godot : name: Adreno (TM) 530
08-06 07:20:29.335 31659 32118 I godot : vendor: 20803
08-06 07:20:29.335 31659 32118 I godot : deviceID: 84082690
08-06 07:20:29.335 31659 32118 I godot : driverVersion: 512.331.0
08-06 07:20:29.385 31659 32118 I godot : Vulkan 1.1.66 - Forward Mobile - Using Device #0: Qualcomm - Adreno (TM) 530

08-06 07:20:53.589 32757 32757 F DEBUG : backtrace:
08-06 07:20:53.590 32757 32757 F DEBUG : #00 pc 00000000000c4ad0 /system/vendor/lib64/hw/vulkan.msm8996.so (A5xCommandBuffer::FinalizeDescriptors(A5xBaseAddressDesc*, unsigned int*, unsigned int*, unsigned int)+536)
08-06 07:20:53.590 32757 32757 F DEBUG : #01 pc 00000000000bd1a4 /system/vendor/lib64/hw/vulkan.msm8996.so (A5xCommandBuffer::HwValidateState()+1156)
08-06 07:20:53.590 32757 32757 F DEBUG : #02 pc 0000000000077a1c /system/vendor/lib64/hw/vulkan.msm8996.so (QglCommandBuffer::DrawIndexed(unsigned int, unsigned int, int, unsigned int, unsigned int)+220)
08-06 07:20:53.590 32757 32757 F DEBUG : #03 pc 000000000001856c /system/lib64/libvulkan.so (vulkan::api::(anonymous namespace)::CmdDrawIndexed(VkCommandBuffer_T*, unsigned int, unsigned int, unsigned int, int, unsigned int)+216)


F-01L - Fujitsu (OK)
08-06 07:20:58.261 25316 25479 I godot : name: Adreno (TM) 506
08-06 07:20:58.261 25316 25479 I godot : vendor: 20803
08-06 07:20:58.261 25316 25479 I godot : deviceID: 83887616
08-06 07:20:58.262 25316 25479 I godot : driverVersion: 12.940.3991
08-06 07:20:58.284 25316 25479 I godot : Vulkan 1.0.61 - Forward Mobile - Using Device #0: Qualcomm - Adreno (TM) 506

BACKUP, previous solution:
https://github.com/Alex2782/godot/tree/fix_uniform_crash_adreno_5xx_back

I could create a new PR, if the commit should have a chance.
Otherwise I will give up trying to fix the 'uniform' crashes because I lack productive testing facilities.

@Alex2782 Alex2782 marked this pull request as ready for review July 31, 2024 19:40
Co-Authored-By: Clay John <[email protected]>
Co-Authored-By: Darío <[email protected]>
Co-Authored-By: MatheusMDX <[email protected]>
@Alex2782 Alex2782 force-pushed the fix_uniform_crash_adreno_5xx branch from e0f18b1 to b0d1493 Compare August 6, 2024 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Android editor crash when interacting with some GUI elements
5 participants