Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ubershaders and pipeline pre-compilation (and dedicated transfer queues). #90400

Merged
merged 1 commit into from
Oct 3, 2024

Conversation

DarioSamo
Copy link
Contributor

@DarioSamo DarioSamo commented Apr 8, 2024

This is a big PR with quite a bit of history that should be evaluated very thoroughly to evaluate where do we want to make some concessions and try to mitigate the side effects as much as possible. However, the benefits are essential to shipping games with the engine and making the final experience for users much better. To read further on @reduz's notes about the topic, you can check out these documents (Part 1 and Part 2).

Due to the complexity of this PR and how 4.3 is currently in feature freeze, I'd definitely not consider this PR until 4.3 is out. If you want the TL;DR: skip ahead to the two videos with the TPS demo to see the immediate difference.

NOTE: These improvements will only affect the Forward+ and Mobile renderers. No changes are expected for Compatibility.

Transfer queues

First of all, this PR supersedes the transfer queues PR and effectively uses it as its base. The reliance on needing to unlock parts of the behavior of RenderingDevice to make it multithread-friendly to reap the benefits was far too much to keep both PRs separate. As mentioned in that previous PR, merging it as is will cause a small performance regression unless #86333 is merged first.

Pipeline compilation

Modern APIs like Vulkan and D3D12 have made rendering pipeline management very explicit: their creation is no longer hidden behind the current rendering state and handled on demand by the driver. Instead, the developer must create the entire pipeline ahead of time and wait on a blocking operation that can take a significant amount of time depending on the complexity of the shader and the speed of the hardware. This has seen some improvements recently with the introduction of new extensions like VK_EXT_graphics_pipeline_library, but as always, Godot must engineer solutions aimed towards resolving the problem for as much hardware as possible and use such features optionally for optimization in the future.

Godot has the responsibility to perform as fast as possible for the end user, which leaves it no choice but to generate pipelines with the least amount of code and requirements as possible. The engine achieves this through the use of shader compilation macros (shader variants) and the use of specialization constants to optimize code for a particular pipeline (pipeline variants). While Godot resolves shader variant compilation and can even ship the shader cache to skip the step altogether, it coudn't resolve pipeline variant compilation ahead of time before this PR at all.

If you're familiar with the "stutters when playing the game for the first time" phenomenon that has plagued all games shipped with Godot 4's RD-based renderers, this is pretty much the entire root of the problem. This is not a problem exclusive to Godot as it's been very evident in lots of commercial releases that include very extensive shader pre-compilation steps the first time a game starts or a driver update happens. The issue is so prevalent even Digital Foundry points it out as the #1 problem plaguing PC game releases in this article and they never fail to mention the existence of the problem on any new game that suffers from it.

Ubershaders for Godot 4

The exciting part about this PR is an effective solution was developed to address this problem completely without the need to introduce extensive shader pre-compilation steps or any input from the game developer whatsoever. Instead, attempts have been made to make pipeline compilation a part of loading assets as much as possible. Not only does this mean most pipeline compilation is no longer resolved at drawing time, it can also even be done in background threads and presented as part of a regular loading screen. That means the game is no longer at the mercy of the renderer introducing these stutters when it needs to draw, but it makes the behavior much more predictable and able to be handled as part of a loading process.

The main improvement this PR makes is the introduction of ubershaders once more to the engine, but these are quite different from what was previously done in Godot 3. Unlike the previous version of the engine, these shaders do not correspond to generating text shaders with specializations and compiling them in the background, which could lead to a lot of CPU usage that'd take lots of time in weaker systems. Instead, ubershaders are mostly still very similar to the current shaders the engine already has, with a key difference: specialization constants are pulled from push constants instead. This means that the engine is able to use a version of the shader already that can be used for drawing immediately while the specialized version is generated in the background. Pipeline variants are much faster to generate like this instead of relying on runtime shader compilation to insert the constants as part of the shader text, as they work directly on the SPIR-V and skip the need to compile the shader from text again.

Specialization constants are a big part of how Godot optimizes pipelines, but they've been limited by parts of the design as to how many can actually be used. Any additional constant implied an explosion of variants that led to the pipeline cache structure getting even bigger (160 KB in just pointers in Forward+ for any single material in master at the moment!), and every new addition meant that if the state is very dynamic, stutters would occur due to extra pipeline compilation. This was quite evident in the Mobile renderer, which uses a specialization constant to disable lights if they're not used: as soon as a light popped up, then stutters due to pipeline compilation were inevitable.

With this change, a new simple hashing system for pipeline caching is introduced instead:

  • The required pipeline is requested in its specialized form from the cache.
  • If the pipeline is not available yet, compilation is started on the background without stalling the main thread.
  • The ubershader pipeline (which has been compiled at loading time) is used instead. The specialization constants are pushed as part of the push constant instead along with some other rasterization state parameters (like backface culling).

Pipeline compilation at loading time

The other key part behind the PR is the introduction of pipeline compilation of the ubershaders in two extra steps.

  • During the creation of the surface cache in the renderer (stutter on scene setup).
  • During loading of ArrayMeshes (can be pushed to a background thread).

The difference in how both of these changes work together is pretty evident on the TPS demo by simulating a clean run as an end user would see the first time they run the game. A big chunk of the stutters are gone, especially the one that happens the first time the character shoots, which is a typical case of a stutter that only happened at drawing time despite the effect being loaded in the scene tree already.

Both of these videos have pipeline caching disabled and the driver cache deleted between each run.

master (dc91479)

Godot.Third-Person.Shooter.Demo.DEBUG.2024-04-08.13-26-46-00.00.03.950-00.00.25.336.mp4

transfer_and_pipelines

2024-04-08.13-28-29-00.00.07.952-00.00.21.936.mp4

It's also worth noting how the loading screen animation actually plays out more of the time instead of having one big stutter at the end due to the initial pipeline compilation at drawing time. These loading times are also significantly shortened by making multiple improvements to the behavior of both the shader and pipeline compiler, allowing it to multi-thread more effectively and use more of the system's resources.

The negatives (and how we can mitigate them)

As was expected, these benefits do not come for free. But there's multiple ways we can attempt to mitigate most of the extra cost and this is an area I'm open to feedback on and that we can further optimize in future versions as well.

  • An extra shader variant had to be introduced. @reduz has diligently always recommended against adding shader variants as it leads to a combinatory explosion, but this is one sacrifice that had to be made to allow ubershaders to exist. However, there's potential for reducing some of the existing shader variants to dynamic paths in the ubershader if possible. This is also significantly mitigated by improvements to the shader compiler's multithreading and it should be a non-issue for games that ship the SPIR-V shader cache.
  • Loading times are bound to be longer as pipeline compilation is pushed here instead. This is the intended effect and it's paying a cost upfront that would happen at drawing time otherwise (which is less preferable). That said, pipeline caching always plays a part here and it'll speed up loading times in later runs as it should.
  • Higher memory consumption from extra pipelines and shader variants being compiled that might go unused. This is sadly one cost that must be paid no matter what and can hopefully be mitigated by implementing better detection of features in use.

The biggest reason behind these negatives is the engine's flexibility. Features can be turned on and off without explicit operations from the user at a global level: a scene can be instanced to use VoxelGI while another one might use Lightmaps instead. As a matter of fact, this is exactly what the TPS demo does, so any run of the game must pre-compile the Lightmap variants because it can't know ahead of time which method the user has chosen without looking at the scene's contents, which is yet to be instanced during mesh loading.

One of the things I hope to improve while this PR is in progress is reducing the amount of variants that are pre-compiled as much as possible. Therefore it'd be great to gather feedback on which of these methods are most effective and how to implement them:

  • Detecting features that aren't used at the project level would help significantly. If a user never uses VoxelGI, we shouldn't be pre-compiling variants for it. The biggest culprits here that I could identify are features like separate specular and motion vectors. Adding some form of tracking somewhere at a global level so the engine can know ahead of time without having to instance scenes would be very helpful here.
  • Assuming features aren't enabled by default and going back to compile them if they are: this is actually something that's partially implemented with the 'advanced' shader groups already. Upgrading this to a per-feature detection could help a lot towards reducing pre-compilation and delegating it to the surface cache setup. If a developer wishes to properly delegate the pre-compilation during mesh loading, all they need to do is just instance scenes first with the appropriate features.
  • Allowing developers to opt in or out of variants to pre-compile at a global level instead. This is likely a very good solution for more experienced developers to fine-tune their game if it actually has a significant amount of shaders and materials that need it. This would likely be a simple set of toggles indicating which features shouldn't be compiled as the developer knows they'll never make use of them.

It's worth noting that under the current implementation, none of these leading to false positives will lead to the engine misbehaving: at worst, it just causes the drawing time stutters the current version already has.

Testing methodology

image

  • Delete the driver pipeline cache. This is the last barrier of defense the driver has if the application doesn't implement a pipeline cache of its own. This heavily depends on the IHV and the platform (e.g. on Windows & NVIDIA it's located at %LocalAppData%/NVIDIA/GLCache). No test should be considered valid without deleting this cache first and foremost.

image

  • Run the game!

Trying to measure the results can be a bit tricky as the results are heavily dependent on the behavior you see in a project. As the benefits are more visually evident as seen in the videos, it is hard to measure the effects of pipeline compilation at drawing time as they present themselves as stutters that happen all throughout the game instead of one particular scenario.

New performance monitors

Some new statistics have been added to the performance monitors which should help verify without a shadow of a doubt if the pipeline pre-compilation is working as intended. There's four different pipeline compilation sources that are identified and they should help towards understanding where a extended loading time or stutter comes from.

image

Quoted from the documentation added by this PR:

  • RENDERING_INFO_PIPELINE_COMPILATIONS_MESH: Number of pipeline compilations that were triggered by loading meshes. These compilations will show up as longer loading times the first time a user runs the game and the pipeline is required.
  • RENDERING_INFO_PIPELINE_COMPILATIONS_SURFACE: Number of pipeline compilations that were triggered by building the surface cache before rendering the scene. These compilations will show up as a stutter when loading scenes the first time a user runs the game and the pipeline is required.
  • RENDERING_INFO_PIPELINE_COMPILATIONS_DRAW: Number of pipeline compilations that were triggered while drawing the scene. These compilations will show up as stutters during gameplay the first time a user runs the game and the pipeline is required.
  • RENDERING_INFO_PIPELINE_COMPILATIONS_SPECIALIZATION: Number of pipeline compilations that were triggered to optimize the current scene. These compilations are done in the background and should not cause any stutters whatsoever.

bugsquad edit: Fixes #61233

TODO

  • Pass CI and address compatibility breakage (if there's any).
  • Make sure compatibility renderer hasn't had any regressions from modifying the common classes.
  • Address any multi-threading issues that can possibly arise from both transfer queues and this PR.
  • Test D3D12 for any regressions or new issues introduced from multithreading.
  • Evaluate further ways to reduce the amount of pipelines being pre-compiled.
  • Evaluate any possible CPU-time regressions during drawing and how to mitigate them.
  • Evaluate adding this improvement to Canvas Renderer as well.

Contributed by W4 Games. 🍀

@Calinou
Copy link
Member

Calinou commented Apr 8, 2024

  • Allowing developers to opt in or out of variants to pre-compile at a global level instead. This is likely a very good solution for more experienced developers to fine-tune their game if it actually has a significant amount of shaders and materials that need it. This would likely be a simple set of toggles indicating which features shouldn't be compiled as the developer knows they'll never make use of them.

This resembles godotengine/godot-proposals#5229 and godotengine/godot-proposals#6497 a lot, although I haven't proposed it for VoxelGI and LightmapGI yet as these are not Environment or CameraEffects properties.

If such a setting is disabled, we can assume the user is OK with having runtime shader compilation occur the first time the setting is enabled (since they'll probably be in an options menu while doing so).

@DarioSamo
Copy link
Contributor Author

DarioSamo commented Apr 11, 2024

Assuming features aren't enabled by default and going back to compile them if they are: this is actually something that's partially implemented with the 'advanced' shader groups already. Upgrading this to a per-feature detection could help a lot towards reducing pre-compilation and delegating it to the surface cache setup. If a developer wishes to properly delegate the pre-compilation during mesh loading, all they need to do is just instance scenes first with the appropriate features.

I gave this a shot and got pretty successful results. The current caveat is that pipeline compilation will be less likely to be triggered for resources loaded through a background thread in a loading screen unless the game features an scene first with the feature used in-place. If not, then it must defer the loading to the surface cache creation instead.

However the results are pretty good. The pre-compilation on the TPS demo has gone down significantly:

image

That's around 300 pipelines down from 650+ pipelines in the OP, pretty much doubling the speed of the initial load in the demo that I showcased in the video and still has no pipeline stutters during drawing. I haven't detected any regressions from implementing this yet but trying to find edge cases is still worth investigating.

Godot.Third-Person.Shooter.Demo.DEBUG.2024-04-11.13-19-04-00.00.04.468-00.00.11.535.mp4

I still think we could use some global settings to fine-tune the behavior (e.g. automatically detect, always pre-compile, never pre-compile), but this gets us much closer to an ideal level of pre-compilations that I wanted to see from the start.

@DarioSamo
Copy link
Contributor Author

DarioSamo commented Apr 22, 2024

I investigated Canvas Renderer support and the potential problems we'd have to fix to fully take advantage of it.

First off, Canvas Renderer does suffer from the exact same problem: pipelines are compiled at drawing time if necessary. However, the total amount of pipelines that this does happen on is fairly small. However, it's undeniable you can get stutters from behavior such as enabling and disabling lights in proximity of the elements.

I added the entire framework for supporting ubershaders but ultimately left it disabled for now for a few reasons even if it does work as intended.

  • The amount of pipelines that you can pre-compile from just the shader data is about six without taking into account the possibility of meshes and polygons. That's a lot of pipelines to pre-compile when the final amount usually ends up being far less. In one example project, the pre-compiled count was basically 24 while the specializations were merely 3. That's a lot of added loading time for very little benefit.
  • It's impossible for the checks I added to pre-compile pipelines ahead of time for polygons and meshes. The vertex attribute format needs to be known ahead of time with exact precision of the offsets and strides.
  • A real solution would involve some sort of scheme where we detect commands that get added and cached, pass those off to the renderer and pre-compile pipelines for the cached commands. However upon experimentation it seems whatever point I found that could be used as the hook was called way too often to be beneficial.

For now I'm leaning towards addressing other issues the PR currently has (such as an extra CPU cost due to a mutex I want to avoid), but if anyone has an example of a project that requires lots of different shaders, pipelines, is entirely 2D and suffers from stutters, that'd help to provide a good example of something I can use as a reference.

@DarioSamo DarioSamo force-pushed the transfer_and_pipelines branch 3 times, most recently from 50882fa to c21f062 Compare April 23, 2024 17:58
Copy link
Member

@Calinou Calinou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally with Vulkan Forward+ and Mobile rendering methods, it works as expected. Shader compilation stutter is completely gone in the TPS demo when shooting or destroying an enemy. Runtime performance is identical to master when no shader compilation occurs.

The profilers that track pipeline compilations also work as expected. Docs look good to me as well.

This comes at the cost of slightly longer startup times, but I'd say it's worth it.

Benchmark

PC specifications
  • CPU: Intel Core i9-13900K
  • GPU: NVIDIA GeForce RTX 4090
  • RAM: 64 GB (2×32 GB DDR5-5800 C30)
  • SSD: Solidigm P44 Pro 2 TB
  • OS: Linux (Fedora 39)

Using a Linux x86_64 optimized editor build (with LTO).

Startup + shutdown times when running https://github.com/godotengine/tps-demo's main menu:

Cold driver shader cache

$ hyperfine -iw1 -p "rm -rf ~/.cache/nvidia/GLCache" "bin/godot.linuxbsd.editor.x86_64 --path ~/Documents/Godot/tps-demo --quit" "bin/godot.linuxbsd.editor.x86_64.transfer_and_pipelines --path ~/Documents/Godot/tps-demo --quit"
Benchmark 1: bin/godot.linuxbsd.editor.x86_64 --path ~/Documents/Godot/tps-demo --quit
  Time (mean ± σ):      2.412 s ±  0.029 s    [User: 1.057 s, System: 0.294 s]
  Range (min … max):    2.371 s …  2.463 s    10 runs

Benchmark 2: bin/godot.linuxbsd.editor.x86_64.transfer_and_pipelines --path ~/Documents/Godot/tps-demo --quit
  Time (mean ± σ):      2.555 s ±  0.247 s    [User: 1.418 s, System: 0.318 s]
  Range (min … max):    2.079 s …  2.719 s    10 runs

Warm shader driver cache

$ hyperfine -iw1 "bin/godot.linuxbsd.editor.x86_64 --path ~/Documents/Godot/tps-demo --quit" "bin/godot.linuxbsd.editor.x86_64.transfer_and_pipelines --path ~/Documents/Godot/tps-demo --quit"
Benchmark 1: bin/godot.linuxbsd.editor.x86_64 --path ~/Documents/Godot/tps-demo --quit
  Time (mean ± σ):      2.152 s ±  0.028 s    [User: 0.831 s, System: 0.271 s]
  Range (min … max):    2.126 s …  2.204 s    10 runs

Benchmark 2: bin/godot.linuxbsd.editor.x86_64.transfer_and_pipelines --path ~/Documents/Godot/tps-demo --quit
  Time (mean ± σ):      2.236 s ±  0.039 s    [User: 0.917 s, System: 0.294 s]
  Range (min … max):    2.193 s …  2.320 s    10 runs

Summary
  bin/godot.linuxbsd.editor.x86_64 --path ~/Documents/Godot/tps-demo --quit ran
    1.04 ± 0.02 times faster than bin/godot.linuxbsd.editor.x86_64.transfer_and_pipelines --path ~/Documents/Godot/tps-demo --quit

@Calinou
Copy link
Member

Calinou commented Apr 30, 2024

PS: I wonder how this will interact with #88199 – does Metal make this approach possible?

@DarioSamo
Copy link
Contributor Author

DarioSamo commented Apr 30, 2024

PS: I wonder how this will interact with #88199 – does Metal make this approach possible?

Should be completely fine as far as I know as the PR's approach is completely driver-agnostic. A lot of the changes on this one are just basically fixing a lot of stuff that wasn't thread safe, so it could expose some other bugs if part of the Metal driver assumed that wasn't gonna happen (which was a common issue in the D3D12 one but easily fixed).

@DarioSamo DarioSamo force-pushed the transfer_and_pipelines branch 3 times, most recently from 755d06e to 7c3a8d1 Compare October 1, 2024 14:52
@DarioSamo
Copy link
Contributor Author

DarioSamo commented Oct 2, 2024

For folks keeping up to date with this PR, we encountered a few problems that currently make this a bit risky to merge with the Metal backend. I'm unsure at the moment if the problem originates from the PR itself or just the fact that the Metal backend was not made to go through such heavily multithreaded work on the past before. This wouldn't be entirely unexpected, as this PR had to implement multiple fixes to the D3D12 driver to avoid race conditions.

As far as I'm concerned I consider the PR to be done and it's been stable for us in Windows and Linux so far, but keep it in mind if Mac support is important for you.

I'll attempt to see if the issues in Mac can be identified and solved.

@DarioSamo
Copy link
Contributor Author

DarioSamo commented Oct 2, 2024

Small update, it seems most of it was related to the secondary thread stack size default being a different size and a bit too small for Godot. Increasing this has fixed most of the crashes. I'm still tracking one remaining issue but the PR seems to be working fine now on Mac.

@DarioSamo DarioSamo force-pushed the transfer_and_pipelines branch 3 times, most recently from 2127d4e to 23468a3 Compare October 2, 2024 17:58
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great now! This is the final culmination of a lot of work spread over many months. I am very glad to see if finished.

This is ready to merge, and I suggest we merge it quickly to avoid conflicts.

I have personally tested on many devices including Win10, Linux, MacOS, and Android. I tested the TPS demo on all platforms, but I also tested the Nuku Warriors demo on Windows and multiple misc. demos on Linux. I am confident at this point that this is good enough for merging.

@akien-mga akien-mga merged commit 98deb2a into godotengine:master Oct 3, 2024
19 checks passed
@akien-mga
Copy link
Member

Amazing work @DarioSamo 🎉
It's absolutely surreal to see demos like the TPS demo finally run stutter free 🤯

@DarioSamo
Copy link
Contributor Author

DarioSamo commented Oct 3, 2024

Anyone wanting an introduction to this merge can have a look at the tutorial introduced by the PR to the docs here: https://docs.godotengine.org/en/latest/tutorials/performance/pipeline_compilations.html

DarioSamo added a commit to DarioSamo/godot-docs that referenced this pull request Oct 3, 2024
DarioSamo added a commit to DarioSamo/godot-docs that referenced this pull request Oct 3, 2024
@HeadClot
Copy link

HeadClot commented Oct 4, 2024

Super excited to try this with the XR Editor in Dev 4. Bit of a question however - Does this support the compatibility renderer?

@DarioSamo
Copy link
Contributor Author

DarioSamo commented Oct 4, 2024

Does this support the compatibility renderer?

I'm afraid it's pretty much not possibly by design. Modern APIs like Vulkan are the only ones that provide direct control over creating pipelines, which is what this entire system is designed around.

@Capewearer
Copy link

This PR could've fixed #95112 , needs further testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vulkan: Shader compilation stutter when materials enter the view frustum for the first time (unless cached)
9 participants