2-3x Performance improvement in screen texture mipmap generation#117339
2-3x Performance improvement in screen texture mipmap generation#117339SoftLattice wants to merge 1 commit into
Conversation
|
I tried this with Metal, and it currently doesn't compile with the following errors in the logs: I'll see what is going on with SPIRV-Cross |
|
I tried the test project with a godot/core/templates/safe_refcount.h Lines 186 to 189 in c5df0cb |
|
Thanks for checking this! It was really difficult to get stable benchmarks, so the sample project is just a copy / paste of the GLSL files loaded in as compute shaders using the It looks like you tracked down the 4.7 crashing bug, but I rewrote the demo anyway to do the same test using linear recursion. The new attached demo no longer crashes (for me) if you're interested in testing performance. |
|
Thanks @SoftLattice – incidentally, when I fixed SPIRV-Cross, the benchmarks showed no different on my M4 Max with Metal. |
|
There is a fix for the crash coming in #117053 |
Thank you for testing, @stuartcarnie! Unfortunately I don't have access to a metal device to test myself, but the benchmark uses I was surprised it ran at all, because the |
|
It'd be nice if you also provided the before/after times in milliseconds in the benchmarks instead of just the improvement. |
|
Good idea, @blueskythlikesclouds. Here is a new version. I'll make it a public repo later rather than posting a half dozen copies. |
Does this mean that this optimization is ineffective for compatibility renderer? Is there any way to optimize performance under compatibility renderer? |
Can you update your main post with the results? |
|
This change completely breaks glow. It is not functional at all with this PR. Tested on an Intel ARC A770 with Vulkan |
Unfortunately, this is only for Forward+. Mobile and compatibility use different pipelines that are harder to optimize with subgroups.
Done!
Thanks for testing this @clayjohn ! would you mind cloning this project and providing a screenshot after clicking "Start Test"? I don't have access to an Intel GPU, but I can try to diagnose the problem if I can see the mipmap result. Thanks! |
The results are pretty interesting! My guess is you aren't sharing any of the samples across subgroups and so the result just appears to be a much darker version (since you are still applying weights to each sample) If it helps, here is the device report in the vulkan hardware database https://vulkan.gpuinfo.org/displayreport.php?id=46985 edit: same problem with default shader too
|
|
Super helpful @clayjohn ! I pushed some changes to the benchmark project you ran earlier. Would you mind pulling changes and running it again with "glow" and if it's still dark try "safe"? Apparently SIMD lockstep isn't actually in the Khronos spec, so Intel doesn't make the same guarantee as NVIDIA and AMD. So the cache barriers I had might be insufficient on Intel. I increased them to full subgroup execution barriers which should work. The "safe" shader variant in the benchmark puts in full workgroup barriers if the subgroup barriers aren't enough. Thanks again! |
6279ddc to
48e09f7
Compare
|
Thank you! You've been so generous with your time. I pushed another change to the benchmark project if you wouldn't mind checking. A motivation / summary of changes below. Given its the same behavior with "safe" my suspicion is it's a failed loop-unroll or an implicit type conversion error (there were statements with [uint] + [int] where the int was negative). That leads to under accumulation of the Gaussian kernel which darkens the image but it would still looks same-ish. I now have stricter typing, eliminated [int]s, got rid of index subtractions, and explicitly requested loop unroll depths. I also got rid of alpha on the benchmark (alpha = 1, but RGB are still random) so I can quantitatively compare color in screenshots. |
48e09f7 to
aa67887
Compare
|
Thank you, thank you! I think I finally figured it out from this output! In your most recent test the output was exactly 1/4th the expected value. This suggests something wrong with the initial texture fetch (which happens in 4 rounds). After careful review, I discovered that In retrospect, this is in the spec, I was just interpreting it wrong. Eventually I found the comments in this example direct from Vulkan and realized it's just a poorly named variable. Anyway, I updated the benchmark. "Glow" and "default" hopefully both work, assuming Hopefully this will be the last time. |
aa67887 to
d372e04
Compare
|
Incredible, thank you for sticking it out and helping me! If Intel is dispatching subgroup sizes of 8 then it already avoids most of the bank conflicts (current shader has SIMD width 16 conflicts), so the main improvements you'll see are from better texture reads and the skipped execution barrier. It's still a boost though, and it works on the big 3 GPU vendors so I'm satisfied. I updated the PR to use the fixed version, so glow should work as expected once again. |
|
I think the next steps are just to confirm that this works on Apple and Android devices before merging |
|
We'll need to sync the latest SPIRV-Cross, as I submitted a bug report for the incorrect function call (already fixed by Hans-Kristian) I can verify the output matches these; however, I'm not able to verify performance differences, as timestamps aren't implemented for Metal 3. Timestamps work similarly to Vulkan in Metal 4, but I'm still getting incorrect values, so I'll try to fix that to see if I can verify. @SoftLattice perhaps you can adjust the benchmark to not rely on that API, so I can validate? |
|
Great news the SIPRV-Cross was already fixed. I added a toggle to use CPU timing instead (just above start button) of GPU timestamps. Please pull the latest version of the benchmark repo. I get the same results between the two. I originally favored timestamps because early on I was having frame backup problems on CPU. Disabling V-Sync cleared that up, but I just never switched back. Let me know if that works for you, @stuartcarnie hopefully you see improvement. Lastly, the ~2% error is actually from a clamp mis-calculation in the original copy shader code. The out of bounds clamp is only done once, meaning OOB pixels actually change value during the 4 texture lookups instead of remaining clamped. Dropping the edge 8 pixels from the comparison gives identical results within floating point precision. |
I presume you are talking about the barriers? I will try your benchmarks to see how it performs and report back. |
I think they mean RenderingDevice::sync() |
Ahh, of course – thanks! |
|
@stuartcarnie , I don't know if you've had luck running the "stress test", but a brief search said Metal forces V-Sync, so that may not even work. In fact, if V-Sync is forced that could explain why the benchmark is failing. When I was running the benchmark with V-Sync enabled, I got frame queue backups because which totally skewed the results. I updated the benchmark project (https://github.com/SoftLattice/optimized-copy-shader-benchmark) repo to save the test-by-test values to a CSV. If you wouldn't mind, I'd be interested in the results there? If there's drift occurring that could explain the problem. It might be that a GPU profiler is the only way to check performance impact on Metal. |
It's isn't Metal that forces a V-Sync, it's when you use Core Animation to present a drawable texture to the display; however, on macOS you can disable V-Sync, which we support in Godot. Incidentally, I have added an environment option in Godot (for testing) called GODOT_MTL_OFF_SCREEN=(0|1). If set to 1, Godot renders to an off-screen texture, and only displays the output every 1 second, to completely eliminate V-Sync. So when you disable V-Sync and use this option, it renders as fast as possible.
I'll give it another run |
|
Results (using the Not a significant difference, but it is definitely a bit slower. |
Hmmm, I tested on an M2 macbook and got an improvement
|
|
I'll try on my M1, @clayjohn – how did you resolve the SPIRV-Cross compilation issue, as I haven't opened a PR with an updated version yet. |
I didn't I just ran https://github.com/SoftLattice/optimized-copy-shader-benchmark using the last build of master that I had lying around |
|
curious how that would work 🤔 |
|
I'll test this PR rather than run the test project and run the benchmark to see if it performs better. |
I agree with consistent drop in performance, the only thing that concerns me (which may be moot given consistency here) is that the first round is clearly faster than following rounds for both shaders. The first round should be slowest with warmup delay (and potentially JIT compilation). Then performance should improve or stay the same for following rounds. This shows the opposite which points to RenderingDevice::sync() calls choking the GPU, so timings are unreliable. Given the consistency, qualitative comparison is likely valid, I'd just like to find a quantitative way to compare them on Metal.
Did SPIRV-Cross only struggle with |
|
We better double check D3D12 too #118113 |
|
Yes! I've verified it works with D3D12, both in 4.6 (as a compute shader) and in the current PR build using glow. Sample benchmark results using timestamps (RTX 4070 laptop GPU running Windows 11):
|












Summary
Improved the
copy.glslshader which is used in calculating mipmaps of screen textures. Performance increase has been measured at 2x-3x speedup depending on resolution (results in table below).Motivation
Screen reading textures are the preferred way to perform custom post-processing. The calculation of screen texture mipmaps executes the
copy.glslshader multiple times per frame. As such, optimization to this shader will benefit any project that uses custom post-processing.The current implementation doesn't take GPU subgroups into consideration. Taking advantage of subgroups, it's possible to avoid execution barriers, and eliminate cache bank collisions found in the current implementation. Making these changes can significantly improve performance (expected 2x, but observed as high as 3x).
Changes
Accuracy
Benchmarks were performed executing the existing
copy.glsland proposed change as stand alone compute shaders. Pixel-wise absolute value differences were measured for comparison. Floating point precision differences were observed, but average per-pixel difference is measured at < 0.01%, and maximum difference (out of all pixels across all executions in batch) of < 2%.Benchmarks were performed on NVIDIA, AMD, and mobile architectures to test different subgroup sizes. Mobile by default uses a different shader, but benchmarks invoked can still be measured through compute shader execution.
Benchmarks
For smaller textures the overhead of kernel execution approaches shader execution time, but even at 8x8 textures a speedup of 1.63x is observed.
Benchmark project can be cloned here.
Notes
servers/rendering/renderer_rd/shaders/effects/copy.glsl