Skip to content

2-3x Performance improvement in screen texture mipmap generation#117339

Open
SoftLattice wants to merge 1 commit into
godotengine:masterfrom
SoftLattice:optimized_copy_shader
Open

2-3x Performance improvement in screen texture mipmap generation#117339
SoftLattice wants to merge 1 commit into
godotengine:masterfrom
SoftLattice:optimized_copy_shader

Conversation

@SoftLattice
Copy link
Copy Markdown

@SoftLattice SoftLattice commented Mar 11, 2026

Summary

Improved the copy.glsl shader which is used in calculating mipmaps of screen textures. Performance increase has been measured at 2x-3x speedup depending on resolution (results in table below).

Motivation

Screen reading textures are the preferred way to perform custom post-processing. The calculation of screen texture mipmaps executes the copy.glsl shader multiple times per frame. As such, optimization to this shader will benefit any project that uses custom post-processing.

The current implementation doesn't take GPU subgroups into consideration. Taking advantage of subgroups, it's possible to avoid execution barriers, and eliminate cache bank collisions found in the current implementation. Making these changes can significantly improve performance (expected 2x, but observed as high as 3x).

Changes

  • Changed initial texture read to linear index to encourage coalesced texture memory access
  • Replaced cache read/write pattern to use a shuffled index which avoids bank conflicts
  • Changed first Gaussian pass to compute samples fetched by the same subgroup (avoiding execution barrier)
  • Combined "first pass" glow read to initial write, eliminating a secondary cache read/write

Accuracy

Benchmarks were performed executing the existing copy.glsl and proposed change as stand alone compute shaders. Pixel-wise absolute value differences were measured for comparison. Floating point precision differences were observed, but average per-pixel difference is measured at < 0.01%, and maximum difference (out of all pixels across all executions in batch) of < 2%.

Benchmarks were performed on NVIDIA, AMD, and mobile architectures to test different subgroup sizes. Mobile by default uses a different shader, but benchmarks invoked can still be measured through compute shader execution.

Benchmarks

GPU Family Texture Size Mode Old Time (us) New Time (us) Speedup
NVIDIA 2048 BLUR 26.5 9.4 2.82x
NVIDIA 1024 BLUR 26.2 9.3 2.81x
NVIDIA 512 BLUR 15.2 7.6 2.00x
NVIDIA 1024 GLOW 35.0 12.8 2.73x
AMD 1024 BLUR 13.2 4.3 3.07x
AMD 1024 GLOW 16.0 5.2 3.07x

For smaller textures the overhead of kernel execution approaches shader execution time, but even at 8x8 textures a speedup of 1.63x is observed.

Benchmark project can be cloned here.

Notes

  • The change is isolated to servers/rendering/renderer_rd/shaders/effects/copy.glsl
  • No AI was used to develop this code

@SoftLattice SoftLattice requested a review from a team as a code owner March 11, 2026 21:48
@stuartcarnie
Copy link
Copy Markdown
Contributor

I tried this with Metal, and it currently doesn't compile with the following errors in the logs:

2026-03-12 09:17:20.931269+1100 Godot[2029:2969149] [ERROR] /Volumes/Data/projects/games/godot/drivers/metal/metal_objects_shared.cpp:849:operator()(): Error compiling shader : program_source:107:79: error: use of undeclared identifier 'thread_scope_subgroup'; did you mean 'thread_scope_simdgroup'?
        atomic_thread_fence(mem_flags::mem_threadgroup, memory_order_seq_cst, thread_scope_subgroup);
                                                                              ^~~~~~~~~~~~~~~~~~~~~
                                                                              thread_scope_simdgroup

I'll see what is going on with SPIRV-Cross

@stuartcarnie
Copy link
Copy Markdown
Contributor

stuartcarnie commented Mar 11, 2026

I tried the test project with a DEV_ENABLED build and it is crashing , likely due to threading issues:

CRASH_COND_MSG(count.get() == 0,
"Trying to unreference a SafeRefCount which is already zero is wrong and a symptom of it being misused.\n"
"Upon a SafeRefCount reaching zero any object whose lifetime is tied to it, as well as the ref count itself, must be destroyed.\n"
"Moreover, to guarantee that, no multiple threads should be racing to do the final unreferencing to zero.");

@SoftLattice
Copy link
Copy Markdown
Author

Thanks for checking this!

It was really difficult to get stable benchmarks, so the sample project is just a copy / paste of the GLSL files loaded in as compute shaders using the RenderingDevice interface, so it was actually done in 4.6.1. If you have a better idea on how to benchmark the live shader I'd love to hear it.

It looks like you tracked down the 4.7 crashing bug, but I rewrote the demo anyway to do the same test using linear recursion. The new attached demo no longer crashes (for me) if you're interested in testing performance.

@stuartcarnie
Copy link
Copy Markdown
Contributor

Thanks @SoftLattice – incidentally, when I fixed SPIRV-Cross, the benchmarks showed no different on my M4 Max with Metal.

@stuartcarnie
Copy link
Copy Markdown
Contributor

There is a fix for the crash coming in #117053

@SoftLattice
Copy link
Copy Markdown
Author

SoftLattice commented Mar 12, 2026

Thanks @SoftLattice – incidentally, when I fixed SPIRV-Cross, the benchmarks showed no different on my M4 Max with Metal.

Thank you for testing, @stuartcarnie!

Unfortunately I don't have access to a metal device to test myself, but the benchmark uses RenderingDevice::capture_timestamp to compute timing. Unless I'm mistaken, that's implemented for Vulkan, and d3d12, but is just a noop for Metal.

I was surprised it ran at all, because the RenderingDevice::get_captured_timestamp_gpu_time wouldn't make sense, but it looks like Metal just reports the timestamp index, so the value computed in the benchmark on Metal would appear to be meaningless.

@blueskythlikesclouds
Copy link
Copy Markdown
Member

blueskythlikesclouds commented Mar 12, 2026

It'd be nice if you also provided the before/after times in milliseconds in the benchmarks instead of just the improvement.

@SoftLattice
Copy link
Copy Markdown
Author

Good idea, @blueskythlikesclouds. Here is a new version.

I'll make it a public repo later rather than posting a half dozen copies.

@scgm0
Copy link
Copy Markdown
Contributor

scgm0 commented Mar 12, 2026

  • The change is isolated to servers/rendering/renderer_rd/shaders/effects/copy.glsl

Does this mean that this optimization is ineffective for compatibility renderer? Is there any way to optimize performance under compatibility renderer?

@blueskythlikesclouds
Copy link
Copy Markdown
Member

Good idea, @blueskythlikesclouds. Here is a new version.

Can you update your main post with the results?

@clayjohn
Copy link
Copy Markdown
Member

This change completely breaks glow. It is not functional at all with this PR.

Tested on an Intel ARC A770 with Vulkan

@SoftLattice
Copy link
Copy Markdown
Author

  • The change is isolated to servers/rendering/renderer_rd/shaders/effects/copy.glsl

Does this mean that this optimization is ineffective for compatibility renderer? Is there any way to optimize performance under compatibility renderer?

Unfortunately, this is only for Forward+. Mobile and compatibility use different pipelines that are harder to optimize with subgroups.

Good idea, @blueskythlikesclouds. Here is a new version.

Can you update your main post with the results?

Done!

This change completely breaks glow. It is not functional at all with this PR.

Tested on an Intel ARC A770 with Vulkan

Thanks for testing this @clayjohn ! would you mind cloning this project and providing a screenshot after clicking "Start Test"?

I don't have access to an Intel GPU, but I can try to diagnose the problem if I can see the mipmap result. Thanks!

@clayjohn
Copy link
Copy Markdown
Member

clayjohn commented Mar 13, 2026

image

The results are pretty interesting! My guess is you aren't sharing any of the samples across subgroups and so the result just appears to be a much darker version (since you are still applying weights to each sample)

If it helps, here is the device report in the vulkan hardware database https://vulkan.gpuinfo.org/displayreport.php?id=46985

edit: same problem with default shader too

image

@SoftLattice
Copy link
Copy Markdown
Author

Super helpful @clayjohn !

I pushed some changes to the benchmark project you ran earlier. Would you mind pulling changes and running it again with "glow" and if it's still dark try "safe"?

Apparently SIMD lockstep isn't actually in the Khronos spec, so Intel doesn't make the same guarantee as NVIDIA and AMD. So the cache barriers I had might be insufficient on Intel.

I increased them to full subgroup execution barriers which should work. The "safe" shader variant in the benchmark puts in full workgroup barriers if the subgroup barriers aren't enough.

Thanks again!

@SoftLattice SoftLattice force-pushed the optimized_copy_shader branch from 6279ddc to 48e09f7 Compare March 14, 2026 18:36
@clayjohn
Copy link
Copy Markdown
Member

Running with the new version

image

With safe mode:

image

@SoftLattice
Copy link
Copy Markdown
Author

Thank you! You've been so generous with your time.

I pushed another change to the benchmark project if you wouldn't mind checking. A motivation / summary of changes below.

Given its the same behavior with "safe" my suspicion is it's a failed loop-unroll or an implicit type conversion error (there were statements with [uint] + [int] where the int was negative). That leads to under accumulation of the Gaussian kernel which darkens the image but it would still looks same-ish.

I now have stricter typing, eliminated [int]s, got rid of index subtractions, and explicitly requested loop unroll depths.

I also got rid of alpha on the benchmark (alpha = 1, but RGB are still random) so I can quantitatively compare color in screenshots.

@SoftLattice SoftLattice force-pushed the optimized_copy_shader branch from 48e09f7 to aa67887 Compare March 16, 2026 02:29
@clayjohn
Copy link
Copy Markdown
Member

clayjohn commented Mar 16, 2026

No change with the new update

image

I have also tested on the same device using a Radeon™ 780M Graphics iGPU and everything looks okay. So the issue is specifically with the Intel dGPU

@SoftLattice
Copy link
Copy Markdown
Author

Thank you, thank you! I think I finally figured it out from this output!

In your most recent test the output was exactly 1/4th the expected value. This suggests something wrong with the initial texture fetch (which happens in 4 rounds).

After careful review, I discovered that gl_SubgroupSize is not actually the subgroup size (despite the name). In fact, it is typically the max possible subgroup size which is only different from actual subgroup size on Intel GPUs.

In retrospect, this is in the spec, I was just interpreting it wrong. Eventually I found the comments in this example direct from Vulkan and realized it's just a poorly named variable.

Anyway, I updated the benchmark. "Glow" and "default" hopefully both work, assuming gl_NumSubgroups is what it sounds like. If not "trust_ballot" uses ballot voting to identify the subgroup size (at ~5% performance hit) and "trust_nothing" does ballot voting + full barriers.

Hopefully this will be the last time.

@clayjohn
Copy link
Copy Markdown
Member

Success! great work. There still appears to be a nice performance improvement

Screenshot from 2026-03-16 15-45-45 Screenshot from 2026-03-16 15-45-40 Screenshot from 2026-03-16 15-45-35 Screenshot from 2026-03-16 15-45-27

@SoftLattice SoftLattice force-pushed the optimized_copy_shader branch from aa67887 to d372e04 Compare March 16, 2026 23:45
@SoftLattice
Copy link
Copy Markdown
Author

Incredible, thank you for sticking it out and helping me!

If Intel is dispatching subgroup sizes of 8 then it already avoids most of the bank conflicts (current shader has SIMD width 16 conflicts), so the main improvements you'll see are from better texture reads and the skipped execution barrier.

It's still a boost though, and it works on the big 3 GPU vendors so I'm satisfied.

I updated the PR to use the fixed version, so glow should work as expected once again.

@clayjohn
Copy link
Copy Markdown
Member

I think the next steps are just to confirm that this works on Apple and Android devices before merging

@stuartcarnie
Copy link
Copy Markdown
Contributor

We'll need to sync the latest SPIRV-Cross, as I submitted a bug report for the incorrect function call (already fixed by Hans-Kristian)

I can verify the output matches these; however, I'm not able to verify performance differences, as timestamps aren't implemented for Metal 3. Timestamps work similarly to Vulkan in Metal 4, but I'm still getting incorrect values, so I'll try to fix that to see if I can verify.

@SoftLattice perhaps you can adjust the benchmark to not rely on that API, so I can validate?

@SoftLattice
Copy link
Copy Markdown
Author

Great news the SIPRV-Cross was already fixed.

I added a toggle to use CPU timing instead (just above start button) of GPU timestamps. Please pull the latest version of the benchmark repo.

I get the same results between the two. I originally favored timestamps because early on I was having frame backup problems on CPU. Disabling V-Sync cleared that up, but I just never switched back.

Let me know if that works for you, @stuartcarnie hopefully you see improvement.

Lastly, the ~2% error is actually from a clamp mis-calculation in the original copy shader code. The out of bounds clamp is only done once, meaning OOB pixels actually change value during the 4 texture lookups instead of remaining clamped. Dropping the edge 8 pixels from the comparison gives identical results within floating point precision.

@stuartcarnie
Copy link
Copy Markdown
Contributor

CleanShot 2026-03-19 at 08 34 06@2x

It is working, but I am not seeing an improvement. I'll take a look at how the code is generated in MSL

@SoftLattice
Copy link
Copy Markdown
Author

Maybe the sync() is handled differently on Metal? I did have a dip in timing using CPU timing compared to timestamps (2.8x --> 2.6x) but it was still faster. It is concerning it's slower, because the benchmark alternates which shader runs first so both are given equal chance.

Maybe try this, here's a project which simulates (via SubViewports) post processing a 4K game.

stress_test.tar.gz

VSync disabled, and FPS uncapped

I get ~1070 FPS in the trunk build

Trunk_Result

I get ~1310 FPS in this PR build

PR_Result

This is on an NVIDIA RTX 3080 Ti

@stuartcarnie
Copy link
Copy Markdown
Contributor

Maybe the sync() is handled differently on Metal?

I presume you are talking about the barriers?

I will try your benchmarks to see how it performs and report back.

@clayjohn
Copy link
Copy Markdown
Member

Maybe the sync() is handled differently on Metal?

I presume you are talking about the barriers?

I will try your benchmarks to see how it performs and report back.

I think they mean RenderingDevice::sync()

@stuartcarnie
Copy link
Copy Markdown
Contributor

Maybe the sync() is handled differently on Metal?

I presume you are talking about the barriers?
I will try your benchmarks to see how it performs and report back.

I think they mean RenderingDevice::sync()

Ahh, of course – thanks!

@SoftLattice
Copy link
Copy Markdown
Author

@stuartcarnie , I don't know if you've had luck running the "stress test", but a brief search said Metal forces V-Sync, so that may not even work.

In fact, if V-Sync is forced that could explain why the benchmark is failing. When I was running the benchmark with V-Sync enabled, I got frame queue backups because which totally skewed the results.

I updated the benchmark project (https://github.com/SoftLattice/optimized-copy-shader-benchmark) repo to save the test-by-test values to a CSV.

If you wouldn't mind, I'd be interested in the results there? If there's drift occurring that could explain the problem. It might be that a GPU profiler is the only way to check performance impact on Metal.

@stuartcarnie
Copy link
Copy Markdown
Contributor

but a brief search said Metal forces V-Sync, so that may not even work.

It's isn't Metal that forces a V-Sync, it's when you use Core Animation to present a drawable texture to the display; however, on macOS you can disable V-Sync, which we support in Godot.

Incidentally, I have added an environment option in Godot (for testing) called GODOT_MTL_OFF_SCREEN=(0|1). If set to 1, Godot renders to an off-screen texture, and only displays the output every 1 second, to completely eliminate V-Sync. So when you disable V-Sync and use this option, it renders as fast as possible.

I updated the benchmark project (https://github.com/SoftLattice/optimized-copy-shader-benchmark) repo to save the test-by-test values to a CSV.

I'll give it another run

@stuartcarnie
Copy link
Copy Markdown
Contributor

Results (using the GODOT_MTL_OFF_SCREEN=1 feature):

╭───┬──────┬───────┬───────╮
│ # │ test │  old  │  new  │
├───┼──────┼───────┼───────┤
│ 0 │    0 │ 30.80 │ 47.95 │
│ 1 │    1 │ 53.10 │ 90.50 │
│ 2 │    2 │ 47.00 │ 75.00 │
│ 3 │    3 │ 53.05 │ 88.15 │
│ 4 │    4 │ 52.00 │ 88.30 │
│ 5 │    5 │ 54.55 │ 92.05 │
│ 6 │    6 │ 47.10 │ 73.75 │
│ 7 │    7 │ 43.85 │ 78.20 │
│ 8 │    8 │ 47.60 │ 73.20 │
╰───┴──────┴───────┴───────╯

Not a significant difference, but it is definitely a bit slower.

@clayjohn
Copy link
Copy Markdown
Member

Results (using the GODOT_MTL_OFF_SCREEN=1 feature):

╭───┬──────┬───────┬───────╮
│ # │ test │  old  │  new  │
├───┼──────┼───────┼───────┤
│ 0 │    0 │ 30.80 │ 47.95 │
│ 1 │    1 │ 53.10 │ 90.50 │
│ 2 │    2 │ 47.00 │ 75.00 │
│ 3 │    3 │ 53.05 │ 88.15 │
│ 4 │    4 │ 52.00 │ 88.30 │
│ 5 │    5 │ 54.55 │ 92.05 │
│ 6 │    6 │ 47.10 │ 73.75 │
│ 7 │    7 │ 43.85 │ 78.20 │
│ 8 │    8 │ 47.60 │ 73.20 │
╰───┴──────┴───────┴───────╯

Not a significant difference, but it is definitely a bit slower.

Hmmm, I tested on an M2 macbook and got an improvement

test old new
0 600.05 454.65
1 426.15 328.45
2 375.85 380.50
3 467.30 323.05
4 373.85 402.75
5 0.00 0.00
6 0.00 0.00
7 0.00 0.00
8 0.00 0.00

@stuartcarnie
Copy link
Copy Markdown
Contributor

I'll try on my M1, @clayjohn – how did you resolve the SPIRV-Cross compilation issue, as I haven't opened a PR with an updated version yet.

@clayjohn
Copy link
Copy Markdown
Member

I'll try on my M1, @clayjohn – how did you resolve the SPIRV-Cross compilation issue, as I haven't opened a PR with an updated version yet.

I didn't I just ran https://github.com/SoftLattice/optimized-copy-shader-benchmark using the last build of master that I had lying around

@stuartcarnie
Copy link
Copy Markdown
Contributor

curious how that would work 🤔

@stuartcarnie
Copy link
Copy Markdown
Contributor

I'll test this PR rather than run the test project and run the benchmark to see if it performs better.

@SoftLattice
Copy link
Copy Markdown
Author

Results (using the GODOT_MTL_OFF_SCREEN=1 feature):

╭───┬──────┬───────┬───────╮
│ # │ test │  old  │  new  │
├───┼──────┼───────┼───────┤
│ 0 │    0 │ 30.80 │ 47.95 │
│ 1 │    1 │ 53.10 │ 90.50 │
│ 2 │    2 │ 47.00 │ 75.00 │
│ 3 │    3 │ 53.05 │ 88.15 │
│ 4 │    4 │ 52.00 │ 88.30 │
│ 5 │    5 │ 54.55 │ 92.05 │
│ 6 │    6 │ 47.10 │ 73.75 │
│ 7 │    7 │ 43.85 │ 78.20 │
│ 8 │    8 │ 47.60 │ 73.20 │
╰───┴──────┴───────┴───────╯

Not a significant difference, but it is definitely a bit slower.

I agree with consistent drop in performance, the only thing that concerns me (which may be moot given consistency here) is that the first round is clearly faster than following rounds for both shaders.

The first round should be slowest with warmup delay (and potentially JIT compilation). Then performance should improve or stay the same for following rounds. This shows the opposite which points to RenderingDevice::sync() calls choking the GPU, so timings are unreliable.

Given the consistency, qualitative comparison is likely valid, I'd just like to find a quantitative way to compare them on Metal.

curious how that would work 🤔

Did SPIRV-Cross only struggle with subgroupMemoryBarrierShared()? I changed these to full subgroupBarrier() in debugging for Intel, so SPIRV-Cross might work now.

@clayjohn
Copy link
Copy Markdown
Member

clayjohn commented Apr 3, 2026

We better double check D3D12 too #118113

Cc @blueskythlikesclouds

@SoftLattice
Copy link
Copy Markdown
Author

Yes! I've verified it works with D3D12, both in 4.6 (as a compute shader) and in the current PR build using glow.

Sample benchmark results using timestamps (RTX 4070 laptop GPU running Windows 11):

test old new
0 120.422400 38.297600
1 118.579200 36.352000
2 118.630400 36.300800
3 118.630400 36.249600
4 118.528000 36.249600

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants