2-3x Performance improvement in screen texture mipmap generation by SoftLattice · Pull Request #117339 · godotengine/godot

SoftLattice · 2026-03-11T21:48:56Z

Summary

Improved the copy.glsl shader which is used in calculating mipmaps of screen textures. Performance increase has been measured at 2x-3x speedup depending on resolution (results in table below).

Motivation

Screen reading textures are the preferred way to perform custom post-processing. The calculation of screen texture mipmaps executes the copy.glsl shader multiple times per frame. As such, optimization to this shader will benefit any project that uses custom post-processing.

The current implementation doesn't take GPU subgroups into consideration. Taking advantage of subgroups, it's possible to avoid execution barriers, and eliminate cache bank collisions found in the current implementation. Making these changes can significantly improve performance (expected 2x, but observed as high as 3x).

Changes

Changed initial texture read to linear index to encourage coalesced texture memory access
Replaced cache read/write pattern to use a shuffled index which avoids bank conflicts
Changed first Gaussian pass to compute samples fetched by the same subgroup (avoiding execution barrier)
Combined "first pass" glow read to initial write, eliminating a secondary cache read/write

Accuracy

Benchmarks were performed executing the existing copy.glsl and proposed change as stand alone compute shaders. Pixel-wise absolute value differences were measured for comparison. Floating point precision differences were observed, but average per-pixel difference is measured at < 0.01%, and maximum difference (out of all pixels across all executions in batch) of < 2%.

Benchmarks were performed on NVIDIA, AMD, and mobile architectures to test different subgroup sizes. Mobile by default uses a different shader, but benchmarks invoked can still be measured through compute shader execution.

Benchmarks

GPU Family	Texture Size	Mode	Old Time (us)	New Time (us)	Speedup
NVIDIA	2048	BLUR	26.5	9.4	2.82x
NVIDIA	1024	BLUR	26.2	9.3	2.81x
NVIDIA	512	BLUR	15.2	7.6	2.00x
NVIDIA	1024	GLOW	35.0	12.8	2.73x
AMD	1024	BLUR	13.2	4.3	3.07x
AMD	1024	GLOW	16.0	5.2	3.07x

For smaller textures the overhead of kernel execution approaches shader execution time, but even at 8x8 textures a speedup of 1.63x is observed.

Benchmark project can be cloned here.

Notes

The change is isolated to servers/rendering/renderer_rd/shaders/effects/copy.glsl
No AI was used to develop this code

stuartcarnie · 2026-03-11T22:24:46Z

I tried this with Metal, and it currently doesn't compile with the following errors in the logs:

2026-03-12 09:17:20.931269+1100 Godot[2029:2969149] [ERROR] /Volumes/Data/projects/games/godot/drivers/metal/metal_objects_shared.cpp:849:operator()(): Error compiling shader : program_source:107:79: error: use of undeclared identifier 'thread_scope_subgroup'; did you mean 'thread_scope_simdgroup'?
        atomic_thread_fence(mem_flags::mem_threadgroup, memory_order_seq_cst, thread_scope_subgroup);
                                                                              ^~~~~~~~~~~~~~~~~~~~~
                                                                              thread_scope_simdgroup

I'll see what is going on with SPIRV-Cross

stuartcarnie · 2026-03-11T22:34:52Z

I tried the test project with a DEV_ENABLED build and it is crashing ~~, likely due to threading issues~~:

godot/core/templates/safe_refcount.h

Lines 186 to 189 in c5df0cb

    
           CRASH_COND_MSG(count.get() == 0, 
        
           		"Trying to unreference a SafeRefCount which is already zero is wrong and a symptom of it being misused.\n" 
        
           		"Upon a SafeRefCount reaching zero any object whose lifetime is tied to it, as well as the ref count itself, must be destroyed.\n" 
        
           		"Moreover, to guarantee that, no multiple threads should be racing to do the final unreferencing to zero.");

SoftLattice · 2026-03-12T02:04:24Z

Thanks for checking this!

It was really difficult to get stable benchmarks, so the sample project is just a copy / paste of the GLSL files loaded in as compute shaders using the RenderingDevice interface, so it was actually done in 4.6.1. If you have a better idea on how to benchmark the live shader I'd love to hear it.

It looks like you tracked down the 4.7 crashing bug, but I rewrote the demo anyway to do the same test using linear recursion. The new attached demo no longer crashes (for me) if you're interested in testing performance.

stuartcarnie · 2026-03-12T02:59:18Z

Thanks @SoftLattice – incidentally, when I fixed SPIRV-Cross, the benchmarks showed no different on my M4 Max with Metal.

stuartcarnie · 2026-03-12T05:30:15Z

There is a fix for the crash coming in #117053

SoftLattice · 2026-03-12T09:28:34Z

Thanks @SoftLattice – incidentally, when I fixed SPIRV-Cross, the benchmarks showed no different on my M4 Max with Metal.

Thank you for testing, @stuartcarnie!

Unfortunately I don't have access to a metal device to test myself, but the benchmark uses RenderingDevice::capture_timestamp to compute timing. Unless I'm mistaken, that's implemented for Vulkan, and d3d12, but is just a noop for Metal.

I was surprised it ran at all, because the RenderingDevice::get_captured_timestamp_gpu_time wouldn't make sense, but it looks like Metal just reports the timestamp index, so the value computed in the benchmark on Metal would appear to be meaningless.

blueskythlikesclouds · 2026-03-12T09:59:49Z

It'd be nice if you also provided the before/after times in milliseconds in the benchmarks instead of just the improvement.

SoftLattice · 2026-03-12T10:35:15Z

Good idea, @blueskythlikesclouds. Here is a new version.

I'll make it a public repo later rather than posting a half dozen copies.

scgm0 · 2026-03-12T10:56:59Z

The change is isolated to servers/rendering/renderer_rd/shaders/effects/copy.glsl

Does this mean that this optimization is ineffective for compatibility renderer? Is there any way to optimize performance under compatibility renderer?

blueskythlikesclouds · 2026-03-12T12:04:22Z

Good idea, @blueskythlikesclouds. Here is a new version.

Can you update your main post with the results?

clayjohn · 2026-03-12T18:14:24Z

This change completely breaks glow. It is not functional at all with this PR.

Tested on an Intel ARC A770 with Vulkan

SoftLattice · 2026-03-13T02:31:30Z

The change is isolated to servers/rendering/renderer_rd/shaders/effects/copy.glsl

Does this mean that this optimization is ineffective for compatibility renderer? Is there any way to optimize performance under compatibility renderer?

Unfortunately, this is only for Forward+. Mobile and compatibility use different pipelines that are harder to optimize with subgroups.

Good idea, @blueskythlikesclouds. Here is a new version.

Can you update your main post with the results?

Done!

This change completely breaks glow. It is not functional at all with this PR.

Tested on an Intel ARC A770 with Vulkan

Thanks for testing this @clayjohn ! would you mind cloning this project and providing a screenshot after clicking "Start Test"?

I don't have access to an Intel GPU, but I can try to diagnose the problem if I can see the mipmap result. Thanks!

clayjohn · 2026-03-13T19:57:01Z

The results are pretty interesting! My guess is you aren't sharing any of the samples across subgroups and so the result just appears to be a much darker version (since you are still applying weights to each sample)

If it helps, here is the device report in the vulkan hardware database https://vulkan.gpuinfo.org/displayreport.php?id=46985

edit: same problem with default shader too

SoftLattice · 2026-03-14T03:24:16Z

Super helpful @clayjohn !

I pushed some changes to the benchmark project you ran earlier. Would you mind pulling changes and running it again with "glow" and if it's still dark try "safe"?

Apparently SIMD lockstep isn't actually in the Khronos spec, so Intel doesn't make the same guarantee as NVIDIA and AMD. So the cache barriers I had might be insufficient on Intel.

I increased them to full subgroup execution barriers which should work. The "safe" shader variant in the benchmark puts in full workgroup barriers if the subgroup barriers aren't enough.

Thanks again!

clayjohn · 2026-03-14T23:56:56Z

Running with the new version

With safe mode:

SoftLattice · 2026-03-15T14:14:33Z

Thank you! You've been so generous with your time.

I pushed another change to the benchmark project if you wouldn't mind checking. A motivation / summary of changes below.

Given its the same behavior with "safe" my suspicion is it's a failed loop-unroll or an implicit type conversion error (there were statements with [uint] + [int] where the int was negative). That leads to under accumulation of the Gaussian kernel which darkens the image but it would still looks same-ish.

I now have stricter typing, eliminated [int]s, got rid of index subtractions, and explicitly requested loop unroll depths.

I also got rid of alpha on the benchmark (alpha = 1, but RGB are still random) so I can quantitatively compare color in screenshots.

clayjohn · 2026-03-16T03:43:07Z

No change with the new update

I have also tested on the same device using a Radeon™ 780M Graphics iGPU and everything looks okay. So the issue is specifically with the Intel dGPU

SoftLattice · 2026-03-16T20:58:58Z

Thank you, thank you! I think I finally figured it out from this output!

In your most recent test the output was exactly 1/4th the expected value. This suggests something wrong with the initial texture fetch (which happens in 4 rounds).

After careful review, I discovered that gl_SubgroupSize is not actually the subgroup size (despite the name). In fact, it is typically the max possible subgroup size which is only different from actual subgroup size on Intel GPUs.

In retrospect, this is in the spec, I was just interpreting it wrong. Eventually I found the comments in this example direct from Vulkan and realized it's just a poorly named variable.

Anyway, I updated the benchmark. "Glow" and "default" hopefully both work, assuming gl_NumSubgroups is what it sounds like. If not "trust_ballot" uses ballot voting to identify the subgroup size (at ~5% performance hit) and "trust_nothing" does ballot voting + full barriers.

Hopefully this will be the last time.

clayjohn · 2026-03-16T22:46:33Z

Success! great work. There still appears to be a nice performance improvement

SoftLattice · 2026-03-17T02:12:14Z

Incredible, thank you for sticking it out and helping me!

If Intel is dispatching subgroup sizes of 8 then it already avoids most of the bank conflicts (current shader has SIMD width 16 conflicts), so the main improvements you'll see are from better texture reads and the skipped execution barrier.

It's still a boost though, and it works on the big 3 GPU vendors so I'm satisfied.

I updated the PR to use the fixed version, so glow should work as expected once again.

clayjohn · 2026-03-17T15:53:33Z

I think the next steps are just to confirm that this works on Apple and Android devices before merging

stuartcarnie · 2026-03-17T20:04:47Z

We'll need to sync the latest SPIRV-Cross, as I submitted a bug report for the incorrect function call (already fixed by Hans-Kristian)

MSL: A shader using subgroupMemoryBarrierShared() generates invalid MSL KhronosGroup/SPIRV-Cross#2607

I can verify the output matches these; however, I'm not able to verify performance differences, as timestamps aren't implemented for Metal 3. Timestamps work similarly to Vulkan in Metal 4, but I'm still getting incorrect values, so I'll try to fix that to see if I can verify.

@SoftLattice perhaps you can adjust the benchmark to not rely on that API, so I can validate?

SoftLattice · 2026-03-18T02:21:42Z

Great news the SIPRV-Cross was already fixed.

I added a toggle to use CPU timing instead (just above start button) of GPU timestamps. Please pull the latest version of the benchmark repo.

I get the same results between the two. I originally favored timestamps because early on I was having frame backup problems on CPU. Disabling V-Sync cleared that up, but I just never switched back.

Let me know if that works for you, @stuartcarnie hopefully you see improvement.

Lastly, the ~2% error is actually from a clamp mis-calculation in the original copy shader code. The out of bounds clamp is only done once, meaning OOB pixels actually change value during the 4 texture lookups instead of remaining clamped. Dropping the edge 8 pixels from the comparison gives identical results within floating point precision.

stuartcarnie · 2026-03-18T21:34:58Z

It is working, but I am not seeing an improvement. I'll take a look at how the code is generated in MSL

SoftLattice · 2026-03-21T00:03:40Z

Maybe the sync() is handled differently on Metal? I did have a dip in timing using CPU timing compared to timestamps (2.8x --> 2.6x) but it was still faster. It is concerning it's slower, because the benchmark alternates which shader runs first so both are given equal chance.

Maybe try this, here's a project which simulates (via SubViewports) post processing a 4K game.

stress_test.tar.gz

VSync disabled, and FPS uncapped

I get ~1070 FPS in the trunk build

I get ~1310 FPS in this PR build

This is on an NVIDIA RTX 3080 Ti

stuartcarnie · 2026-03-22T01:58:36Z

Maybe the sync() is handled differently on Metal?

I presume you are talking about the barriers?

I will try your benchmarks to see how it performs and report back.

clayjohn · 2026-03-22T03:11:58Z

Maybe the sync() is handled differently on Metal?

I presume you are talking about the barriers?

I will try your benchmarks to see how it performs and report back.

I think they mean RenderingDevice::sync()

stuartcarnie · 2026-03-22T19:34:02Z

Maybe the sync() is handled differently on Metal?

I presume you are talking about the barriers?
I will try your benchmarks to see how it performs and report back.

I think they mean RenderingDevice::sync()

Ahh, of course – thanks!

SoftLattice · 2026-03-29T14:21:23Z

@stuartcarnie , I don't know if you've had luck running the "stress test", but a brief search said Metal forces V-Sync, so that may not even work.

In fact, if V-Sync is forced that could explain why the benchmark is failing. When I was running the benchmark with V-Sync enabled, I got frame queue backups because which totally skewed the results.

I updated the benchmark project (https://github.com/SoftLattice/optimized-copy-shader-benchmark) repo to save the test-by-test values to a CSV.

If you wouldn't mind, I'd be interested in the results there? If there's drift occurring that could explain the problem. It might be that a GPU profiler is the only way to check performance impact on Metal.

stuartcarnie · 2026-03-30T19:39:52Z

but a brief search said Metal forces V-Sync, so that may not even work.

It's isn't Metal that forces a V-Sync, it's when you use Core Animation to present a drawable texture to the display; however, on macOS you can disable V-Sync, which we support in Godot.

Incidentally, I have added an environment option in Godot (for testing) called GODOT_MTL_OFF_SCREEN=(0|1). If set to 1, Godot renders to an off-screen texture, and only displays the output every 1 second, to completely eliminate V-Sync. So when you disable V-Sync and use this option, it renders as fast as possible.

I updated the benchmark project (https://github.com/SoftLattice/optimized-copy-shader-benchmark) repo to save the test-by-test values to a CSV.

I'll give it another run

stuartcarnie · 2026-03-30T19:46:13Z

Results (using the GODOT_MTL_OFF_SCREEN=1 feature):

╭───┬──────┬───────┬───────╮
│ # │ test │  old  │  new  │
├───┼──────┼───────┼───────┤
│ 0 │    0 │ 30.80 │ 47.95 │
│ 1 │    1 │ 53.10 │ 90.50 │
│ 2 │    2 │ 47.00 │ 75.00 │
│ 3 │    3 │ 53.05 │ 88.15 │
│ 4 │    4 │ 52.00 │ 88.30 │
│ 5 │    5 │ 54.55 │ 92.05 │
│ 6 │    6 │ 47.10 │ 73.75 │
│ 7 │    7 │ 43.85 │ 78.20 │
│ 8 │    8 │ 47.60 │ 73.20 │
╰───┴──────┴───────┴───────╯

Not a significant difference, but it is definitely a bit slower.

clayjohn · 2026-03-30T21:47:52Z

Results (using the GODOT_MTL_OFF_SCREEN=1 feature):

╭───┬──────┬───────┬───────╮
│ # │ test │  old  │  new  │
├───┼──────┼───────┼───────┤
│ 0 │    0 │ 30.80 │ 47.95 │
│ 1 │    1 │ 53.10 │ 90.50 │
│ 2 │    2 │ 47.00 │ 75.00 │
│ 3 │    3 │ 53.05 │ 88.15 │
│ 4 │    4 │ 52.00 │ 88.30 │
│ 5 │    5 │ 54.55 │ 92.05 │
│ 6 │    6 │ 47.10 │ 73.75 │
│ 7 │    7 │ 43.85 │ 78.20 │
│ 8 │    8 │ 47.60 │ 73.20 │
╰───┴──────┴───────┴───────╯

Not a significant difference, but it is definitely a bit slower.

Hmmm, I tested on an M2 macbook and got an improvement

test	old	new
0	600.05	454.65
1	426.15	328.45
2	375.85	380.50
3	467.30	323.05
4	373.85	402.75
5	0.00	0.00
6	0.00	0.00
7	0.00	0.00
8	0.00	0.00

stuartcarnie · 2026-03-30T21:53:00Z

I'll try on my M1, @clayjohn – how did you resolve the SPIRV-Cross compilation issue, as I haven't opened a PR with an updated version yet.

clayjohn · 2026-03-30T21:56:30Z

I'll try on my M1, @clayjohn – how did you resolve the SPIRV-Cross compilation issue, as I haven't opened a PR with an updated version yet.

I didn't I just ran https://github.com/SoftLattice/optimized-copy-shader-benchmark using the last build of master that I had lying around

stuartcarnie · 2026-03-30T22:00:35Z

curious how that would work 🤔

stuartcarnie · 2026-03-30T22:02:15Z

I'll test this PR rather than run the test project and run the benchmark to see if it performs better.

SoftLattice · 2026-03-31T12:04:09Z

Results (using the GODOT_MTL_OFF_SCREEN=1 feature):

╭───┬──────┬───────┬───────╮
│ # │ test │  old  │  new  │
├───┼──────┼───────┼───────┤
│ 0 │    0 │ 30.80 │ 47.95 │
│ 1 │    1 │ 53.10 │ 90.50 │
│ 2 │    2 │ 47.00 │ 75.00 │
│ 3 │    3 │ 53.05 │ 88.15 │
│ 4 │    4 │ 52.00 │ 88.30 │
│ 5 │    5 │ 54.55 │ 92.05 │
│ 6 │    6 │ 47.10 │ 73.75 │
│ 7 │    7 │ 43.85 │ 78.20 │
│ 8 │    8 │ 47.60 │ 73.20 │
╰───┴──────┴───────┴───────╯

Not a significant difference, but it is definitely a bit slower.

I agree with consistent drop in performance, the only thing that concerns me (which may be moot given consistency here) is that the first round is clearly faster than following rounds for both shaders.

The first round should be slowest with warmup delay (and potentially JIT compilation). Then performance should improve or stay the same for following rounds. This shows the opposite which points to RenderingDevice::sync() calls choking the GPU, so timings are unreliable.

Given the consistency, qualitative comparison is likely valid, I'd just like to find a quantitative way to compare them on Metal.

curious how that would work 🤔

Did SPIRV-Cross only struggle with subgroupMemoryBarrierShared()? I changed these to full subgroupBarrier() in debugging for Intel, so SPIRV-Cross might work now.

clayjohn · 2026-04-03T20:09:26Z

We better double check D3D12 too #118113

Cc @blueskythlikesclouds

SoftLattice · 2026-04-04T03:06:29Z

Yes! I've verified it works with D3D12, both in 4.6 (as a compute shader) and in the current PR build using glow.

Sample benchmark results using timestamps (RTX 4070 laptop GPU running Windows 11):

test	old	new
0	120.422400	38.297600
1	118.579200	36.352000
2	118.630400	36.300800
3	118.630400	36.249600
4	118.528000	36.249600

SoftLattice requested a review from a team as a code owner March 11, 2026 21:48

This was referenced Mar 11, 2026

MSL: A shader using subgroupMemoryBarrierShared() generates invalid MSL KhronosGroup/SPIRV-Cross#2607

Closed

GDScript: Fix crash when DEV_ENABLED #117346

Closed

Nintorch added enhancement topic:rendering performance labels Mar 12, 2026

Nintorch added this to the 4.x milestone Mar 12, 2026

stuartcarnie mentioned this pull request Mar 12, 2026

GDScript: Fix and simplify coroutine stack clearing #117053

Merged

SoftLattice force-pushed the optimized_copy_shader branch from 6279ddc to 48e09f7 Compare March 14, 2026 18:36

SoftLattice force-pushed the optimized_copy_shader branch from 48e09f7 to aa67887 Compare March 16, 2026 02:29

Optimized copy shader bank read/writes

d372e04

SoftLattice force-pushed the optimized_copy_shader branch from aa67887 to d372e04 Compare March 16, 2026 23:45

clayjohn mentioned this pull request Mar 30, 2026

2x Performance Improvement to Forward+ Auto Exposure #117963

Open

Uh oh!

Conversation

SoftLattice commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Accuracy

Benchmarks

Notes

Uh oh!

stuartcarnie commented Mar 11, 2026

Uh oh!

stuartcarnie commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SoftLattice commented Mar 12, 2026

Uh oh!

stuartcarnie commented Mar 12, 2026

Uh oh!

stuartcarnie commented Mar 12, 2026

Uh oh!

SoftLattice commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blueskythlikesclouds commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SoftLattice commented Mar 12, 2026

Uh oh!

scgm0 commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blueskythlikesclouds commented Mar 12, 2026

Uh oh!

clayjohn commented Mar 12, 2026

Uh oh!

SoftLattice commented Mar 13, 2026

Uh oh!

clayjohn commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SoftLattice commented Mar 14, 2026

Uh oh!

clayjohn commented Mar 14, 2026

Uh oh!

SoftLattice commented Mar 15, 2026

Uh oh!

clayjohn commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SoftLattice commented Mar 16, 2026

Uh oh!

clayjohn commented Mar 16, 2026

Uh oh!

SoftLattice commented Mar 17, 2026

Uh oh!

clayjohn commented Mar 17, 2026

Uh oh!

stuartcarnie commented Mar 17, 2026

Uh oh!

SoftLattice commented Mar 18, 2026

Uh oh!

stuartcarnie commented Mar 18, 2026

Uh oh!

SoftLattice commented Mar 21, 2026

Uh oh!

stuartcarnie commented Mar 22, 2026

Uh oh!

clayjohn commented Mar 22, 2026

Uh oh!

stuartcarnie commented Mar 22, 2026

Uh oh!

SoftLattice commented Mar 29, 2026

Uh oh!

stuartcarnie commented Mar 30, 2026

Uh oh!

stuartcarnie commented Mar 30, 2026

Uh oh!

clayjohn commented Mar 30, 2026

SoftLattice commented Mar 11, 2026 •

edited

Loading

stuartcarnie commented Mar 11, 2026 •

edited

Loading

SoftLattice commented Mar 12, 2026 •

edited

Loading

blueskythlikesclouds commented Mar 12, 2026 •

edited

Loading

scgm0 commented Mar 12, 2026 •

edited

Loading

clayjohn commented Mar 13, 2026 •

edited

Loading

clayjohn commented Mar 16, 2026 •

edited

Loading