Add Persistent Buffers utilizing UMA #111183

stuartcarnie · 2025-10-02T22:13:31Z

Important

This is a rebase of @darksylinc's #103779 to resolve all conflicts. I've re-run the TPS demo and all is working under Metal. As there were no conflicts with Vulkan and D3D12, I'd presume those still work, but they will need to be tested.

All push constant usage has been removed and can be revisited in a future PR.

This work is a heavily refactored and rewritten from TheForge's initial code.

TheForge's original code had too many race conditions and was fundamentally flawed as it was too easy to incur into those data races by accident.

However they identified the proper places that needed changes, and the idea was sound. I used their work as a blueprint to design this work.

This PR implements:

Introduction of UMA buffers used by a few buffers, most notably the ones filled by _fill_instance_data
2D instance buffer uses UMA, pulled from 2d: Use dynamic buffers for instance data #104566

Ironically this change seems to positively affect PC more than it does on Mobile:
Bugsquad edit: These numbers reflect a much earlier draft of this PR

	Before (ms)	After (ms)	Improvement
Workstation	0,622	0,606	2,60 %
Laptop	7,326	7,220	1,44 %
Adreno 640	70,833	70,819	0,02 %
Mali-G68	48,894	49,138	-0,50 %

Notes:

This is the modified TPS Demo
This is the Mobile backend. However improvements apply to Clustered, but it's harder to see them because Clustered is much more GPU bound than Mobile.
Some of these numbers are unimpressive because as a result of this PR, Fix inefficient upload in Mobile Shadows #103531 was submitted. This made "before" faster as the bulk of performance difference is in _fill_instance_data. The more objects are on scene, the bigger the impact.
For some reason before rebasing Adreno 640 "after" got 67.931ms, making it 4.09% faster. But I can no longer reproduce those results. Maybe it was a fluke?
I have no idea why Mali-G68 resulted in a performance degradation. It makes no sense. The best explanation is that Mali GPU does not like the Memory Type/Heap the data is being placed on.

Specs:

Workstation: Ryzen 5900X 2x16GB AMD Radeon 6800 XT 16GB. Ubuntu 24.04 LTS, RADV Mesa 24.3.4
Laptop: Ryzen 5700U 4+8GB AMD Radeon Vega 8. Ubuntu 24.04 LTS, RADV Mesa 24.2.8
Adreno 640: POCO F2 Pro, stock Android 13
Mali-G68: Samsung A54 5G, stock Android 14

Calinou

Tested locally on Vulkan, it works as expected. Code looks good to me.

Benchmark

PC specifications

CPU: AMD Ryzen 9 9950X3D
GPU: NVIDIA GeForce RTX 5090
RAM: 64 GB (2×32 GB DDR5-6000 CL30)
SSD: Solidigm P44 Pro 2 TB
OS: Linux (Fedora 42)

Using release export template builds with production=yes lto=full, with https://github.com/Calinou/godot-reflection run using --disable-vsync -- --benchmark.

1920×1080

Before

	Average	1% low	0.1% low
Frametime	898 FPS (1.11 mspf)	729 FPS (1.37 mspf)	490 FPS (2.04 mspf)
CPU Time	11654 FPS (0.09 mspf)	8620 FPS (0.12 mspf)	6410 FPS (0.16 mspf)
GPU Time	945 FPS (1.06 mspf)	768 FPS (1.30 mspf)	505 FPS (1.98 mspf)

After

	Average	1% low	0.1% low
Frametime	890 FPS (1.12 mspf)	734 FPS (1.36 mspf)	495 FPS (2.02 mspf)
CPU Time	10872 FPS (0.09 mspf)	7751 FPS (0.13 mspf)	5524 FPS (0.18 mspf)
GPU Time	947 FPS (1.06 mspf)	773 FPS (1.29 mspf)	510 FPS (1.96 mspf)

3840×2160

Before

	Average	1% low	0.1% low
Frametime	398 FPS (2.51 mspf)	291 FPS (3.43 mspf)	237 FPS (4.22 mspf)
CPU Time	9295 FPS (0.11 mspf)	6493 FPS (0.15 mspf)	5154 FPS (0.19 mspf)
GPU Time	413 FPS (2.42 mspf)	300 FPS (3.33 mspf)	245 FPS (4.07 mspf)

After

	Average	1% low	0.1% low
Frametime	399 FPS (2.50 mspf)	292 FPS (3.42 mspf)	248 FPS (4.02 mspf)
CPU Time	8638 FPS (0.12 mspf)	5988 FPS (0.17 mspf)	4854 FPS (0.21 mspf)
GPU Time	414 FPS (2.41 mspf)	301 FPS (3.31 mspf)	255 FPS (3.91 mspf)

64×64

Just to check for CPU overhead.

Before

	Average	1% low	0.1% low
Frametime	1736 FPS (0.58 mspf)	1533 FPS (0.65 mspf)	690 FPS (1.45 mspf)
CPU Time	12349 FPS (0.08 mspf)	10416 FPS (0.10 mspf)	7194 FPS (0.14 mspf)
GPU Time	1852 FPS (0.54 mspf)	1683 FPS (0.59 mspf)	709 FPS (1.41 mspf)

After

	Average	1% low	0.1% low
Frametime	1756 FPS (0.57 mspf)	1545 FPS (0.65 mspf)	698 FPS (1.43 mspf)
CPU Time	11554 FPS (0.09 mspf)	9615 FPS (0.10 mspf)	6666 FPS (0.15 mspf)
GPU Time	1872 FPS (0.53 mspf)	1694 FPS (0.59 mspf)	719 FPS (1.39 mspf)

64×64 on llvmpipe

Before

	Average	1% low	0.1% low
Frametime	14 FPS (69.07 mspf)	5 FPS (172.01 mspf)	0 FPS (6839.52 mspf)
CPU Time	2917 FPS (0.34 mspf)	2364 FPS (0.42 mspf)	699 FPS (1.43 mspf)
GPU Time	18 FPS (53.60 mspf)	6 FPS (161.56 mspf)	3 FPS (286.58 mspf)

After

	Average	1% low	0.1% low
Frametime	17 FPS (56.11 mspf)	7 FPS (140.12 mspf)	0 FPS (1047.59 mspf)
CPU Time	3132 FPS (0.32 mspf)	2457 FPS (0.41 mspf)	611 FPS (1.64 mspf)
GPU Time	18 FPS (53.08 mspf)	7 FPS (125.18 mspf)	5 FPS (194.14 mspf)

The figures are pretty similar overall, but there is a significant improvement in llvmpipe at least, which shows the PR is having some positive effect.

clayjohn · 2025-10-17T16:17:00Z

@Calinou This PR should only have an impact on devices that support UMA. Your RTX 5090 doesn't support UMA, so we don't expect it to make a difference. The changes on LLVMpipe are within a margin of error too. So at most we can conclude that this doesn't cause regressions on that device

darksylinc · 2025-10-17T18:07:11Z

Actually Calinou's performance on RTX 5090 are within the expected outcome (though the difference is so small it could be noise).

Even though the RTX 5090 isn't UMA, this PR takes advantage of CPU-visible, device-local VRAM (which is plentiful in GPUs with ReBAR enabled). In other words, the RTX 5090 will pretend it is UMA and we will use that.

This means:

CPU may execute slower, because CPU "pushes" data through the slower uncached PCIe bus (instead of having the GPU "pull" that data later through the bus). This could be theoretically improved if the copy itself is offloaded to a worker thread when possible (not always possible).
GPU might execute faster.
Overall may or may not execute faster, because some of the code now pushes data directly to GPU memory instead of doing two copies (first to staging area, then to GPU).
Low 1% / 0.1% should improve because there's fewer GPU copy commands being scheduled. This is most likely the most noticeable difference and it was the only metric that consistently always improved for Framerate and GPU time, and consistently deteriorated for CPU time. This is exactly expected.
On a true UMA architecture, the difference is that there is no slow PCIe bus. The CPU writes directly to GPU memory without downsides(*), making the PR (theoretically) more impactful on such Hardware.

llvmpipe is an outlier in too many ways (it's not an actual GPU, it's so slow that noise is huge, and rendering threads compete with Godot's threads for execution time).

(*) On some SoCs like the PS4 and PS5, the CPU bandwidth to RAM is artificially throttled though (to avoid CPU from starving the GPU's bandwidth) so the analysis is more nuanced in such cases as it will resemble the RTX 5090 case.

stuartcarnie · 2025-10-17T20:07:40Z

@Calinou thanks for testing. As discussed with @clayjohn in this thread on RocketChat, we're removing the push constant changes, which significantly reduces the blast radius of this PR. Note that no .glsl shaders are modified after that change.

UMA are only used for instance buffers now, and I have incorporated #104566 into this PR, as it was quite small and showed a measurable improvement in the 2D batching demo I tested (10% on my M1 Pro Max; I'll retest on my M4 Pro Max)

servers/rendering/multi_uma_buffer.h

drivers/vulkan/rendering_device_driver_vulkan.cpp

drivers/metal/metal_objects.mm

Calinou · 2025-10-17T23:12:51Z

device-local VRAM (which is plentiful in GPUs with ReBAR enabled). In other words, the RTX 5090 will pretend it is UMA and we will use that.

Note that the NVIDIA Windows driver only enables ReBAR by default in specific games that are known to benefit from it (according to the per-game profile), but on Linux, it seems to be enabled for everything by default as long as it's enabled in the UEFI. I can confirm it's enabled on my end.

clayjohn

Reviewed in rendering meeting and discussed several times with Stuart and Matias. Let's go ahead with this now!

stuartcarnie · 2025-10-23T02:39:24Z

@AThousandShips thanks – I've incorporated your changes!

servers/rendering/renderer_rd/forward_mobile/render_forward_mobile.cpp

This work is a heavily refactored and rewritten from TheForge's initial code. TheForge's original code had too many race conditions and was fundamentally flawed as it was too easy to incur into those data races by accident. However they identified the proper places that needed changes, and the idea was sound. I used their work as a blueprint to design this work. This PR implements: - Introduction of UMA buffers used by a few buffers (most notably the ones filled by _fill_instance_data). Ironically this change seems to positively affect PC more than it does on Mobile. Updates D3D12 Memory Allocator to get GPU_UPLOAD heap support. Metal implementation by Stuart Carnie. Co-authored-by: Stuart Carnie <[email protected]> Co-authored-by: TheForge team

Repiteo · 2025-10-24T16:30:13Z

Thanks!

AThousandShips · 2025-10-26T08:46:10Z

Working on pinning down the specifics but this caused a regression on (at least) Windows, seemingly with combinations of parallax layers and ui, seems to be compiler specific

Will make a report on Monday or when I can pin down the specifics

Edit: Seems to be mingw specific, though can't test much further, but it happens on debug and release builds with and without production build mode for release builds on mingw, but not on debug builds on msvc, haven't tested production or release builds with msvc but seems to be conclusive

Will make an issue with an MRP later today or tomorrow

AThousandShips · 2025-10-28T12:25:16Z

Made an issue report for this:

Regression with rendering UI #112121

Regression from godotengine#111183 Closes godotengine#112121

blueskythlikesclouds · 2025-10-29T09:52:14Z

This PR seems to have caused a big performance regression in D3D12. In the commit right before this PR, I get about 145~ FPS in the editor, but with this PR, it drops to 60~ FPS. I'll investigate what caused it.

darksylinc · 2025-10-29T15:06:36Z

This PR seems to have caused a big performance regression in D3D12. In the commit right before this PR, I get about 145~ FPS in the editor, but with this PR, it drops to 60~ FPS. I'll investigate what caused it.

What GPU and driver? What's the log info? (all info you'll be asked when creating a ticket). And of course the MRP.

I'm specifically interested in knowing which one you get of these msgs:

D3D12: Device supports GPU UPLOAD heap. (only printed in verbose logging mode).
D3D12: Device does NOT support GPU UPLOAD heap. ReBAR must be enabled for this feature. Regular UPLOAD heaps will be used as fallback. (printed as a warning).

Godot will try to use D3D12_HEAP_TYPE_GPU_UPLOAD when supported, else it will use D3D12_HEAP_TYPE_UPLOAD. These two can have different performance profiles. If your system is using D3D12_HEAP_TYPE_GPU_UPLOAD, try forcing it off by modifying the source code (look for dynamic_persistent_upload_heap = D3D12_HEAP_TYPE_GPU_UPLOAD; in the code).

In both cases, the way we upload data to GPU changed compared to how it worked before the PR, though D3D12_HEAP_TYPE_UPLOAD resembles a little more to how it was before.

darksylinc · 2025-10-29T15:07:09Z

Oh I just saw you submitted a PR with a fix. Cool.

Regression from godotengine#111183 Closes godotengine#112121

stuartcarnie requested review from a team as code owners October 2, 2025 22:13

stuartcarnie marked this pull request as draft October 2, 2025 22:13

Calinou added enhancement topic:rendering labels Oct 2, 2025

Calinou added this to the 4.x milestone Oct 2, 2025

stuartcarnie force-pushed the matias-uma-pc-pr branch from 3204200 to a2ccf52 Compare October 16, 2025 00:40

stuartcarnie marked this pull request as ready for review October 16, 2025 00:41

stuartcarnie force-pushed the matias-uma-pc-pr branch 3 times, most recently from c636f7e to 199c304 Compare October 16, 2025 01:29

Calinou approved these changes Oct 17, 2025

View reviewed changes

Calinou added the performance label Oct 17, 2025

stuartcarnie force-pushed the matias-uma-pc-pr branch from 199c304 to 06354a9 Compare October 17, 2025 20:01

AThousandShips reviewed Oct 17, 2025

View reviewed changes

servers/rendering/multi_uma_buffer.h Outdated Show resolved Hide resolved

drivers/vulkan/rendering_device_driver_vulkan.cpp Outdated Show resolved Hide resolved

drivers/metal/metal_objects.mm Outdated Show resolved Hide resolved

stuartcarnie mentioned this pull request Oct 19, 2025

2d: Use dynamic buffers for instance data #104566

Closed

clayjohn approved these changes Oct 23, 2025

View reviewed changes

clayjohn modified the milestones: 4.x, 4.6 Oct 23, 2025

stuartcarnie force-pushed the matias-uma-pc-pr branch 2 times, most recently from 06484e8 to 6053381 Compare October 23, 2025 02:39

AThousandShips approved these changes Oct 23, 2025

View reviewed changes

servers/rendering/renderer_rd/forward_mobile/render_forward_mobile.cpp Outdated Show resolved Hide resolved

stuartcarnie force-pushed the matias-uma-pc-pr branch from 6053381 to 230adb7 Compare October 23, 2025 21:16

clayjohn changed the title ~~Add Persistent Buffers and Push Constants~~ Add Persistent Buffers utilizing UMA Oct 23, 2025

Repiteo merged commit edbfb7a into godotengine:master Oct 24, 2025
20 checks passed

Repiteo mentioned this pull request Oct 24, 2025

Add Persistent Buffers and Push Constants #103779

Closed

stuartcarnie mentioned this pull request Oct 26, 2025

Metal: Stable argument buffers; GPU rendering crashes; visionOS exports #111976

Merged

stuartcarnie deleted the matias-uma-pc-pr branch October 28, 2025 05:41

AThousandShips mentioned this pull request Oct 28, 2025

Regression with rendering UI #112121

Closed

stuartcarnie added a commit to stuartcarnie/godot that referenced this pull request Oct 28, 2025

2D: Fix incorrect 2D rendering

7db9be5

Regression from godotengine#111183 Closes godotengine#112121

stuartcarnie mentioned this pull request Oct 28, 2025

2D: Fix incorrect 2D rendering #112131

Merged

blueskythlikesclouds mentioned this pull request Oct 29, 2025

Set DONT_PREFER_SMALL_BUFFERS_COMMITTED when initializing D3D12MA. #112152

Merged

jss2a98aj pushed a commit to jss2a98aj/blazium that referenced this pull request Nov 1, 2025

2D: Fix incorrect 2D rendering

450fe89

Regression from godotengine#111183 Closes godotengine#112121

jss2a98aj pushed a commit to jss2a98aj/blazium that referenced this pull request Nov 1, 2025

2D: Fix incorrect 2D rendering

48207b5

Regression from godotengine#111183 Closes godotengine#112121

xls pushed a commit to xls/godot that referenced this pull request Nov 5, 2025

2D: Fix incorrect 2D rendering

8b0a9e1

Regression from godotengine#111183 Closes godotengine#112121

Yanxiyimengya pushed a commit to Yanxiyimengya/godot that referenced this pull request Nov 15, 2025

2D: Fix incorrect 2D rendering

af3cea2

Regression from godotengine#111183 Closes godotengine#112121

blueskythlikesclouds mentioned this pull request Nov 18, 2025

Fix buffer creation on old D3D12 runtimes. #112914

Merged

Uh oh!

Add Persistent Buffers utilizing UMA #111183

Add Persistent Buffers utilizing UMA #111183

Uh oh!

Conversation

stuartcarnie commented Oct 2, 2025 • edited by clayjohn Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Calinou left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Benchmark

1920×1080

Before

After

3840×2160

Before

After

64×64

Before

After

64×64 on llvmpipe

Before

After

Uh oh!

clayjohn commented Oct 17, 2025

Uh oh!

darksylinc commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stuartcarnie commented Oct 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Calinou commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clayjohn left a comment

Choose a reason for hiding this comment

Uh oh!

stuartcarnie commented Oct 23, 2025

Uh oh!

Uh oh!

Uh oh!

Repiteo commented Oct 24, 2025

Uh oh!

AThousandShips commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AThousandShips commented Oct 28, 2025

Uh oh!

blueskythlikesclouds commented Oct 29, 2025

Uh oh!

darksylinc commented Oct 29, 2025

Uh oh!

darksylinc commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

stuartcarnie commented Oct 2, 2025 •

edited by clayjohn

Loading

Calinou left a comment •

edited

Loading

darksylinc commented Oct 17, 2025 •

edited

Loading

Calinou commented Oct 17, 2025 •

edited

Loading

AThousandShips commented Oct 26, 2025 •

edited

Loading