Skip to content

Conversation

@stuartcarnie
Copy link
Contributor

@stuartcarnie stuartcarnie commented Oct 2, 2025

Important

This is a rebase of @darksylinc's #103779 to resolve all conflicts. I've re-run the TPS demo and all is working under Metal. As there were no conflicts with Vulkan and D3D12, I'd presume those still work, but they will need to be tested.

All push constant usage has been removed and can be revisited in a future PR.

This work is a heavily refactored and rewritten from TheForge's initial code.

TheForge's original code had too many race conditions and was fundamentally flawed as it was too easy to incur into those data races by accident.

However they identified the proper places that needed changes, and the idea was sound. I used their work as a blueprint to design this work.

This PR implements:

Ironically this change seems to positively affect PC more than it does on Mobile:
Bugsquad edit: These numbers reflect a much earlier draft of this PR

Before (ms) After (ms) Improvement
Workstation 0,622 0,606 2,60 %
Laptop 7,326 7,220 1,44 %
Adreno 640 70,833 70,819 0,02 %
Mali-G68 48,894 49,138 -0,50 %

Notes:

  1. This is the modified TPS Demo
  2. This is the Mobile backend. However improvements apply to Clustered, but it's harder to see them because Clustered is much more GPU bound than Mobile.
  3. Some of these numbers are unimpressive because as a result of this PR, Fix inefficient upload in Mobile Shadows #103531 was submitted. This made "before" faster as the bulk of performance difference is in _fill_instance_data. The more objects are on scene, the bigger the impact.
  4. For some reason before rebasing Adreno 640 "after" got 67.931ms, making it 4.09% faster. But I can no longer reproduce those results. Maybe it was a fluke?
  5. I have no idea why Mali-G68 resulted in a performance degradation. It makes no sense. The best explanation is that Mali GPU does not like the Memory Type/Heap the data is being placed on.

Specs:

Workstation: Ryzen 5900X 2x16GB AMD Radeon 6800 XT 16GB. Ubuntu 24.04 LTS, RADV Mesa 24.3.4
Laptop: Ryzen 5700U 4+8GB AMD Radeon Vega 8. Ubuntu 24.04 LTS, RADV Mesa 24.2.8
Adreno 640: POCO F2 Pro, stock Android 13
Mali-G68: Samsung A54 5G, stock Android 14

@stuartcarnie stuartcarnie requested review from a team as code owners October 2, 2025 22:13
@stuartcarnie stuartcarnie marked this pull request as draft October 2, 2025 22:13
@Calinou Calinou added this to the 4.x milestone Oct 2, 2025
@stuartcarnie stuartcarnie marked this pull request as ready for review October 16, 2025 00:41
@stuartcarnie stuartcarnie force-pushed the matias-uma-pc-pr branch 3 times, most recently from c636f7e to 199c304 Compare October 16, 2025 01:29
Copy link
Member

@Calinou Calinou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally on Vulkan, it works as expected. Code looks good to me.

Benchmark

PC specifications
  • CPU: AMD Ryzen 9 9950X3D
  • GPU: NVIDIA GeForce RTX 5090
  • RAM: 64 GB (2×32 GB DDR5-6000 CL30)
  • SSD: Solidigm P44 Pro 2 TB
  • OS: Linux (Fedora 42)

Using release export template builds with production=yes lto=full, with https://github.com/Calinou/godot-reflection run using --disable-vsync -- --benchmark.

1920×1080

Before

  Average 1% low 0.1% low
Frametime 898 FPS (1.11 mspf) 729 FPS (1.37 mspf) 490 FPS (2.04 mspf)
CPU Time 11654 FPS (0.09 mspf) 8620 FPS (0.12 mspf) 6410 FPS (0.16 mspf)
GPU Time 945 FPS (1.06 mspf) 768 FPS (1.30 mspf) 505 FPS (1.98 mspf)

After

  Average 1% low 0.1% low
Frametime 890 FPS (1.12 mspf) 734 FPS (1.36 mspf) 495 FPS (2.02 mspf)
CPU Time 10872 FPS (0.09 mspf) 7751 FPS (0.13 mspf) 5524 FPS (0.18 mspf)
GPU Time 947 FPS (1.06 mspf) 773 FPS (1.29 mspf) 510 FPS (1.96 mspf)

3840×2160

Before

  Average 1% low 0.1% low
Frametime 398 FPS (2.51 mspf) 291 FPS (3.43 mspf) 237 FPS (4.22 mspf)
CPU Time 9295 FPS (0.11 mspf) 6493 FPS (0.15 mspf) 5154 FPS (0.19 mspf)
GPU Time 413 FPS (2.42 mspf) 300 FPS (3.33 mspf) 245 FPS (4.07 mspf)

After

  Average 1% low 0.1% low
Frametime 399 FPS (2.50 mspf) 292 FPS (3.42 mspf) 248 FPS (4.02 mspf)
CPU Time 8638 FPS (0.12 mspf) 5988 FPS (0.17 mspf) 4854 FPS (0.21 mspf)
GPU Time 414 FPS (2.41 mspf) 301 FPS (3.31 mspf) 255 FPS (3.91 mspf)

64×64

Just to check for CPU overhead.

Before

  Average 1% low 0.1% low
Frametime 1736 FPS (0.58 mspf) 1533 FPS (0.65 mspf) 690 FPS (1.45 mspf)
CPU Time 12349 FPS (0.08 mspf) 10416 FPS (0.10 mspf) 7194 FPS (0.14 mspf)
GPU Time 1852 FPS (0.54 mspf) 1683 FPS (0.59 mspf) 709 FPS (1.41 mspf)

After

  Average 1% low 0.1% low
Frametime 1756 FPS (0.57 mspf) 1545 FPS (0.65 mspf) 698 FPS (1.43 mspf)
CPU Time 11554 FPS (0.09 mspf) 9615 FPS (0.10 mspf) 6666 FPS (0.15 mspf)
GPU Time 1872 FPS (0.53 mspf) 1694 FPS (0.59 mspf) 719 FPS (1.39 mspf)

64×64 on llvmpipe

Before

  Average 1% low 0.1% low
Frametime 14 FPS (69.07 mspf) 5 FPS (172.01 mspf) 0 FPS (6839.52 mspf)
CPU Time 2917 FPS (0.34 mspf) 2364 FPS (0.42 mspf) 699 FPS (1.43 mspf)
GPU Time 18 FPS (53.60 mspf) 6 FPS (161.56 mspf) 3 FPS (286.58 mspf)

After

  Average 1% low 0.1% low
Frametime 17 FPS (56.11 mspf) 7 FPS (140.12 mspf) 0 FPS (1047.59 mspf)
CPU Time 3132 FPS (0.32 mspf) 2457 FPS (0.41 mspf) 611 FPS (1.64 mspf)
GPU Time 18 FPS (53.08 mspf) 7 FPS (125.18 mspf) 5 FPS (194.14 mspf)

The figures are pretty similar overall, but there is a significant improvement in llvmpipe at least, which shows the PR is having some positive effect.

@clayjohn
Copy link
Member

@Calinou This PR should only have an impact on devices that support UMA. Your RTX 5090 doesn't support UMA, so we don't expect it to make a difference. The changes on LLVMpipe are within a margin of error too. So at most we can conclude that this doesn't cause regressions on that device

@darksylinc
Copy link
Contributor

darksylinc commented Oct 17, 2025

Actually Calinou's performance on RTX 5090 are within the expected outcome (though the difference is so small it could be noise).

Even though the RTX 5090 isn't UMA, this PR takes advantage of CPU-visible, device-local VRAM (which is plentiful in GPUs with ReBAR enabled). In other words, the RTX 5090 will pretend it is UMA and we will use that.

This means:

  1. CPU may execute slower, because CPU "pushes" data through the slower uncached PCIe bus (instead of having the GPU "pull" that data later through the bus). This could be theoretically improved if the copy itself is offloaded to a worker thread when possible (not always possible).
  2. GPU might execute faster.
  3. Overall may or may not execute faster, because some of the code now pushes data directly to GPU memory instead of doing two copies (first to staging area, then to GPU).
  4. Low 1% / 0.1% should improve because there's fewer GPU copy commands being scheduled. This is most likely the most noticeable difference and it was the only metric that consistently always improved for Framerate and GPU time, and consistently deteriorated for CPU time. This is exactly expected.
  5. On a true UMA architecture, the difference is that there is no slow PCIe bus. The CPU writes directly to GPU memory without downsides(*), making the PR (theoretically) more impactful on such Hardware.

llvmpipe is an outlier in too many ways (it's not an actual GPU, it's so slow that noise is huge, and rendering threads compete with Godot's threads for execution time).

(*) On some SoCs like the PS4 and PS5, the CPU bandwidth to RAM is artificially throttled though (to avoid CPU from starving the GPU's bandwidth) so the analysis is more nuanced in such cases as it will resemble the RTX 5090 case.

@stuartcarnie
Copy link
Contributor Author

@Calinou thanks for testing. As discussed with @clayjohn in this thread on RocketChat, we're removing the push constant changes, which significantly reduces the blast radius of this PR. Note that no .glsl shaders are modified after that change.

UMA are only used for instance buffers now, and I have incorporated #104566 into this PR, as it was quite small and showed a measurable improvement in the 2D batching demo I tested (10% on my M1 Pro Max; I'll retest on my M4 Pro Max)

@Calinou
Copy link
Member

Calinou commented Oct 17, 2025

device-local VRAM (which is plentiful in GPUs with ReBAR enabled). In other words, the RTX 5090 will pretend it is UMA and we will use that.

Note that the NVIDIA Windows driver only enables ReBAR by default in specific games that are known to benefit from it (according to the per-game profile), but on Linux, it seems to be enabled for everything by default as long as it's enabled in the UEFI. I can confirm it's enabled on my end.

image

Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed in rendering meeting and discussed several times with Stuart and Matias. Let's go ahead with this now!

@clayjohn clayjohn modified the milestones: 4.x, 4.6 Oct 23, 2025
@stuartcarnie stuartcarnie force-pushed the matias-uma-pc-pr branch 2 times, most recently from 06484e8 to 6053381 Compare October 23, 2025 02:39
@stuartcarnie
Copy link
Contributor Author

@AThousandShips thanks – I've incorporated your changes!

This work is a heavily refactored and rewritten from TheForge's initial
code.

TheForge's original code had too many race conditions and was
fundamentally flawed as it was too easy to incur into those data races
by accident.

However they identified the proper places that needed changes, and the
idea was sound. I used their work as a blueprint to design this work.

This PR implements:

 - Introduction of UMA buffers used by a few buffers
(most notably the ones filled by _fill_instance_data).

Ironically this change seems to positively affect PC more than it does
on Mobile.

Updates D3D12 Memory Allocator to get GPU_UPLOAD heap support.

Metal implementation by Stuart Carnie.

Co-authored-by: Stuart Carnie <[email protected]>
Co-authored-by: TheForge team
@clayjohn clayjohn changed the title Add Persistent Buffers and Push Constants Add Persistent Buffers utilizing UMA Oct 23, 2025
@Repiteo Repiteo merged commit edbfb7a into godotengine:master Oct 24, 2025
20 checks passed
@Repiteo
Copy link
Contributor

Repiteo commented Oct 24, 2025

Thanks!

@AThousandShips
Copy link
Member

AThousandShips commented Oct 26, 2025

Working on pinning down the specifics but this caused a regression on (at least) Windows, seemingly with combinations of parallax layers and ui, seems to be compiler specific

Will make a report on Monday or when I can pin down the specifics

Edit: Seems to be mingw specific, though can't test much further, but it happens on debug and release builds with and without production build mode for release builds on mingw, but not on debug builds on msvc, haven't tested production or release builds with msvc but seems to be conclusive

Will make an issue with an MRP later today or tomorrow

@AThousandShips
Copy link
Member

Made an issue report for this:

stuartcarnie added a commit to stuartcarnie/godot that referenced this pull request Oct 28, 2025
@blueskythlikesclouds
Copy link
Contributor

This PR seems to have caused a big performance regression in D3D12. In the commit right before this PR, I get about 145~ FPS in the editor, but with this PR, it drops to 60~ FPS. I'll investigate what caused it.

@darksylinc
Copy link
Contributor

This PR seems to have caused a big performance regression in D3D12. In the commit right before this PR, I get about 145~ FPS in the editor, but with this PR, it drops to 60~ FPS. I'll investigate what caused it.

What GPU and driver? What's the log info? (all info you'll be asked when creating a ticket). And of course the MRP.

I'm specifically interested in knowing which one you get of these msgs:

  1. D3D12: Device supports GPU UPLOAD heap. (only printed in verbose logging mode).
  2. D3D12: Device does NOT support GPU UPLOAD heap. ReBAR must be enabled for this feature. Regular UPLOAD heaps will be used as fallback. (printed as a warning).

Godot will try to use D3D12_HEAP_TYPE_GPU_UPLOAD when supported, else it will use D3D12_HEAP_TYPE_UPLOAD. These two can have different performance profiles. If your system is using D3D12_HEAP_TYPE_GPU_UPLOAD, try forcing it off by modifying the source code (look for dynamic_persistent_upload_heap = D3D12_HEAP_TYPE_GPU_UPLOAD; in the code).

In both cases, the way we upload data to GPU changed compared to how it worked before the PR, though D3D12_HEAP_TYPE_UPLOAD resembles a little more to how it was before.

@darksylinc
Copy link
Contributor

Oh I just saw you submitted a PR with a fix. Cool.

jss2a98aj pushed a commit to jss2a98aj/blazium that referenced this pull request Nov 1, 2025
jss2a98aj pushed a commit to jss2a98aj/blazium that referenced this pull request Nov 1, 2025
xls pushed a commit to xls/godot that referenced this pull request Nov 5, 2025
Yanxiyimengya pushed a commit to Yanxiyimengya/godot that referenced this pull request Nov 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants