-
-
Notifications
You must be signed in to change notification settings - Fork 23.7k
Add Persistent Buffers utilizing UMA #111183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3204200 to
a2ccf52
Compare
c636f7e to
199c304
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested locally on Vulkan, it works as expected. Code looks good to me.
Benchmark
PC specifications
- CPU: AMD Ryzen 9 9950X3D
- GPU: NVIDIA GeForce RTX 5090
- RAM: 64 GB (2×32 GB DDR5-6000 CL30)
- SSD: Solidigm P44 Pro 2 TB
- OS: Linux (Fedora 42)
Using release export template builds with production=yes lto=full, with https://github.com/Calinou/godot-reflection run using --disable-vsync -- --benchmark.
1920×1080
Before
| Average | 1% low | 0.1% low | |
|---|---|---|---|
| Frametime | 898 FPS (1.11 mspf) | 729 FPS (1.37 mspf) | 490 FPS (2.04 mspf) |
| CPU Time | 11654 FPS (0.09 mspf) | 8620 FPS (0.12 mspf) | 6410 FPS (0.16 mspf) |
| GPU Time | 945 FPS (1.06 mspf) | 768 FPS (1.30 mspf) | 505 FPS (1.98 mspf) |
After
| Average | 1% low | 0.1% low | |
|---|---|---|---|
| Frametime | 890 FPS (1.12 mspf) | 734 FPS (1.36 mspf) | 495 FPS (2.02 mspf) |
| CPU Time | 10872 FPS (0.09 mspf) | 7751 FPS (0.13 mspf) | 5524 FPS (0.18 mspf) |
| GPU Time | 947 FPS (1.06 mspf) | 773 FPS (1.29 mspf) | 510 FPS (1.96 mspf) |
3840×2160
Before
| Average | 1% low | 0.1% low | |
|---|---|---|---|
| Frametime | 398 FPS (2.51 mspf) | 291 FPS (3.43 mspf) | 237 FPS (4.22 mspf) |
| CPU Time | 9295 FPS (0.11 mspf) | 6493 FPS (0.15 mspf) | 5154 FPS (0.19 mspf) |
| GPU Time | 413 FPS (2.42 mspf) | 300 FPS (3.33 mspf) | 245 FPS (4.07 mspf) |
After
| Average | 1% low | 0.1% low | |
|---|---|---|---|
| Frametime | 399 FPS (2.50 mspf) | 292 FPS (3.42 mspf) | 248 FPS (4.02 mspf) |
| CPU Time | 8638 FPS (0.12 mspf) | 5988 FPS (0.17 mspf) | 4854 FPS (0.21 mspf) |
| GPU Time | 414 FPS (2.41 mspf) | 301 FPS (3.31 mspf) | 255 FPS (3.91 mspf) |
64×64
Just to check for CPU overhead.
Before
| Average | 1% low | 0.1% low | |
|---|---|---|---|
| Frametime | 1736 FPS (0.58 mspf) | 1533 FPS (0.65 mspf) | 690 FPS (1.45 mspf) |
| CPU Time | 12349 FPS (0.08 mspf) | 10416 FPS (0.10 mspf) | 7194 FPS (0.14 mspf) |
| GPU Time | 1852 FPS (0.54 mspf) | 1683 FPS (0.59 mspf) | 709 FPS (1.41 mspf) |
After
| Average | 1% low | 0.1% low | |
|---|---|---|---|
| Frametime | 1756 FPS (0.57 mspf) | 1545 FPS (0.65 mspf) | 698 FPS (1.43 mspf) |
| CPU Time | 11554 FPS (0.09 mspf) | 9615 FPS (0.10 mspf) | 6666 FPS (0.15 mspf) |
| GPU Time | 1872 FPS (0.53 mspf) | 1694 FPS (0.59 mspf) | 719 FPS (1.39 mspf) |
64×64 on llvmpipe
Before
| Average | 1% low | 0.1% low | |
|---|---|---|---|
| Frametime | 14 FPS (69.07 mspf) | 5 FPS (172.01 mspf) | 0 FPS (6839.52 mspf) |
| CPU Time | 2917 FPS (0.34 mspf) | 2364 FPS (0.42 mspf) | 699 FPS (1.43 mspf) |
| GPU Time | 18 FPS (53.60 mspf) | 6 FPS (161.56 mspf) | 3 FPS (286.58 mspf) |
After
| Average | 1% low | 0.1% low | |
|---|---|---|---|
| Frametime | 17 FPS (56.11 mspf) | 7 FPS (140.12 mspf) | 0 FPS (1047.59 mspf) |
| CPU Time | 3132 FPS (0.32 mspf) | 2457 FPS (0.41 mspf) | 611 FPS (1.64 mspf) |
| GPU Time | 18 FPS (53.08 mspf) | 7 FPS (125.18 mspf) | 5 FPS (194.14 mspf) |
The figures are pretty similar overall, but there is a significant improvement in llvmpipe at least, which shows the PR is having some positive effect.
|
@Calinou This PR should only have an impact on devices that support UMA. Your RTX 5090 doesn't support UMA, so we don't expect it to make a difference. The changes on LLVMpipe are within a margin of error too. So at most we can conclude that this doesn't cause regressions on that device |
|
Actually Calinou's performance on RTX 5090 are within the expected outcome (though the difference is so small it could be noise). Even though the RTX 5090 isn't UMA, this PR takes advantage of CPU-visible, device-local VRAM (which is plentiful in GPUs with ReBAR enabled). In other words, the RTX 5090 will pretend it is UMA and we will use that. This means:
llvmpipe is an outlier in too many ways (it's not an actual GPU, it's so slow that noise is huge, and rendering threads compete with Godot's threads for execution time). (*) On some SoCs like the PS4 and PS5, the CPU bandwidth to RAM is artificially throttled though (to avoid CPU from starving the GPU's bandwidth) so the analysis is more nuanced in such cases as it will resemble the RTX 5090 case. |
199c304 to
06354a9
Compare
|
@Calinou thanks for testing. As discussed with @clayjohn in this thread on RocketChat, we're removing the push constant changes, which significantly reduces the blast radius of this PR. Note that no UMA are only used for instance buffers now, and I have incorporated #104566 into this PR, as it was quite small and showed a measurable improvement in the 2D batching demo I tested (10% on my M1 Pro Max; I'll retest on my M4 Pro Max) |
clayjohn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed in rendering meeting and discussed several times with Stuart and Matias. Let's go ahead with this now!
06484e8 to
6053381
Compare
|
@AThousandShips thanks – I've incorporated your changes! |
servers/rendering/renderer_rd/forward_mobile/render_forward_mobile.cpp
Outdated
Show resolved
Hide resolved
This work is a heavily refactored and rewritten from TheForge's initial code. TheForge's original code had too many race conditions and was fundamentally flawed as it was too easy to incur into those data races by accident. However they identified the proper places that needed changes, and the idea was sound. I used their work as a blueprint to design this work. This PR implements: - Introduction of UMA buffers used by a few buffers (most notably the ones filled by _fill_instance_data). Ironically this change seems to positively affect PC more than it does on Mobile. Updates D3D12 Memory Allocator to get GPU_UPLOAD heap support. Metal implementation by Stuart Carnie. Co-authored-by: Stuart Carnie <[email protected]> Co-authored-by: TheForge team
6053381 to
230adb7
Compare
|
Thanks! |
|
Working on pinning down the specifics but this caused a regression on (at least) Windows, seemingly with combinations of parallax layers and ui, seems to be compiler specific Will make a report on Monday or when I can pin down the specifics Edit: Seems to be mingw specific, though can't test much further, but it happens on debug and release builds with and without production build mode for release builds on mingw, but not on debug builds on msvc, haven't tested production or release builds with msvc but seems to be conclusive Will make an issue with an MRP later today or tomorrow |
|
Made an issue report for this: |
Regression from godotengine#111183 Closes godotengine#112121
|
This PR seems to have caused a big performance regression in D3D12. In the commit right before this PR, I get about 145~ FPS in the editor, but with this PR, it drops to 60~ FPS. I'll investigate what caused it. |
What GPU and driver? What's the log info? (all info you'll be asked when creating a ticket). And of course the MRP. I'm specifically interested in knowing which one you get of these msgs:
Godot will try to use In both cases, the way we upload data to GPU changed compared to how it worked before the PR, though |
|
Oh I just saw you submitted a PR with a fix. Cool. |
Regression from godotengine#111183 Closes godotengine#112121
Regression from godotengine#111183 Closes godotengine#112121
Regression from godotengine#111183 Closes godotengine#112121
Regression from godotengine#111183 Closes godotengine#112121

Important
This is a rebase of @darksylinc's #103779 to resolve all conflicts. I've re-run the TPS demo and all is working under Metal. As there were no conflicts with Vulkan and D3D12, I'd presume those still work, but they will need to be tested.
All push constant usage has been removed and can be revisited in a future PR.
This work is a heavily refactored and rewritten from TheForge's initial code.
TheForge's original code had too many race conditions and was fundamentally flawed as it was too easy to incur into those data races by accident.
However they identified the proper places that needed changes, and the idea was sound. I used their work as a blueprint to design this work.
This PR implements:
Ironically this change seems to positively affect PC more than it does on Mobile:
Bugsquad edit: These numbers reflect a much earlier draft of this PR
Notes:
_fill_instance_data. The more objects are on scene, the bigger the impact.Specs:
Workstation: Ryzen 5900X 2x16GB AMD Radeon 6800 XT 16GB. Ubuntu 24.04 LTS, RADV Mesa 24.3.4
Laptop: Ryzen 5700U 4+8GB AMD Radeon Vega 8. Ubuntu 24.04 LTS, RADV Mesa 24.2.8
Adreno 640: POCO F2 Pro, stock Android 13
Mali-G68: Samsung A54 5G, stock Android 14