Skip to content

Massively optimize canvas 2D rendering by using vertex buffers#112481

Merged
Repiteo merged 1 commit intogodotengine:masterfrom
stuartcarnie:2d_canvas_vbos
Nov 14, 2025
Merged

Massively optimize canvas 2D rendering by using vertex buffers#112481
Repiteo merged 1 commit intogodotengine:masterfrom
stuartcarnie:2d_canvas_vbos

Conversation

@stuartcarnie
Copy link
Contributor

@stuartcarnie stuartcarnie commented Nov 6, 2025

Summary

Closes #104194

This PR is a performance optimisation for the Canvas RD renderer after introducing batching. @clayjohn has observed a regression in performance under certain combinations and older hardware that this PR is intended to address. It should not regress existing performance gains. The core change is to switch instance data from a uniform buffer (shader storage buffer object) to a vertex buffer object (VBO).

Caution

D3D12 needs a validate vertex_format_create to use the new VertexAttribute::binding member and to update command_render_bind_vertex_buffers to handle UMA buffers.

TODOs

  • Remove all the USE_VAO stuff, as that was just me trying to work with both
  • Fix the PushConstant data so that it's size is reduced from 144 bytes to 84 (+ padding)
  • Add the dynamic_offset index to the vertex buffer binding. Update the drivers to use the offset (already similar for uniforms):
    • Metal
    • Vulkan
    • D3D12
  • Clean up the new dynamic vertex binding API and streamline it (switch to Span in the RenderDeviceGraph, so we don't have to allocate)
  • Extend VertexAttribute change, so a set of attributes can bind to the same buffer
    • This will make the API more efficient as drivers will only need to create a single buffer binding and consume a single slot.

Testing

We must verify all rendered command types:

  • TYPE_RECT
  • TYPE_NINEPATCH,
  • TYPE_POLYGON,
  • TYPE_PRIMITIVE,
  • TYPE_MESH,
  • TYPE_MULTIMESH,
  • TYPE_PARTICLES,

Verify:

  • INSTANCE_FLAGS_USE_MSDF
  • INSTANCE_FLAGS_USE_LCD

Benchmarks

See below for more detail

Adreno 530 Adreno 640 Mali G715 Intel Xe RX 6900XT Mali G68 MP5
improvement 4-5x 1.5-2x 1-1.25x 2-3x 1x 1.1x

@stuartcarnie stuartcarnie force-pushed the 2d_canvas_vbos branch 2 times, most recently from 83ceba7 to 0251525 Compare November 11, 2025 01:14
Copy link
Contributor Author

@stuartcarnie stuartcarnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clayjohn, et al some notes for your information

}
}

HashMap &operator=(HashMap &&p_other) noexcept {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can std::move so we can transfer to another type. I'm using it in vertex_format_create for the bindings member:

	VertexDescriptionCache &ce = vertex_formats.insert(id, VertexDescriptionCache())->value;
	ce.vertex_formats = vertex_descriptions;
	ce.bindings = std::move(bindings);
	ce.driver_id = driver_id;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is worth a separate core PR. Move semantics are desirable for all our containers, and that would make this PR completely non-core.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ivorforce Are you okay with merging this change in this PR? I don't really want to block this PR while we wait for this optimization to be applied to other containers.

My preference in general is for core optimizations to be in the same PR where they are used as well so the git history shows why certain changes were needed

Copy link
Member

@Ivorforce Ivorforce Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally agree that optimizations should be introduced only when needed, but move semantics are needed for all our containers (and most of them do have it already). Looks like I just forgot to add them for hash maps.
Core changes particularly have a habit of being 'snuck in' in bigger PRs, which can make it very hard to spot and estimate the repercussions of. Generally I would expect core changes to be beneficial to not only the use-case of a PR, but the codebase in general. For example, I would normally expect benchmarks for common cases, if it's optimization related. That's why I prefer them in separate PRs when possible.

Anyway, they look fine to me (granted noexcept is removed), so I'm OK with merging them in here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noexcept has been removed - thanks for the feedback!

Comment on lines -304 to +305
virtual VertexFormatID vertex_format_create(VectorView<VertexAttribute> p_vertex_attribs) = 0;
virtual VertexFormatID vertex_format_create(Span<VertexAttribute> p_vertex_attribs, const VertexAttributeBindingsMap &p_vertex_bindings) = 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching to Span means we can avoid allocations at call sites. We should evaluate all calls and use FixedVector, as the sizes in the renderer_rd are all known at compile time.

@stuartcarnie stuartcarnie force-pushed the 2d_canvas_vbos branch 2 times, most recently from 18e4eea to f1ba020 Compare November 11, 2025 19:52
@stuartcarnie stuartcarnie changed the title spike: VBOs for Canvas 2D 2D: Use Vertex Buffer Objects for Canvas 2D instance data Nov 11, 2025
static constexpr uint32_t MIN_CAPACITY_INDEX = 2; // Use a prime.
static constexpr float MAX_OCCUPANCY = 0.75;
static constexpr uint32_t EMPTY_HASH = 0;
using KV = KeyValue<TKey, TValue>; // Type alias for easier access to KeyValue.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Particularly useful when you have a typedef such as:

typedef HashMap<uint32_t, VertexAttributeBinding> VertexAttributeBindingsMap;

as you can then use:

for (const VertexAttributeBindingsMap::KV &ky : p_vertex_bindings) {
  // ...
}

vs

for (const KeyValue<uint32_t, VertexAttributeBinding> &kv : p_vertex_bindings) {
  // ...
}

Also, if you change the Key or Value type, these call sites don't need to be updated.

@stuartcarnie stuartcarnie force-pushed the 2d_canvas_vbos branch 5 times, most recently from 6cfae71 to 329eec7 Compare November 11, 2025 23:43
@clayjohn
Copy link
Member

clayjohn commented Nov 13, 2025

Some preliminary testing using a modified MRP from #104194

android-4.4-perf-clay.zip

The original MRP has a loop that adds SubViewports that have a single Sprite rendered to them, then are rendered to the screen.

That means that in a loop of N, we rendering N * 2 draw calls using N + 1 render passes (i.e. N render passes with 1 draw call and 1 render pass with N draw calls). None of the draw calls are batched, so this MRP exposes the worst case performance. It can be CPU bottlenecked on some hardware and GPU bottlenecked on others.

The second test case renders N sprites with the same texture in the same location. This is the best case for batching. The number of draw calls depends on the hardware's capabilities, but it is often only 1 or a handful. Drawing that many Sprites in one location is a weird edge case for TBDR GPUs, so Mali GPUs and Apple silicon GPUs may not reflect typical performance scenarios.

# 2000 Sprites Adreno 530
4.3: 55 FPS
4.5.1: 10 FPS
PR:  43 FPS

# 100 viewports Adreno 530
4.3 36 FPS
4.5.1: 6 FPS
PR: 30 FPS

# 10000 Sprites Adreno 640
4.3 25 FPS
4.5.1: 19 FPS
PR: 37 FPS

# 500 viewports Adreno 640
4.3 18 FPS
4.5.1: 12 FPS
PR: 18 FPS

# 10000 Sprites Mali G715
4.3 52 FPS
4.5.1: 45 FPS
PR: 57 FPS

# 500 viewports Mali G715
4.3 32 FPS
4.5.1: 24 FPS
PR: 25 FPS (This is likely caused by an unrelated bug and should be investigated)

# 10000 Sprites Intel XE
4.3: 80 FPS
4.5.1: 23 FPS
PR: 102 FPS

# 500 viewports Intel XE
4.3: 28 FPS
4.5.1: 16 FPS
PR: 39 FPS

# 10000 Sprites Windows - Ryzen 5 9600X - Radeon RX 6900 XT
4.3: 675 FPS
4.5.1:  752 FPS
PR: 856 FPS
Master: 839 FPS

# 500 viewports Windows - Ryzen 5 9600X - Radeon RX 6900 XT
4.3: 1475 FPS
4.5.1:  1924 FPS
PR: 2307 FPS
Master: 2206 FPS

# 50000 Sprites Windows - Ryzen 5 9600X - Radeon RX 6900 XT
4.3: 102 FPS
4.5.1:  170 FPS
PR: 145 FPS
Master: 145 FPS

# 10000 viewports Windows - Ryzen 5 9600X - Radeon RX 6900 XT
4.3:  97 FPS (GPU Bound)
4.5.1:  13 FPS
PR: 98 FP
Master: CRASH

@blueskythlikesclouds
Copy link
Contributor

I can implement the D3D12 changes. Should I make a PR to your fork?

}
}

HashMap(HashMap &&p_other) noexcept {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these noexcept have any effect? We don't use exceptions and as far as I know we don't use this directive elsewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True – I'll remove them

Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work! I have tested extensively on Linux, Windows, and Android in addition to your testing on MacOS. So I think we have covered our bases.

From testing it appears we have resolved the performance regression introduced from batching in almost all cases and even improved performance in many cases. At this point I think this PR is ready to go and get wider testing!

While discussing this with Stuart, we identified some further optimizations we could make. But the current state is really good and gives us the majority of possible gains with the least intrusive changes

@clayjohn clayjohn changed the title 2D: Use Vertex Buffer Objects for Canvas 2D instance data Massively optimize canvas 2D rendering by using vertex buffers Nov 14, 2025
<members>
<member name="binding" type="int" setter="set_binding" getter="get_binding" default="4294967295">
The index of the buffer in the vertex buffer array to bind this vertex attribute. When set to -1, it defaults to the index of the attribute.
[b]Note:[/b] You cannot mix binding explicitly assigned attributes with implicitly assigned ones (i.e. -1). Either all attributes must have their binding set to -1, or all must have explicit bindings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[b]Note:[/b] You cannot mix binding explicitly assigned attributes with implicitly assigned ones (i.e. -1). Either all attributes must have their binding set to -1, or all must have explicit bindings.
[b]Note:[/b] You cannot mix binding explicitly assigned attributes with implicitly assigned ones (i.e. [code]-1[/code]). Either all attributes must have their binding set to [code]-1[/code], or all must have explicit bindings.

</tutorials>
<members>
<member name="binding" type="int" setter="set_binding" getter="get_binding" default="4294967295">
The index of the buffer in the vertex buffer array to bind this vertex attribute. When set to -1, it defaults to the index of the attribute.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The index of the buffer in the vertex buffer array to bind this vertex attribute. When set to -1, it defaults to the index of the attribute.
The index of the buffer in the vertex buffer array to bind this vertex attribute. When set to [code]-1[/code], it defaults to the index of the attribute.

@software-2
Copy link

I ran some tests on my Samsung Galaxy Tab S9 FE+, which seems to be a worst-case device, using my original test program (here). @clayjohn should I run your modified version? (I didn't see it until after I finished)

I couldn't get this PR branch to run (the android app would hang at the splash screen). Master at bd2ca13 ran just fine, so I pulled in this PR's commits on top of that.

4.5-dev3: 68-72fps lows, sometimes hitting 80
master: 72-80 lows, but frequently hitting the device max of 90
master + PR: 86-90 consistently, but I'm seeing dips sometimes to 74 briefly about every 10 seconds. (Those spikes seem to disappear when I have the Visual Profiler running, keeping a consistent 86+ when the profiler is on.)

For comparison, 4.3-stable (prior to the batching changes) gives lower overall frames (~82 on average), but does not have the occasional framerate dip.

This is absolutely a major improvement!

@clayjohn
Copy link
Member

I ran some tests on my Samsung Galaxy Tab S9 FE+, which seems to be a worst-case device, using my original test program (here). @clayjohn should I run your modified version? (I didn't see it until after I finished)

No need. My modified version just added a couple lines of code to also test rendering a high number of sprites in a single batch

I couldn't get this PR branch to run (the android app would hang at the splash screen). Master at bd2ca13 ran just fine, so I pulled in this PR's commits on top of that.

Sorry about that, there was an android regression two days ago that was fixed yesterday #112716. Pulling in this change on top of master was the right thing to do!

For comparison, 4.3-stable (prior to the batching changes) gives lower overall frames (~82 on average), but does not have the occasional framerate dip.

Depending on your build settings, the dip may go away with official builds. We enable swappy by default on official builds which helps reduce frame dips. By default, swappy is disabled for custom build

@stuartcarnie
Copy link
Contributor Author

@AThousandShips I've removed the except and fixed the documentation and added you as a co-author.

@clayjohn shall I'll wait for @Ivorforce's response before removing the move semantics from HashMap?

@Ivorforce
Copy link
Member

I've already replied; in short: The HashMap changes look good to me.

- Add support for vertex bindings and UMA vertex buffers in D3D12.
- Simplify 2D instance params and move more into per-batch data to save
  bandwidth

Co-authored-by: Skyth <19259897+blueskythlikesclouds@users.noreply.github.com>
Co-authored-by: Clay John <claynjohn@gmail.com>
Co-authored-by: A Thousand Ships <96648715+athousandships@users.noreply.github.com>
@Repiteo Repiteo merged commit 235d112 into godotengine:master Nov 14, 2025
20 checks passed
@Repiteo
Copy link
Contributor

Repiteo commented Nov 14, 2025

Thanks!

@YeldhamDev
Copy link
Member

This PR is still has regressions, see #112938.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance regression in 4.4 on Android after introducing batching (GPU bottleneck)

9 participants