Massively optimize canvas 2D rendering by using vertex buffers#112481
Massively optimize canvas 2D rendering by using vertex buffers#112481Repiteo merged 1 commit intogodotengine:masterfrom
Conversation
83ceba7 to
0251525
Compare
stuartcarnie
left a comment
There was a problem hiding this comment.
@clayjohn, et al some notes for your information
core/templates/hash_map.h
Outdated
| } | ||
| } | ||
|
|
||
| HashMap &operator=(HashMap &&p_other) noexcept { |
There was a problem hiding this comment.
We can std::move so we can transfer to another type. I'm using it in vertex_format_create for the bindings member:
VertexDescriptionCache &ce = vertex_formats.insert(id, VertexDescriptionCache())->value;
ce.vertex_formats = vertex_descriptions;
ce.bindings = std::move(bindings);
ce.driver_id = driver_id;There was a problem hiding this comment.
This is worth a separate core PR. Move semantics are desirable for all our containers, and that would make this PR completely non-core.
There was a problem hiding this comment.
@Ivorforce Are you okay with merging this change in this PR? I don't really want to block this PR while we wait for this optimization to be applied to other containers.
My preference in general is for core optimizations to be in the same PR where they are used as well so the git history shows why certain changes were needed
There was a problem hiding this comment.
Generally agree that optimizations should be introduced only when needed, but move semantics are needed for all our containers (and most of them do have it already). Looks like I just forgot to add them for hash maps.
Core changes particularly have a habit of being 'snuck in' in bigger PRs, which can make it very hard to spot and estimate the repercussions of. Generally I would expect core changes to be beneficial to not only the use-case of a PR, but the codebase in general. For example, I would normally expect benchmarks for common cases, if it's optimization related. That's why I prefer them in separate PRs when possible.
Anyway, they look fine to me (granted noexcept is removed), so I'm OK with merging them in here.
There was a problem hiding this comment.
noexcept has been removed - thanks for the feedback!
servers/rendering/renderer_rd/forward_clustered/render_forward_clustered.cpp
Show resolved
Hide resolved
| virtual VertexFormatID vertex_format_create(VectorView<VertexAttribute> p_vertex_attribs) = 0; | ||
| virtual VertexFormatID vertex_format_create(Span<VertexAttribute> p_vertex_attribs, const VertexAttributeBindingsMap &p_vertex_bindings) = 0; |
There was a problem hiding this comment.
Switching to Span means we can avoid allocations at call sites. We should evaluate all calls and use FixedVector, as the sizes in the renderer_rd are all known at compile time.
18e4eea to
f1ba020
Compare
f1ba020 to
a40aca9
Compare
| static constexpr uint32_t MIN_CAPACITY_INDEX = 2; // Use a prime. | ||
| static constexpr float MAX_OCCUPANCY = 0.75; | ||
| static constexpr uint32_t EMPTY_HASH = 0; | ||
| using KV = KeyValue<TKey, TValue>; // Type alias for easier access to KeyValue. |
There was a problem hiding this comment.
Particularly useful when you have a typedef such as:
typedef HashMap<uint32_t, VertexAttributeBinding> VertexAttributeBindingsMap;as you can then use:
for (const VertexAttributeBindingsMap::KV &ky : p_vertex_bindings) {
// ...
}
vs
for (const KeyValue<uint32_t, VertexAttributeBinding> &kv : p_vertex_bindings) {
// ...
}Also, if you change the Key or Value type, these call sites don't need to be updated.
6cfae71 to
329eec7
Compare
329eec7 to
4c65fbd
Compare
|
Some preliminary testing using a modified MRP from #104194 The original MRP has a loop that adds SubViewports that have a single Sprite rendered to them, then are rendered to the screen. That means that in a loop of N, we rendering N * 2 draw calls using N + 1 render passes (i.e. N render passes with 1 draw call and 1 render pass with N draw calls). None of the draw calls are batched, so this MRP exposes the worst case performance. It can be CPU bottlenecked on some hardware and GPU bottlenecked on others. The second test case renders N sprites with the same texture in the same location. This is the best case for batching. The number of draw calls depends on the hardware's capabilities, but it is often only 1 or a handful. Drawing that many Sprites in one location is a weird edge case for TBDR GPUs, so Mali GPUs and Apple silicon GPUs may not reflect typical performance scenarios. |
|
I can implement the D3D12 changes. Should I make a PR to your fork? |
core/templates/hash_map.h
Outdated
| } | ||
| } | ||
|
|
||
| HashMap(HashMap &&p_other) noexcept { |
There was a problem hiding this comment.
Do these noexcept have any effect? We don't use exceptions and as far as I know we don't use this directive elsewhere
There was a problem hiding this comment.
True – I'll remove them
b0722cf to
1f88bc0
Compare
There was a problem hiding this comment.
Amazing work! I have tested extensively on Linux, Windows, and Android in addition to your testing on MacOS. So I think we have covered our bases.
From testing it appears we have resolved the performance regression introduced from batching in almost all cases and even improved performance in many cases. At this point I think this PR is ready to go and get wider testing!
While discussing this with Stuart, we identified some further optimizations we could make. But the current state is really good and gives us the majority of possible gains with the least intrusive changes
doc/classes/RDVertexAttribute.xml
Outdated
| <members> | ||
| <member name="binding" type="int" setter="set_binding" getter="get_binding" default="4294967295"> | ||
| The index of the buffer in the vertex buffer array to bind this vertex attribute. When set to -1, it defaults to the index of the attribute. | ||
| [b]Note:[/b] You cannot mix binding explicitly assigned attributes with implicitly assigned ones (i.e. -1). Either all attributes must have their binding set to -1, or all must have explicit bindings. |
There was a problem hiding this comment.
| [b]Note:[/b] You cannot mix binding explicitly assigned attributes with implicitly assigned ones (i.e. -1). Either all attributes must have their binding set to -1, or all must have explicit bindings. | |
| [b]Note:[/b] You cannot mix binding explicitly assigned attributes with implicitly assigned ones (i.e. [code]-1[/code]). Either all attributes must have their binding set to [code]-1[/code], or all must have explicit bindings. |
doc/classes/RDVertexAttribute.xml
Outdated
| </tutorials> | ||
| <members> | ||
| <member name="binding" type="int" setter="set_binding" getter="get_binding" default="4294967295"> | ||
| The index of the buffer in the vertex buffer array to bind this vertex attribute. When set to -1, it defaults to the index of the attribute. |
There was a problem hiding this comment.
| The index of the buffer in the vertex buffer array to bind this vertex attribute. When set to -1, it defaults to the index of the attribute. | |
| The index of the buffer in the vertex buffer array to bind this vertex attribute. When set to [code]-1[/code], it defaults to the index of the attribute. |
|
I ran some tests on my Samsung Galaxy Tab S9 FE+, which seems to be a worst-case device, using my original test program (here). @clayjohn should I run your modified version? (I didn't see it until after I finished) I couldn't get this PR branch to run (the android app would hang at the splash screen). Master at bd2ca13 ran just fine, so I pulled in this PR's commits on top of that. 4.5-dev3: 68-72fps lows, sometimes hitting 80 For comparison, 4.3-stable (prior to the batching changes) gives lower overall frames (~82 on average), but does not have the occasional framerate dip. This is absolutely a major improvement! |
1f88bc0 to
c954518
Compare
No need. My modified version just added a couple lines of code to also test rendering a high number of sprites in a single batch
Sorry about that, there was an android regression two days ago that was fixed yesterday #112716. Pulling in this change on top of master was the right thing to do!
Depending on your build settings, the dip may go away with official builds. We enable swappy by default on official builds which helps reduce frame dips. By default, swappy is disabled for custom build |
c954518 to
2b264fe
Compare
|
@AThousandShips I've removed the @clayjohn shall I'll wait for @Ivorforce's response before removing the move semantics from |
|
I've already replied; in short: The |
2b264fe to
cfa9bab
Compare
- Add support for vertex bindings and UMA vertex buffers in D3D12. - Simplify 2D instance params and move more into per-batch data to save bandwidth Co-authored-by: Skyth <19259897+blueskythlikesclouds@users.noreply.github.com> Co-authored-by: Clay John <claynjohn@gmail.com> Co-authored-by: A Thousand Ships <96648715+athousandships@users.noreply.github.com>
cfa9bab to
90c0e6a
Compare
|
Thanks! |
|
This PR is still has regressions, see #112938. |
Summary
Closes #104194
This PR is a performance optimisation for the Canvas RD renderer after introducing batching. @clayjohn has observed a regression in performance under certain combinations and older hardware that this PR is intended to address. It should not regress existing performance gains. The core change is to switch instance data from a uniform buffer (shader storage buffer object) to a vertex buffer object (VBO).
Caution
D3D12 needs a validate
vertex_format_createto use the newVertexAttribute::bindingmember and to updatecommand_render_bind_vertex_buffersto handle UMA buffers.TODOs
dynamic_offsetindex to the vertex buffer binding. Update the drivers to use the offset (already similar for uniforms):VertexAttributechange, so a set of attributes can bind to the same bufferTesting
We must verify all rendered command types:
Verify:
INSTANCE_FLAGS_USE_MSDFINSTANCE_FLAGS_USE_LCDBenchmarks
See below for more detail