[3.x] Vertex cache optimizer #86339

lawnjelly · 2023-12-19T20:04:02Z

Optimizes indices to make good use of vertex cache on GPU.

This is a modified version of Tom Forsyth's original code (which is quite old) for vertex cache optimization. There's probably some newer versions and I suspect mesh_optimizer does vertex cache optimization in 4.x (although currently it may not be used for the highest detail / non LOD meshes).

Benefit is larger in higher poly meshes where vertex throughput is a bottleneck, however this is essentially a free speedup (done usually at import time) so why not.

Discussion and trade offs in index order

There is some discussion as to whether modern GPUs still use ye olde style vertex caches in the same way, but they still seem to benefit from the same local use of indices:
https://www.reddit.com/r/opengl/comments/js9a9t/is_vertex_cache_optimization_still_a_thing/

Although this optimizes for GPU vertex cache to increase vertex throughput, there are other considerations for triangle order. In particular it can be useful to draw large triangles front to back (this is similar principle to depth prepass, GPU can often reject later tris more efficiently if hidden by an earlier triangle, particularly in tiled renderer).
mesh_optimizer may make some attempt to order by outer large tris first. This may sometimes be a win (if viewed from the outside) but maybe be subject to random effects because it is viewpoint dependent.

We could alternatively use mesh_optimizer however this may be a bit more involved as we may want to also use the library for generating LODs. So deferring this may be a better option for now (we can easily slot in to replace the Forsyth code if desired).

Additionally I'm soon going to be looking at progressive meshes and this may work better with bespoke decimator (for progressive mesh we may want vertex to be collapsed only once per LOD).

Demo

Load in the editor and run, note the FPS.
Select the santa obj model and click the import tab, turn on vertex_cache_optimization then click reimport.
Run again, note change in FPS.
VertexCacheTest.zip

Notes

Increases fps by 30-40% in a high poly test project.
Benefits depend on your GPU and drivers, older / slower hardware seems to benefit more. (Some drivers may even do this step automatically when you upload an index buffer, in which case you would see no difference in benchmarks.)
Adds extra import option vertex_cache_optimization for obj and other meshes, defaults to true.
Turning this off is useful for importing "exact" meshes, and pre-processed meshes, however for the general case defaulting to true seems sensible as many models are not vertex cache optimized.
Probably needs double checking I haven't broken anything by adding the extra flag. 😁
Existing projects can be "upgraded" simply by deleting the .import folder and letting the editor regenerate the optimized meshes. These will be backward compatible if reloaded in earlier version of editor, as the only change is to the indexing.
The flag ARRAY_FLAG_USE_VERTEX_CACHE_OPTIMIZATION is optional and is not part of ARRAY_COMPRESS_DEFAULT. It is only called on import, and users must set it explicitly if they want to apply it to their user generated content. This is to prevent over application.

core/math/vertex_cache_optimizer.h

Ansraer · 2024-01-03T19:35:40Z

Could you give us a bit more information on how exactly you were benchmarking this PR? I played around with the vertex cache optimizations in the 4.x forward+ renderer last week and couldn't detect any significant difference (more than 4%) in performance on my 6900xt using the radeon profiler, no matter how many vertices I forced godot to render.

At least in my 4.x fork I decided to skip vco in favor of a more radical optimisation experiments, but I would love to know in which scenarios this is still a useful optimization. I only had time to test on my amd gpu with vulkan, but vco might still make sense for other platforms/drivers.

lawnjelly · 2024-01-03T20:05:07Z

Sure. The demo project I included in the original post should show some of these differences.

Some things to bear in mind:

Changes in performance will depend on how well your original model is optimized in the first place. If you have an already optimized mesh, then running this will not be able to improve. Conversely indexing that jumps all over the place should offer most potential for improvement. You can artificially create this by e.g. randomizing your index buffer triangles.
Whether you see difference will depend to what extent you are bottlenecked by vertex shader. You can artificially increase this by using small screen size (eliminating fill rate) and high poly models.
This will also likely depend on your hardware. I test mainly using low end hardware (mainly low end PC with integrated GPU and Android), for a number of reasons, but in this case it shows up bottlenecks nicely, whereas high end GPU will often eat up such bottlenecks.
Results may be different OpenGL versus Vulkan (not sure on this).

My policy is always have at least a second low end PC available for testing and test often. I have 2 desktops (1 new, one 10 years old), 3 laptops (10-15 years old), 2 android tv boxes, tablet and phone I use for testing. 😁

Some hardware may not show these index buffer locality effects so much, depending on how it works internally.

If you can't see changes on high end hardware it can still be worth doing purely for low end users (as it should not be detrimental, unless you are getting fill rate + ordering effects).
Also bear in mind 4.x already uses mesh_optimizer for LODs which may well already be doing vertex cache optimization (among other things).

Ansraer · 2024-01-06T11:48:55Z

I was playing around with the meshoptimizer code we have in master. And I did my very best to create a scene that would have the highest possible benefit from vco.
The thing I didn't do was test this on older HW or with opengl, that is probably why I barely saw any performance changes.
I wouldn't be surprised if AMD already does something similar in the driver when you use a modern gpu and vulkan. And the fact that my 6900xt has a fairly big cache would probably also help to hide any performamce benefits, since the odds that a vertex has been cached already are simply higher out of the box.

clayjohn · 2024-01-29T20:09:29Z

servers/visual_server.cpp

+					// Expecting triangles.
+					ERR_FAIL_COND_V((indices.size() % 3) != 0, ERR_INVALID_PARAMETER);
+					VertexCacheOptimizer opt;
+					opt.reorder_indices_pool(indices, indices.size() / 3, p_vertex_array_len);


Would it be possible to do this earlier in the pipeline?

One side effect of doing it here is that the CPU and GPU representations of the mesh will get out of sync. For generated meshes, that would happen regardless, but for imported meshes, it might be an issue. When reading back a mesh from the RenderingServer, it will have these changes, but the version the user sent to the RenderingServer won't

That being said, unless something changed recently, we used to do a round trip to the GPU during mesh import, which means that this code will run during import anyway. The only problem is that it would then needlessly run again at run time

Good point, will have a look tomorrow.

Given this is controlled by a flag ARRAY_FLAG_USE_VERTEX_CACHE_OPTIMIZATION it might be possible to do something like remove the flag when loading from a file (rather than the import, or creating geometry at runtime).

Ok reminding myself of how this works, the flag ARRAY_FLAG_USE_VERTEX_CACHE_OPTIMIZATION is optional (it is not part of ARRAY_COMPRESS_DEFAULT) and thus is only currently called during the import.

So due to the round trip on importing, the data saved after import is vertex cache optimized, but the optimization isn't run again at runtime while loading the files (I've just tested and confirmed this).

This does mean for user generated content, the user would need to explicitly set the flag if they want vertex cache optimization, but this seems reasonable as it may not be wanted in all cases (and particularly not for dynamic content).

I seem to remember considering these problems when writing the PR and stepped back from making it "fully automatic in all cases", to having to explicitly choose the optimization flag, to avoid applying it multiple times as you point out.

I see. That makes sense!

clayjohn

Looks good to me!

core/math/vertex_cache_optimizer.cpp

lawnjelly · 2024-01-31T10:44:35Z

Ah yes I can do those formatting changes.

The reason they are like that BTW is that I didn't write that code .. see the copyright notice in the header. I only made the minimum changes required to get it to work with Godot. Sometimes afaik (e.g. in the case of third party folder) we don't alter the code at all and turn off formatting for this reason, especially where the license may not allow alteration, and for easy updating from upstream.

But in this case I don't think there's a problem with changing the formatting to match our standard.

akien-mga · 2024-02-07T08:51:47Z

Needs rebasing (probably after the MergeGroup merge).

Optimizes indices to make good use of vertex cache on GPU.

lawnjelly · 2024-02-07T09:43:12Z

Rebased. 👍

akien-mga · 2024-02-07T10:11:13Z

Thanks!

lawnjelly added enhancement topic:rendering topic:3d labels Dec 19, 2023

lawnjelly added this to the 3.6 milestone Dec 19, 2023

lawnjelly force-pushed the vertex_cache_optimizer branch from 1a70da4 to 0d21752 Compare December 19, 2023 20:22

lawnjelly mentioned this pull request Dec 20, 2023

Crash with cpuparticles2D?. #86323

Closed

lawnjelly force-pushed the vertex_cache_optimizer branch from 0d21752 to 2414a9c Compare December 20, 2023 11:07

lawnjelly marked this pull request as ready for review December 20, 2023 11:34

lawnjelly requested review from a team as code owners December 20, 2023 11:34

AThousandShips reviewed Dec 20, 2023

View reviewed changes

core/math/vertex_cache_optimizer.h Outdated Show resolved Hide resolved

lawnjelly force-pushed the vertex_cache_optimizer branch from 2414a9c to 435262d Compare December 20, 2023 14:09

clayjohn reviewed Jan 29, 2024

View reviewed changes

clayjohn approved these changes Jan 30, 2024

View reviewed changes

AThousandShips reviewed Jan 31, 2024

View reviewed changes

lawnjelly force-pushed the vertex_cache_optimizer branch 2 times, most recently from 3422f25 to 4bc9e81 Compare January 31, 2024 10:57

lawnjelly mentioned this pull request Feb 1, 2024

The number of vertices imported from obj file is not correct #75279

Open

Vertex cache optimizer

0aa22b8

Optimizes indices to make good use of vertex cache on GPU.

lawnjelly force-pushed the vertex_cache_optimizer branch from 4bc9e81 to 0aa22b8 Compare February 7, 2024 09:38

akien-mga merged commit dbe3eca into godotengine:3.x Feb 7, 2024
13 checks passed

lawnjelly deleted the vertex_cache_optimizer branch February 7, 2024 10:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[3.x] Vertex cache optimizer #86339

[3.x] Vertex cache optimizer #86339

lawnjelly commented Dec 19, 2023 •

edited

Loading

Ansraer commented Jan 3, 2024

lawnjelly commented Jan 3, 2024

Ansraer commented Jan 6, 2024

clayjohn Jan 29, 2024

lawnjelly Jan 29, 2024

lawnjelly Jan 30, 2024

clayjohn Jan 30, 2024

clayjohn left a comment

lawnjelly commented Jan 31, 2024

akien-mga commented Feb 7, 2024

lawnjelly commented Feb 7, 2024

akien-mga commented Feb 7, 2024

[3.x] Vertex cache optimizer #86339

[3.x] Vertex cache optimizer #86339

Conversation

lawnjelly commented Dec 19, 2023 • edited Loading

Discussion and trade offs in index order

Demo

Notes

Ansraer commented Jan 3, 2024

lawnjelly commented Jan 3, 2024

Ansraer commented Jan 6, 2024

clayjohn Jan 29, 2024

Choose a reason for hiding this comment

lawnjelly Jan 29, 2024

Choose a reason for hiding this comment

lawnjelly Jan 30, 2024

Choose a reason for hiding this comment

clayjohn Jan 30, 2024

Choose a reason for hiding this comment

clayjohn left a comment

Choose a reason for hiding this comment

lawnjelly commented Jan 31, 2024

akien-mga commented Feb 7, 2024

lawnjelly commented Feb 7, 2024

akien-mga commented Feb 7, 2024

lawnjelly commented Dec 19, 2023 •

edited

Loading