Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3.x] Vertex cache optimizer #86339

Merged
merged 1 commit into from
Feb 7, 2024

Conversation

lawnjelly
Copy link
Member

@lawnjelly lawnjelly commented Dec 19, 2023

Optimizes indices to make good use of vertex cache on GPU.

This is a modified version of Tom Forsyth's original code (which is quite old) for vertex cache optimization. There's probably some newer versions and I suspect mesh_optimizer does vertex cache optimization in 4.x (although currently it may not be used for the highest detail / non LOD meshes).

Benefit is larger in higher poly meshes where vertex throughput is a bottleneck, however this is essentially a free speedup (done usually at import time) so why not.

Discussion and trade offs in index order

There is some discussion as to whether modern GPUs still use ye olde style vertex caches in the same way, but they still seem to benefit from the same local use of indices:
https://www.reddit.com/r/opengl/comments/js9a9t/is_vertex_cache_optimization_still_a_thing/

Although this optimizes for GPU vertex cache to increase vertex throughput, there are other considerations for triangle order. In particular it can be useful to draw large triangles front to back (this is similar principle to depth prepass, GPU can often reject later tris more efficiently if hidden by an earlier triangle, particularly in tiled renderer).
mesh_optimizer may make some attempt to order by outer large tris first. This may sometimes be a win (if viewed from the outside) but maybe be subject to random effects because it is viewpoint dependent.

We could alternatively use mesh_optimizer however this may be a bit more involved as we may want to also use the library for generating LODs. So deferring this may be a better option for now (we can easily slot in to replace the Forsyth code if desired).

Additionally I'm soon going to be looking at progressive meshes and this may work better with bespoke decimator (for progressive mesh we may want vertex to be collapsed only once per LOD).

Demo

Load in the editor and run, note the FPS.
Select the santa obj model and click the import tab, turn on vertex_cache_optimization then click reimport.
Run again, note change in FPS.
VertexCacheTest.zip

Notes

  • Increases fps by 30-40% in a high poly test project.
  • Benefits depend on your GPU and drivers, older / slower hardware seems to benefit more. (Some drivers may even do this step automatically when you upload an index buffer, in which case you would see no difference in benchmarks.)
  • Adds extra import option vertex_cache_optimization for obj and other meshes, defaults to true.
  • Turning this off is useful for importing "exact" meshes, and pre-processed meshes, however for the general case defaulting to true seems sensible as many models are not vertex cache optimized.
  • Probably needs double checking I haven't broken anything by adding the extra flag. 😁
  • Existing projects can be "upgraded" simply by deleting the .import folder and letting the editor regenerate the optimized meshes. These will be backward compatible if reloaded in earlier version of editor, as the only change is to the indexing.
  • The flag ARRAY_FLAG_USE_VERTEX_CACHE_OPTIMIZATION is optional and is not part of ARRAY_COMPRESS_DEFAULT. It is only called on import, and users must set it explicitly if they want to apply it to their user generated content. This is to prevent over application.

@Ansraer
Copy link
Contributor

Ansraer commented Jan 3, 2024

Could you give us a bit more information on how exactly you were benchmarking this PR? I played around with the vertex cache optimizations in the 4.x forward+ renderer last week and couldn't detect any significant difference (more than 4%) in performance on my 6900xt using the radeon profiler, no matter how many vertices I forced godot to render.

At least in my 4.x fork I decided to skip vco in favor of a more radical optimisation experiments, but I would love to know in which scenarios this is still a useful optimization. I only had time to test on my amd gpu with vulkan, but vco might still make sense for other platforms/drivers.

@lawnjelly
Copy link
Member Author

Sure. The demo project I included in the original post should show some of these differences.

Some things to bear in mind:

  • Changes in performance will depend on how well your original model is optimized in the first place. If you have an already optimized mesh, then running this will not be able to improve. Conversely indexing that jumps all over the place should offer most potential for improvement. You can artificially create this by e.g. randomizing your index buffer triangles.
  • Whether you see difference will depend to what extent you are bottlenecked by vertex shader. You can artificially increase this by using small screen size (eliminating fill rate) and high poly models.
  • This will also likely depend on your hardware. I test mainly using low end hardware (mainly low end PC with integrated GPU and Android), for a number of reasons, but in this case it shows up bottlenecks nicely, whereas high end GPU will often eat up such bottlenecks.
  • Results may be different OpenGL versus Vulkan (not sure on this).

My policy is always have at least a second low end PC available for testing and test often. I have 2 desktops (1 new, one 10 years old), 3 laptops (10-15 years old), 2 android tv boxes, tablet and phone I use for testing. 😁

  • Some hardware may not show these index buffer locality effects so much, depending on how it works internally.

If you can't see changes on high end hardware it can still be worth doing purely for low end users (as it should not be detrimental, unless you are getting fill rate + ordering effects).
Also bear in mind 4.x already uses mesh_optimizer for LODs which may well already be doing vertex cache optimization (among other things).

@Ansraer
Copy link
Contributor

Ansraer commented Jan 6, 2024

I was playing around with the meshoptimizer code we have in master. And I did my very best to create a scene that would have the highest possible benefit from vco.
The thing I didn't do was test this on older HW or with opengl, that is probably why I barely saw any performance changes.
I wouldn't be surprised if AMD already does something similar in the driver when you use a modern gpu and vulkan. And the fact that my 6900xt has a fairly big cache would probably also help to hide any performamce benefits, since the odds that a vertex has been cached already are simply higher out of the box.

// Expecting triangles.
ERR_FAIL_COND_V((indices.size() % 3) != 0, ERR_INVALID_PARAMETER);
VertexCacheOptimizer opt;
opt.reorder_indices_pool(indices, indices.size() / 3, p_vertex_array_len);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to do this earlier in the pipeline?

One side effect of doing it here is that the CPU and GPU representations of the mesh will get out of sync. For generated meshes, that would happen regardless, but for imported meshes, it might be an issue. When reading back a mesh from the RenderingServer, it will have these changes, but the version the user sent to the RenderingServer won't

That being said, unless something changed recently, we used to do a round trip to the GPU during mesh import, which means that this code will run during import anyway. The only problem is that it would then needlessly run again at run time

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will have a look tomorrow.

Given this is controlled by a flag ARRAY_FLAG_USE_VERTEX_CACHE_OPTIMIZATION it might be possible to do something like remove the flag when loading from a file (rather than the import, or creating geometry at runtime).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok reminding myself of how this works, the flag ARRAY_FLAG_USE_VERTEX_CACHE_OPTIMIZATION is optional (it is not part of ARRAY_COMPRESS_DEFAULT) and thus is only currently called during the import.

So due to the round trip on importing, the data saved after import is vertex cache optimized, but the optimization isn't run again at runtime while loading the files (I've just tested and confirmed this).

This does mean for user generated content, the user would need to explicitly set the flag if they want vertex cache optimization, but this seems reasonable as it may not be wanted in all cases (and particularly not for dynamic content).

I seem to remember considering these problems when writing the PR and stepped back from making it "fully automatic in all cases", to having to explicitly choose the optimization flag, to avoid applying it multiple times as you point out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. That makes sense!

Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

core/math/vertex_cache_optimizer.cpp Outdated Show resolved Hide resolved
core/math/vertex_cache_optimizer.cpp Outdated Show resolved Hide resolved
core/math/vertex_cache_optimizer.cpp Outdated Show resolved Hide resolved
core/math/vertex_cache_optimizer.cpp Outdated Show resolved Hide resolved
core/math/vertex_cache_optimizer.cpp Outdated Show resolved Hide resolved
core/math/vertex_cache_optimizer.cpp Outdated Show resolved Hide resolved
core/math/vertex_cache_optimizer.cpp Outdated Show resolved Hide resolved
core/math/vertex_cache_optimizer.cpp Outdated Show resolved Hide resolved
core/math/vertex_cache_optimizer.cpp Outdated Show resolved Hide resolved
@lawnjelly
Copy link
Member Author

Ah yes I can do those formatting changes.

The reason they are like that BTW is that I didn't write that code .. see the copyright notice in the header. I only made the minimum changes required to get it to work with Godot. Sometimes afaik (e.g. in the case of third party folder) we don't alter the code at all and turn off formatting for this reason, especially where the license may not allow alteration, and for easy updating from upstream.

But in this case I don't think there's a problem with changing the formatting to match our standard.

@akien-mga
Copy link
Member

Needs rebasing (probably after the MergeGroup merge).

Optimizes indices to make good use of vertex cache on GPU.
@lawnjelly
Copy link
Member Author

Rebased. 👍

@akien-mga akien-mga merged commit dbe3eca into godotengine:3.x Feb 7, 2024
13 checks passed
@akien-mga
Copy link
Member

Thanks!

@lawnjelly lawnjelly deleted the vertex_cache_optimizer branch February 7, 2024 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants