-
-
Notifications
You must be signed in to change notification settings - Fork 21.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[3.x] Vertex cache optimizer #86339
[3.x] Vertex cache optimizer #86339
Conversation
1a70da4
to
0d21752
Compare
0d21752
to
2414a9c
Compare
2414a9c
to
435262d
Compare
Could you give us a bit more information on how exactly you were benchmarking this PR? I played around with the vertex cache optimizations in the 4.x forward+ renderer last week and couldn't detect any significant difference (more than 4%) in performance on my 6900xt using the radeon profiler, no matter how many vertices I forced godot to render. At least in my 4.x fork I decided to skip vco in favor of a more radical optimisation experiments, but I would love to know in which scenarios this is still a useful optimization. I only had time to test on my amd gpu with vulkan, but vco might still make sense for other platforms/drivers. |
Sure. The demo project I included in the original post should show some of these differences. Some things to bear in mind:
My policy is always have at least a second low end PC available for testing and test often. I have 2 desktops (1 new, one 10 years old), 3 laptops (10-15 years old), 2 android tv boxes, tablet and phone I use for testing. 😁
If you can't see changes on high end hardware it can still be worth doing purely for low end users (as it should not be detrimental, unless you are getting fill rate + ordering effects). |
I was playing around with the meshoptimizer code we have in master. And I did my very best to create a scene that would have the highest possible benefit from vco. |
// Expecting triangles. | ||
ERR_FAIL_COND_V((indices.size() % 3) != 0, ERR_INVALID_PARAMETER); | ||
VertexCacheOptimizer opt; | ||
opt.reorder_indices_pool(indices, indices.size() / 3, p_vertex_array_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to do this earlier in the pipeline?
One side effect of doing it here is that the CPU and GPU representations of the mesh will get out of sync. For generated meshes, that would happen regardless, but for imported meshes, it might be an issue. When reading back a mesh from the RenderingServer, it will have these changes, but the version the user sent to the RenderingServer won't
That being said, unless something changed recently, we used to do a round trip to the GPU during mesh import, which means that this code will run during import anyway. The only problem is that it would then needlessly run again at run time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will have a look tomorrow.
Given this is controlled by a flag ARRAY_FLAG_USE_VERTEX_CACHE_OPTIMIZATION
it might be possible to do something like remove the flag when loading from a file (rather than the import, or creating geometry at runtime).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok reminding myself of how this works, the flag ARRAY_FLAG_USE_VERTEX_CACHE_OPTIMIZATION
is optional (it is not part of ARRAY_COMPRESS_DEFAULT
) and thus is only currently called during the import.
So due to the round trip on importing, the data saved after import is vertex cache optimized, but the optimization isn't run again at runtime while loading the files (I've just tested and confirmed this).
This does mean for user generated content, the user would need to explicitly set the flag if they want vertex cache optimization, but this seems reasonable as it may not be wanted in all cases (and particularly not for dynamic content).
I seem to remember considering these problems when writing the PR and stepped back from making it "fully automatic in all cases", to having to explicitly choose the optimization flag, to avoid applying it multiple times as you point out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. That makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Ah yes I can do those formatting changes. The reason they are like that BTW is that I didn't write that code .. see the copyright notice in the header. I only made the minimum changes required to get it to work with Godot. Sometimes afaik (e.g. in the case of third party folder) we don't alter the code at all and turn off formatting for this reason, especially where the license may not allow alteration, and for easy updating from upstream. But in this case I don't think there's a problem with changing the formatting to match our standard. |
3422f25
to
4bc9e81
Compare
Needs rebasing (probably after the MergeGroup merge). |
Optimizes indices to make good use of vertex cache on GPU.
4bc9e81
to
0aa22b8
Compare
Rebased. 👍 |
Thanks! |
Optimizes indices to make good use of vertex cache on GPU.
This is a modified version of Tom Forsyth's original code (which is quite old) for vertex cache optimization. There's probably some newer versions and I suspect
mesh_optimizer
does vertex cache optimization in 4.x (although currently it may not be used for the highest detail / non LOD meshes).Benefit is larger in higher poly meshes where vertex throughput is a bottleneck, however this is essentially a free speedup (done usually at import time) so why not.
Discussion and trade offs in index order
There is some discussion as to whether modern GPUs still use ye olde style vertex caches in the same way, but they still seem to benefit from the same local use of indices:
https://www.reddit.com/r/opengl/comments/js9a9t/is_vertex_cache_optimization_still_a_thing/
Although this optimizes for GPU vertex cache to increase vertex throughput, there are other considerations for triangle order. In particular it can be useful to draw large triangles front to back (this is similar principle to depth prepass, GPU can often reject later tris more efficiently if hidden by an earlier triangle, particularly in tiled renderer).
mesh_optimizer
may make some attempt to order by outer large tris first. This may sometimes be a win (if viewed from the outside) but maybe be subject to random effects because it is viewpoint dependent.We could alternatively use
mesh_optimizer
however this may be a bit more involved as we may want to also use the library for generating LODs. So deferring this may be a better option for now (we can easily slot in to replace the Forsyth code if desired).Additionally I'm soon going to be looking at progressive meshes and this may work better with bespoke decimator (for progressive mesh we may want vertex to be collapsed only once per LOD).
Demo
Load in the editor and run, note the FPS.
Select the santa obj model and click the import tab, turn on
vertex_cache_optimization
then clickreimport
.Run again, note change in FPS.
VertexCacheTest.zip
Notes
vertex_cache_optimization
for obj and other meshes, defaults to true..import
folder and letting the editor regenerate the optimized meshes. These will be backward compatible if reloaded in earlier version of editor, as the only change is to the indexing.ARRAY_FLAG_USE_VERTEX_CACHE_OPTIMIZATION
is optional and is not part ofARRAY_COMPRESS_DEFAULT
. It is only called on import, and users must set it explicitly if they want to apply it to their user generated content. This is to prevent over application.