-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Pack multiple vertex and index arrays together into growable buffers. #14257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pack multiple vertex and index arrays together into growable buffers. #14257
Conversation
067cd9e to
6403a60
Compare
This commit uses the [`offset-allocator`] crate to combine vertex and index arrays from different meshes into single buffers. Since the primary source of `wgpu` overhead is from validation and synchronization when switching buffers, this significantly improves Bevy's rendering performance on many scenes. This patch is a more flexible version of bevyengine#13218, which also used slabs. Unlike bevyengine#13218, which used slabs of a fixed size, this commit implements slabs that start small and can grow. In addition to reducing memory usage, supporting slab growth reduces the number of vertex and index buffer switches that need to happen during rendering, leading to improved performance. To prevent pathological fragmentation behavior, slabs are capped to a maximum size, and mesh arrays that are too large get their own dedicated slabs. As an additional improvement over bevyengine#13218, this commit allows the application to customize all allocator heuristics. The `MeshAllocatorSettings` resource contains values that adjust the minimum and maximum slab sizes, the cutoff point at which meshes get their own dedicated slabs, and the rate at which slabs grow. Hopefully-sensible defaults have been chosen for each value. Unfortunately, WebGL 2 doesn't support the *base vertex* feature, which is necessary to pack vertex arrays from different meshes into the same buffer. `wgpu` represents this restriction as the downlevel flag `BASE_VERTEX`. This patch detects that bit and ensures that all vertex buffers get dedicated slabs on that platform. Even on WebGL 2, though, we can combine all *index* arrays into single buffers to reduce buffer changes, and we do so.
6403a60 to
6c17a07
Compare
JMS55
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great docs as always. Looks generally good. If there any bugs or perf overhead from the allocator itself, we can iron them out later. I'm not confident that separate slabs for large buffers are needed, but we can try it out, I don't feel strongly about removing them either. Bevy really needs an example/benchmark for loading and unloading multiple assets in the background during gameplay.
The one thing I'm concerned about is that GpuMesh seems to be in an akward spot now, given that it doesn't actually hold any mesh data anymore. Not sure what we want to do with it now, I'll let @superdump weigh in.
I'd like to see:
- Perf test on WebGL2 to see how much of a hit the allocator logic is vs main
- A bit more detailed migration guide, specifically mentioning the GpuMesh change and how it now lives in MeshAllocator. Please mention the full types, not just the concepts.
| Render, | ||
| allocate_and_free_meshes | ||
| .in_set(RenderSet::PrepareAssets) | ||
| .before(prepare_assets::<GpuMesh>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why before here, and not after? This stands out as very weird to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because prepare_assets clears the contents of the ExtractedAssets<RenderMesh> resource after processing it. Because allocate_and_free_meshes needs to use that data, it has to run first. (We don't want to merge allocate_and_free_meshes with RenderMesh::prepare_asset because, among other reasons, that would result in a lot of useless copying when a bunch of meshes load all at the same time and the buffer grows multiple times in a single frame, which happens a lot.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct that this this system runs on a 1-frame delay then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The order in a single frame is this:
extract_render_assetpopulatesExtractedAssets<RenderMesh>during the extract phase.allocate_and_free_meshesexamines theExtractedAssets<RenderMesh>to perform the allocations during the asset prep phase.prepare_assetprocesses theExtractedAssets<RenderMesh>again, and clears it out.
So extract_render_asset produces the data, allocate_and_free_meshes inspects the data, and finally prepare_asset consumes the data. This pipeline happens entirely during the frame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, thanks for the explanation. This makes sense to me, and mimics how I have it setup in my own code.
|
The original PR was marked for release notes, so I'm going to add it here too! |
|
I've addressed all the review comments as much as I feel is appropriate. From running the examples on WebGL the performance seems fine. I can't do very in-depth performance analysis there without better diagnostics, and I don't think we should block this PR on better WebGL diagnostics. |
tychedelia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for the excellent documentation. Two small comments:
- It might be nice to have some logging to warn when a mesh spills over into the large object slab if it's near the boundary.
- We should follow up to rename the other
RenderAssete.gGpuImage-> `RenderImage for consistency per the changes here.
|
The issue with the bistro interior seems to have been resolved. I can replicate the opaque pass improvements. FPS did not improve in tests of low-end hardware (I am totally gpu bottlenecked on most scenes), but it also doesn't regress which is the main thing I was concerned about. |
|
The merge fallout is fixed. This is ready to be merged again. |
The "uberbuffers" PR bevyengine#14257 caused some examples to fail intermittently for different reasons: 1. `morph_targets` could fail because vertex displacements for morph targets are keyed off the vertex index. With buffer packing, the vertex index can vary based on the position in the buffer, which caused the morph targets to be potentially incorrect. The solution is to include the first vertex index with the `MeshUniform` (and `MeshInputUniform` if GPU preprocessing is in use), so that the shader can calculate the true vertex index before performing the morph operation. This results in wasted space in `MeshUniform`, which is unfortunate, but we'll soon be filling in the padding with the ID of the material when bindless textures land, so this had to happen sooner or later anyhow. Including the vertex index in the `MeshInputUniform` caused an ordering problem. The `MeshInputUniform` was created during the extraction phase, before the allocations occurred, so the extraction logic didn't know where the mesh vertex data was going to end up. The solution is to move the `MeshInputUniform` creation (the `collect_meshes_for_gpu_building` system) to after the allocations phase. This should be better for parallelism anyhow, because it allows the extraction phase to finish quicker. It's also something we'll have to do for bindless in any event. 2. The `lines` and `fog_volumes` examples could fail because their custom drawing nodes weren't updated to supply the vertex and index offsets in their `draw_indexed` and `draw` calls. This commit fixes this oversight. Fixes bevyengine#14366.
The "uberbuffers" PR #14257 caused some examples to fail intermittently for different reasons: 1. `morph_targets` could fail because vertex displacements for morph targets are keyed off the vertex index. With buffer packing, the vertex index can vary based on the position in the buffer, which caused the morph targets to be potentially incorrect. The solution is to include the first vertex index with the `MeshUniform` (and `MeshInputUniform` if GPU preprocessing is in use), so that the shader can calculate the true vertex index before performing the morph operation. This results in wasted space in `MeshUniform`, which is unfortunate, but we'll soon be filling in the padding with the ID of the material when bindless textures land, so this had to happen sooner or later anyhow. Including the vertex index in the `MeshInputUniform` caused an ordering problem. The `MeshInputUniform` was created during the extraction phase, before the allocations occurred, so the extraction logic didn't know where the mesh vertex data was going to end up. The solution is to move the `MeshInputUniform` creation (the `collect_meshes_for_gpu_building` system) to after the allocations phase. This should be better for parallelism anyhow, because it allows the extraction phase to finish quicker. It's also something we'll have to do for bindless in any event. 2. The `lines` and `fog_volumes` examples could fail because their custom drawing nodes weren't updated to supply the vertex and index offsets in their `draw_indexed` and `draw` calls. This commit fixes this oversight. Fixes #14366.
|
Thank you to everyone involved with the authoring or reviewing of this PR! This work is relatively important and needs release notes! Head over to bevyengine/bevy-website#1662 if you'd like to help out. |
This commit uses the
offset-allocatorcrate to combine vertex and index arrays from different meshes into single buffers. Since the primary source ofwgpuoverhead is from validation and synchronization when switching buffers, this significantly improves Bevy's rendering performance on many scenes.This patch is a more flexible version of #13218, which also used slabs. Unlike #13218, which used slabs of a fixed size, this commit implements slabs that start small and can grow. In addition to reducing memory usage, supporting slab growth reduces the number of vertex and index buffer switches that need to happen during rendering, leading to improved performance. To prevent pathological fragmentation behavior, slabs are capped to a maximum size, and mesh arrays that are too large get their own dedicated slabs.
As an additional improvement over #13218, this commit allows the application to customize all allocator heuristics. The
MeshAllocatorSettingsresource contains values that adjust the minimum and maximum slab sizes, the cutoff point at which meshes get their own dedicated slabs, and the rate at which slabs grow. Hopefully-sensible defaults have been chosen for each value.Unfortunately, WebGL 2 doesn't support the base vertex feature, which is necessary to pack vertex arrays from different meshes into the same buffer.
wgpurepresents this restriction as the downlevel flagBASE_VERTEX. This patch detects that bit and ensures that all vertex buffers get dedicated slabs on that platform. Even on WebGL 2, though, we can combine all index arrays into single buffers to reduce buffer changes, and we do so.The following measurements are on Bistro:
Overall frame time improves from 8.74 ms to 5.53 ms (1.58x speedup):

Render system time improves from 6.57 ms to 3.54 ms (1.86x speedup):

Opaque pass time improves from 4.64 ms to 2.33 ms (1.99x speedup):

Migration Guide
Changed
GpuMeshhas been renamed toRenderMesh, to reflect the fact that it no longer directly stores handles to GPU objects.GpuMesh(now calledRenderMesh) to theMeshAllocatorresource. To access the vertex data for a mesh, useMeshAllocator::mesh_vertex_slice. To access the index data for a mesh, useMeshAllocator::mesh_index_slice.