Skip to content

Conversation

@pcwalton
Copy link
Contributor

@pcwalton pcwalton commented Jul 9, 2024

This commit uses the offset-allocator crate to combine vertex and index arrays from different meshes into single buffers. Since the primary source of wgpu overhead is from validation and synchronization when switching buffers, this significantly improves Bevy's rendering performance on many scenes.

This patch is a more flexible version of #13218, which also used slabs. Unlike #13218, which used slabs of a fixed size, this commit implements slabs that start small and can grow. In addition to reducing memory usage, supporting slab growth reduces the number of vertex and index buffer switches that need to happen during rendering, leading to improved performance. To prevent pathological fragmentation behavior, slabs are capped to a maximum size, and mesh arrays that are too large get their own dedicated slabs.

As an additional improvement over #13218, this commit allows the application to customize all allocator heuristics. The MeshAllocatorSettings resource contains values that adjust the minimum and maximum slab sizes, the cutoff point at which meshes get their own dedicated slabs, and the rate at which slabs grow. Hopefully-sensible defaults have been chosen for each value.

Unfortunately, WebGL 2 doesn't support the base vertex feature, which is necessary to pack vertex arrays from different meshes into the same buffer. wgpu represents this restriction as the downlevel flag BASE_VERTEX. This patch detects that bit and ensures that all vertex buffers get dedicated slabs on that platform. Even on WebGL 2, though, we can combine all index arrays into single buffers to reduce buffer changes, and we do so.

The following measurements are on Bistro:

Overall frame time improves from 8.74 ms to 5.53 ms (1.58x speedup):
Screenshot 2024-07-09 163521

Render system time improves from 6.57 ms to 3.54 ms (1.86x speedup):
Screenshot 2024-07-09 163559

Opaque pass time improves from 4.64 ms to 2.33 ms (1.99x speedup):
Screenshot 2024-07-09 163536

Migration Guide

Changed

  • Vertex and index buffers for meshes may now be packed alongside other buffers, for performance.
  • GpuMesh has been renamed to RenderMesh, to reflect the fact that it no longer directly stores handles to GPU objects.
  • Because meshes no longer have their own vertex and index buffers, the responsibility for the buffers has moved from GpuMesh (now called RenderMesh) to the MeshAllocator resource. To access the vertex data for a mesh, use MeshAllocator::mesh_vertex_slice. To access the index data for a mesh, use MeshAllocator::mesh_index_slice.

@pcwalton pcwalton requested review from Elabajaba and IceSentry July 9, 2024 23:41
@pcwalton pcwalton added this to the 0.15 milestone Jul 9, 2024
@pcwalton pcwalton added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Jul 9, 2024
@pcwalton pcwalton force-pushed the growable-uberbuffers branch from 067cd9e to 6403a60 Compare July 9, 2024 23:43
This commit uses the [`offset-allocator`] crate to combine vertex and
index arrays from different meshes into single buffers. Since the
primary source of `wgpu` overhead is from validation and synchronization
when switching buffers, this significantly improves Bevy's rendering
performance on many scenes.

This patch is a more flexible version of bevyengine#13218, which also used slabs.
Unlike bevyengine#13218, which used slabs of a fixed size, this commit implements
slabs that start small and can grow. In addition to reducing memory
usage, supporting slab growth reduces the number of vertex and index
buffer switches that need to happen during rendering, leading to
improved performance. To prevent pathological fragmentation behavior,
slabs are capped to a maximum size, and mesh arrays that are too large
get their own dedicated slabs.

As an additional improvement over bevyengine#13218, this commit allows the
application to customize all allocator heuristics. The
`MeshAllocatorSettings` resource contains values that adjust the minimum
and maximum slab sizes, the cutoff point at which meshes get their own
dedicated slabs, and the rate at which slabs grow. Hopefully-sensible
defaults have been chosen for each value.

Unfortunately, WebGL 2 doesn't support the *base vertex* feature, which
is necessary to pack vertex arrays from different meshes into the same
buffer. `wgpu` represents this restriction as the downlevel flag
`BASE_VERTEX`. This patch detects that bit and ensures that all vertex
buffers get dedicated slabs on that platform. Even on WebGL 2, though,
we can combine all *index* arrays into single buffers to reduce buffer
changes, and we do so.
@pcwalton pcwalton force-pushed the growable-uberbuffers branch from 6403a60 to 6c17a07 Compare July 9, 2024 23:48
Copy link
Contributor

@JMS55 JMS55 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great docs as always. Looks generally good. If there any bugs or perf overhead from the allocator itself, we can iron them out later. I'm not confident that separate slabs for large buffers are needed, but we can try it out, I don't feel strongly about removing them either. Bevy really needs an example/benchmark for loading and unloading multiple assets in the background during gameplay.

The one thing I'm concerned about is that GpuMesh seems to be in an akward spot now, given that it doesn't actually hold any mesh data anymore. Not sure what we want to do with it now, I'll let @superdump weigh in.

I'd like to see:

  • Perf test on WebGL2 to see how much of a hit the allocator logic is vs main
  • A bit more detailed migration guide, specifically mentioning the GpuMesh change and how it now lives in MeshAllocator. Please mention the full types, not just the concepts.

Render,
allocate_and_free_meshes
.in_set(RenderSet::PrepareAssets)
.before(prepare_assets::<GpuMesh>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why before here, and not after? This stands out as very weird to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because prepare_assets clears the contents of the ExtractedAssets<RenderMesh> resource after processing it. Because allocate_and_free_meshes needs to use that data, it has to run first. (We don't want to merge allocate_and_free_meshes with RenderMesh::prepare_asset because, among other reasons, that would result in a lot of useless copying when a bunch of meshes load all at the same time and the buffer grows multiple times in a single frame, which happens a lot.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that this this system runs on a 1-frame delay then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The order in a single frame is this:

  1. extract_render_asset populates ExtractedAssets<RenderMesh> during the extract phase.
  2. allocate_and_free_meshes examines the ExtractedAssets<RenderMesh> to perform the allocations during the asset prep phase.
  3. prepare_asset processes the ExtractedAssets<RenderMesh> again, and clears it out.

So extract_render_asset produces the data, allocate_and_free_meshes inspects the data, and finally prepare_asset consumes the data. This pipeline happens entirely during the frame.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, thanks for the explanation. This makes sense to me, and mimics how I have it setup in my own code.

@BD103 BD103 added the M-Release-Note Work that should be called out in the blog due to impact label Jul 11, 2024
@BD103
Copy link
Member

BD103 commented Jul 11, 2024

The original PR was marked for release notes, so I'm going to add it here too!

@pcwalton
Copy link
Contributor Author

I've addressed all the review comments as much as I feel is appropriate. From running the examples on WebGL the performance seems fine. I can't do very in-depth performance analysis there without better diagnostics, and I don't think we should block this PR on better WebGL diagnostics.

@pcwalton pcwalton requested a review from JMS55 July 14, 2024 22:53
Copy link
Member

@tychedelia tychedelia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the excellent documentation. Two small comments:

  • It might be nice to have some logging to warn when a mesh spills over into the large object slab if it's near the boundary.
  • We should follow up to rename the other RenderAsset e.g GpuImage -> `RenderImage for consistency per the changes here.

@NthTensor
Copy link
Contributor

The issue with the bistro interior seems to have been resolved. I can replicate the opaque pass improvements. FPS did not improve in tests of low-end hardware (I am totally gpu bottlenecked on most scenes), but it also doesn't regress which is the main thing I was concerned about.

@NthTensor NthTensor added S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it and removed S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Jul 16, 2024
@alice-i-cecile alice-i-cecile added this pull request to the merge queue Jul 16, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 16, 2024
@pcwalton
Copy link
Contributor Author

The merge fallout is fixed. This is ready to be merged again.

@alice-i-cecile alice-i-cecile added this pull request to the merge queue Jul 16, 2024
Merged via the queue into bevyengine:main with commit bc34216 Jul 16, 2024
pcwalton added a commit to pcwalton/bevy that referenced this pull request Jul 18, 2024
The "uberbuffers" PR bevyengine#14257 caused some examples to fail intermittently
for different reasons:

1. `morph_targets` could fail because vertex displacements for morph
   targets are keyed off the vertex index. With buffer packing, the
   vertex index can vary based on the position in the buffer, which
   caused the morph targets to be potentially incorrect. The solution is
   to include the first vertex index with the `MeshUniform` (and
   `MeshInputUniform` if GPU preprocessing is in use), so that the
   shader can calculate the true vertex index before performing the
   morph operation. This results in wasted space in `MeshUniform`, which
   is unfortunate, but we'll soon be filling in the padding with the ID
   of the material when bindless textures land, so this had to happen
   sooner or later anyhow.

   Including the vertex index in the `MeshInputUniform` caused an
   ordering problem. The `MeshInputUniform` was created during the
   extraction phase, before the allocations occurred, so the extraction
   logic didn't know where the mesh vertex data was going to end up. The
   solution is to move the `MeshInputUniform` creation (the
   `collect_meshes_for_gpu_building` system) to after the allocations
   phase. This should be better for parallelism anyhow, because it
   allows the extraction phase to finish quicker. It's also something
   we'll have to do for bindless in any event.

2. The `lines` and `fog_volumes` examples could fail because their
   custom drawing nodes weren't updated to supply the vertex and index
   offsets in their `draw_indexed` and `draw` calls. This commit fixes
   this oversight.

Fixes bevyengine#14366.
github-merge-queue bot pushed a commit that referenced this pull request Jul 22, 2024
The "uberbuffers" PR #14257 caused some examples to fail intermittently
for different reasons:

1. `morph_targets` could fail because vertex displacements for morph
targets are keyed off the vertex index. With buffer packing, the vertex
index can vary based on the position in the buffer, which caused the
morph targets to be potentially incorrect. The solution is to include
the first vertex index with the `MeshUniform` (and `MeshInputUniform` if
GPU preprocessing is in use), so that the shader can calculate the true
vertex index before performing the morph operation. This results in
wasted space in `MeshUniform`, which is unfortunate, but we'll soon be
filling in the padding with the ID of the material when bindless
textures land, so this had to happen sooner or later anyhow.

Including the vertex index in the `MeshInputUniform` caused an ordering
problem. The `MeshInputUniform` was created during the extraction phase,
before the allocations occurred, so the extraction logic didn't know
where the mesh vertex data was going to end up. The solution is to move
the `MeshInputUniform` creation (the `collect_meshes_for_gpu_building`
system) to after the allocations phase. This should be better for
parallelism anyhow, because it allows the extraction phase to finish
quicker. It's also something we'll have to do for bindless in any event.

2. The `lines` and `fog_volumes` examples could fail because their
custom drawing nodes weren't updated to supply the vertex and index
offsets in their `draw_indexed` and `draw` calls. This commit fixes this
oversight.

Fixes #14366.
@alice-i-cecile
Copy link
Member

Thank you to everyone involved with the authoring or reviewing of this PR! This work is relatively important and needs release notes! Head over to bevyengine/bevy-website#1662 if you'd like to help out.

bas-ie added a commit to bas-ie/bevy_ecs_tilemap that referenced this pull request Oct 24, 2024
bas-ie added a commit to bas-ie/bevy_ecs_tilemap that referenced this pull request Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times M-Release-Note Work that should be called out in the blog due to impact S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants