-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Renderer optimization tracking issue #12590
Labels
A-Rendering
Drawing game state to the screen
C-Feature
A new feature, making something new possible
C-Performance
A change motivated by improving speed, memory usage or compile times
C-Tracking-Issue
An issue that collects information about a broad development initiative
Comments
superdump
added
C-Feature
A new feature, making something new possible
A-Rendering
Drawing game state to the screen
C-Performance
A change motivated by improving speed, memory usage or compile times
labels
Mar 20, 2024
SolarLiner
added
the
C-Tracking-Issue
An issue that collects information about a broad development initiative
label
Mar 20, 2024
Other things to add:
|
pcwalton
added a commit
to pcwalton/bevy
that referenced
this issue
Mar 23, 2024
This commit introduces a new component, `GpuCulling`, which, when present on a camera, skips the CPU visibility check in favor of doing the frustum culling on the GPU. This trades off potentially-increased CPU work and drawcalls in favor of cheaper culling and doesn't improve the performance of any workloads that I know of today. However, it opens the door to significant optimizations in the future by taking the necessary first step toward *GPU-driven rendering*. Enabling GPU culling for a view puts the rendering for that view into *indirect mode*. In indirect mode, CPU-level visibility checks are skipped, and all visible entities are considered potentially visible. Bevy's batching logic still runs as usual, but it doesn't directly generate mesh instance indices. Instead, it generates *instance handles*, which are indices into an array of real instance indices. Before any rendering is done, for each view, a compute shader, `cull.wgsl`, maps instance handles to instance indices, discarding any instance handles that represent meshes that are outside the visible frustum. Draws are then done using the *indirect draw* feature of `wgpu`, which instructs the GPU to read the number of actual instances from the output of that compute shader. Essentially, GPU culling works by adding a new level of indirection between the CPU's notion of instances (known as instance handles) and the GPU's notion of instances. A new `--gpu-culling` flag has been added to the `many_foxes`, `many_cubes`, and `3d_shapes` examples. Potential follow-ups include: * Split up `RenderMeshInstances` into CPU-driven and GPU-driven parts. The former, which contain fields like the transform, won't be initialized at all in when GPU culling is enabled. Instead, the transform will be directly written to the GPU in `extract_meshes`, like `extract_skins` does for joint matrices. * Implement GPU culling for shadow maps. - Following that, we can treat all cascades as one as far as the CPU is concerned, simply replaying the final draw commands with different view uniforms, which should reduce the CPU overhead considerably. * Retain bins from frame to frame so that they don't have to be rebuilt. This is a longer term project that will build on top of bevyengine#12453 and several of the tasks in bevyengine#12590, such as main-world pipeline specialization.
This may be supplanted on platforms where compute shaders are present by #12773.
teoxoy/encase#65 should make |
#12773 also effectively bypasses |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-Rendering
Drawing game state to the screen
C-Feature
A new feature, making something new possible
C-Performance
A change motivated by improving speed, memory usage or compile times
C-Tracking-Issue
An issue that collects information about a broad development initiative
There are many ideas, branches, proofs of concepts, PRs, and discussions around improving the performance of the main code paths, systems, data structures, etc for rendering entities with meshes and materials.
This issue is a tracking issue to help with giving an overview of what has been considered, is known/has been tried, is almost ready but needs finishing off, needs reviewing, or has been merged. It should help to sequence work, and avoid forgetting things.
Optimizations for general usage
RenderAssets
andRenderMaterials
RenderMaterials
exists to work around limitations of theRenderAssets
APIRenderMaterials
is duplicated across 3D, 2D, UI, gizmos, anywhere where there is a duplicate of theMaterial
API abstractionRenderAssets
but alsoRenderMaterials
and all its duplicatesRenderAsset
be the target type (e.g.GpuMesh
instead ofMesh
). This removes the root cause that prevented reuse ofRenderAssets
for materials.PhaseItem
RenderPhase
which contains aVec<PhaseItem>
PhaseItem
contains aRange<u32>
(8 bytes) for the instance/batch range, and anOption<NonMaxU32>
(4 bytes) for the dynamic offset in case of using aBatchedUniformBuffer
. These take space in caches, and are more data to move around when sorting.batch_and_prepare_render_phase
PhaseItem
s.Vec
orEntityHashMap
inRenderPhase
, or separate components with similar data structures on the view, generic over thePhaseItem
to allow there to be one per phase per view. The latter enables easier parallelism through ECS queries (Arc<Mutex<T>>
members inRenderPhase
would solve this too), but is perhaps a bit more awkward.MeshUniform
GpuArrayBuffer
preparation frombatch_and_prepare_render_phase
batch_and_prepare_render_phase
is a bottleneckbatch_and_prepare_render_phase
prepares the per-instance data buffer because when usingBatchedUniformBuffer
on WebGL2/where storage buffers are not available, batches can be broken by filling up a uniform buffer binding (16kB minimum guaranteed, 64kB on NVIDIA, can be more on others) such that a new dynamic offset binding has to be started.BatchedUniformBuffer
dynamic offset calculatorGpuArrayBuffer
into aprepare_render_phase
system that is separate frombatch_and_prepare_render_phase
and can be run in parallel with it. Renamebatch_and_prepare_render_phase
tobatch_render_phase
PostUpdate
:PrepareAssets
:Asset
events to additionally identify entities that need re-specialization (if they use that asset)RenderPhase
s for opaque (note that alpha mask is also opaque) passes (including opaque pre and main passes, and shadow passes)HashMap<BinKey, Entity>
MeshUniform
inverse matrix calculation performanceultraviolet
or another similar 'wide' SIMD crate to enable calculating many matrix inverses in parallel, instead of using 'vertical' SIMD likeglam
does and calculating for one at a time.TrackedRenderPass
to keep track of draw state so that when draw commands (binding pipelines and buffers, updating dynamic offsets, etc) are issued to theTrackedRenderPass
, it can compare and see if anything changed, and if not, it can skip passing the call on towgpu
. This means information is being compared twice, both in batching and rendering.Vec<u32>
with a protocol. The firstu32
is a bit field for a single draw command that contains bits indicating for example whether a pipeline needs to be rebound, a bind group needs rebinding, if there is an index/vertex buffer to be rebound, the type of draw (indexed vs not, direct vs inirect, etc). Then theu32
s that follow contain the ids or information to be able to encode that draw.TrackedRenderPass
is no longer needed in terms of checking whether something actually needs rebinding or so, because the draw stream only contains exactly what needs to be done.MeshUniform
serializationMeshUniform
serialization is a bottleneck.encase
performance is part of the problem.bytemuck
to bypassencase
wgpu
staging buffers, avoiding memory copiesVec<T: ShaderType>
, that is then serialised into aVec<u8>
usingencase
, that is then given towgpu
'sQueue::write_buffer()
API. This results in making multiple copies, which costs performancewgpu
'swrite_buffer_with()
APIs. This allows requesting a mutable slice into an internalwgpu
staging buffer. Serializing data directly into this mutable slice then avoids lots of unnecessary copies.GpuArrayBuffer<MeshUniform>
MeshUniform
buffer preparation acrossRenderPhase
sbatch_and_prepare_render_phase
usesResMut<GpuArrayBuffer<MeshUniform>>
which means that these systems run serially as there is only oneGpuArrayBuffer<MeshUniform>
resource and only one system can mutate it at a timeGpuArrayBuffer<MeshUniform>
per phase per viewGpuArrayBuffer
per phase per view, and then prepare each phase for each view in parallelAsBindGroup
AsBindGroup
to write material data into aGpuArrayBuffer
per material typeAssetId
is quite large due to having an UUID variant (16 bytes) which means slower hashing and worse cache hit rates (larger data uses more space in caches).AssetId
s are done in all those bottleneck systems, either directly, or just because of being part ofPhaseItem
s.Asset
s asEntity
sUuid
is removed!RenderAssets
could instead use aSlotMap<PreparedAsset>
, andHashMap<AssetId, SlotMapKey>
.SlotMapKey
is au32
+u32
generational index, so 8 bytes.SlotMapKey
is looked up for theAssetId
, or theAssetId
is extracted to a separate queue of assets to be prepared and theSlotMapKey
is later backfilled into the extracted data type after preparation is complete.SlotMapKey
to look up thePreparedAsset
which avoids hashing entirely, is more cache friendly, and means less data to be sorted.~~Handle
andAssetId
into structs containingAssetIndex
andOption<Uuid>
const Handle
intoconst Uuid
and maintain aHashMap<Uuid, Handle<T>>
, create the handles at runtime and insert into the map, look up from the map using theconst Uuid
Optimizations for specific use cases
The text was updated successfully, but these errors were encountered: