V0.19 seems taking toooo long on preparing `BindGroupLayout` #5196

cryscan · 2024-02-05T10:36:48Z

Description
I try to upgrade web-rwkv, an LLM inferencing backend using compute shaders. to v0.19. However after upgrading, I find that when running the model, it gets slower and slower, and most of the time, the GPU is idle. I suspect that internally, the CPU side is waiting on something.

This does not happen in v0.18.

Repro steps
Try to upgrade web-rwkv to v0.19 without touch anything other than adding .into_iter() when selecting adapters and run the model via

$ cargo run --example -r chat -- -m .\assets\models\rwkv-x060-3b-world-v2-28%trained-20231208-ctx4k.st -t

Expected vs observed behavior

Expected: the inference goes at a constant speed, since the model is an RNN.
Observed: it goes slower and slower, with GPU usage going down.

Extra materials
I did a framegraph (attached) and found that comparing to v0.18, v0.19 spent a lot of time on wgpu::ComputePipeline::get_bind_group_layout.

v0.18

v0.19

Platform

OS: Windows 10
WGPU: v0.19.1
GPU: NVidia RTX 3080, 4090

The text was updated successfully, but these errors were encountered:

nical · 2024-02-05T14:54:41Z

@cwfitzgerald looks like it might have been caused by the bindgroup layout dedup refactor ?

cwfitzgerald · 2024-02-05T15:39:11Z

Double checked the trace - I think this is actually arcanization.

Based on the trace provided, wgpu_core::identity::IdentityValues::alloc seems to be calling max_by on a slice. This is what is taking the time. I think this is one of a few leaks combined with linear behavior during identity value allocation.

cwfitzgerald · 2024-02-05T17:58:24Z

For those hitting this problem, are you calling get_bind_group_layout every frame? To be clear this is a bug on our side, but reducing calls to get_bind_group_layout should reduce the problem.

nathanielsimard · 2024-02-05T20:18:31Z

@cwfitzgerald, we are indeed calling get_bind_group_layout not just at every frame, but for every compute kernel that we execute (can be thousands or more each second). See here: https://github.com/tracel-ai/burn/blob/3eab14160875ddaa1d0527247c09d6f37f8c75c7/burn-wgpu/src/compute/server.rs#L333

We are still waiting to have many ComputePipeline instances before creating the compute pass and submitting work to the GPU. We are actually caching the ComputePipeline based on kernel id (compute shader id). Do you think there are any obvious improvements that we should make to reduce CPU overhead and better utilize the GPU?

cwfitzgerald · 2024-02-05T20:22:47Z

If you cache the bind group layouts alongside the compute pipeline, the problem should mostly go away. Currently (with the bugs) I believe performance is ~O(n^2) where n = calls to get_bind_group_layout.

nathanielsimard · 2024-02-05T20:45:12Z

Just to clarify, instead of calling get_bind_group_layout on the cached ComputePipeline when we want to execute it with different buffers, we should cache the BindGroupLayout along with the ComputePipeline when we first create it to avoid the need for the subsequent call to get_bind_group_layout, is that correct?

I actually tried it, and it didn't impact the performance significantly: https://github.com/tracel-ai/burn/blob/57cc3ffe60f8526a218404d433373128c3b24f17/burn-wgpu/src/compute/server.rs#L344

cwfitzgerald · 2024-02-05T20:47:59Z

Yeah that is what I meant - that's a bit unexpected. Could you try using cargo flamegraph and uploading the generated flamegraph (like OP did).

nathanielsimard · 2024-02-06T15:11:22Z

This is profiled in the middle of a training run, since it needs to run for a few seconds before becoming slow.

cwfitzgerald · 2024-02-29T22:38:04Z

This should be worked around in 0.19.2, and a full fix should land in 0.20. The leak still exists, but you shouldn't notice.

cryscan mentioned this issue Feb 5, 2024

wgpu version 0.19 performance regression #5180

Closed

nical added the area: performance How fast things go label Feb 5, 2024

nical mentioned this issue Feb 9, 2024

Simplify the ID allocation in IdentityValues #5229

Merged

4 tasks

cwfitzgerald closed this as completed Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V0.19 seems taking toooo long on preparing `BindGroupLayout` #5196

V0.19 seems taking toooo long on preparing `BindGroupLayout` #5196

cryscan commented Feb 5, 2024 •

edited

Loading

nical commented Feb 5, 2024

cwfitzgerald commented Feb 5, 2024

cwfitzgerald commented Feb 5, 2024

nathanielsimard commented Feb 5, 2024

cwfitzgerald commented Feb 5, 2024

nathanielsimard commented Feb 5, 2024

cwfitzgerald commented Feb 5, 2024

nathanielsimard commented Feb 6, 2024

cwfitzgerald commented Feb 29, 2024

V0.19 seems taking toooo long on preparing BindGroupLayout #5196

V0.19 seems taking toooo long on preparing BindGroupLayout #5196

Comments

cryscan commented Feb 5, 2024 • edited Loading

v0.18

v0.19

nical commented Feb 5, 2024

cwfitzgerald commented Feb 5, 2024

cwfitzgerald commented Feb 5, 2024

nathanielsimard commented Feb 5, 2024

cwfitzgerald commented Feb 5, 2024

nathanielsimard commented Feb 5, 2024

cwfitzgerald commented Feb 5, 2024

nathanielsimard commented Feb 6, 2024

cwfitzgerald commented Feb 29, 2024

V0.19 seems taking toooo long on preparing `BindGroupLayout` #5196

V0.19 seems taking toooo long on preparing `BindGroupLayout` #5196

cryscan commented Feb 5, 2024 •

edited

Loading