Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Web] generative decoders are slower than they should be #18754

Open
guschmue opened this issue Dec 8, 2023 · 2 comments
Open

[Web] generative decoders are slower than they should be #18754

guschmue opened this issue Dec 8, 2023 · 2 comments
Assignees
Labels
ep:WebGPU ort-web webgpu provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. platform:web issues related to ONNX Runtime web; typically submitted using template

Comments

@guschmue
Copy link
Contributor

guschmue commented Dec 8, 2023

Describe the issue

running generative decoders via webgpu (ie t5-small, whisper) are slower than wasm while there is plenty of gpu cycle available (gpu is 15% busy).
We know kernel times look good, cross device copy looks good.
Even with io-bindings it is still slower than wasm.

To reproduce

https://github.com/guschmue/ort-web-perf/blob/master/ort-t5.html

Urgency

No response

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

main

Execution Provider

'webgpu' (WebGPU)

@guschmue guschmue added platform:web issues related to ONNX Runtime web; typically submitted using template ep:WebGPU ort-web webgpu provider labels Dec 8, 2023
@github-actions github-actions bot added the model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. label Dec 8, 2023
@guschmue guschmue self-assigned this Dec 8, 2023
@lxfater
Copy link

lxfater commented Dec 14, 2023

What causes this problem?

@qjia7
Copy link
Contributor

qjia7 commented Dec 18, 2023

I think it is probable that gpu buffers are not efficiently reused. For each decoder running, lots of buffers are dynamically allocated instead of reusing existed buffers. I see https://github.com/microsoft/onnxruntime/blob/main/js/web/lib/wasm/jsep/webgpu/gpu-data-manager.ts#L267 is called many times for each inference. The current gpu buffer reuse strategy is not friendly for dynamic models. For each run, the input shapes will change, which results the needed buffer size changes and can't reuse last inference's buffers since the reuse strategy require exact matching buffer size to reuse. We may need to change the reuse strategy to reduce dynamically allocating buffers to see whether the perf can be improved.
And another issue is I still see some data download from gpu to cpu several times during each inference even I choose webgpu + io binding. We need to make sure no unnecessary data read back during inference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:WebGPU ort-web webgpu provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. platform:web issues related to ONNX Runtime web; typically submitted using template
Projects
None yet
Development

No branches or pull requests

3 participants