Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] WebGPU support #73

Closed
loretoparisi opened this issue Apr 7, 2023 · 15 comments · Fixed by #545
Closed

[Feature request] WebGPU support #73

loretoparisi opened this issue Apr 7, 2023 · 15 comments · Fixed by #545
Labels
enhancement New feature or request

Comments

@loretoparisi
Copy link

loretoparisi commented Apr 7, 2023

WebGPU

Chrome shipped WebGPU today in Chrome 113 Beta.

Reason for request

WebGPU is currently a WIP in Firefox and Safari, in addition to the beta Chrome. Also TensorflowJS is supporting WebGPU already in several operators.

Additional context
It's worth to note the Google's project Dawn, a C++ native WebGPU implementation will support NodeJS soon. WIP of Node bindings here.

@loretoparisi loretoparisi added the enhancement New feature or request label Apr 7, 2023
@xenova
Copy link
Collaborator

xenova commented Apr 7, 2023

Thanks for the resources :) For the most part, we are waiting for onnxruntime-web to add webgpu as a supported backend.

Here is the associated PR to track its progress:

However, we do plan to support other model formats/backends (in a similar way to how the python library supports PyTorch, tensorflow and ONNX). I don't want to spoil anything... but things are in the work 😉

@gabrielgrant
Copy link

AFAIU ORT has merged WebGPU support: microsoft/onnxruntime#11695

What's needed to take advantage of this on the transformers.js side?

@sroussey
Copy link

For reference, the webgpu operators implemented:

https://github.com/microsoft/onnxruntime/blob/main/js/web/docs/webgpu-operators.md

@gabrielgrant
Copy link

Unfortunately the WebGPU implementation is currently slower than the WASM version, though: microsoft/onnxruntime#18754 (comment)

Would be great to know what's needed to support WebGPU in transformers.js assuming that perf issue gets resolved at some point, but not super urgent/important at the moment

@DavidGOrtega
Copy link
Contributor

Unfortunately the WebGPU implementation is currently slower than the WASM version,

I have some models running in jsep webGPU and are 10 times faster than wasm. I.E. clip

To me, the main problem is the current backend design: It's global (as far as I know). We should be able to setup the preferred backend to our model.

@gabrielgrant
Copy link

@DavidGOrtega that's great news! to be clear, are you running your models directly on ORT? or using JSEP through transformers.js somehow? would love to hear more details about exactly what your setup looks like, and which other models you've found this perf improvement on!

@DavidGOrtega
Copy link
Contributor

DavidGOrtega commented Jan 16, 2024

Im running them with vanilla onnx.

I can do a PR to support WebGPU here (I did node), its trivial. However I think that we should rethink the backend a bit to be more flexible and be able to choose the backend and options per model. Also the onnx fallback is not perfect i.e. I have models that despite the session can be loaded the infer do not work, thats a step after the onnx fallback...

@xenova can also do webgpu and its testing it among other backends like candle. Probably not done yet just because not all the models supports wgpu?

@xenova xenova linked a pull request Jan 27, 2024 that will close this issue
13 tasks
@luweigen
Copy link

luweigen commented Feb 5, 2024

Unfortunately the WebGPU implementation is currently slower than the WASM version,

I have some models running in jsep webGPU and are 10 times faster than wasm. I.E. clip

To me, the main problem is the current backend design: It's global (as far as I know). We should be able to setup the preferred backend to our model.

@DavidGOrtega What model can you run?

I tried some BERT and got "cannot resolve operator 'Erf' with opsets: ai.onnx v11" with direct call of https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/esm/ort.min.js and the cache of model weights by transformers.js 2.14.2.

@luweigen
Copy link

luweigen commented Feb 5, 2024

I also tried the v3 branch of transformers.js and got a syntax error. It seems that the commit 66da130 was overwritten by 8c465a9. A simple fix as follows leads to other errors. It seems still a long way to go?
28f666d

@xenova
Copy link
Collaborator

xenova commented Feb 5, 2024

@luweigen the v3 branch is still a work-in-progress, and will be marked as non-draft when reading for testing 👍

@beaufortfrancois
Copy link

@xenova May you share with us what is now blocking transformers.js to take advantage of WebGPU?
I think we're pretty much all excited to be able to try it and compare performances (CPU vs GPU). Thank you! ❤️

@luweigen
Copy link

luweigen commented Feb 19, 2024

@xenova May you share with us what is now blocking transformers.js to take advantage of WebGPU?
I think we're pretty much all excited to be able to try it and compare performances (CPU vs GPU). Thank you! ❤️

I wrote a blog to remix transformers.js and onnxruntime webgpu https://medium.com/@GenerationAI/transformers-js-onnx-runtime-webgpu-46c3e58d547c
and a little bit of comparison of CPU vs GPU https://medium.com/@GenerationAI/performance-of-onnxruntime-webgpu-44a25d9897a9
some functions are adapted from transformers.js to make it work as mentioned in the code comments.

@loretoparisi
Copy link
Author

@luweigen thanks for this post. The cpu to WebGPU comparison is fair, but not all the results are obvious. In the tests you declare:

Execution time: 6169.100000ms
Batch Execution time: 23191.899999ms

WebGPU Execution time: 20445.0999994ms
WegGPU Batch Execution time: 2231 ms

hence for processing a batch of size ~ 100 you get a
cpu/webgpu ratio of ~10x i.e. a clear WebGPU speedup.

But when the inference is just one sequence you
have the cpu/webgpu ~=0.3 i.e. this results to a ~3.3x of cpu over WebGPU so it seems that offloading to the GPU it's not that efficient with a batch size = 1. So according to your tests with MiniLM, when WebGPU becomes useful, in other words for which batch size the ratio cpu/webgpu is > 0?

@luweigen
Copy link

luweigen commented Feb 21, 2024

@luweigen thanks for this post. The cpu to WebGPU comparison is fair, but not all the results are obvious. In the tests you declare:

Execution time: 6169.100000ms
Batch Execution time: 23191.899999ms

WebGPU Execution time: 20445.0999994ms
WegGPU Batch Execution time: 2231 ms

hence for processing a batch of size ~ 100 you get a cpu/webgpu ratio of ~10x i.e. a clear WebGPU speedup.

But when the inference is just one sequence you have the cpu/webgpu ~=0.3 i.e. this results to a ~3.3x of cpu over WebGPU so it seems that offloading to the GPU it's not that efficient with a batch size = 1. So according to your tests with MiniLM, when WebGPU becomes useful, in other words for which batch size the ratio cpu/webgpu is > 0?

all-MiniLM-L6-v2 is very small. CPU can handle well enough if batch-size is also small.
I guess in larger model we can see the advantage of GPU in small batch-size too.
This was a very preliminary version of the code therefore not shared in GitHub yet, but will be, with more test results on other models and hyperparameters.
I/O binding to GPU is not implemented yet but overall improvement won't be very much, i guess.

@josephrocca
Copy link
Contributor

josephrocca commented May 11, 2024

But when the inference is just one sequence you have the cpu/webgpu ~=0.3 i.e. this results to a ~3.3x of cpu over WebGPU so it seems that offloading to the GPU it's not that efficient with a batch size = 1.

FWIW, even with batch size = 1, I get a 5x speedup for the WebGPU backend on bge-base-en-v1.5 according to Xenova's excellent webgpu-embedding-benchmark. Note that this model is 109M params - i.e. about 5x larger than all-MiniLM-L6-v2, but it can still embed a couple of passages per second on my Android phone even with the Wasm backend, and is "only" ~100mb 8-bit quantized (fine for my use case).

5x is certainly worth it for me! Really looking forward to the WebGPU backend stabilizing (and hoping Chrome team gets Linux WebGPU sorted soon 🤞 - also, looks like Safari isn't tooo far away from a decent/stable WebGPU release, surprisingly).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants