Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[js/web] WebGPU backend via JSEP #14579

Merged
merged 90 commits into from
Apr 24, 2023
Merged

[js/web] WebGPU backend via JSEP #14579

merged 90 commits into from
Apr 24, 2023

Conversation

fs-eire
Copy link
Contributor

@fs-eire fs-eire commented Feb 4, 2023

Description

This change introduced the following new components into ONNX Runtime Web:

  • JavaScript Execution Provider (JSEP)
    • Asynchronized inferencing execution powered by Emscripten's Asyncify
  • WebGPU backend implemented in TypeScript
    • initial implementation of kernels:
      • elementwise operators (22)
      • binary operators (5)
      • tensor: Shape, Reshape, Transpose, Gemm
      • nn: Conv, {Global}Maxpool, {Global}AveragePool

Code need to be polished. still working on it.

Q&A

What is JSEP?

JSEP, aka JavaScript Execution Provider, is a new ONNXRuntime execution provider that specifically works on Web environment (browsers). JSEP allows JavaScript code to kick in from various places when ONNX Runtime inferences a model.

Why JSEP?

JSEP is a hybrid mode EP that contains both C/C++ and TypeScript/JavaScript implementation. There are 2 strong reasons why we introduces JSEP:

  1. the C/C++ part helps JSEP to leverage ONNX Runtime's capabilities as much as possible including graph transformer, optimizers and also the capabilities to fallback to CPU EP. TypeScript/JavaScript helps JSEP to develop and debug much easier in the browser for the kernel implementation.
  2. the requirement of asynchronized execution from JavaScript API (eg. buffer.mapAsync()) makes it impossible to run OrtRun() in a synchronized context (see "async problem" section below). This is done by using Emscripten's Asyncify.

What is WebGPU?

WebGPU is the new GPU API that available in browser. It's one of the only 2 APIs that currently available to access the GPU from browser (the other is WebGL).
WebGPU is designed with more advanced and stronger features comparing to WebGL and is potentially solution that offer the best GPU performance for model inferencing that currently available.

What is the async problem and why we have the problem?

The "async problem" is a problem that you cannot call an async function in a synchronous context. Think about the following C++ code:

// C-style declarations (API)
typedef void (*ON_COMPLETE)(PVOID state, DATA *data);
void read_data_from_file(FILEHANDLE file, ON_COMPLETE on_complete);

// implementation
DATA * my_impl_read_data_from_file_sync(FILEHANDLE file) {
  // how to implement?
}

The answer is, it's impossible to implement this function. Usually we try to find a sync version API, or launch a thread to call the async function and sync-wait on the main thread. Unfortunately, in browser environment, neither is possible.

WebGPU does not offer any synchronized API for data downloading (GPU to CPU). This is the only operation that MUST be async. As OrtRun() will eventually call into DataTransfer for copy data from GPU to CPU, and OrtRun() is a synchronized function, this cannot be done in normal way.

What is Emscripten? How is the Asyncify feature resolved the problem?

Emscripten is the C/C++ compiler for WebAssembly. It's what we use to compile ORT and generates the WebAssembly artifacts which runs on browsers.

Asyncify is a compiler feature that allows calling async functions from a synchronized context. In short, it generates code to unwind and rewind call stack to emulate async execution. With this feature, we are able to call the async function inside OrtRun() call.

Design Overview

Inter-op

JSEP is doing pretty much same thing to just another EP. It exposes an interface for inter-op with JavaScript, which is defined in onnxruntime/wasm/js_internal_api.js:

// init JSEP
Module["jsepInit"] = function (backend, alloc, free, copy, copyAsync, createKernel, releaseKernel, run) {
    Module.jsepBackend = backend;
    Module.jsepAlloc = alloc;
    Module.jsepFree = free;
    Module.jsepCopy = copy;
    Module.jsepCopyAsync = copyAsync;
    Module.jsepCreateKernel = createKernel;
    Module.jsepReleaseKernel = releaseKernel;
    Module.jsepRun = run;
};

This simple JavaScript snippet defines all language barrier level functions that requires by JSEP to achieve implementing kernels and data transfers using JavaScript inside ONNX Runtime:

  • jsepBackend: assign the singleton object to webassembly module
  • jsepAlloc and jsepFree: implementation of data transfer's Alloc() and Free()
  • jsepCopy: synchronized copy ( GPU to GPU, CPU to GPU)
  • jsepCopyAsync: asynchronized copy ( GPU to CPU)
  • jsepCreateKernel and jsepReleaseKernel: a corresponding object that maintained in JS to match lifecycle of Kernel in ORT
  • jsepRun: OpKernel::Compute() should call into this

The abstraction above allows to tie as little as possible connections and dependencies between C/C++ and TypeScript/JavaScript.

Resource Management

Lifecycle of tensor data and kernels are managed by ORT(C/C++) but the implementation are left to JavaScript. JavaScript code are responsible to implement the callbacks correctly.

For WebGPU, the GPU data is managed by JavaScript using a singleton map (tensot_data_id => GPUBuffer). GPU pipeline is managed as singleton. Shaders are managed using a singletonmap (shader_key => gpu_program), while shader_key is generated by cache_key (OP specific, including attributes) and input shapes.

about data transfer
js::DataTransfer::CopyTensor implemented to call either synchronized or asynchronized copy callback, depending on the destination is GPU or not. Emscripten's macro EM_ASYNC_JS is used to wrap the async function to be called in the synchronized context.

run kernel in JS

Kernel class constructor calls once jsepCreateKernel() with an optional per-kernel specific serialization to pass attributes into JavaScript.

Compute() are implemented in a way that a metadata serialization is performed in a base class and JavaScript code can access the data using the Emscripten specific builtin macro EM_ASM_*.

disabled features
memory pattern is force disabled, because the WebGPU data is not presented by a general memory model (a buffer can be represented by offset + size).
concurrent run support is disabled. WebGPU is stateful and it also has async function call. To support concurrent run will significantly increase the complexity and we don't get any real benefit from it.

prefer channels last
JSEP prefers channels last and returns DataLayout::NHWC in method GetPreferredLayout(). This will let the graph transformers to preprocess the graph into a channels last form so that a more optimized WebGPU shader can be used.

Testing code
It's impossible to test JSEP directly because JSEP itself does not contain any kernel implementation. However, it has the kernel registration which need to work together with the corresponding JavaScript code. There are unit tests that run onnx models from JavaScript API.

commit 340c88b
Author: Yulong Wang <[email protected]>
Date:   Thu Sep 8 13:40:31 2022 -0700

    batch mode

commit b160840
Author: Yulong Wang <[email protected]>
Date:   Tue Jul 26 17:00:39 2022 -0700

    sum

commit 306a19b
Author: Yulong Wang <[email protected]>
Date:   Mon Jul 25 19:04:48 2022 -0700

    squeeze + transpose

commit 86d8d3a
Author: Yulong Wang <[email protected]>
Date:   Mon Jul 18 16:31:59 2022 -0700

    fix webgpu test launch

commit e104d17
Author: Yulong Wang <[email protected]>
Date:   Tue Jul 12 16:52:54 2022 -0700

    shape

commit a2197f0
Author: Yulong Wang <[email protected]>
Date:   Tue Jul 12 13:49:15 2022 -0700

    pool

commit 59b10fb
Author: Yulong Wang <[email protected]>
Date:   Thu Jul 7 17:32:56 2022 -0700

    upgrade to latest webgpu spec

commit 4ed1bfb
Author: Yulong Wang <[email protected]>
Date:   Tue Jun 28 14:23:08 2022 -0700

    naive conv

commit 7c5e446
Author: Yulong Wang <[email protected]>
Date:   Wed Jun 8 15:37:12 2022 -0700

    check webgpu backend in execution loop

commit b0d7dfa
Author: Yulong Wang <[email protected]>
Date:   Wed Jun 8 15:31:19 2022 -0700

    dump shader source only in debug mode

commit 7fca0ea
Author: Yulong Wang <[email protected]>
Date:   Wed Jun 8 15:17:27 2022 -0700

    add verbose log for buffer upload/download

commit 179712b
Author: Yulong Wang <[email protected]>
Date:   Wed Jun 8 15:06:03 2022 -0700

    fix program key

commit 67ea4cb
Author: Yulong Wang <[email protected]>
Date:   Wed Jun 8 15:05:20 2022 -0700

    concat: fix 1 input

commit 21b5dfe
Author: Yulong Wang <[email protected]>
Date:   Tue Jun 7 16:13:12 2022 -0700

    matmul (no-broadcast)

commit a8def8e
Author: Yulong Wang <[email protected]>
Date:   Thu Jun 2 17:56:15 2022 -0700

    ...

commit e871138
Author: Yulong Wang <[email protected]>
Date:   Fri May 27 16:12:56 2022 -0700

    slice (scalar)

commit 75c7941
Author: Yulong Wang <[email protected]>
Date:   Thu May 26 16:54:53 2022 -0700

    slice (...)

commit 40b15e4
Author: Yulong Wang <[email protected]>
Date:   Thu May 26 12:45:16 2022 -0700

    slice

commit 9d92513
Author: Yulong Wang <[email protected]>
Date:   Wed May 25 22:37:48 2022 -0700

    gemm (scalar)

commit c1185b4
Author: Yulong Wang <[email protected]>
Date:   Tue May 24 16:54:43 2022 -0700

    gemm...

commit 99653f5
Author: Yulong Wang <[email protected]>
Date:   Tue May 24 16:54:20 2022 -0700

    format code

commit 86c75bb
Author: Yulong Wang <[email protected]>
Date:   Tue May 24 11:39:35 2022 -0700

    gemm

commit 79dd539
Author: Yulong Wang <[email protected]>
Date:   Fri Apr 8 04:46:03 2022 -0700

    concat

commit 25c9d2a
Author: Yulong Wang <[email protected]>
Date:   Thu Apr 7 19:32:48 2022 -0700

    gather

commit 6627349
Author: Yulong Wang <[email protected]>
Date:   Thu Apr 7 18:46:53 2022 -0700

    binary ops

commit fb81d7f
Author: Yulong Wang <[email protected]>
Date:   Wed Apr 6 17:55:07 2022 -0700

    binary - add

commit 073695f
Author: Yulong Wang <[email protected]>
Date:   Wed Apr 6 17:54:24 2022 -0700

    optimize types

commit e9775fe
Author: Yulong Wang <[email protected]>
Date:   Tue Apr 5 16:45:27 2022 -0700

    working

commit cba119c
Author: Yulong Wang <[email protected]>
Date:   Tue Apr 5 15:10:26 2022 -0700

    upgrade @webgpu/[email protected]

commit ed17c57
Author: Yulong Wang <[email protected]>
Date:   Tue Apr 5 03:37:29 2022 -0700

    neg

commit e8e4d88
Author: Yulong Wang <[email protected]>
Date:   Mon Apr 4 16:28:52 2022 -0700

    other f32 unary operators

commit a1fbcfd
Author: Yulong Wang <[email protected]>
Date:   Fri Apr 1 17:24:10 2022 -0700

    leaky relu

commit dbe57fe
Author: Yulong Wang <[email protected]>
Date:   Fri Apr 1 17:09:27 2022 -0700

    exp, floor

commit 3b883b9
Author: Yulong Wang <[email protected]>
Date:   Fri Apr 1 16:43:15 2022 -0700

    elu

commit aac2fc6
Author: Yulong Wang <[email protected]>
Date:   Thu Mar 24 20:30:54 2022 -0700

    always create storage buffer with 16 bytes alignment

commit ad6bd01
Author: Yulong Wang <[email protected]>
Date:   Thu Mar 24 20:30:07 2022 -0700

    fix unary funcs async signature

commit a782667
Author: Yulong Wang <[email protected]>
Date:   Wed Mar 23 19:57:38 2022 -0700

    fix upload

commit b6e7fba
Author: Yulong Wang <[email protected]>
Date:   Wed Mar 23 15:36:58 2022 -0700

    reshape

commit dfbf6f3
Author: Yulong Wang <[email protected]>
Date:   Thu Mar 24 16:11:31 2022 -0700

    clip and ceil

commit 55af08e
Author: Yulong Wang <[email protected]>
Date:   Thu Mar 24 15:57:58 2022 -0700

    fix clip

commit 41274ba
Author: Yulong Wang <[email protected]>
Date:   Thu Mar 24 14:58:23 2022 -0700

    try more unary ops

commit fe850d1
Author: Yulong Wang <[email protected]>
Date:   Mon Mar 14 16:15:58 2022 -0700

    first operator (correctness validated)

commit ba09337
Author: Yulong Wang <[email protected]>
Date:   Fri Jan 28 17:50:56 2022 -0800

    enable initialization of webgpu

commit 3fb2712
Author: Yulong Wang <[email protected]>
Date:   Fri Jan 28 17:50:24 2022 -0800

    install webgpu typescript type declaration

commit ed35262
Author: Yulong Wang <[email protected]>
Date:   Fri Jan 28 14:53:50 2022 -0800

    [POC] __blank ( npm test -- -b=webgpu )
@fs-eire fs-eire merged commit 14cc02c into main Apr 24, 2023
@fs-eire fs-eire deleted the fs-eire/js-ep-pr branch April 24, 2023 22:21
fs-eire added a commit that referenced this pull request Apr 26, 2023
### Description
This PR resolves a part of non-critical comments from code review
comments in #14579.

- use `USE_JSEP` instead of `USE_JS` in build definition to make it less
ambiguous
- remove unused util functions from util.ts
- fix transpose.h
- other misc fixes
ShukantPal pushed a commit to ShukantPal/onnxruntime that referenced this pull request May 7, 2023
### Description
This change introduced the following new components into ONNX Runtime
Web:
- JavaScript Execution Provider (JSEP)
  - Asynchronized inferencing execution powered by Emscripten's Asyncify
- WebGPU backend implemented in TypeScript
  - initial implementation of kernels:
    - elementwise operators (22)
    - binary operators (5)
    - tensor: Shape, Reshape, Transpose, Gemm
    - nn: Conv, {Global}Maxpool, {Global}AveragePool


Code need to be polished. still working on it.

## Q&A
What is JSEP?
> JSEP, aka JavaScript Execution Provider, is a new ONNXRuntime
execution provider that specifically works on Web environment
(browsers). JSEP allows JavaScript code to kick in from various places
when ONNX Runtime inferences a model.

Why JSEP?
> JSEP is a hybrid mode EP that contains both C/C++ and
TypeScript/JavaScript implementation. There are 2 strong reasons why we
introduces JSEP:
> 1. the C/C++ part helps JSEP to leverage ONNX Runtime's capabilities
as much as possible including graph transformer, optimizers and also the
capabilities to fallback to CPU EP. TypeScript/JavaScript helps JSEP to
develop and debug much easier in the browser for the kernel
implementation.
> 2. the requirement of asynchronized execution from JavaScript API (eg.
`buffer.mapAsync()`) makes it impossible to run `OrtRun()` in a
synchronized context (see "async problem" section below). This is done
by using Emscripten's Asyncify.

What is WebGPU?
> WebGPU is the new GPU API that available in browser. It's one of the
only 2 APIs that currently available to access the GPU from browser (the
other is WebGL).
> WebGPU is designed with more advanced and stronger features comparing
to WebGL and is potentially solution that offer the best GPU performance
for model inferencing that currently available.

What is the async problem and why we have the problem?
> The "async problem" is a problem that you cannot call an async
function in a synchronous context. Think about the following C++ code:
> ```c
> // C-style declarations (API)
> typedef void (*ON_COMPLETE)(PVOID state, DATA *data);
> void read_data_from_file(FILEHANDLE file, ON_COMPLETE on_complete);
> 
> // implementation
> DATA * my_impl_read_data_from_file_sync(FILEHANDLE file) {
>   // how to implement?
> }
> ```
> The answer is, it's impossible to implement this function. Usually we
try to find a sync version API, or launch a thread to call the async
function and sync-wait on the main thread. Unfortunately, in browser
environment, neither is possible.
>
> WebGPU does not offer any synchronized API for data downloading (GPU
to CPU). This is the only operation that MUST be async. As `OrtRun()`
will eventually call into DataTransfer for copy data from GPU to CPU,
and `OrtRun()` is a synchronized function, this cannot be done in normal
way.

What is Emscripten? How is the Asyncify feature resolved the problem?
> Emscripten is the C/C++ compiler for WebAssembly. It's what we use to
compile ORT and generates the WebAssembly artifacts which runs on
browsers.
>
> Asyncify is a [compiler
feature](https://emscripten.org/docs/porting/asyncify.html) that allows
calling async functions from a synchronized context. In short, it
generates code to unwind and rewind call stack to emulate async
execution. With this feature, we are able to call the async function
inside `OrtRun()` call.

## Design Overview

**Inter-op**

JSEP is doing pretty much same thing to just another EP. It exposes an
interface for inter-op with JavaScript, which is defined in
onnxruntime/wasm/js_internal_api.js:
```js
// init JSEP
Module["jsepInit"] = function (backend, alloc, free, copy, copyAsync, createKernel, releaseKernel, run) {
    Module.jsepBackend = backend;
    Module.jsepAlloc = alloc;
    Module.jsepFree = free;
    Module.jsepCopy = copy;
    Module.jsepCopyAsync = copyAsync;
    Module.jsepCreateKernel = createKernel;
    Module.jsepReleaseKernel = releaseKernel;
    Module.jsepRun = run;
};
```
This simple JavaScript snippet defines all language barrier level
functions that requires by JSEP to achieve implementing kernels and data
transfers using JavaScript inside ONNX Runtime:
- `jsepBackend`: assign the singleton object to webassembly module
- `jsepAlloc` and `jsepFree`: implementation of data transfer's Alloc()
and Free()
- `jsepCopy`: synchronized copy ( GPU to GPU, CPU to GPU)
- `jsepCopyAsync`: asynchronized copy ( GPU to CPU)
- `jsepCreateKernel` and `jsepReleaseKernel`: a corresponding object
that maintained in JS to match lifecycle of Kernel in ORT
- `jsepRun`: OpKernel::Compute() should call into this

The abstraction above allows to tie as little as possible connections
and dependencies between C/C++ and TypeScript/JavaScript.

**Resource Management**

Lifecycle of tensor data and kernels are managed by ORT(C/C++) but the
implementation are left to JavaScript. JavaScript code are responsible
to implement the callbacks correctly.

For WebGPU, the GPU data is managed by JavaScript using a singleton map
(tensot_data_id => GPUBuffer). GPU pipeline is managed as singleton.
Shaders are managed using a singletonmap (shader_key => gpu_program),
while shader_key is generated by cache_key (OP specific, including
attributes) and input shapes.

**about data transfer**
`js::DataTransfer::CopyTensor` implemented to call either synchronized
or asynchronized copy callback, depending on the destination is GPU or
not. Emscripten's macro `EM_ASYNC_JS` is used to wrap the async function
to be called in the synchronized context.

**run kernel in JS**

Kernel class constructor calls once `jsepCreateKernel()` with an
optional per-kernel specific serialization to pass attributes into
JavaScript.

`Compute()` are implemented in a way that a metadata serialization is
performed in a base class and JavaScript code can access the data using
the Emscripten specific builtin macro `EM_ASM_*`.

**disabled features**
memory pattern is force disabled, because the WebGPU data is not
presented by a general memory model (a buffer can be represented by
offset + size).
concurrent run support is disabled. WebGPU is stateful and it also has
async function call. To support concurrent run will significantly
increase the complexity and we don't get any real benefit from it.

**prefer channels last**
JSEP prefers channels last and returns `DataLayout::NHWC` in method
`GetPreferredLayout()`. This will let the graph transformers to
preprocess the graph into a channels last form so that a more optimized
WebGPU shader can be used.

**Testing code**
It's impossible to test JSEP directly because JSEP itself does not
contain any kernel implementation. However, it has the kernel
registration which need to work together with the corresponding
JavaScript code. There are unit tests that run onnx models from
JavaScript API.

---------

Co-authored-by: Scott McKay <[email protected]>
ShukantPal pushed a commit to ShukantPal/onnxruntime that referenced this pull request May 7, 2023
### Description
This PR resolves a part of non-critical comments from code review
comments in microsoft#14579.

- use `USE_JSEP` instead of `USE_JS` in build definition to make it less
ambiguous
- remove unused util functions from util.ts
- fix transpose.h
- other misc fixes
@redthing1
Copy link

Can this be used to execute models with WebGPU on desktop?

@fs-eire
Copy link
Contributor Author

fs-eire commented Jul 26, 2023

Can this be used to execute models with WebGPU on desktop?

not now, but probably can do via dawn nodejs binding in future

@guschmue
Copy link
Contributor

to set expectations: I don't think there will be webgpu support for native desktop apps anytime soon.
electron apps might work.

@loretoparisi
Copy link

Can this be used to execute models with WebGPU on desktop?

not now, but probably can do via dawn nodejs binding in future

Hopefully something it is moving on the Google's DART side of the moon
https://dawn.googlesource.com/dawn/+/refs/heads/main/src/dawn/node/

throw new Error('WebGpuBackend: WebGPU is not available.');
}

const adapter = await navigator.gpu.requestAdapter();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I recommend passing powerPreference when requesting a GPU adapter so that developers can request which type of GPU they're looking for.

In huggingface/transformers.js#545 for instance, it would be preferable to test the "high-performance" GPU.

Suggested change
const adapter = await navigator.gpu.requestAdapter();
const adapter = await navigator.gpu.requestAdapter({
powerPreference: 'high-performance'
});

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great! 🔥 It will also be useful if we can get the selected adapter, without having to re-request an adapter. cc @guschmue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Will think about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#19857 is created to address this. Please take a look

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my understanding on some OSes, like Windows, if we have multi-GPU and integrated GPU is the one chosen by Chrome during startup, simply set powerPreference to high-performance will not force WebGPU to utilize the discrete GPU.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI for Windows, here's the bug: https://issues.chromium.org/issues/329211593

fs-eire added a commit that referenced this pull request Mar 13, 2024
### Description
This change exposes a few properties in `ort.env.webgpu` to resolve
feature requirement mentioned in properties in
#14579 (comment).

- Add `powerPreference` and `forceFallbackAdapter` in `ort.env.webgpu`,
to allow users to set the value of the properties before the first
inference session is created.
- Add readonly property `adapter` in `ort.env.webgpu` to allow users to
get the adapter instance. Now users can access `ort.env.webgpu.device`
and `ort.env.webgpu.adapter`.

@xenova @beaufortfrancois
fs-eire added a commit that referenced this pull request Mar 15, 2024
### Description
This change exposes a few properties in `ort.env.webgpu` to resolve
feature requirement mentioned in properties in
#14579 (comment).

- Add `powerPreference` and `forceFallbackAdapter` in `ort.env.webgpu`,
to allow users to set the value of the properties before the first
inference session is created.
- Add readonly property `adapter` in `ort.env.webgpu` to allow users to
get the adapter instance. Now users can access `ort.env.webgpu.device`
and `ort.env.webgpu.adapter`.

@xenova @beaufortfrancois
fs-eire added a commit that referenced this pull request Mar 15, 2024
### Description
This change exposes a few properties in `ort.env.webgpu` to resolve
feature requirement mentioned in properties in
#14579 (comment).

- Add `powerPreference` and `forceFallbackAdapter` in `ort.env.webgpu`,
to allow users to set the value of the properties before the first
inference session is created.
- Add readonly property `adapter` in `ort.env.webgpu` to allow users to
get the adapter instance. Now users can access `ort.env.webgpu.device`
and `ort.env.webgpu.adapter`.

@xenova @beaufortfrancois
siweic0 pushed a commit to siweic0/onnxruntime-web that referenced this pull request May 9, 2024
### Description
This change introduced the following new components into ONNX Runtime
Web:
- JavaScript Execution Provider (JSEP)
  - Asynchronized inferencing execution powered by Emscripten's Asyncify
- WebGPU backend implemented in TypeScript
  - initial implementation of kernels:
    - elementwise operators (22)
    - binary operators (5)
    - tensor: Shape, Reshape, Transpose, Gemm
    - nn: Conv, {Global}Maxpool, {Global}AveragePool


Code need to be polished. still working on it.

## Q&A
What is JSEP?
> JSEP, aka JavaScript Execution Provider, is a new ONNXRuntime
execution provider that specifically works on Web environment
(browsers). JSEP allows JavaScript code to kick in from various places
when ONNX Runtime inferences a model.

Why JSEP?
> JSEP is a hybrid mode EP that contains both C/C++ and
TypeScript/JavaScript implementation. There are 2 strong reasons why we
introduces JSEP:
> 1. the C/C++ part helps JSEP to leverage ONNX Runtime's capabilities
as much as possible including graph transformer, optimizers and also the
capabilities to fallback to CPU EP. TypeScript/JavaScript helps JSEP to
develop and debug much easier in the browser for the kernel
implementation.
> 2. the requirement of asynchronized execution from JavaScript API (eg.
`buffer.mapAsync()`) makes it impossible to run `OrtRun()` in a
synchronized context (see "async problem" section below). This is done
by using Emscripten's Asyncify.

What is WebGPU?
> WebGPU is the new GPU API that available in browser. It's one of the
only 2 APIs that currently available to access the GPU from browser (the
other is WebGL).
> WebGPU is designed with more advanced and stronger features comparing
to WebGL and is potentially solution that offer the best GPU performance
for model inferencing that currently available.

What is the async problem and why we have the problem?
> The "async problem" is a problem that you cannot call an async
function in a synchronous context. Think about the following C++ code:
> ```c
> // C-style declarations (API)
> typedef void (*ON_COMPLETE)(PVOID state, DATA *data);
> void read_data_from_file(FILEHANDLE file, ON_COMPLETE on_complete);
> 
> // implementation
> DATA * my_impl_read_data_from_file_sync(FILEHANDLE file) {
>   // how to implement?
> }
> ```
> The answer is, it's impossible to implement this function. Usually we
try to find a sync version API, or launch a thread to call the async
function and sync-wait on the main thread. Unfortunately, in browser
environment, neither is possible.
>
> WebGPU does not offer any synchronized API for data downloading (GPU
to CPU). This is the only operation that MUST be async. As `OrtRun()`
will eventually call into DataTransfer for copy data from GPU to CPU,
and `OrtRun()` is a synchronized function, this cannot be done in normal
way.

What is Emscripten? How is the Asyncify feature resolved the problem?
> Emscripten is the C/C++ compiler for WebAssembly. It's what we use to
compile ORT and generates the WebAssembly artifacts which runs on
browsers.
>
> Asyncify is a [compiler
feature](https://emscripten.org/docs/porting/asyncify.html) that allows
calling async functions from a synchronized context. In short, it
generates code to unwind and rewind call stack to emulate async
execution. With this feature, we are able to call the async function
inside `OrtRun()` call.

## Design Overview

**Inter-op**

JSEP is doing pretty much same thing to just another EP. It exposes an
interface for inter-op with JavaScript, which is defined in
onnxruntime/wasm/js_internal_api.js:
```js
// init JSEP
Module["jsepInit"] = function (backend, alloc, free, copy, copyAsync, createKernel, releaseKernel, run) {
    Module.jsepBackend = backend;
    Module.jsepAlloc = alloc;
    Module.jsepFree = free;
    Module.jsepCopy = copy;
    Module.jsepCopyAsync = copyAsync;
    Module.jsepCreateKernel = createKernel;
    Module.jsepReleaseKernel = releaseKernel;
    Module.jsepRun = run;
};
```
This simple JavaScript snippet defines all language barrier level
functions that requires by JSEP to achieve implementing kernels and data
transfers using JavaScript inside ONNX Runtime:
- `jsepBackend`: assign the singleton object to webassembly module
- `jsepAlloc` and `jsepFree`: implementation of data transfer's Alloc()
and Free()
- `jsepCopy`: synchronized copy ( GPU to GPU, CPU to GPU)
- `jsepCopyAsync`: asynchronized copy ( GPU to CPU)
- `jsepCreateKernel` and `jsepReleaseKernel`: a corresponding object
that maintained in JS to match lifecycle of Kernel in ORT
- `jsepRun`: OpKernel::Compute() should call into this

The abstraction above allows to tie as little as possible connections
and dependencies between C/C++ and TypeScript/JavaScript.

**Resource Management**

Lifecycle of tensor data and kernels are managed by ORT(C/C++) but the
implementation are left to JavaScript. JavaScript code are responsible
to implement the callbacks correctly.

For WebGPU, the GPU data is managed by JavaScript using a singleton map
(tensot_data_id => GPUBuffer). GPU pipeline is managed as singleton.
Shaders are managed using a singletonmap (shader_key => gpu_program),
while shader_key is generated by cache_key (OP specific, including
attributes) and input shapes.

**about data transfer**
`js::DataTransfer::CopyTensor` implemented to call either synchronized
or asynchronized copy callback, depending on the destination is GPU or
not. Emscripten's macro `EM_ASYNC_JS` is used to wrap the async function
to be called in the synchronized context.

**run kernel in JS**

Kernel class constructor calls once `jsepCreateKernel()` with an
optional per-kernel specific serialization to pass attributes into
JavaScript.

`Compute()` are implemented in a way that a metadata serialization is
performed in a base class and JavaScript code can access the data using
the Emscripten specific builtin macro `EM_ASM_*`.

**disabled features**
memory pattern is force disabled, because the WebGPU data is not
presented by a general memory model (a buffer can be represented by
offset + size).
concurrent run support is disabled. WebGPU is stateful and it also has
async function call. To support concurrent run will significantly
increase the complexity and we don't get any real benefit from it.

**prefer channels last**
JSEP prefers channels last and returns `DataLayout::NHWC` in method
`GetPreferredLayout()`. This will let the graph transformers to
preprocess the graph into a channels last form so that a more optimized
WebGPU shader can be used.

**Testing code**
It's impossible to test JSEP directly because JSEP itself does not
contain any kernel implementation. However, it has the kernel
registration which need to work together with the corresponding
JavaScript code. There are unit tests that run onnx models from
JavaScript API.

---------

Co-authored-by: Scott McKay <[email protected]>
siweic0 pushed a commit to siweic0/onnxruntime-web that referenced this pull request May 9, 2024
### Description
This PR resolves a part of non-critical comments from code review
comments in microsoft#14579.

- use `USE_JSEP` instead of `USE_JS` in build definition to make it less
ambiguous
- remove unused util functions from util.ts
- fix transpose.h
- other misc fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants