Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added DML and CUDA provider support in onnxruntime-node #16050

Merged
merged 20 commits into from
Aug 25, 2023
Merged

Conversation

dakenf
Copy link
Contributor

@dakenf dakenf commented May 23, 2023

Description

I've added changes to support CUDA and DML (only on Windows, on other platforms it will throw an error)

Motivation and Context

It fixes this feature request #14127 which is tracked here #14529

I was working on StableDiffusion implementation for node.js and it is very slow on CPU, so GPU support is essential.

Here is a working demo with a patched and precompiled version https://github.com/dakenf/stable-diffusion-nodejs

@dakenf
Copy link
Contributor Author

dakenf commented May 23, 2023

@microsoft-github-policy-service agree

@fs-eire
Copy link
Contributor

fs-eire commented May 23, 2023

@snnn are the existing tasks in Zip-*-Packaging-Pipeline generates DLLs for cuda and DML?

@dakenf
Copy link
Contributor Author

dakenf commented May 23, 2023

@fs-eire it does not require cuda libs. Will just throw JS error "onnxruntime_providers_cuda.dll(so)" failed to load" if not installed on user's system when using "cuda" execution provider. But it needs bundled DirectML.dll because the one included with windows is outdated. And it's quite small in size compared to cuda provider (11mb vs about 400 for cuda), so i guess having it to work out of the box is good.

And to have cuda user will need to install cuda and cudnn anyway.

@snnn
Copy link
Member

snnn commented May 23, 2023

@snnn are the existing tasks in Zip-*-Packaging-Pipeline generates DLLs for cuda and DML?

Yes

@fs-eire
Copy link
Contributor

fs-eire commented May 25, 2023

I think it should be good for the code part. My concern is that we should also update our CI so that the published npm package onnxruntime-node contains necessary DLLs for CUDA/DML EP on windows.

in current onnxruntime-node package, we have those files under /bin/napi-v3/win32/x64/:

  • onnxruntime.dll
  • onnxruntime_binding.node
  • onnxruntime_providers_shared.dll

In my understanding, we need to add the following files as well:

  • onnxruntime_providers_cuda.dll
  • DirectML.dll

and not sure if we need a different version of onnxruntime.dll/onnxruntime_providers_shared.dll (is it same for CPU/GPU ?)

@snnn
Copy link
Member

snnn commented May 25, 2023

DML EP is not a pluggable EP. It is in onnxruntime.dll. And it is not compatible with CUDA EP.

@fs-eire
Copy link
Contributor

fs-eire commented May 25, 2023

I didn't know what DML EP is not compatible with CUDA EP. Does this mean there is no way to build with both DML and CUDA support? If so, do we need to prepare 2 different onnxruntime.dll files under different folder to support DML and CUDA?

@dakenf
Copy link
Contributor Author

dakenf commented May 26, 2023

After your conversation I've run some more builds/tests on Windows and WSL2 (and realized that on Win it did not even build correctly because of include paths, sorry)

So here are key things:

  1. onnxruntime.dll(so) must be built with CUDA on Linux and with CUDA+DML on Windows before bundling with NPM package
  2. Only DirectML.dll is required to be bundled with npm package (libonnxruntime_providers_shared.dll/so and libonnxruntime_providers_cuda.dll/so can be safely omitted). CPU provider will work without any errors.
  3. onnxruntime must be linked statically on Windows unless a workaround found (will explain below) Found workaround for it.

Now the explanation

  1. If you don't build with CUDA, it won't support it. Same with DML. It can be built with both of them on Win
  2. Lib dependencies table
Use case Requrements
Windows/Linux CPU provider Nothing changed and works as expected
Windows DML provider A fresh version of DirectML.dll required (which will be bundled). Very hard for user to update (you cannot write into System32)
Linux CUDA provider By default it will throw an error "libonnxruntime_providers_shared.so: cannot open shared object file: No such file or directory", so user must download onnxruntime-linux-gpu libraries and put it somewhere like /usr/local/. User would need to install about 2gb of CUDA/cuDNN libs anyway, so it should not be a big deal. CUDA provider lib is about 400mb for each platform and bundling 2 versions (win and linux) would cost 800mb
Windows CUDA provider I think it will be a very rare case, but same as above. Need to install CUDA/cuDNN first. Also, onnxruntime-windows-gpu libs should be put in PATH or node_modules\onnxruntime-node\bin\napi-v3\win32\x64

There are options to improve user experience
a. Check if cuda libs are not present and show a user-friendly message with instructions
b. Create a script like npx onnxruntime-node download-cuda-provider that will download one for user's architecture
c. Keep the script from point b (e.g. for Docker builds so it would not need to download) and also check/download when InferenceSession with 'cuda' provider is created

  1. The biggest issue, which I hoped was fixed in the latest code. For some reason, dynamically linked onnxruntime always loads DirectML.dll from system32 (you can check this "GPU - DirectML" setting is broken in the latest release locaal-ai/obs-backgroundremoval#272 (comment) ). In summary, we cannot ask user to create a .manifest file for node.js to override dll search paths. And I've tried SetDllDirectory and loading the lib directly from nodejs module dir before initialization but it did not help. So i propose a quite fragile method to link it statically (you can see manual linking in CMakeLists.txt lines 56-91)
    It can be more robust if runtime is built without "--build_shared_lib" flag (so extra dependencides won't be required to be linked), but that would require a separate building pipeline for Windows node.js binding. However, if it's already being built in a separate one, then just a flag and linking libs could be changed.

Let me know if you have a workaround for point 3, or if you want me to remove static linking for some testing. Or any other feedback. Thanks

P.S. in the worst case it can work only with CUDA provider until the issue with DirectML.dll loading is resolved, but having DML will simplify user experience on Windows by a magnitude

@fdwr
Copy link
Contributor

fdwr commented May 26, 2023

@dakenf:

For some reason, dynamically linked onnxruntime always loads DirectML.dll from system32

🤨 As a quick sanity check, I tried this minimal example with ORT 1.15.0 and DML 1.12.0, and it's definitely loading both onnxruntime.dll and the redist directml.dll (from the build folder rather than the older system32 version).

Though, I have all my DLL's and the .exe in the same directory, and maybe the issue here is that the .exe and plugin .dll's exist in different directories? Maybe LoadLibrary appears to favor the system32 path only because it fails to find DirectML.dll in the .exe path, and the system DirectML.dll is later in the search path. This makes me wonder what most other current customers using DML are doing, because this is the first I've heard of this challenge (unless everybody else just puts all the DLL's into the .exe path too because they have complete control over their distribution and binary paths) 🤔.

image

@dakenf
Copy link
Contributor Author

dakenf commented May 26, 2023

🤨 As a quick sanity check, I tried this minimal example with ORT 1.15.0 and DML 1.12.0, and it's definitely loading both onnxruntime.dll and the redist directml.dll (from the build folder rather than the older system32 version).

Yeah, i guess the problem is that node.js interpreter lies somewhere in Program Files and node binding lib is loaded from "node_modules/onnxruntime-node/.." in target JS project directory. Same issue with OBS plugin i've linked. I've tried LoadLibrary and LoadLibraryEx with exact path but it did not help :(

UPD: the node binding loads all libraries itself as they are linked during the build, but i thought if i try to load it manually before making any calls to ONNX runtime, it would force to use already loaded one

@snnn
Copy link
Member

snnn commented May 26, 2023

When nodejs call LoadLibraryEx, did they specify LOAD_LIBRARY_SEARCH_DLL_LOAD_DIR?

@dakenf
Copy link
Contributor Author

dakenf commented May 26, 2023

@snnn

When nodejs call LoadLibraryEx, did they specify LOAD_LIBRARY_SEARCH_DLL_LOAD_DIR?

It does not use LoadLibraryEx to load, i've just tried to load manually to see if it helps

@fdwr I've found a workaround. Link DirectML with node binding AND call it with some random data before initializing ONNX runtime

#ifdef _WIN32
  // this will load and call DirectML.dll to enforce using version from the binding directory
  const IID MY_IID = { 0x12345678, 0x1234, 0x1234, { 0x12, 0x34, 0x56, 0x78, 0x9A, 0xBC, 0xDE, 0xF0 } };
  DMLCreateDevice1(nullptr, DML_CREATE_DEVICE_FLAG_NONE, DML_FEATURE_LEVEL_4_0, MY_IID, nullptr);
#endif
  Ort::InitApi();

That way it loads DirectML.dll from the node binding folder. And runtime uses the already loaded library even when linked dynamically. I've tried just linking before but as no functions were used, it did not work. Thanks for your time :)

@fdwr
Copy link
Contributor

fdwr commented May 26, 2023

@fdwr I've found a workaround. Link DirectML with node binding AND call it with some random data before initializing ONNX runtime

@dakenf: Does calling LoadLibrary beforehand and holding the module handle not achieve the same? I can see why that work-around works (well, not entirely, because I haven't looked at how DELAYLOAD works under the hood), but passing a bogus GUID 🤨 and a nullptr D3D device will cause debug spew (not that an end user will see it, but any developer will see in the debug output A null D3D12 device was provided to DMLCreateDevice, which is invalid).

@dakenf
Copy link
Contributor Author

dakenf commented May 26, 2023

@fdwr

yes, you are right. it dit not work before because i assumed

HMODULE hModule = GetModuleHandle(NULL);
GetModuleFileName(hModule, buffer, MAX_PATH);

would return current module path. But it was returning the path to node.js interpreter as it should by design.
So it was loading library from system32

Instead i should have used something like

GetModuleHandleEx(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS |
                            GET_MODULE_HANDLE_EX_FLAG_UNCHANGED_REFCOUNT,
                         (LPCSTR) &ExportFunction, &hm);

I guess sometimes it's better to stop and come back with a fresh look.

Copy link
Contributor

@fdwr fdwr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a more robust approach. Added some comments, but still deferring to Yulong and Guenther for actual approval, since I do not own/know this code.

js/node/src/directml_load_helper.cc Outdated Show resolved Hide resolved
js/node/src/directml_load_helper.cc Outdated Show resolved Hide resolved
js/node/src/directml_load_helper.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@fdwr fdwr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Arthur. Minor comments - easier than my previous one. (still deferring to owners like Yulong and Guenther for actual sign-off)

js/node/src/directml_load_helper.cc Outdated Show resolved Hide resolved
js/node/src/directml_load_helper.cc Show resolved Hide resolved
js/node/src/directml_load_helper.cc Outdated Show resolved Hide resolved
js/node/src/directml_load_helper.cc Outdated Show resolved Hide resolved
@dakenf
Copy link
Contributor Author

dakenf commented Jun 1, 2023

@fdwr can you also help me resolving this warning? it does not affect anything but just outputs to console when DML provider is used

[W:onnxruntime:, session_state.cc:1169 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
[W:onnxruntime:, session_state.cc:1171 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.

Should i change options for ORT memory allocator when using DML provider or just ignore it?

@fdwr
Copy link
Contributor

fdwr commented Jun 1, 2023

@fdwr can you also help me resolving this warning? it does not affect anything but just outputs to console when DML provider is used

What opset is your .onnx model? That means there is some operator which the DML EP doesn't support, or a newer version of an existing operator. It's a perf concern, because some operators are falling back to the CPU (incurring GPU<->CPU synchronization), but the model will still run. So I wouldn't worry about it for your change, but it would be useful to run onnxruntime_perf_test.exe -v -I -e dml -r 1 yourmodel.onnx to see which ops they are (it's the console output spew).

@dakenf
Copy link
Contributor Author

dakenf commented Jun 1, 2023

What opset is your .onnx model?

  1. I've used python converter for StableDiffusion https://huggingface.co/aislamov/stable-diffusion-2-1-base-onnx

perf_test with -v flag returned quite a lot of nodes (i havent included full output, it has lots of pages)
image

js/node/CMakeLists.txt Outdated Show resolved Hide resolved
dakenf added 2 commits August 25, 2023 23:20
Fixed parsing sessionOptions when model passed as a Buffer
@dakenf
Copy link
Contributor Author

dakenf commented Aug 25, 2023

I've also fixed an error that prevented sessionOptions from being parsed when model was created with 4 arguments

@snnn
Copy link
Member

snnn commented Aug 25, 2023

/azp run Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

@snnn
Copy link
Member

snnn commented Aug 25, 2023

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 7 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@snnn snnn merged commit c262879 into microsoft:main Aug 25, 2023
@dakenf
Copy link
Contributor Author

dakenf commented Aug 26, 2023

Thanks everyone

BTW, @fdwr DML gives quite meaningful error messages
Non-zero status code returned while running MultiHeadAttention node. Name:'MultiHeadAttention' Status Message: C:\Users\me\projects\my\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\MLOperatorAuthorImpl.cpp(2448)\onnxruntime.dll!00007FFFA64C38FC: (caller: 00007FFFA64C4A44) Exception(3) tid(8164) 80070057 The parameter is incorrect.

anyway, still better than CUDA
Non-zero status code returned while running MultiHeadAttention node. Name:'MultiHeadAttention' Status Message: packed QKV format is not implemented for current GPU. Please disable it in fusion options.

Actually joking since it's from different test cases and most likely my test cases are not correct

@fdwr
Copy link
Contributor

fdwr commented Aug 26, 2023

DML gives quite meaningful error messages

@dakenf: 😉 Not really, I agree, but the error messages are much more informative in the ORT debug build if you enable the Direct3D debug layer: Start / Run / dxcpl.exe. e.g.:

image

I'm not sure how it works within ORT Node, but for a C++ process, you can see the diagnostic output in Visual Studio's output subwindow.

@dakenf
Copy link
Contributor Author

dakenf commented Aug 26, 2023

Well, i guess there's a reasonable explaination as when i've got "Windows Internals Book" about 10 years ago with ms action pack (or some kind of other promotion for free, don't exactly remember) i got quite a lot of "ah, that's why"

In ort node you open a cmd and call node something.js. And something.js imports onnxruntime-node/dist/index.js that loads the DLL and makes a JS bridge between c++ and JS code for the library. So most likely there's a way to debug i'm not going to explore :D i've got quite enough with xcode finding why 64bit wasm with threads failed to run

snnn added a commit that referenced this pull request Sep 7, 2023
)

### Description
The yaml file changes made in #16050 do not really work. Currently the
pipeline is failing with error:
```
Error: Not found SourceFolder: C:\a\_work\5\b\RelWithDebInfo\RelWithDebInfo\nuget-artifacts\onnxruntime-win-x64\lib
```

So, I will revert the yaml changes first to bring the pipeline back.
Some people are waiting for our nightly packages.

Test run:
https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=351104&view=results

### Motivation and Context
@MountainAndMorning
Copy link

Is the cuda support added to the onnxruntime-node? I am using onnxruntime-node v 1.6.0 in the electron. It seems that the cuda and directml providers don't work.

@TareHimself
Copy link

Same here

@MountainAndMorning
Copy link

Since the npm package has not been updated yet, the onnxruntime-node still only supports cpu. Is there any progress on the npm package update?@dakenf
截屏2023-10-09 16 33 00

@gokaybiz
Copy link

Still waiting for release!!

@wesbos
Copy link

wesbos commented Jan 25, 2024

Looks like none of the platforms have had a release yet. Any idea when this will happen?

@fs-eire
Copy link
Contributor

fs-eire commented Jan 25, 2024

Looks like none of the platforms have had a release yet. Any idea when this will happen?

I have finished test for DML on windows and still pending on testing CUDA on linux.

Once I get a dev package published I will update it here.

kleiti pushed a commit to kleiti/onnxruntime that referenced this pull request Mar 22, 2024
)

### Description
I've added changes to support CUDA and DML (only on Windows, on other
platforms it will throw an error)



### Motivation and Context
It fixes this feature request
microsoft#14127 which is tracked
here microsoft#14529

I was working on StableDiffusion implementation for node.js and it is
very slow on CPU, so GPU support is essential.

Here is a working demo with a patched and precompiled version
https://github.com/dakenf/stable-diffusion-nodejs

---------
kleiti pushed a commit to kleiti/onnxruntime that referenced this pull request Mar 22, 2024
…rosoft#17441)

### Description
The yaml file changes made in microsoft#16050 do not really work. Currently the
pipeline is failing with error:
```
Error: Not found SourceFolder: C:\a\_work\5\b\RelWithDebInfo\RelWithDebInfo\nuget-artifacts\onnxruntime-win-x64\lib
```

So, I will revert the yaml changes first to bring the pipeline back.
Some people are waiting for our nightly packages.

Test run:
https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=351104&view=results

### Motivation and Context
@xenova
Copy link

xenova commented May 6, 2024

Any updates? :) @fs-eire

@fs-eire
Copy link
Contributor

fs-eire commented May 6, 2024

Any updates? :) @fs-eire

Oh I missed this thread. So it should be working now at version 1.17.3-rev.1. Please have a try and let me know if you run into any issue.

siweic0 pushed a commit to siweic0/onnxruntime-web that referenced this pull request May 9, 2024
)

### Description
I've added changes to support CUDA and DML (only on Windows, on other
platforms it will throw an error)



### Motivation and Context
It fixes this feature request
microsoft#14127 which is tracked
here microsoft#14529

I was working on StableDiffusion implementation for node.js and it is
very slow on CPU, so GPU support is essential.

Here is a working demo with a patched and precompiled version
https://github.com/dakenf/stable-diffusion-nodejs

---------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants