Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CUDA on Julia 1.9+ via a package extension. #32

Merged
merged 9 commits into from
Oct 17, 2023

Conversation

GunnarFarneback
Copy link
Collaborator

@GunnarFarneback GunnarFarneback commented Oct 11, 2023

This PR adds a package extension for CUDA on Julia 1.9+, with support for CUDA.jl versions 4 and 5. It does not change anything for older Julia, where CUDA support remains implemented through the Requires package. It drops support for Julia < 1.9 and CUDA.jl < 4.

The file test/test_cuda_extension.jl, not included from runtests.jl, tries to verify that the CUDA extension works for a variety of Julia/CUDA.jl/CUDA runtime versions. Possibly it should use a more demanding ONNX file, in case that affects the amount of needed libraries.

Fixes #27.

@codecov-commenter
Copy link

codecov-commenter commented Oct 12, 2023

Codecov Report

Attention: 10 lines in your changes are missing coverage. Please review.

Comparison is base (dfb67d6) 89.60% compared to head (bd6a35b) 88.18%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #32      +/-   ##
==========================================
- Coverage   89.60%   88.18%   -1.42%     
==========================================
  Files           3        4       +1     
  Lines         375      381       +6     
==========================================
  Hits          336      336              
- Misses         39       45       +6     
Files Coverage Δ
src/ONNXRunTime.jl 100.00% <ø> (ø)
ext/CUDAExt.jl 0.00% <0.00%> (ø)
src/highlevel.jl 84.52% <0.00%> (-6.51%) ⬇️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@GunnarFarneback
Copy link
Collaborator Author

GunnarFarneback commented Oct 12, 2023

The tests fail on Julia 1.9 and later because the test environment specifies CUDA 3 and the package extension has a CUDA 4, 5 compat. There are some different ways forward but it depends on the support ambition for Julia and CUDA versions.

  • Most radical is to make a breaking release and drop support for CUDA < 4 and Julia < 1.9. This allows all the Requires stuff to be removed.
  • Adding support for CUDA 3 in the package extension seems difficult since the cuDNN package isn't compatible with any CUDA version < 4. Possibly a second extension that only depends on CUDA could be implemented and there sort out the logic for handling different CUDA versions.
  • Adding support for CUDA 4+ in the Requires approach doesn't seem impossible but on the other hand nothing much has happened in CUDA 4.0 compatibility #27 over time. It should be easier if CUDA 3 support is dropped at the same time.
  • To keep using Requires in Julia 1.9+ is not advisable since it doesn't play nicely at all with PrecompileTools.

@jw3126
Copy link
Owner

jw3126 commented Oct 12, 2023

I think we should just drop pre 1.9 julia support and also drop old CUDA.jl.

@GunnarFarneback
Copy link
Collaborator Author

Done in second commit. I need to do some tests on my real use case and it needs README updates and a breaking version bump, either to 0.4 or 1.0. I would recommend taking the opportunity to jump to 1.0, regardless of package maturity, but that's your call.

src/highlevel.jl Outdated Show resolved Hide resolved
test/test_cuda_extension.jl Outdated Show resolved Hide resolved
Comment on lines 1 to 6
# This file is not included from `runtests.jl` nor run in CI.
#
# Run it with `julia tests/test_cuda_extension.jl`. This requires that
# Julia is installed with juliaup and will involve downloading of a
# lot of big artifacts. The output will contain lots of error messages
# from caught errors; what matters is that all testsets pass.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file looks pretty useful, thanks!

@jw3126
Copy link
Owner

jw3126 commented Oct 12, 2023

Done in second commit. I need to do some tests on my real use case and it needs README updates and a breaking version bump, either to 0.4 or 1.0. I would recommend taking the opportunity to jump to 1.0, regardless of package maturity, but that's your call.

Thanks, 1.0 is a good idea!

@GunnarFarneback
Copy link
Collaborator Author

From my point of view this is good to go. The CI failure on nightly is a CUDA.jl issue; it doesn't load in isolation either.

Oh, there's a docs directory as well. Clearly index.md needs some of the updates from README.md. Should some of the new material only go into one of the files?

@jw3126
Copy link
Owner

jw3126 commented Oct 12, 2023

From my point of view this is good to go. The CI failure on nightly is a CUDA.jl issue; it doesn't load in isolation either.

Oh, there's a docs directory as well. Clearly index.md needs some of the updates from README.md. Should some of the new material only go into one of the files?

In theory that should be automatic

cp(joinpath(@__DIR__, "..", "README.md"), joinpath(@__DIR__, "src", "index.md"), force=true)

Of course feel free to do any other additions to the docs, but it is not required from my side.

@GunnarFarneback
Copy link
Collaborator Author

Ok, then I'm happy with it as it is.

@jw3126
Copy link
Owner

jw3126 commented Oct 12, 2023

I tried to play with your brach locally. I setup an environment, with

]dev ONNXRunTime
]add CUDA cuDNN
import CUDA; CUDA.set_runtime_version("v11.8")
exit()

However when I do import CUDA I get

 Downloading artifact: CUDA_Runtime
    Downloading [==================================>      ]  82.6 %

And after that hits 99.1% it just starts again at 0% in an infinite loop. Any ideas?

@jw3126
Copy link
Owner

jw3126 commented Oct 12, 2023

With CUDA runtime version v12.2 I don't get stuck in an infinte loop, but I get the following:

julia> include("test/runtests.jl")
[ Info: Precompiling ONNXRunTime [e034b28e-924e-41b2-b98f-d2bbeb830c6a]
Test Summary: | Pass  Total  Time
high level    |  110    110  0.8s
Test Summary: | Pass  Total  Time
Session       |   25     25  0.3s
Test Summary:    | Pass  Total  Time
tensor roundtrip |    9      9  0.1s
[ Info: Precompiling cuDNN [02a925ec-e4fe-4b08-9a7e-0d78e3d38ccd]
┌ Error: cuDNN is not available for your platform (x86_64-linux-gnu-libgfortran5-cxx11-libstdcxx30-c
uda+none-julia_version+1.9.3)
└ @ cuDNN ~/.julia/packages/cuDNN/rSQYc/src/cuDNN.jl:172
[ Info: Precompiling CUDAExt [d886cdb6-706a-5e22-bdf1-b04a849cf272]
┌ Warning: The :cuda execution provider requires a CUDA runtime version of at least 11.8 but less th
an 12. See `CUDA.set_runtime_version!`.
└ @ ONNXRunTime ~/.julia/dev/ONNXRunTime/src/highlevel.jl:86
increment2x3.onnx: Error During Test at /home/jan/.julia/dev/ONNXRunTime/test/test_cuda.jl:13
  Got exception outside of a @test
  /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1131 onnxruntime::Provider& onnxr
untime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime
_providers_cuda.so with error: libcublasLt.so.11: cannot open shared object file: No such file or di
rectory

@GunnarFarneback
Copy link
Collaborator Author

GunnarFarneback commented Oct 12, 2023

And after that hits 99.1% it just starts again at 0% in an infinite loop. Any ideas?

It's nothing I've run into but I'm trying it out in a clean depot. If it's reproducible I'm sure the Pkg developers would love a bug report.

Update: Failed to reproduce the problem.

With CUDA runtime version v12.2 I don't get stuck in an infinte loop, but I get the following:

This is expected since you don't get v11 .so files with CUDA runtime v12, and ONNXRunTime tries to load v11.

@GunnarFarneback
Copy link
Collaborator Author

And after that hits 99.1% it just starts again at 0% in an infinite loop. Any ideas?

Could it be a network or package server issue? Kind of sounds like the artifact transmission gets interrupted and something decides to start over. Might be worth trying to switch to another package server or disable it, to see if something changes. Or just try again and hope it was a transient problem.

@GunnarFarneback
Copy link
Collaborator Author

With CUDA runtime version v12.2 I don't get stuck in an infinte loop, but I get the following:

This is expected since you don't get v11 .so files with CUDA runtime v12, and ONNXRunTime tries to load v11.

Ok, let's make this stricter. The latest commit gives an error outside the officially supported range of CUDA runtime versions. It also centralizes the version data and adds consistency checks within the package.

@stemann
Copy link

stemann commented Oct 13, 2023

What’s the rush with 1.0?

What would be reasonable, achievable goals for a first not-supposed-to-need-breaking changes version?

@stemann
Copy link

stemann commented Oct 13, 2023

Will restricting CUDA support to v11+ (via restrictions on CUDA.jl) rule out support for ONNXRuntime on Jetson Nano devices (that only support CUDA 10.2)?

It might be that Nvidia has released a newer version with Jetson Orin Nano (at 5x the price), but I think dropping support for a quite popular runtime platform from a quite popular runtime package sounds like a bad idea.

@GunnarFarneback
Copy link
Collaborator Author

The main point of going to 1.0 is to get all three digits of semantic versioning and being able to distinguish between new features and bugfixes.

Does the current version work with Jetson Nano?

@jw3126
Copy link
Owner

jw3126 commented Oct 14, 2023

And after that hits 99.1% it just starts again at 0% in an infinite loop. Any ideas?

Could it be a network or package server issue? Kind of sounds like the artifact transmission gets interrupted and something decides to start over. Might be worth trying to switch to another package server or disable it, to see if something changes. Or just try again and hope it was a transient problem.

The issue disappeared, but now I get with CUDA_Runtime_jll v11.8:

  Got exception outside of a @test
  /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1131 onnxruntime::Provider& onnxr
untime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime
_providers_cuda.so with error: libcudnn.so.8: cannot open shared object file: No such file or direct
ory

@jw3126
Copy link
Owner

jw3126 commented Oct 14, 2023

If I try to import it directly, I get:

julia> import cuDNN
┌ Error: cuDNN is not available for your platform (x86_64-linux-gnu-libgfortran5-cxx11-libstdcxx30-c
uda+none-julia_version+1.9.3)

CUDA itself seems to work with v11.8:

julia> import CUDA; CUDA.randn(3)
3-element CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
  0.34151646
 -1.027569
  0.46423724

@GunnarFarneback
Copy link
Collaborator Author

GunnarFarneback commented Oct 14, 2023

My best guess is that your artifacts are messed up. I've seen similar things at work occasionally and then it has turned out that there is some empty artifact directory, possibly a result of some Pkg operation being interrupted at a bad time. Just removing the empty directory has solved the problem since Pkg in that case automatically downloads the missing artifact. A more brutal approach is to just delete all of .julia/artifacts and reinstantiate everything.

A more sophisticated approach is to try to find the exact artifact where libcudnn.so.8 should be located. With Julia 1.9.3 on Linux, doing

$ mkdir /tmp/testenv
$ julia --project=/tmp/testenv
pkg> add CUDA cuDNN
pkg> status
julia> import cuDNN
julia> cuDNN.set_runtime_version!(v"11.8")
julia> exit()
$ julia --project=/tmp/testenv
julia> import CUDA, cuDNN
julia> cuDNN.libcudnn
pkg> status

I get

julia> import CUDA, cuDNN

julia> cuDNN.libcudnn
"/home/gunnar/.julia/artifacts/330d3b99898bb6899242acecc7fefe96ce279acc/lib/libcudnn.so"

(testenv) pkg> status
Status `/tmp/testenv/Project.toml`
  [052768ef] CUDA v5.0.0
  [02a925ec] cuDNN v1.2.0

@jw3126
Copy link
Owner

jw3126 commented Oct 15, 2023

@GunnarFarneback Thanks a lot, my artifacts are now fixed and this works for me locally.
@stemann I have no idea if the current version runs on Jetson Nano, I would guess you are the only person, who even dared to run it in that setup? Would this PR make life for you more harder? Are you opposed to parts of this PR, or only to the 1.0 tag? It would be super cool if someday we can use proper artifacts and don't depend on CUDA.jl anymore.
In my view tagging 1.0 does not mean we are opposed to supporting Jetson Nano or any other platform.
This package did not have breaking changes for a long time and I don't foresee any breaking changes after this PR (except bumping CUDA.jl version from time to time), so in my view 1.0 is appropriate.

@stemann
Copy link

stemann commented Oct 15, 2023

The current version of ONNXRunTime.jl does not (I believe) run on the Jetson Nano, but ONNXRuntime does and the ONNXRuntime JLL should - that was a primary goal of JuliaPackaging/Yggdrasil#4386 (#12) and #19.

All that has been left (for ages, I agree!) is to implement, I believe, a similar approach as the CUDARuntime JLL - using e.g. Preferences.jl to select among the execution providers (and hence the artifacts) at Pkg.add-time. Please feel free to comment e.g. in #19 or #12.

I agree that compatibility with recent CUDA.jl versions should be an important goal for this package (I believe that is what this PR achieves…?).

But it would be nice if there was still room to get the Jetson Nano etc. support completed. Preferable without having to go for a 2.x version of this package, I would say.

So yes, I guess my main concern with this is bumping the version to 1.0.

P.S. Inspired by this PR, I raced to the office to get the old Jetson Nano fired up and ready for testing - hoping to find some time soon for pitching in…

@GunnarFarneback
Copy link
Collaborator Author

The main point of this PR is to make the package play well with Julia 1.9+ precompilation and PrecompileTools, which is accomplished by replacing Requires with a package extension. Making it compatible with recent CUDA.jl is more of a bonus since it was easier to do that than to stay with the old version.

There's no harm done in going to major version 2, or more, if it turns out that further breaking changes are needed.

@jw3126
Copy link
Owner

jw3126 commented Oct 16, 2023

So yes, I guess my main concern with this is bumping the version to 1.0.

P.S. Inspired by this PR, I raced to the office to get the old Jetson Nano fired up and ready for testing - hoping to find some time soon for pitching in…

Cool! How about this? We merge this PR with v0.4 and release v1 December first. If you find the time to add artifacts support until then @stemann, that is awesome.
If not, also no big deal IMO to release a v2 a later.

@stemann
Copy link

stemann commented Oct 16, 2023

The no-later-than-December plan sounds great.

@stemann
Copy link

stemann commented Oct 16, 2023

Have you considered using https://github.com/cjdoris/PackageExtensionCompat.jl to keep 1.6 compatibility?

Project.toml Outdated Show resolved Hide resolved
@jw3126
Copy link
Owner

jw3126 commented Oct 16, 2023

Have you considered using https://github.com/cjdoris/PackageExtensionCompat.jl to keep 1.6 compatibility?

CUDA v4 would be compatible with 1.6, CUDA v5 needs julia 1.8. Personally I would keep it simple and just drop 1.6 compat, but if anyone wants to add it using PackageExtensionCompat I am also fine with that.

@GunnarFarneback
Copy link
Collaborator Author

I did some git commit rewriting to change the version bump to 0.4.

@GunnarFarneback
Copy link
Collaborator Author

Have you considered using https://github.com/cjdoris/PackageExtensionCompat.jl to keep 1.6 compatibility?

It's not entirely that simple since it doesn't help with the Base.get_extension-based diagnostics.

README.md Outdated Show resolved Hide resolved
Co-authored-by: Jan Weidner <[email protected]>
Copy link
Owner

@jw3126 jw3126 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot this is a great PR! If there are no further objections, I will merge today.

@GunnarFarneback
Copy link
Collaborator Author

Remember to register as well. I'd love to get back to a registered version instead of depending on this branch.

@jw3126 jw3126 merged commit 4688a1a into jw3126:main Oct 17, 2023
7 of 12 checks passed
@GunnarFarneback GunnarFarneback deleted the cuda_package_extension branch October 18, 2023 14:06
@jw3126 jw3126 mentioned this pull request Dec 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CUDA 4.0 compatibility
4 participants