Fixed compat to enable tests on CI by stemann · Pull Request #60 · FluxML/Torch.jl

stemann · 2023-08-03T10:13:52Z

Using CUDA_Runtime_jll to have CUDA 10.2 in LD_LIBRARY_PATH
Fixed compat for NNlib - to avoid breaking change in NNlib v0.7.25.
Limited compat to enable tests on Julia 1.6. Dropped testing on Julia > 1.6 as this would require a more recent Flux.jl and CUDA.jl.
"Adjusted" numerical accuracy of NNlib tests (see below)
Re-organized tests

DhairyaLGandhi · 2023-08-03T10:23:21Z

the ci seems like it couldn't find the head of this PR, could we push into the branch and try to force it with a valid commit.

stemann · 2023-08-03T10:32:43Z

Right - strange...

Where to query/report Buildkite CI issues? On Slack in #ci-failures ?

stemann · 2023-08-09T07:48:42Z

@ToucheSir I got a little persistent with getting the tests running (I had a false memory of there being more tests)...

Anyway, the tests are being run now (with CUDA 10.2) - they are running using Flux v0.12 on Julia 1.9 (and using Flux v0.11 on Julia 1.6). I have almost zero experience with Flux: Are there any obvious quick fixes to just get the tests passing?

I changed ResNet() to ResNet(18). Now the tests are stalling at top = tresnet(tip) with tresnet being an Int64 in https://github.com/FluxML/Torch.jl/blob/master/test/runtests.jl#L22:

@testset "Flux" begin
  resnet = ResNet(18)
  tresnet = Flux.fmap(Torch.to_tensor, resnet.layers)

  ip = rand(Float32, 224, 224, 3, 1) # An RGB Image
  tip = tensor(ip, dev = 0) # 0 => GPU:0 in Torch

  top = tresnet(tip)
  op = resnet.layers(ip)

  gs = gradient(() -> sum(tresnet(tip)), Flux.params(tresnet))
  @test top isa Tensor
  @test size(top) == size(op)
  @test gs isa Flux.Zygote.Grads
end

ToucheSir · 2023-08-11T20:50:40Z

I'm not sure, but I suspect the very outdated compat for NNlib in Project.toml is holding Metalhead back and giving an older version which doesn't behave as expected.

ToucheSir · 2023-08-11T20:52:03Z

LocalPreferences.toml

If I understand Preferences.jl correctly, we should not have this checked in. If we need to set the preference at a package level, it should be set in Project.toml.

stemann · 2023-10-24T15:40:10Z

@ToucheSir I got a little persistent with getting the tests running (I had a false memory of there being more tests)...

Anyway, the tests are being run now (with CUDA 10.2) - they are running using Flux v0.12 on Julia 1.9 (and using Flux v0.11 on Julia 1.6). I have almost zero experience with Flux: Are there any obvious quick fixes to just get the tests passing?

Err... that thing was easily fixed ... eventually - by just not doing it completely wrong... (using ResNet(18) instead of ResNet18() ... 🤦 )

With the current code, it now stumbles on a more challenging error in NNLib.conv (in src/nnlib.jl#L9) - i.e. with:

@testset "Flux" begin
  resnet = ResNet18()
  tresnet = Flux.fmap(Torch.to_tensor, resnet.layers)

  ip = rand(Float32, 224, 224, 3, 1) # An RGB Image
  tip = tensor(ip, dev = 0) # 0 => GPU:0 in Torch

  top = tresnet(tip)
  op = resnet.layers(ip)

  gs = gradient(() -> sum(tresnet(tip)), Flux.params(tresnet))
  @test top isa Tensor
  @test size(top) == size(op)
  @test gs isa Flux.Zygote.Grads
end

the call top = tresnet(tip) fails while executing:

function NNlib.conv(x::Tensor{xT, N}, w::Tensor, b::Tensor{T},
                    cdims::DenseConvDims{M,K,C_in,C_out,S,P,D,F}) where {T,N,xT,M,K,C_in,C_out,S,P,D,F}
  op = conv2d(x, w, b, stride = collect(S), padding = [P[1];P[3]], dilation = collect(D))
  op
end

on a bounds error for P[3]:

ERROR: BoundsError: attempt to access Tuple{Int64, Int64} at index [3]
Stacktrace:
  [1] getindex(t::Tuple, i::Int64)
    @ Base ./tuple.jl:29
  [2] macro expansion
    @ ./show.jl:1128 [inlined]
  [3] conv(x::Tensor{Float32, 4}, w::Tensor{Float32, 4}, b::Tensor{Float32, 1}, cdims::DenseConvDims{2, (7, 7), 3, 64, 1, (2, 2), (3, 3, 3, 3), (1, 1), false})
    @ Torch ~/jsa/Torch.jl/src/nnlib.jl:9
  [4] conv(x::Tensor{Float32, 4}, w::Tensor{Float32, 4}, cdims::DenseConvDims{2, (7, 7), 3, 64, 1, (2, 2), (3, 3, 3, 3), (1, 1), false})
    @ Torch ~/jsa/Torch.jl/src/nnlib.jl:15
  [5] (::Conv{2, 2, typeof(identity), Tensor{Float32, 4}, Tensor{Float32, 1}})(x::Tensor{Float32, 4})
    @ Flux ~/.julia/packages/Flux/BPPNj/src/layers/conv.jl:166

where:

typeof(x), size(x) # = (Tensor{Float32, 4}, (224, 224, 3, 1))
typeof(w), size(w) # = (Tensor{Float32, 4}, (7, 7, 3, 64))
typeof(b), size(b) # = (Tensor{Float32, 1}, (64,))

b # = Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
cdims # = DenseConvDims: (224, 224, 3) * (7, 7) -> (112, 112, 64), stride: (2, 2), pad: (3, 3, 3, 3), dil: (1, 1), flip: false, groups: 1
S # = 1
P # = (2, 2)
D # = (3, 3, 3, 3)

@ToucheSir Any suggestion for what is failing here?

stemann · 2023-10-24T15:47:28Z

Just changing P[3] to P[2] just results in an error in Torch.conv2d (in src/ops.jl#L93) - in reverse(stride) where typeof(stride), size(stride), stride = (Array{Int64, 0}, (), fill(1)) which is hard to reverse...

stemann · 2023-10-24T16:00:18Z

@DhairyaLGandhi Can you remember why Base.getindex for Torch.Tensor was left in the non-functional state in src/tensor.jl#L84-L88 ?

function Base.getindex(t::Tensor{T,N}, I::Vararg{Int,N}) where {T,N}
  # @show reverse!(collect(I)) .- 1, size(t)
  # at_double_value_at_indexes(t.ptr, reverse!(collect(I)) .- 1, N)
  zero(T)
end

It is needed for display of Tensor's ... which would be neat for debugging... :-)

DhairyaLGandhi · 2023-10-24T16:11:01Z

It was because the indexing had a bug that I was trying to figure out. And collecting returned correct results still if we needed to represent the tensor in a Julia-"native" datatype.

DhairyaLGandhi · 2023-10-24T16:12:15Z

We don't need to fall back to NNlib's conv, the one in Torch may have to be updated to the latest API in NNlib if it doesn't get hit

stemann · 2023-10-24T16:59:56Z

We don't need to fall back to NNlib's conv, the one in Torch may have to be updated to the latest API in NNlib if it doesn't get hit

It is hitting the Torch conv, but likely with the wrong input.

stemann · 2023-10-25T07:38:24Z

Any idea why CI is still waiting for an agent?

Edit, Oct. 25: Cf. build 68 and build 69

Edit, Oct. 31: Build 68 and build 69 are still pending...

stemann · 2023-10-25T08:07:32Z

It seems like rolling back to Julia v1.6-compatible versions of Flux and Metalhead can avoid the NNLib.conv(::Torch.Tensor, ::Torch.Tensor; ...)-issue. I will look into setting up a test env. (using Julia 1.1 test-specific dependency management)...

Edit/err: Fixed by limiting compatibility for NNlib.

DhairyaLGandhi · 2023-11-01T13:28:26Z

We can probably drop support for intermediate versions of Julia, and low bound it to 1.6

stemann · 2023-11-01T15:13:39Z

We can probably drop support for intermediate versions of Julia, and low bound it to 1.6

Yes - I was just trying out if the old version would still pass on Julia v1.5 - as v1.6 is still causing trouble wrt. the Flux integration.

stemann · 2023-11-02T08:25:21Z

Alright: The older Julia 1.5 Manifest was using NNLib 0.7.10, and indeed if limiting NNlib to <= 0.7.24, the tests on Julia 1.6 gets as far as the tests on Julia 1.5 currently gets: Success! ✅

Edit: Fixed by running on pre-Ampere / CUDNN 7 compatible GPU:
~~Failing deeper into the NNlib.conv(::Tensor, ::Tensor; ...)-call - with a CUDNN error reported from Torch - might be due to the CUDNN version - or running on CUDA 10.2 (instead of 10.1...):~~

Got exception outside of a @test "cuDNN error: CUDNN_STATUS_EXECUTION_FAILED (cudnn_convolution_add_bias_ at ../../aten/src/ATen/native/cudnn/Conv.cpp:812) ... Stacktrace: -- | [1] macro expansion | @ /var/lib/buildkite-agent/builds/gpuci-14/julialang/torch-dot-jl/src/error.jl:16 [inlined] | [2] atg_conv2d(arg1::Base.RefValue{Ptr{Nothing}}, input::Ptr{Nothing}, weight::Ptr{Nothing}, bias::Ptr{Nothing}, stride_data::Vector{Int64}, stride_len::Int64, padding_data::Vector{Int64}, padding_len::Int64, dilation_data::Vector{Int64}, dilation_len::Int64, groups::Int64) | @ Torch /var/lib/buildkite-agent/builds/gpuci-14/julialang/torch-dot-jl/src/wrap/libdoeye_caml_generated.jl:904 | [3] conv2d(input::Tensor{Float32, 4}, filter::Tensor{Float32, 4}, bias::Tensor{Float32, 1}; stride::Vector{Int64}, padding::Vector{Int64}, dilation::Vector{Int64}, groups::Int64) | @ Torch /var/lib/buildkite-agent/builds/gpuci-14/julialang/torch-dot-jl/src/ops.jl:92 | [4] conv(x::Tensor{Float32, 4}, w::Tensor{Float32, 4}, b::Tensor{Float32, 1}, cdims::DenseConvDims{2, (7, 7), 3, 64, (2, 2), (3, 3, 3, 3), (1, 1), false}) | @ Torch /var/lib/buildkite-agent/builds/gpuci-14/julialang/torch-dot-jl/src/nnlib.jl:9 | [5] conv(x::Tensor{Float32, 4}, w::Tensor{Float32, 4}, cdims::DenseConvDims{2, (7, 7), 3, 64, (2, 2), (3, 3, 3, 3), (1, 1), false}) | @ Torch /var/lib/buildkite-agent/builds/gpuci-14/julialang/torch-dot-jl/src/nnlib.jl:15 | [6] (::Conv{2, 2, typeof(identity), Tensor{Float32, 4}, Tensor{Float32, 1}})(x::Tensor{Float32, 4}) | @ Flux ~/.cache/julia-buildkite-plugin/depots/6e7ea706-f768-4492-9c6f-30c3c87ddb4d/packages/Flux/goUGu/src/layers/conv.jl:147 | [7] applychain(fs::Tuple{Conv{2, 2, typeof(identity), Tensor{Float32, 4}, Tensor{Float32, 1}}, MaxPool{2, 2}, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, MeanPool{2, 4}, Metalhead.var"#103#104", Dense{typeof(identity), Tensor{Float32, 2}, Tensor{Float32, 1}}, typeof(softmax)}, x::Tensor{Float32, 4}) | @ Flux ~/.cache/julia-buildkite-plugin/depots/6e7ea706-f768-4492-9c6f-30c3c87ddb4d/packages/Flux/goUGu/src/layers/basic.jl:36 | [8] (::Chain{Tuple{Conv{2, 2, typeof(identity), Tensor{Float32, 4}, Tensor{Float32, 1}}, MaxPool{2, 2}, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, MeanPool{2, 4}, Metalhead.var"#103#104", Dense{typeof(identity), Tensor{Float32, 2}, Tensor{Float32, 1}}, typeof(softmax)}})(x::Tensor{Float32, 4}) | @ Flux ~/.cache/julia-buildkite-plugin/depots/6e7ea706-f768-4492-9c6f-30c3c87ddb4d/packages/Flux/goUGu/src/layers/basic.jl:38 | [9] macro expansion | @ /var/lib/buildkite-agent/builds/gpuci-14/julialang/torch-dot-jl/test/runtests.jl:22 [inlined] | [10] macro expansion | @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Test/src/Test.jl:1151 [inlined] | [11] top-level scope | @ /var/lib/buildkite-agent/builds/gpuci-14/julialang/torch-dot-jl/test/runtests.jl:16 ```</strike>

stemann · 2023-11-02T13:52:25Z

test/test_nnlib.jl

            test_output     = NNlib.conv(x,          w,      cdims)

            test_output = Array(test_output)
-            @test maximum(abs.(test_output - expected_output)) < 10 * eps(Float32)


@bjosv79, @DhairyaLGandhi: Did you ever experience problems with the numerical accuracy of these tests? (in relation to #38)

It seems they were never included in runtests.jl, so I suggest to leave them either skipped/marked broken or with the current relaxed constraint on their numerical accuracy.

Notably, * Limited NNlib compat to <= 0.7.24: DenseConvDims was changed (breaking) in v0.7.25 (in FluxML/NNlib.jl@5ffabbc). Also: * Limited test-compat for Flux to v0.11. * Limited test-compat for Zygote to v0.5. * Removed Manifest.toml. * Buildkite: Updated cuda definition. * Buildkite: Set cap to sm_75 to limit to pre-Ampere GPUs (compatible with Torch_jll v1.4 CUDNN 7). * Buildkite: Dropped testing on Julia > v1.6. Julia v1.7+ needs newer version of Flux.jl (> v0.11) to support a newer version of CUDA.jl (> v2).

On CI, * max abs difference was up to 6.1035156f-5. * max abs difference for L61 as high as 0.017906189f0 Also: * Included test_nnlib.jl in runtests.jl.

stemann · 2023-11-08T09:54:37Z

LGTM :-)

ToucheSir

Do you mind reminding me what the order of PRs and dependencies going forwards is?

stemann · 2023-11-10T09:36:25Z

I would suggest the following:

Updated Julia wrapper generator #59 to have that part cleaned-up
Update the C wrapper, Updated C wrapper wrt. Torch v1.10 #61
Re-do the Torch_jll in Yggdrasil - it is missing CUDA platform augmentation, I believe.
Build the C wrapper (Updated C wrapper wrt. Torch v1.10 #61) in Yggdrasil, e.g. as TorchCAPI_jll - having Torch_jll as a dependency
Update Torch.jl to use the updated wrapper

stemann mentioned this pull request Aug 3, 2023

Updated Julia wrapper generator #59

Merged

stemann changed the title ~~BuildKite: Set CUDA version to 10.2~~ Buildkite: Set CUDA version to 10.2 Aug 9, 2023

ToucheSir reviewed Aug 11, 2023

View reviewed changes

stemann mentioned this pull request Oct 22, 2023

Updated C wrapper wrt. Torch v1.10 #61

Merged

3 tasks

stemann commented Nov 2, 2023

View reviewed changes

stemann changed the title ~~Buildkite: Set CUDA version to 10.2~~ Fixed compat to enable tests on CI Nov 3, 2023

stemann marked this pull request as ready for review November 3, 2023 03:26

stemann marked this pull request as draft November 3, 2023 03:48

stemann added 3 commits November 3, 2023 04:57

Relaxed numerical accuracy for NNlib tests

cd20abf

On CI, * max abs difference was up to 6.1035156f-5. * max abs difference for L61 as high as 0.017906189f0 Also: * Included test_nnlib.jl in runtests.jl.

Re-organized tests

4ac15ab

stemann marked this pull request as ready for review November 3, 2023 04:08

ToucheSir approved these changes Nov 10, 2023

View reviewed changes

ToucheSir merged commit a09c4f6 into FluxML:master Nov 10, 2023

stemann deleted the feature/buildkite_cuda_10.2 branch November 10, 2023 09:48

stemann mentioned this pull request Nov 11, 2023

Update wrapper #54

Closed

Uh oh!

Conversation

stemann commented Aug 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DhairyaLGandhi commented Aug 3, 2023

Uh oh!

stemann commented Aug 3, 2023

Uh oh!

stemann commented Aug 9, 2023

Uh oh!

ToucheSir commented Aug 11, 2023

Uh oh!

ToucheSir Aug 11, 2023

Choose a reason for hiding this comment

Uh oh!

stemann Nov 3, 2023

Choose a reason for hiding this comment

Uh oh!

stemann commented Oct 24, 2023

Uh oh!

stemann commented Oct 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stemann commented Oct 24, 2023

Uh oh!

DhairyaLGandhi commented Oct 24, 2023

Uh oh!

DhairyaLGandhi commented Oct 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stemann commented Oct 24, 2023

Uh oh!

stemann commented Oct 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stemann commented Oct 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DhairyaLGandhi commented Nov 1, 2023

Uh oh!

stemann commented Nov 1, 2023

Uh oh!

stemann commented Nov 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stemann Nov 2, 2023

Choose a reason for hiding this comment

Uh oh!

DhairyaLGandhi Nov 8, 2023

Choose a reason for hiding this comment

Uh oh!

stemann Nov 8, 2023

Choose a reason for hiding this comment

Uh oh!

stemann commented Nov 8, 2023

Uh oh!

ToucheSir left a comment

Choose a reason for hiding this comment

Uh oh!

stemann commented Nov 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stemann commented Aug 3, 2023 •

edited

Loading

stemann commented Oct 24, 2023 •

edited

Loading

DhairyaLGandhi commented Oct 24, 2023 •

edited

Loading

stemann commented Oct 25, 2023 •

edited

Loading

stemann commented Oct 25, 2023 •

edited

Loading

stemann commented Nov 2, 2023 •

edited

Loading