Fixed compat to enable tests on CI#60
Fixed compat to enable tests on CI#60ToucheSir merged 3 commits intoFluxML:masterfrom stemann:feature/buildkite_cuda_10.2
Conversation
|
the ci seems like it couldn't find the head of this PR, could we push into the branch and try to force it with a valid commit. |
|
Right - strange... Where to query/report Buildkite CI issues? On Slack in #ci-failures ? |
|
@ToucheSir I got a little persistent with getting the tests running (I had a false memory of there being more tests)... Anyway, the tests are being run now (with CUDA 10.2) - they are running using Flux v0.12 on Julia 1.9 (and using Flux v0.11 on Julia 1.6). I have almost zero experience with Flux: Are there any obvious quick fixes to just get the tests passing? I changed @testset "Flux" begin
resnet = ResNet(18)
tresnet = Flux.fmap(Torch.to_tensor, resnet.layers)
ip = rand(Float32, 224, 224, 3, 1) # An RGB Image
tip = tensor(ip, dev = 0) # 0 => GPU:0 in Torch
top = tresnet(tip)
op = resnet.layers(ip)
gs = gradient(() -> sum(tresnet(tip)), Flux.params(tresnet))
@test top isa Tensor
@test size(top) == size(op)
@test gs isa Flux.Zygote.Grads
end |
|
I'm not sure, but I suspect the very outdated compat for NNlib in Project.toml is holding Metalhead back and giving an older version which doesn't behave as expected. |
LocalPreferences.toml
Outdated
There was a problem hiding this comment.
If I understand Preferences.jl correctly, we should not have this checked in. If we need to set the preference at a package level, it should be set in Project.toml.
Err... that thing was easily fixed ... eventually - by just not doing it completely wrong... (using With the current code, it now stumbles on a more challenging error in @testset "Flux" begin
resnet = ResNet18()
tresnet = Flux.fmap(Torch.to_tensor, resnet.layers)
ip = rand(Float32, 224, 224, 3, 1) # An RGB Image
tip = tensor(ip, dev = 0) # 0 => GPU:0 in Torch
top = tresnet(tip)
op = resnet.layers(ip)
gs = gradient(() -> sum(tresnet(tip)), Flux.params(tresnet))
@test top isa Tensor
@test size(top) == size(op)
@test gs isa Flux.Zygote.Grads
endthe call function NNlib.conv(x::Tensor{xT, N}, w::Tensor, b::Tensor{T},
cdims::DenseConvDims{M,K,C_in,C_out,S,P,D,F}) where {T,N,xT,M,K,C_in,C_out,S,P,D,F}
op = conv2d(x, w, b, stride = collect(S), padding = [P[1];P[3]], dilation = collect(D))
op
endon a bounds error for ERROR: BoundsError: attempt to access Tuple{Int64, Int64} at index [3]
Stacktrace:
[1] getindex(t::Tuple, i::Int64)
@ Base ./tuple.jl:29
[2] macro expansion
@ ./show.jl:1128 [inlined]
[3] conv(x::Tensor{Float32, 4}, w::Tensor{Float32, 4}, b::Tensor{Float32, 1}, cdims::DenseConvDims{2, (7, 7), 3, 64, 1, (2, 2), (3, 3, 3, 3), (1, 1), false})
@ Torch ~/jsa/Torch.jl/src/nnlib.jl:9
[4] conv(x::Tensor{Float32, 4}, w::Tensor{Float32, 4}, cdims::DenseConvDims{2, (7, 7), 3, 64, 1, (2, 2), (3, 3, 3, 3), (1, 1), false})
@ Torch ~/jsa/Torch.jl/src/nnlib.jl:15
[5] (::Conv{2, 2, typeof(identity), Tensor{Float32, 4}, Tensor{Float32, 1}})(x::Tensor{Float32, 4})
@ Flux ~/.julia/packages/Flux/BPPNj/src/layers/conv.jl:166where: typeof(x), size(x) # = (Tensor{Float32, 4}, (224, 224, 3, 1))
typeof(w), size(w) # = (Tensor{Float32, 4}, (7, 7, 3, 64))
typeof(b), size(b) # = (Tensor{Float32, 1}, (64,))
b # = Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
cdims # = DenseConvDims: (224, 224, 3) * (7, 7) -> (112, 112, 64), stride: (2, 2), pad: (3, 3, 3, 3), dil: (1, 1), flip: false, groups: 1
S # = 1
P # = (2, 2)
D # = (3, 3, 3, 3)@ToucheSir Any suggestion for what is failing here? |
|
Just changing |
|
@DhairyaLGandhi Can you remember why function Base.getindex(t::Tensor{T,N}, I::Vararg{Int,N}) where {T,N}
# @show reverse!(collect(I)) .- 1, size(t)
# at_double_value_at_indexes(t.ptr, reverse!(collect(I)) .- 1, N)
zero(T)
endIt is needed for display of |
|
It was because the indexing had a bug that I was trying to figure out. And |
|
We don't need to fall back to NNlib's |
It is hitting the Torch conv, but likely with the wrong input. |
|
It seems like rolling back to Julia v1.6-compatible versions of Flux and Metalhead can avoid the Edit/err: Fixed by limiting compatibility for NNlib. |
|
We can probably drop support for intermediate versions of Julia, and low bound it to 1.6 |
Yes - I was just trying out if the old version would still pass on Julia v1.5 - as v1.6 is still causing trouble wrt. the Flux integration. |
|
Alright: The older Julia 1.5 Manifest was using NNLib 0.7.10, and indeed if limiting NNlib to <= 0.7.24, the tests on Julia 1.6 gets as far as the tests on Julia 1.5 currently gets: Success! ✅ Edit: Fixed by running on pre-Ampere / CUDNN 7 compatible GPU: |
| test_output = NNlib.conv(x, w, cdims) | ||
|
|
||
| test_output = Array(test_output) | ||
| @test maximum(abs.(test_output - expected_output)) < 10 * eps(Float32) |
There was a problem hiding this comment.
@bjosv79, @DhairyaLGandhi: Did you ever experience problems with the numerical accuracy of these tests? (in relation to #38)
There was a problem hiding this comment.
It seems they were never included in runtests.jl, so I suggest to leave them either skipped/marked broken or with the current relaxed constraint on their numerical accuracy.
Notably, * Limited NNlib compat to <= 0.7.24: DenseConvDims was changed (breaking) in v0.7.25 (in FluxML/NNlib.jl@5ffabbc). Also: * Limited test-compat for Flux to v0.11. * Limited test-compat for Zygote to v0.5. * Removed Manifest.toml. * Buildkite: Updated cuda definition. * Buildkite: Set cap to sm_75 to limit to pre-Ampere GPUs (compatible with Torch_jll v1.4 CUDNN 7). * Buildkite: Dropped testing on Julia > v1.6. Julia v1.7+ needs newer version of Flux.jl (> v0.11) to support a newer version of CUDA.jl (> v2).
On CI, * max abs difference was up to 6.1035156f-5. * max abs difference for L61 as high as 0.017906189f0 Also: * Included test_nnlib.jl in runtests.jl.
|
LGTM :-) |
ToucheSir
left a comment
There was a problem hiding this comment.
Do you mind reminding me what the order of PRs and dependencies going forwards is?
|
I would suggest the following:
|
Uh oh!
There was an error while loading. Please reload this page.