-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make seed!
and extract_jacobian!
gpu-friendly
#472
Conversation
Interestingly, when trying ip = Iterators.product(ydual, 1:n)
out_reshaped .= map(rc->partials(T, rc[1], rc[2]), ip) I can see in the GPU stack trace: ERROR: GPU compilation of kernel broadcast_kernel(CUDA.CuKernelContext, CuDeviceArray{Float32,2,1}, Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(identity),Tuple{Base.Broadcast.Extruded{Array{Float32,2},Tuple{Bool,Bool},Tuple{Int64,Int64}}}}, Int64) failed
KernelError: passing and using non-bitstype argument
Argument 4 to your kernel function is of type Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(identity),Tuple{Base.Broadcast.Extruded{Array{Float32,2},Tuple{Bool,Bool},Tuple{Int64,Int64}}}}, which is not isbits:
.args is of type Tuple{Base.Broadcast.Extruded{Array{Float32,2},Tuple{Bool,Bool},Tuple{Int64,Int64}}} which is not isbits.
.1 is of type Base.Broadcast.Extruded{Array{Float32,2},Tuple{Bool,Bool},Tuple{Int64,Int64}} which is not isbits.
.x is of type Array{Float32,2} which is not isbits. that the array in Not sure who to ping here |
a08d02d
to
b6e6984
Compare
Alright, I worked out the issue with using ForwardDiff, BenchmarkTools
x = rand(1000);
cfg = ForwardDiff.GradientConfig(nothing, x);
@btime ForwardDiff.seed!($(cfg.duals), $x, $(cfg.seeds)); Results in 0 allocations, and using CUDA, ForwardDiff, Test;
CUDA.allowscalar(false);
function f!(F, x)
t = Tuple(x)
F_nt = ntuple(Val(length(F))) do i
i==1 && (F_i = (t[1] + 3) * (t[2]^3 - 7) + 18)
i==2 && (F_i = sin(t[2] * exp(t[1]) - 1))
F_i
end
F .= F_nt
end;
x_initial = CUDA.CuArray(Float32[0.1; 1.2]);
# x_initial = Float32[0.1; 1.2];
J = similar(x_initial, (length(x_initial),length(x_initial)));
F = similar(x_initial);
Ja =[-5.272 13.392
1.25627 1.04689]
ForwardDiff.jacobian!(J, f!, F, x_initial);
@test all(Array(J) .≈ Ja) passes. |
b6e6984
to
f39a36b
Compare
seeds!
and extract_jacobian!
gpu-friendlyseed!
and extract_jacobian!
gpu-friendly
Ok, looks like Travis is passing. Some of AppVeyor is, too, but it looks like Windows has not consistently passed in CI. |
This would be a good candidate for the julialang GPU CI. |
bump |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though some small comments about naming thing and using vec
for reshaping to a vector.
This might be a tiny bit slower than it is now for extract_jacobian_chunk!
,
but i think it probably optimized out, at least in 1.5, and that is not perforance critical anyway, so if this gets us GPU support than that is a win.
Please bump the version number as a non-breaking change.
Strong agreement re: setting up JuliaGPU tests, but I won't block the PR over that.
Once these things are addressed I will merge.
f39a36b
to
b4cb1ac
Compare
I changed to
I bumped |
How did the benchmark show 0 allocations when views and broadcasting were missed in some places? Where those not benchmarked? Would be good to ensure all the stuff changed here has been benchmarked and confirmed not to allocate. |
Where were views/broadcasts missing? I agree that adding a test would be nice. I'll see if I can add that somewhere. The only piece I checked locally was julia> using ForwardDiff, BenchmarkTools
julia> x = rand(1000);
julia> cfg = ForwardDiff.GradientConfig(nothing, x);
julia> @btime ForwardDiff.seed!($(cfg.duals), $x, $(cfg.seeds));
54.821 ns (0 allocations: 0 bytes) |
I though e.g #472 (comment) would have caused allocations. |
I must have had the view in the For good measure, I added |
Ah, glad I added the tests, I just realized that I missed one method (the |
Alright, now all of the |
f303aea
to
a5b9e7f
Compare
Ok, I also squashed. I think this should be ready to go. |
Allocation tests are failing on julia 1.4 and 1.0. The failure on nightly is unrelated and is fixed by |
test/AllocationsTest.jl
Outdated
x = rand(1000) | ||
cfg = ForwardDiff.GradientConfig(nothing, x) | ||
|
||
balloc = @ballocated ForwardDiff.seed!($(cfg.duals), $x, $(cfg.seeds)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't just @allocated
work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried, but I wasn't able to interpolate with @allocated
, so it results in non-zero allocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it looks like we can use @allocated
, we'll just need to duplicate the calls and remove the getproperty
/getindex
in the calls:
@testset "Test seed! allocations" begin
x = rand(1000)
cfg = ForwardDiff.GradientConfig(nothing, x)
duals = cfg.duals
seeds = cfg.seeds
seed = cfg.seeds[1]
alloc = @allocated ForwardDiff.seed!(duals, x, seeds)
alloc = @allocated ForwardDiff.seed!(duals, x, seeds)
@test alloc == 0
alloc = @allocated ForwardDiff.seed!(duals, x, seed)
alloc = @allocated ForwardDiff.seed!(duals, x, seed)
@test alloc == 0
index = 1
alloc = @allocated ForwardDiff.seed!(duals, x, index, seeds)
alloc = @allocated ForwardDiff.seed!(duals, x, index, seeds)
@test alloc == 0
index = 1
alloc = @allocated ForwardDiff.seed!(duals, x, index, seed)
alloc = @allocated ForwardDiff.seed!(duals, x, index, seed)
@test alloc == 0
end
Let me know which one is preferred
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that originally (in e.g. 1.0) julia tried to excluded compilation from the output of @allocated
, but that proved unreliable, and so that was removed in 1.4 (ish?).
In response to that I added @ballocated
to BenchmarkTools
that just pulled the allocations out of the fastest run according to @benchmark
(which will naturally not be the one that included the timing).
I think @ballocated
is nicer than duplicating personally, but I will defer to others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that @ballocated
seems nicer, but I'll also defer to others as this is my first PR to this repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ballocated
doesn't really make sense to me. You run the function thousands of time and then pick out the lowest one. For functions that don't allocate, that will be the second run. So why run the function 998 more times? Seems like it will just slow down the test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll switch to the code snippet above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Ah, good point. I'll update to test on 1.5 and above. |
23ed88e
to
46bd294
Compare
Bump |
46bd294
to
3b6f62b
Compare
Use broadcast/macro consistently Fix jac Add AllocationsTest.jl
3b6f62b
to
e0be439
Compare
Rebased and re-bumped the version. |
Per @vchuravy's suggestion to open a new PR: Now that JuliaLang/julia#14955 is resolved, this PR revives the work of #353 and #406.