Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make seed! and extract_jacobian! gpu-friendly #472

Merged
merged 1 commit into from
Nov 30, 2020

Conversation

charleskawczynski
Copy link
Contributor

@charleskawczynski charleskawczynski commented Nov 9, 2020

Per @vchuravy's suggestion to open a new PR: Now that JuliaLang/julia#14955 is resolved, this PR revives the work of #353 and #406.

@charleskawczynski
Copy link
Contributor Author

charleskawczynski commented Nov 9, 2020

Interestingly, when trying

    ip = Iterators.product(ydual, 1:n)
    out_reshaped .= map(rc->partials(T, rc[1], rc[2]), ip)

I can see in the GPU stack trace:

ERROR: GPU compilation of kernel broadcast_kernel(CUDA.CuKernelContext, CuDeviceArray{Float32,2,1}, Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(identity),Tuple{Base.Broadcast.Extruded{Array{Float32,2},Tuple{Bool,Bool},Tuple{Int64,Int64}}}}, Int64) failed
KernelError: passing and using non-bitstype argument

Argument 4 to your kernel function is of type Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(identity),Tuple{Base.Broadcast.Extruded{Array{Float32,2},Tuple{Bool,Bool},Tuple{Int64,Int64}}}}, which is not isbits:
  .args is of type Tuple{Base.Broadcast.Extruded{Array{Float32,2},Tuple{Bool,Bool},Tuple{Int64,Int64}}} which is not isbits.
    .1 is of type Base.Broadcast.Extruded{Array{Float32,2},Tuple{Bool,Bool},Tuple{Int64,Int64}} which is not isbits.
      .x is of type Array{Float32,2} which is not isbits.

that the array in Extruded is not similar to the device array. I'm not familiar enough with Extruded, but maybe we can change initialization to use similar so that this solution is isbits?

Not sure who to ping here

src/apiutils.jl Outdated Show resolved Hide resolved
@charleskawczynski
Copy link
Contributor Author

charleskawczynski commented Nov 9, 2020

Alright, I worked out the issue with extract_jacobian!, thanks @vchuravy! With this PR:

using ForwardDiff, BenchmarkTools
x = rand(1000);
cfg = ForwardDiff.GradientConfig(nothing, x);
@btime ForwardDiff.seed!($(cfg.duals), $x, $(cfg.seeds));

Results in 0 allocations, and

using CUDA, ForwardDiff, Test;
CUDA.allowscalar(false);

function f!(F, x)
  t = Tuple(x)
  F_nt = ntuple(Val(length(F))) do i
      i==1 && (F_i = (t[1] + 3) * (t[2]^3 - 7) + 18)
      i==2 && (F_i = sin(t[2] * exp(t[1]) - 1))
      F_i
  end
  F .= F_nt
end;

x_initial = CUDA.CuArray(Float32[0.1; 1.2]);
# x_initial = Float32[0.1; 1.2];
J = similar(x_initial, (length(x_initial),length(x_initial)));
F = similar(x_initial);
Ja =[-5.272    13.392
  1.25627   1.04689]
ForwardDiff.jacobian!(J, f!, F, x_initial);
@test all(Array(J) .≈ Ja)

passes.

src/apiutils.jl Outdated Show resolved Hide resolved
@charleskawczynski charleskawczynski changed the title Make seeds! and extract_jacobian! gpu-friendly Make seed! and extract_jacobian! gpu-friendly Nov 10, 2020
@charleskawczynski
Copy link
Contributor Author

charleskawczynski commented Nov 10, 2020

Ok, looks like Travis is passing. Some of AppVeyor is, too, but it looks like Windows has not consistently passed in CI.

@simonbyrne
Copy link
Contributor

This would be a good candidate for the julialang GPU CI.

@charleskawczynski
Copy link
Contributor Author

bump

Copy link
Member

@oxinabox oxinabox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though some small comments about naming thing and using vec for reshaping to a vector.

This might be a tiny bit slower than it is now for extract_jacobian_chunk!,
but i think it probably optimized out, at least in 1.5, and that is not perforance critical anyway, so if this gets us GPU support than that is a win.

Please bump the version number as a non-breaking change.

Strong agreement re: setting up JuliaGPU tests, but I won't block the PR over that.

Once these things are addressed I will merge.

src/apiutils.jl Outdated Show resolved Hide resolved
src/jacobian.jl Outdated Show resolved Hide resolved
src/jacobian.jl Outdated Show resolved Hide resolved
@charleskawczynski
Copy link
Contributor Author

LGTM, though some small comments about naming thing and using vec for reshaping to a vector.

I changed to vec, looks like it still runs fine on the GPU.

Please bump the version number as a non-breaking change.

I bumped version = "0.10.12" ➡️ version = "0.10.13", this is the version bump you mean, right?

@KristofferC
Copy link
Collaborator

How did the benchmark show 0 allocations when views and broadcasting were missed in some places? Where those not benchmarked? Would be good to ensure all the stuff changed here has been benchmarked and confirmed not to allocate.

@charleskawczynski
Copy link
Contributor Author

charleskawczynski commented Nov 17, 2020

How did the benchmark show 0 allocations when views and broadcasting were missed in some places? Where those not benchmarked? Would be good to ensure all the stuff changed here has been benchmarked and confirmed not to allocate.

Where were views/broadcasts missing? I agree that adding a test would be nice. I'll see if I can add that somewhere. The only piece I checked locally was

julia> using ForwardDiff, BenchmarkTools


julia> x = rand(1000);

julia> cfg = ForwardDiff.GradientConfig(nothing, x);

julia> @btime ForwardDiff.seed!($(cfg.duals), $x, $(cfg.seeds));
  54.821 ns (0 allocations: 0 bytes)

@KristofferC
Copy link
Collaborator

I though e.g #472 (comment) would have caused allocations.

@charleskawczynski
Copy link
Contributor Author

I though e.g #472 (comment) would have caused allocations.

I must have had the view in the seed!(duals::AbstractArray{Dual{T,V,N}}, x, seeds::NTuple{N,Partials{N,V}}) where {T,V,N} signature but not in seed!(duals::AbstractArray{Dual{T,V,N}}, x, index, seed::Partials{N,V} = zero(Partials{N,V})) where {T,V,N}, where you commented-- I'm not sure since I squashed that commit.

For good measure, I added BenchmarkTools to the test environment and added tests to ensure that we're getting zero allocations for all three seed! definitions.

@charleskawczynski
Copy link
Contributor Author

charleskawczynski commented Nov 17, 2020

Ah, glad I added the tests, I just realized that I missed one method (the seed! with NTuple seeds and index), and that one happens to be failing. Will update soon. It turns out I ended up having to change the chunk seed! to use getindex.(Ref(seeds), 1:chunksize)) to ensure zero allocations.

@charleskawczynski
Copy link
Contributor Author

charleskawczynski commented Nov 17, 2020

Alright, now all of the seed! methods are non-allocating and are tested to be non-allocating.

@charleskawczynski
Copy link
Contributor Author

Ok, I also squashed. I think this should be ready to go.

@oxinabox
Copy link
Member

oxinabox commented Nov 18, 2020

Allocation tests are failing on julia 1.4 and 1.0.
You probably want to only run that file on 1.5 and above.

The failure on nightly is unrelated and is fixed by
#476

x = rand(1000)
cfg = ForwardDiff.GradientConfig(nothing, x)

balloc = @ballocated ForwardDiff.seed!($(cfg.duals), $x, $(cfg.seeds))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't just @allocated work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried, but I wasn't able to interpolate with @allocated, so it results in non-zero allocations.

Copy link
Contributor Author

@charleskawczynski charleskawczynski Nov 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it looks like we can use @allocated, we'll just need to duplicate the calls and remove the getproperty/getindex in the calls:

@testset "Test seed! allocations" begin
    x = rand(1000)
    cfg = ForwardDiff.GradientConfig(nothing, x)
    duals = cfg.duals
    seeds = cfg.seeds
    seed = cfg.seeds[1]

    alloc = @allocated ForwardDiff.seed!(duals, x, seeds)
    alloc = @allocated ForwardDiff.seed!(duals, x, seeds)
    @test alloc == 0

    alloc = @allocated ForwardDiff.seed!(duals, x, seed)
    alloc = @allocated ForwardDiff.seed!(duals, x, seed)
    @test alloc == 0

    index = 1
    alloc = @allocated ForwardDiff.seed!(duals, x, index, seeds)
    alloc = @allocated ForwardDiff.seed!(duals, x, index, seeds)
    @test alloc == 0

    index = 1
    alloc = @allocated ForwardDiff.seed!(duals, x, index, seed)
    alloc = @allocated ForwardDiff.seed!(duals, x, index, seed)
    @test alloc == 0
end

Let me know which one is preferred

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that originally (in e.g. 1.0) julia tried to excluded compilation from the output of @allocated, but that proved unreliable, and so that was removed in 1.4 (ish?).
In response to that I added @ballocated to BenchmarkTools that just pulled the allocations out of the fastest run according to @benchmark (which will naturally not be the one that included the timing).

I think @ballocated is nicer than duplicating personally, but I will defer to others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that @ballocated seems nicer, but I'll also defer to others as this is my first PR to this repo.

Copy link
Collaborator

@KristofferC KristofferC Nov 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ballocated doesn't really make sense to me. You run the function thousands of time and then pick out the lowest one. For functions that don't allocate, that will be the second run. So why run the function 998 more times? Seems like it will just slow down the test suite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll switch to the code snippet above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@charleskawczynski
Copy link
Contributor Author

Allocation tests are failing on julia 1.4 and 1.0.
You probably want to only run that file on 1.5 and above.

The failure on nightly is unrelated and is fixed by
#476

Ah, good point. I'll update to test on 1.5 and above.

@charleskawczynski
Copy link
Contributor Author

Bump

Use broadcast/macro consistently

Fix jac

Add AllocationsTest.jl
@charleskawczynski
Copy link
Contributor Author

Rebased and re-bumped the version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants