Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scalar getindex when shared index changes order #28

Closed
MacKenzieHnC opened this issue Oct 13, 2020 · 12 comments
Closed

scalar getindex when shared index changes order #28

MacKenzieHnC opened this issue Oct 13, 2020 · 12 comments
Labels
gpu anything involving a CuArray or similar

Comments

@MacKenzieHnC
Copy link

MacKenzieHnC commented Oct 13, 2020

I was asked to make this an issue, but I'm not sure there's anything to be done about it.

TensorCast uses scalar indexing when a shared index changes position, which is nightmarishly slow for big calculations on gpu.

using TensorCast
using CUDA
CUDA.allowscalar(false)

a, b, c, d = 1, 2, 3, 4
A = cu(rand(a,b,c))
B = cu(rand(a,c,d))

# Fast
@reduce C[a,b,d] := sum(c) A[a,b,c] * B[a,c,d]

# Fast
@reduce C[a,d,b] := sum(c) A[a,b,c] * B[a,c,d]

# Not fast
@reduce C[b,a,d] := sum(c) A[a,b,c] * B[a,c,d]

# Fast
@reduce C[b,c,d] := sum(a) A[a,b,c] * B[a,c,d]

# Not fast
@reduce C[c,b,d] := sum(a) A[a,b,c] * B[a,c,d]
@mcabbott
Copy link
Owner

I think this is similar to #25, which is JuliaGPU/CUDA.jl#228. When the broadcast has one naked CuArray, and others reshaped / permuted, then it hits CUDA's broadcasting. But when all elements are too deeply wrapped, somehow the BroadcastStyle doesn't get passed outwards, and you get the default CPU broadcasting. Two of these examples are:

julia> @pretty @reduce C[a,b,d] := sum(c) A[a,b,c] * B[a,c,d] # Fast
begin
    local raven = orient(B, (:, *, :, :))
    C = dropdims(sum(@__dot__(A * raven), dims = 3), dims = 3)
end

julia> @pretty @reduce C[b,a,d] := sum(c) A[a,b,c] * B[a,c,d] # Not fast
begin
    local raven = orient(PermutedDimsArray(A, (2, 1, 3)), (:, :, *, :))
    local deer = orient(PermutedDimsArray(B, (1, 3, 2)), (*, :, :, :))
    C = dropdims(sum(@__dot__(raven * deer), dims = 4), dims = 4)
end

The workaround from #25 (comment) might be useful? Other things which might work (but I haven't tried) are:

julia> @pretty @reduce C[b,a,d] := sum(c) A[a,b,c] * B[a,c,d]  nolazy 
begin
    local raven = orient(permutedims(A, (2, 1, 3)), (:, :, *, :))
    local deer = orient(permutedims(B, (1, 3, 2)), (*, :, :, :))
    C = dropdims(sum(@__dot__(raven * deer), dims = 4), dims = 4)
end

julia> @pretty @matmul C[b,a,d] := sum(c) A[a,b,c] * B[a,c,d]
begin
    local (sz_a, sz_b, sz_c, sz_d) = (size(A, 1), size(A, 2), size(A, 3), size(B, 3))
    local raven = permutedims(B, (2, 1, 3))
    local deer = reshape(A, (star(sz_a, sz_b), sz_c))
    local seal = reshape(raven, (sz_c, star(sz_a, sz_d)))
    local goat = reshape(deer * seal, (sz_a, sz_b, sz_a, sz_d))
    local skunk = orient(permutedims(goat, (2, 1, 3, 4)), (:, :, :))
    C = skunk
end

@mcabbott
Copy link
Owner

On the CPU, orient(...) usually reshapes, but sometimes makes a copy, since things like ReshapedArray(Transpose(...)) are very slow. This was disabled on the GPU in #10, and one possibility would be to change that.

Another possible plan I had is to replace things like orient(PermutedDimsArray(A, (2, 1, 3)), (:, :, *, :)) with transmute(A, (2,1,0,3)) from https://github.com/mcabbott/TransmuteDims.jl . Since this is just one wrapper, it would avoid this issue.

@MacKenzieHnC
Copy link
Author

MacKenzieHnC commented Oct 15, 2020

Fwiw my project worked just fine with me just data-wrangling until it wasn't in a (b,a,d) ordering anymore.

This doesn't seem like a huge problem to me now that I understand how it works. It could just be enough to add something to the docs to warn about it (at least as a quick patch)

Let me take a day to recover from my project and I'll start looking at this in-depth. I seriously owe you for making this library.

@MacKenzieHnC
Copy link
Author

MacKenzieHnC commented Oct 23, 2020

Sorry, it turns out there was much more work to be done.

Have you tested the transmute() solution? That seems really simple and maintainable if it works

edit: Oh, of course you wrote that library too haha. So are there any issues with using that method that you know about?

@mcabbott
Copy link
Owner

I haven't tried it out, it means diving deep into the belly of this package's logic, as it still needs to call reshape in other contexts. And TransmuteDims needs a bit of tidying up, GPUArrays + Adapt had a bunch of changes. But eventually I'll get there!

@MacKenzieHnC
Copy link
Author

MacKenzieHnC commented Oct 23, 2020

Gotcha. I thought it would be easier than that, but if I'm being honest your code's a little above my paygrade.

I ran some benchmarks on the easy cases though

For small arrays, the trick and solution have a pretty bad worst case, but otherwise are pretty comparable to the case where the index doesn't change. And the difference in the worst case basically disappears as the arrays get large.

Either seems viable.

EDIT: Corrected tests and reran them

using TensorCast
using CUDA
CUDA.allowscalar(true)

function easy(A, B)
    CUDA.@sync @reduce C[a,b,d] := sum(c) A[a,b,c] * B[a,c,d]
    return C
end

function simple(A, B)
    CUDA.@sync @reduce C[b,a,d] := sum(c) A[a,b,c] * B[a,c,d]
    return C
end

function trick(A, B)
    trick = cu(fill(false))
    CUDA.@sync @reduce C[b,a,d] := sum(c) A[a,b,c] * B[a,c,d] + trick
    return C
end


using TensorCast: orient, star
function solution(A, B) 
    local raven = orient(permutedims(A, (2, 1, 3)), (:, :, *, :))
    local deer = orient(permutedims(B, (1, 3, 2)), (*, :, :, :))
    CUDA.@sync C = dropdims(sum(@__dot__(raven * deer), dims = 4), dims = 4)
end

Benchmarks for small arrays

julia> @benchmark easy($(cu(rand(1,2,3))), $(cu(rand(1,3,4))))
BenchmarkTools.Trial: 
  memory estimate:  2.53 KiB
  allocs estimate:  97
  --------------
  minimum time:     57.300 μs (0.00% GC)
  median time:      75.201 μs (0.00% GC)
  mean time:        80.069 μs (0.00% GC)
  maximum time:     4.932 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> 

julia> @benchmark simple($(cu(rand(1,2,3))), $(cu(rand(1,3,4))))
┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays C:\Users\Lepre\.julia\packages\GPUArrays\uaFZh\src\host\indexing.jl:43
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  191
  --------------
  minimum time:     1.756 ms (0.00% GC)
  median time:      1.849 ms (0.00% GC)
  mean time:        1.905 ms (0.00% GC)
  maximum time:     3.293 ms (0.00% GC)
  --------------
  samples:          2629
  evals/sample:     1

julia> 

julia> @benchmark trick($(cu(rand(1,2,3))), $(cu(rand(1,3,4))))
BenchmarkTools.Trial: 
  memory estimate:  12.84 KiB
  allocs estimate:  201
  --------------
  minimum time:     88.200 μs (0.00% GC)
  median time:      108.900 μs (0.00% GC)
  mean time:        116.488 μs (1.44% GC)
  maximum time:     19.184 ms (45.69% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> 

julia> @benchmark solution($(cu(rand(1,2,3))), $(cu(rand(1,3,4))))
BenchmarkTools.Trial: 
  memory estimate:  4.42 KiB
  allocs estimate:  166
  --------------
  minimum time:     66.600 μs (0.00% GC)
  median time:      85.000 μs (0.00% GC)
  mean time:        87.695 μs (0.00% GC)
  maximum time:     7.036 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Benchmarks for big arrays

julia> @benchmark easy($(cu(rand(50,100,150))), $(cu(rand(50,150,200))))
BenchmarkTools.Trial:
  memory estimate:  2.72 KiB
  allocs estimate:  106
  --------------
  minimum time:     105.321 ms (0.00% GC)
  median time:      106.172 ms (0.00% GC)
  mean time:        110.391 ms (0.07% GC)
  maximum time:     172.219 ms (0.65% GC)
  --------------
  samples:          46
  evals/sample:     1

julia> # Skipped the allow scalar one because I got impatient.

julia> @benchmark trick($(cu(rand(50,100,150))), $(cu(rand(50,150,200))))
BenchmarkTools.Trial: 
  memory estimate:  13.03 KiB
  allocs estimate:  210
  --------------
  minimum time:     131.732 ms (0.00% GC)
  median time:      132.681 ms (0.00% GC)
  mean time:        137.899 ms (0.07% GC)
  maximum time:     198.738 ms (0.65% GC)
  --------------
  samples:          37
  evals/sample:     1

julia> 

julia> @benchmark solution($(cu(rand(50,100,150))), $(cu(rand(50,150,200))))
BenchmarkTools.Trial:
  memory estimate:  4.61 KiB
  allocs estimate:  175
  --------------
  minimum time:     104.896 ms (0.00% GC)
  median time:      106.823 ms (0.00% GC)
  mean time:        111.285 ms (0.08% GC)
  maximum time:     176.250 ms (0.72% GC)
  --------------
  samples:          45
  evals/sample:     1

@mcabbott
Copy link
Owner

Thanks for testing. I think these might need CUDA.@sync in there, and a few more $ signs?

But if these are accurate, then I'm a little surprised that "trick" costs as much as 50 μs. The variance of the large cases also seems huge, I guess they all allocate memory.

Note that if index a was last, then C[b,d,a] := sum(c) a[b,c,a] * b[c,d,a] would be exactly NNlib.batched_mul. Which ought to be faster, it's some CuBLAS call. This in fact allows the batch index to be any place except first, but you need FluxML/NNlib.jl#191 and its now outdated CuArrays.jl cousin... but this still won't do C[b,a,d] := sum(c) a[a,b,c] * b[a,c,d] sadly.

@MacKenzieHnC
Copy link
Author

MacKenzieHnC commented Oct 23, 2020

Whoops. I went ahead and made those corrections, but they did not make a huge difference, other than the minimum time increasing to be closer to the median, and increasing the maximum time slightly.

The ratios between techniques stayed about the same.

Does the results seem more in-line with your thinking now?

P.S. I attempted implementing the second solution you posted, but I'm getting errors from the final orient(). I'll keep trying to figure it out, but this metaprogramming is pretty far over my head.

@mcabbott
Copy link
Owner

Oh that's embarrassing, my example of @matmul seems to hit a bug. And ought not to work anyway, as it intends to only support single * calls, no batch indices, which I should have remembered... will take a look.

The times don't seem crazy to me.

Another thing you might try is OMEinsum.jl, this will call permutedims three times to line up the CuBLAS call... on my laptop this is still quicker, but I don't have a working GPU at the moment.

julia> using OMEinsum
julia> ENV["JULIA_DEBUG"] = OMEinsum;
julia> A = rand(1:99.0,2,2,2); B = rand(1:99.0, 2,2,2);

julia> @ein C3[b,a,d] := A[a,b,c] * B[a,c,d]
┌ Debug: BatchedContract
│   ixs => iy = ((1, 2, 3), (1, 3, 4)) => (2, 1, 4)
┌ Debug: conditioned_permutedims
┌ Debug: conditioned_permutedims
┌ Debug: conditioned_permutedims
2×2×2 Array{Float64, 3}:
[:, :, 1] =
 3313.0  6593.0
 4202.0  4184.0
...
julia> ENV["JULIA_DEBUG"] = "none";

@MacKenzieHnC
Copy link
Author

MacKenzieHnC commented Oct 26, 2020

Just tried OMEinsum. It runs into scalar operations. I didn't finish the big test cause it wasn't even close.

Small test:

memory estimate:  4.84 KiB
  allocs estimate:  56
  --------------
  minimum time:     1.720 ms (0.00% GC)
  median time:      1.739 ms (0.00% GC)
  mean time:        1.774 ms (0.00% GC)
  maximum time:     2.989 ms (0.00% GC)
  --------------
  samples:          2818
  evals/sample:     1

I don't know what is necessary for this solution

begin
    local raven = orient(permutedims(A, (2, 1, 3)), (:, :, *, :))
    local deer = orient(permutedims(B, (1, 3, 2)), (*, :, :, :))
    C = dropdims(sum(@__dot__(raven * deer), dims = 4), dims = 4)
end

But it's looking like a clear winner. Are there other things you want me to run? I'm happy to test any code you send my way.

@mcabbott
Copy link
Owner

It might be worth opening an issue with OMEinsum.jl, I see Leo was working to update CUDA things recently.

I'm out of ideas for now, it just seems a bit crazy that the best way to do this is broadcasting. For matrix multiplication on the CPU this is 50 times slower. And many operations involving permutedims are dominated by that.

@mcabbott mcabbott added the gpu anything involving a CuArray or similar label Sep 3, 2022
@mcabbott
Copy link
Owner

mcabbott commented Sep 3, 2022

All the examples from the first message run without scalar indexing now, TensorCast v0.4.5, which uses TransmuteDims, #31. The expansions from here have become:

julia> @pretty @reduce C[a,b,d] := sum(c) A[a,b,c] * B[a,c,d] # Fast
begin
    local spider = transmute(B, Val((1, nothing, 2, 3)))
    C = dropdims(sum(@__dot__(A * spider), dims = 3), dims = 3)
end

julia> @pretty @reduce C[b,a,d] := sum(c) A[a,b,c] * B[a,c,d] # Not fast
begin
    local spider = transmute(A, Val((2, 1, nothing, 3)))
    local tapir = transmute(B, Val((nothing, 1, 3, 2)))
    C = dropdims(sum(@__dot__(spider * tapir), dims = 4), dims = 4)
end

So I think this can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpu anything involving a CuArray or similar
Projects
None yet
Development

No branches or pull requests

2 participants