Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application for transposed matrix-matrix product: wrong answer #10

Closed
PetrKryslUCSD opened this issue Jan 4, 2020 · 1 comment
Closed

Comments

@PetrKryslUCSD
Copy link

This function doesn't give the correct answer (but works when @avx is removed):

function mulCAtB!(C, A, B)
    M, N = size(C); K = size(B,1)
    @assert size(C, 1) == size(A, 2)
    @assert size(C, 2) == size(B, 2)
    @assert size(A, 1) == size(B, 1)
    if mod(M, 2) == 0 && mod(N, 2) == 0
	    for m ∈ 1:2:M
	    	m1 = m + 1
	    	for n ∈ 1:2:N 
	    		n1 = n + 1
		    	C11, C21, C12, C22 = 0.0, 0.0, 0.0, 0.0 
		    	@avx for k ∈ 1:K
		    		C11 += A[k,m] * B[k,n] 
		    		C21 += A[k,m1] * B[k,n] 
		    		C12 += A[k,m] * B[k,n1] 
		    		C22 += A[k,m1] * B[k,n1]
		    	end
		    	C[m,n] = C11
		    	C[m1,n] = C21
		    	C[m,n1] = C12
		    	C[m1,n1] = C22
		    end
	    end
	else
		@inbounds for n ∈ 1:N, m ∈ 1:M 
	    	Cmn = 0.0
	    	@inbounds for k ∈ 1:K
	    		Cmn += A[k,m] * B[k,n]
	    	end
	    	C[m,n] = Cmn
	    end
	end
    return C
end
@chriselrod
Copy link
Member

chriselrod commented Jan 4, 2020

Thanks, I fixed the problem, added it to the tests, and created a new release.
It should work starting on 0.2.3.

For what it's worth, what you're doing there is similar to what @avx does when applied to those three loops, but @avx should be faster.

using LoopVectorization, Test, BenchmarkTools
M = K = N = 100
T = Float64
C1 = Matrix{T}(undef, M, N); C2 = similar(C1); C3 = similar(C1);
A = rand(M, K); At = copy(A');
B = rand(K, N);

function mulCAtB_2x2block!(C, A, B)
    M, N = size(C); K = size(B,1)
    @assert size(C, 1) == size(A, 2)
    @assert size(C, 2) == size(B, 2)
    @assert size(A, 1) == size(B, 1)
    T = eltype(C)
    if mod(M, 2) == 0 && mod(N, 2) == 0
        for m  1:2:M
            m1 = m + 1
            for n  1:2:N 
                n1 = n + 1
                C11, C21, C12, C22 = zero(T), zero(T), zero(T), zero(T)
                @avx for k  1:K
                    C11 += A[k,m] * B[k,n] 
                    C21 += A[k,m1] * B[k,n] 
                    C12 += A[k,m] * B[k,n1] 
                    C22 += A[k,m1] * B[k,n1]
                end
                C[m,n] = C11
                C[m1,n] = C21
                C[m,n1] = C12
                C[m1,n1] = C22
            end
        end
    else
        @inbounds for n  1:N, m  1:M 
            Cmn = 0.0
            @inbounds for k  1:K
                Cmn += A[k,m] * B[k,n]
            end
            C[m,n] = Cmn
        end
    end
    return C
end
function AtmulBavx!(C, A, B)
    @avx for n  1:size(C,2), m  1:size(C,1)
        Cᵢⱼ = zero(eltype(C))
        for k  1:size(A,1)
            Cᵢⱼ += A[k,m] * B[k,n]
        end
        C[m,n] = Cᵢⱼ
    end
end
function AmulBavx!(C, A, B)
    @avx for m  1:size(A,1), n  1:size(B,2)
        Cᵢⱼ = zero(eltype(C))
        for k  1:size(A,2)
            Cᵢⱼ += A[m,k] * B[k,n]
        end
        C[m,n] = Cᵢⱼ
    end
end

mulCAtB_2x2block!(C1, At, B);
AtmulBavx!(C2, At, B);
AmulBavx!(C3, A, B);
C4 = At' * B;
@test C1  C4
@test C2  C4
@test C3  C4

I'm running these benchmarks on the latest (not yet released) master, where I did some ad hoc fiddling with how it chooses block sizes to get AtmulBavx! to be faster than mulCAtB_2x2block!.

Please let me know how the two perform on your computer. I may need to adjust it for AVX (instruction set) computers as well.
Performance still is bad compared to MKL and the A-not-tranposed case, as you can see below:

julia> using LinearAlgebra; BLAS.set_num_threads(1)

julia> @benchmark mulCAtB_2x2block!($C1, $At, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     63.729 μs (0.00% GC)
  median time:      65.949 μs (0.00% GC)
  mean time:        67.540 μs (0.00% GC)
  maximum time:     110.972 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark AtmulBavx!($C2, $At, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     40.664 μs (0.00% GC)
  median time:      42.035 μs (0.00% GC)
  mean time:        42.846 μs (0.00% GC)
  maximum time:     74.588 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark AmulBavx!($C2, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     25.863 μs (0.00% GC)
  median time:      26.176 μs (0.00% GC)
  mean time:        26.578 μs (0.00% GC)
  maximum time:     66.092 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($C3, $At', $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     21.489 μs (0.00% GC)
  median time:      22.154 μs (0.00% GC)
  mean time:        22.415 μs (0.00% GC)
  maximum time:     70.543 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($C3, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.007 μs (0.00% GC)
  median time:      20.642 μs (0.00% GC)
  mean time:        20.932 μs (0.00% GC)
  maximum time:     51.647 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

It's twice as slow as MKL, and 60% slower than @avx when A isn't transposed.

To figure out what @avx is doing (aside from the obvious macroexpand), it can also help to:

julia> AtmulBq = :(for n  1:size(C,2), m  1:size(C,1)
           Cᵢⱼ = zero(eltype(C))
           for k  1:size(A,1)
               Cᵢⱼ += A[k,m] * B[k,n]
           end
           C[m,n] = Cᵢⱼ
       end);

julia> lsAtmulB = LoopVectorization.LoopSet(AtmulBq);

julia> LoopVectorization.choose_order(lsAtmulB)
([:m, :n, :k], :k, 4, 4)

This essentially means that the loop order is:

for mouter in 1:4:M # the 4 here is the second 4
    for nouter in 1:4:N # this 4 is the first 4
        # unrolled, for m in mouter:mouter+3, for n in nouter:nouter+3
        for k in 1:K
            # unrolled, for m in mouter:mouter+3, for n in nouter:nouter+3
        end
        # unrolled, for m in mouter:mouter+3, for n in nouter:nouter+3
    end
end

So the difference between your code in this issue and @avx on the outer loop is that the former uses a 2x2 tile, while @avx chose to use a 4x4 tile.
The :k (not in a vector) means that it is the vectorized loop.

On your computer, @avx will probably choose to use a 3x4 tile.
Feel free to try out different tile sizes with AtmulBavx! by manually redefining:

LoopVectorization.choose_order(::LoopSet) = ([:m,:n,:k], :k, 2, 4)

and then redefining AtmulBavx! (without redefining AtmulBavx! it wont update, like we're back in Julia 0.5).

Note that you'd have to restart your Julia session (or manually revert to the old definition) if you're using any function with @avx but without 3 loops indexed by :m, :n, and k after replacing choose_order` with the above.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants