Application for transposed matrix-matrix product: wrong answer #10

PetrKryslUCSD · 2020-01-04T18:15:32Z

This function doesn't give the correct answer (but works when @avx is removed):

function mulCAtB!(C, A, B)
    M, N = size(C); K = size(B,1)
    @assert size(C, 1) == size(A, 2)
    @assert size(C, 2) == size(B, 2)
    @assert size(A, 1) == size(B, 1)
    if mod(M, 2) == 0 && mod(N, 2) == 0
	    for m ∈ 1:2:M
	    	m1 = m + 1
	    	for n ∈ 1:2:N 
	    		n1 = n + 1
		    	C11, C21, C12, C22 = 0.0, 0.0, 0.0, 0.0 
		    	@avx for k ∈ 1:K
		    		C11 += A[k,m] * B[k,n] 
		    		C21 += A[k,m1] * B[k,n] 
		    		C12 += A[k,m] * B[k,n1] 
		    		C22 += A[k,m1] * B[k,n1]
		    	end
		    	C[m,n] = C11
		    	C[m1,n] = C21
		    	C[m,n1] = C12
		    	C[m1,n1] = C22
		    end
	    end
	else
		@inbounds for n ∈ 1:N, m ∈ 1:M 
	    	Cmn = 0.0
	    	@inbounds for k ∈ 1:K
	    		Cmn += A[k,m] * B[k,n]
	    	end
	    	C[m,n] = Cmn
	    end
	end
    return C
end

The text was updated successfully, but these errors were encountered:

chriselrod · 2020-01-04T22:50:06Z

Thanks, I fixed the problem, added it to the tests, and created a new release.
It should work starting on 0.2.3.

For what it's worth, what you're doing there is similar to what @avx does when applied to those three loops, but @avx should be faster.

using LoopVectorization, Test, BenchmarkTools
M = K = N = 100
T = Float64
C1 = Matrix{T}(undef, M, N); C2 = similar(C1); C3 = similar(C1);
A = rand(M, K); At = copy(A');
B = rand(K, N);

function mulCAtB_2x2block!(C, A, B)
    M, N = size(C); K = size(B,1)
    @assert size(C, 1) == size(A, 2)
    @assert size(C, 2) == size(B, 2)
    @assert size(A, 1) == size(B, 1)
    T = eltype(C)
    if mod(M, 2) == 0 && mod(N, 2) == 0
        for m ∈ 1:2:M
            m1 = m + 1
            for n ∈ 1:2:N 
                n1 = n + 1
                C11, C21, C12, C22 = zero(T), zero(T), zero(T), zero(T)
                @avx for k ∈ 1:K
                    C11 += A[k,m] * B[k,n] 
                    C21 += A[k,m1] * B[k,n] 
                    C12 += A[k,m] * B[k,n1] 
                    C22 += A[k,m1] * B[k,n1]
                end
                C[m,n] = C11
                C[m1,n] = C21
                C[m,n1] = C12
                C[m1,n1] = C22
            end
        end
    else
        @inbounds for n ∈ 1:N, m ∈ 1:M 
            Cmn = 0.0
            @inbounds for k ∈ 1:K
                Cmn += A[k,m] * B[k,n]
            end
            C[m,n] = Cmn
        end
    end
    return C
end
function AtmulBavx!(C, A, B)
    @avx for n ∈ 1:size(C,2), m ∈ 1:size(C,1)
        Cᵢⱼ = zero(eltype(C))
        for k ∈ 1:size(A,1)
            Cᵢⱼ += A[k,m] * B[k,n]
        end
        C[m,n] = Cᵢⱼ
    end
end
function AmulBavx!(C, A, B)
    @avx for m ∈ 1:size(A,1), n ∈ 1:size(B,2)
        Cᵢⱼ = zero(eltype(C))
        for k ∈ 1:size(A,2)
            Cᵢⱼ += A[m,k] * B[k,n]
        end
        C[m,n] = Cᵢⱼ
    end
end

mulCAtB_2x2block!(C1, At, B);
AtmulBavx!(C2, At, B);
AmulBavx!(C3, A, B);
C4 = At' * B;
@test C1 ≈ C4
@test C2 ≈ C4
@test C3 ≈ C4

I'm running these benchmarks on the latest (not yet released) master, where I did some ad hoc fiddling with how it chooses block sizes to get AtmulBavx! to be faster than mulCAtB_2x2block!.

Please let me know how the two perform on your computer. I may need to adjust it for AVX (instruction set) computers as well.
Performance still is bad compared to MKL and the A-not-tranposed case, as you can see below:

julia> using LinearAlgebra; BLAS.set_num_threads(1)

julia> @benchmark mulCAtB_2x2block!($C1, $At, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     63.729 μs (0.00% GC)
  median time:      65.949 μs (0.00% GC)
  mean time:        67.540 μs (0.00% GC)
  maximum time:     110.972 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark AtmulBavx!($C2, $At, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     40.664 μs (0.00% GC)
  median time:      42.035 μs (0.00% GC)
  mean time:        42.846 μs (0.00% GC)
  maximum time:     74.588 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark AmulBavx!($C2, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     25.863 μs (0.00% GC)
  median time:      26.176 μs (0.00% GC)
  mean time:        26.578 μs (0.00% GC)
  maximum time:     66.092 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($C3, $At', $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     21.489 μs (0.00% GC)
  median time:      22.154 μs (0.00% GC)
  mean time:        22.415 μs (0.00% GC)
  maximum time:     70.543 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($C3, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.007 μs (0.00% GC)
  median time:      20.642 μs (0.00% GC)
  mean time:        20.932 μs (0.00% GC)
  maximum time:     51.647 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

It's twice as slow as MKL, and 60% slower than @avx when A isn't transposed.

To figure out what @avx is doing (aside from the obvious macroexpand), it can also help to:

julia> AtmulBq = :(for n ∈ 1:size(C,2), m ∈ 1:size(C,1)
           Cᵢⱼ = zero(eltype(C))
           for k ∈ 1:size(A,1)
               Cᵢⱼ += A[k,m] * B[k,n]
           end
           C[m,n] = Cᵢⱼ
       end);

julia> lsAtmulB = LoopVectorization.LoopSet(AtmulBq);

julia> LoopVectorization.choose_order(lsAtmulB)
([:m, :n, :k], :k, 4, 4)

This essentially means that the loop order is:

for mouter in 1:4:M # the 4 here is the second 4
    for nouter in 1:4:N # this 4 is the first 4
        # unrolled, for m in mouter:mouter+3, for n in nouter:nouter+3
        for k in 1:K
            # unrolled, for m in mouter:mouter+3, for n in nouter:nouter+3
        end
        # unrolled, for m in mouter:mouter+3, for n in nouter:nouter+3
    end
end

So the difference between your code in this issue and @avx on the outer loop is that the former uses a 2x2 tile, while @avx chose to use a 4x4 tile.
The :k (not in a vector) means that it is the vectorized loop.

On your computer, @avx will probably choose to use a 3x4 tile.
Feel free to try out different tile sizes with AtmulBavx! by manually redefining:

LoopVectorization.choose_order(::LoopSet) = ([:m,:n,:k], :k, 2, 4)

and then redefining AtmulBavx! (without redefining AtmulBavx! it wont update, like we're back in Julia 0.5).

Note that you'd have to restart your Julia session (or manually revert to the old definition) if you're using any function with @avx but without 3 loops indexed by :m, :n, and k after replacing choose_order` with the above.

chriselrod closed this as completed Jan 4, 2020

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Application for transposed matrix-matrix product: wrong answer #10

Application for transposed matrix-matrix product: wrong answer #10

PetrKryslUCSD commented Jan 4, 2020

chriselrod commented Jan 4, 2020 •

edited

Loading

Application for transposed matrix-matrix product: wrong answer #10

Application for transposed matrix-matrix product: wrong answer #10

Comments

PetrKryslUCSD commented Jan 4, 2020

chriselrod commented Jan 4, 2020 • edited Loading

chriselrod commented Jan 4, 2020 •

edited

Loading