-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Application for transposed matrix-matrix product: wrong answer #10
Comments
Thanks, I fixed the problem, added it to the tests, and created a new release. For what it's worth, what you're doing there is similar to what using LoopVectorization, Test, BenchmarkTools
M = K = N = 100
T = Float64
C1 = Matrix{T}(undef, M, N); C2 = similar(C1); C3 = similar(C1);
A = rand(M, K); At = copy(A');
B = rand(K, N);
function mulCAtB_2x2block!(C, A, B)
M, N = size(C); K = size(B,1)
@assert size(C, 1) == size(A, 2)
@assert size(C, 2) == size(B, 2)
@assert size(A, 1) == size(B, 1)
T = eltype(C)
if mod(M, 2) == 0 && mod(N, 2) == 0
for m ∈ 1:2:M
m1 = m + 1
for n ∈ 1:2:N
n1 = n + 1
C11, C21, C12, C22 = zero(T), zero(T), zero(T), zero(T)
@avx for k ∈ 1:K
C11 += A[k,m] * B[k,n]
C21 += A[k,m1] * B[k,n]
C12 += A[k,m] * B[k,n1]
C22 += A[k,m1] * B[k,n1]
end
C[m,n] = C11
C[m1,n] = C21
C[m,n1] = C12
C[m1,n1] = C22
end
end
else
@inbounds for n ∈ 1:N, m ∈ 1:M
Cmn = 0.0
@inbounds for k ∈ 1:K
Cmn += A[k,m] * B[k,n]
end
C[m,n] = Cmn
end
end
return C
end
function AtmulBavx!(C, A, B)
@avx for n ∈ 1:size(C,2), m ∈ 1:size(C,1)
Cᵢⱼ = zero(eltype(C))
for k ∈ 1:size(A,1)
Cᵢⱼ += A[k,m] * B[k,n]
end
C[m,n] = Cᵢⱼ
end
end
function AmulBavx!(C, A, B)
@avx for m ∈ 1:size(A,1), n ∈ 1:size(B,2)
Cᵢⱼ = zero(eltype(C))
for k ∈ 1:size(A,2)
Cᵢⱼ += A[m,k] * B[k,n]
end
C[m,n] = Cᵢⱼ
end
end
mulCAtB_2x2block!(C1, At, B);
AtmulBavx!(C2, At, B);
AmulBavx!(C3, A, B);
C4 = At' * B;
@test C1 ≈ C4
@test C2 ≈ C4
@test C3 ≈ C4 I'm running these benchmarks on the latest (not yet released) master, where I did some ad hoc fiddling with how it chooses block sizes to get Please let me know how the two perform on your computer. I may need to adjust it for julia> using LinearAlgebra; BLAS.set_num_threads(1)
julia> @benchmark mulCAtB_2x2block!($C1, $At, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 63.729 μs (0.00% GC)
median time: 65.949 μs (0.00% GC)
mean time: 67.540 μs (0.00% GC)
maximum time: 110.972 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark AtmulBavx!($C2, $At, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 40.664 μs (0.00% GC)
median time: 42.035 μs (0.00% GC)
mean time: 42.846 μs (0.00% GC)
maximum time: 74.588 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark AmulBavx!($C2, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 25.863 μs (0.00% GC)
median time: 26.176 μs (0.00% GC)
mean time: 26.578 μs (0.00% GC)
maximum time: 66.092 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark mul!($C3, $At', $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 21.489 μs (0.00% GC)
median time: 22.154 μs (0.00% GC)
mean time: 22.415 μs (0.00% GC)
maximum time: 70.543 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark mul!($C3, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 20.007 μs (0.00% GC)
median time: 20.642 μs (0.00% GC)
mean time: 20.932 μs (0.00% GC)
maximum time: 51.647 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1 It's twice as slow as MKL, and 60% slower than To figure out what julia> AtmulBq = :(for n ∈ 1:size(C,2), m ∈ 1:size(C,1)
Cᵢⱼ = zero(eltype(C))
for k ∈ 1:size(A,1)
Cᵢⱼ += A[k,m] * B[k,n]
end
C[m,n] = Cᵢⱼ
end);
julia> lsAtmulB = LoopVectorization.LoopSet(AtmulBq);
julia> LoopVectorization.choose_order(lsAtmulB)
([:m, :n, :k], :k, 4, 4) This essentially means that the loop order is: for mouter in 1:4:M # the 4 here is the second 4
for nouter in 1:4:N # this 4 is the first 4
# unrolled, for m in mouter:mouter+3, for n in nouter:nouter+3
for k in 1:K
# unrolled, for m in mouter:mouter+3, for n in nouter:nouter+3
end
# unrolled, for m in mouter:mouter+3, for n in nouter:nouter+3
end
end So the difference between your code in this issue and On your computer, LoopVectorization.choose_order(::LoopSet) = ([:m,:n,:k], :k, 2, 4) and then redefining Note that you'd have to restart your Julia session (or manually revert to the old definition) if you're using any function with |
This function doesn't give the correct answer (but works when
@avx
is removed):The text was updated successfully, but these errors were encountered: