Reduce compile time for generic matmatmul #52038

dkarrasch · 2023-11-05T21:41:35Z

This is another attempt at improving the compile time issue with generic matmatmul, hopefully improving runtime performance also.

@chriselrod @jishnub

There seems to be a little typo/oversight somewhere, but it shows how it could work. Locally, this reduces benchmark times from #51812 (comment) by more than 50%.

chriselrod · 2023-11-05T21:55:44Z

stdlib/LinearAlgebra/src/matmul.jl

+    @inbounds for i in AxM, j in BxN
+        z2 = zero(A[i, a1]*B[b1, j] + A[i, a1]*B[b1, j])
+        Ctmp = convert(promote_type(R, typeof(z2)), z2)
+        for k in AxK
+            Ctmp += A[i, k]*B[k, j]


LLVM's default loop vectorize isn't smart enough to optimize this effectively, so the Float64 performance of this will probably look worse than what I shared.

On the other hand, if sizeof(eltype(C)) is large and itself SIMD-able, then this order will probably perform better than re-loading and re-storing on every iteration of the inner most loop, so I think this is fine.
I chose the order I did to ensure a decisive win in the Float64 benchmark (to say "the tiling is really bad"), but obviously, this code isn't going to run with Float64 often, but ForwardDiff.Dual may be common and is the eltype I'm actually interested in.

Yes, these are details that we can figure out in the process ("under the hood"). For now I basically kept the old order, as in the 'N'-'N' case.

oscardssmith · 2023-11-05T21:56:09Z

this looks great! are all the constprop afterwards still needed now that the characters are introduced lower down?

dkarrasch · 2023-11-05T21:57:09Z

Essentially, this removes the tiling stuff, and (in the generic case only) redirections to the 2x2 and 3x3 functions. I assume that compiling those added a lot to the compile times, with all the explicit wrapper branches. It would be interesting to see whether the aggresive constant propagation is really helpful, or just forces the compiler to work harder for little gain. At runtime, those checks should not dominate timings, so I guess it's not so important for runtime performance to "compile branches away".

chriselrod · 2023-11-05T21:58:44Z

Thanks!
Most of the runtime performance problem came from the unhelpful tiling.
Optimized BLAS routines do of course tile, but the old implementation did such a bad job it actually hurt performance, even for very large arrays (as the benchmark showed).
So the simplest improvement is of course to just cut out that tiling.

Without the tiling, the 2x2 and 3x3 special cases are also less important.

I'll try this PR out tonight.

dkarrasch · 2023-11-05T22:10:54Z

this looks great! are all the constprop afterwards still needed now that the characters are introduced lower down?

The characters are actually introduced higher up. They allow to split potential wrappers from the underlying storage arrays. This is where array packages hook into and dispatch on the storage array, just like the BLAS one dispatches on strided arrays with BlasFloat eltypes. Now, for the really generic case we rewrap immediately, once no other method caught the call. So, I guess we would like the compiler to understand that after unwrapping, one call away we are rewrapping again (in which wrapper exactly!) in the generic case. But I have no idea if that can be achieved by constant propagation, so we may need to turn things on and off and see what happens.

chriselrod · 2023-11-06T02:47:10Z

I think this looks pretty good w/ respect to compile time when using ForwardDiff.Dual.
I'm comparing the latest master, this PR, and defining

using LinearAlgebra, BenchmarkTools
using LinearAlgebra: @lazy_str
function _generic_matmatmuladd!(C, A, B)
    AxM = axes(A, 1)
    AxK = axes(A, 2) # we use two `axes` calls in case of `AbstractVector`
    BxK = axes(B, 1)
    BxN = axes(B, 2)
    CxM = axes(C, 1)
    CxN = axes(C, 2)
    if AxM != CxM
        throw(DimensionMismatch(lazy"matrix A has axes ($AxM,$AxK), matrix C has axes ($CxM,$CxN)"))
    end
    if AxK != BxK
        throw(DimensionMismatch(lazy"matrix A has axes ($AxM,$AxK), matrix B has axes ($BxK,$CxN)"))
    end
    if BxN != CxN
        throw(DimensionMismatch(lazy"matrix B has axes ($BxK,$BxN), matrix C has axes ($CxM,$CxN)"))
    end
    for n = BxN, k = BxK, m = AxM
        C[m,n] = muladd(A[m,k], B[k,n], C[m,n])
    end
    return C
end
function _generic_matmatmul!(C, A, B)
    _generic_matmatmuladd!(fill!(C, zero(eltype(C))), A, B)
end

in the REPL.

Running loops like

d(x, n) = ForwardDiff.Dual(x, ntuple(_ -> randn(), n))

function dualify(A, n, j)
  if n > 0
    A = d.(A, n)
    if (j > 0)
      A = d.(A, j)
    end
  end
  A
end

@time for n = 0:8, j = (n!=0):4
  A = dualify.(randn(5,5), n, j);
  B = dualify.(randn(5,5), n, j);
  C = similar(A);
  mul!(C, A, B);
  mul!(C, A', B);
  mul!(C, A, B');
  mul!(C, A', B');
  mul!(C, transpose(A), B);
  mul!(C, A, transpose(B));
  mul!(C, transpose(A), transpose(B));
end
# or (not in the same Julia session!)
@time for n = 0:8, j = (n!=0):4
  A = dualify.(randn(5,5), n, j);
  B = dualify.(randn(5,5), n, j);
  C = similar(A);
  mul!(C, A, B);
end

I get, for mul! only:

`_generic_matmatmul!`: 4.654779 seconds (6.47 M allocations: 342.166 MiB, 0.76% gc time, 99.69% compilation time)
PR: 6.865700 seconds (21.86 M allocations: 1.167 GiB, 1.94% gc time, 99.76% compilation time)
master: 36.816275 seconds (99.52 M allocations: 5.147 GiB, 1.76% gc time, 99.95% compilation time)

All the permutations:

`_generic_matmatmul!`: 14.575288 seconds (14.71 M allocations: 799.559 MiB, 0.57% gc time, 99.86% compilation time)
PR: 13.251875 seconds (31.26 M allocations: 1.693 GiB, 1.68% gc time, 99.83% compilation time)
master: 37.544679 seconds (99.53 M allocations: 5.147 GiB, 1.68% gc time, 99.95% compilation time)

So this is comparable to my PR and a substantial improvement over master.

chriselrod · 2023-11-06T02:50:29Z

Runtime performance, however, is worse for this example:
PR:

julia> B = dualify.(randn(5,5), 8, 2);

julia> A = dualify.(randn(5,5), 8, 2);

julia> C = similar(A);

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.632 μs …  6.157 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.647 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.654 μs ± 46.285 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▆▇██▇▇▅▂                                          ▁       ▂
  █████████▆▅▃▁▁▁▁▃▃▁▁▁▁▃▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃▃▁▄▆▆▇█████▇▆▆▇ █
  4.63 μs      Histogram: log(frequency) by time     4.83 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark _generic_matmatmul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.248 μs …  3.293 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.258 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.261 μs ± 32.042 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▂▇█                                                      
  ▂▃▆███▆▃▂▂▂▂▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂ ▂
  2.25 μs        Histogram: frequency by time        2.37 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Still much better than master, of course

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (min … max):  5.692 μs …  2.369 ms  ┊ GC (min … max):  0.00% … 97.50%
 Time  (median):     6.190 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   8.318 μs ± 46.745 μs  ┊ GC (mean ± σ):  13.28% ±  2.38%

    ▅█▆▄▃▂▂▁                        ▁▂▄▄▄▃▂                  ▂
  ▅▇█████████▇▆▆▅▄▁▁▁▁▄▁▄▁▃▃▄▄▃▁▁▃▁▃█████████▇▅▄▄▃▁▁▁▅▃▅▇▇██ █
  5.69 μs      Histogram: log(frequency) by time       13 μs <

 Memory estimate: 20.92 KiB, allocs estimate: 7.

chriselrod · 2023-11-06T03:00:52Z

Here, making the reduction the inner loop actually helps (as I said earlier for Dual numbers):

julia> function mulreduceinnerloop!(C, A, B)
           AxM = axes(A, 1)
           AxK = axes(A, 2) # we use two `axes` calls in case of `AbstractVector`
           BxK = axes(B, 1)
           BxN = axes(B, 2)
           CxM = axes(C, 1)
           CxN = axes(C, 2)
           if AxM != CxM
               throw(DimensionMismatch(lazy"matrix A has axes ($AxM,$AxK), matrix C has axes ($CxM,$CxN)"))
           end
           if AxK != BxK
               throw(DimensionMismatch(lazy"matrix A has axes ($AxM,$AxK), matrix B has axes ($BxK,$CxN)"))
           end
           if BxN != CxN
               throw(DimensionMismatch(lazy"matrix B has axes ($BxK,$BxN), matrix C has axes ($CxM,$CxN)"))
           end
           @inbounds for n = BxN, m = AxM
               Cmn = zero(eltype(C))
               for k = BxK
                   Cmn = muladd(A[m,k], B[k,n], Cmn)
               end
               C[m,n] = Cmn
           end
           return C
       end
mulreduceinnerloop! (generic function with 1 method)

julia> B = dualify.(randn(5,5), 8, 2);

julia> A = dualify.(randn(5,5), 8, 2);

julia> C = similar(A);

julia> @benchmark _generic_matmatmul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.243 μs …  2.569 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.249 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.252 μs ± 17.621 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▅▇█▇▃                                                     ▂
  ██████▇▆▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▇██▆ █
  2.24 μs      Histogram: log(frequency) by time     2.36 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.657 μs …  5.177 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.682 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.688 μs ± 36.267 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▂▅▆▇███▆▄▂                                       ▁       ▂
  ▅███████████▇▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▄▄▅▆▇▇█████▇▆▇ █
  4.66 μs      Histogram: log(frequency) by time     4.86 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mulreduceinnerloop!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.131 μs …  1.454 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.138 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.141 μs ± 14.890 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▇█                                                       
  ▂▂▄██▆▄▄▅▃▂▁▁▂▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂ ▂
  1.13 μs        Histogram: frequency by time        1.23 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

so the overhead is coming from elsewhere. These benchmarks done on the PR branch, of course.

chriselrod · 2023-11-06T03:02:44Z

It comes from not using muladd:

julia> @noinline function LinearAlgebra._generic_matmatmul!(C::AbstractVecOrMat{R}, A::AbstractVecOrMat{T}, B::AbstractVecOrMat{S},
                                    _add::LinearAlgebra.MulAddMul) where {T,S,R}
           AxM = axes(A, 1)
           AxK = axes(A, 2) # we use two `axes` calls in case of `AbstractVector`
           BxK = axes(B, 1)
           BxN = axes(B, 2)
           CxM = axes(C, 1)
           CxN = axes(C, 2)
           if AxM != CxM
               throw(DimensionMismatch(lazy"matrix A has axes ($AxM,$AxK), matrix C has axes ($CxM,$CxN)"))
           end
           if AxK != BxK
               throw(DimensionMismatch(lazy"matrix A has axes ($AxM,$AxK), matrix B has axes ($BxK,$CxN)"))
           end
           if BxN != CxN
               throw(DimensionMismatch(lazy"matrix B has axes ($BxK,$BxN), matrix C has axes ($CxM,$CxN)"))
           end
           if iszero(_add.alpha) || isempty(A) || isempty(B)
               return LinearAlgebra._rmul_or_fill!(C, _add.beta)
           end
           a1 = first(AxK)
           b1 = first(BxK)
           @inbounds for i in AxM, j in BxN
               z2 = zero(A[i, a1]*B[b1, j] + A[i, a1]*B[b1, j])
               Ctmp = convert(promote_type(R, typeof(z2)), z2)
               for k in AxK
                   Ctmp = muladd(A[i, k], B[k, j], Ctmp)
               end
               LinearAlgebra._modify!(_add, Ctmp, C, (i,j))
           end
           return C
       end

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.092 μs …  1.376 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.098 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.099 μs ± 11.850 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     █▂                                                       
  ▂▆▄██▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂ ▂
  1.09 μs        Histogram: frequency by time        1.19 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

stdlib/LinearAlgebra/src/matmul.jl

Co-authored-by: Chris Elrod <[email protected]>

dkarrasch · 2023-11-06T07:38:30Z

Thanks @chriselrod! I'll need to investigate those two super specific errors and fix them, and then we should perhaps launch a pkgeval run. What do you think about the necessity for the aggressive constant propagation? Seems like even with it compile times were pretty good.

dkarrasch · 2023-11-06T10:20:53Z

Hm, muladd is adding issues, one of which is the fact that it implements A*y .+ z, not A*y + z in the array-of-arrays case. So, when z is a Number, it is promoted to one-element vectors/matrices in some of the tests. Another issue is that muladd(::Number, ::Number, ::Number) promotes its arguments, which is what we don't want for unitful arguments.

jishnub · 2023-11-06T10:30:06Z

Do we want fma in general in such an operation, as the docstring for muladd states that "The result can be different on different machines and can also be different on the same machine due to constant propagation or other optimizations"?

chriselrod · 2023-11-06T17:32:54Z

Do we want fma in general in such an operation, as the docstring for muladd states that "The result can be different on different machines and can also be different on the same machine due to constant propagation or other optimizations"?

mul! for BlasFloat will be different on different machines already. Any thoughts @oscardssmith about the consistency?
I think the performance for the ForwardDiff.Dual case is well worth it on systems with fma.
The potential performance hit of using fma instead of muladd on systems without it though is extreme, even if those systems should be ashamed of themselves ;) .

What do you think about the necessity for the aggressive constant propagation? Seems like even with it compile times were pretty good.

Part of the hope is that it could help compile times by cutting out dead code that doesn't need to be compiled. But I haven't looked into it/checked if it's even working. I'll do that tonight.

Hm, muladd is adding issues, one of which is the fact that it implements Ay .+ z, not Ay + z in the array-of-arrays case. So, when z is a Number, it is promoted to one-element vectors/matrices in some of the tests. Another issue is that muladd(::Number, ::Number, ::Number) promotes its arguments, which is what we don't want for unitful arguments.

Sounds like we're missing some muladd overloads?
That definitely sounds like a Unitful bug.

oscardssmith · 2023-11-06T17:48:37Z

IMO the consistency isn't a problem. matmul generally isn't consistent.

chriselrod · 2023-11-06T17:52:07Z

@dkarrasch, re unitful
https://github.com/PainterQubits/Unitful.jl/blob/bc5dfb2e87b176274c29c4caec1fd79006c16e17/src/quantities.jl#L186-L203
this seems to do the right thing?

chriselrod · 2023-11-06T17:59:00Z

Hm, muladd is adding issues, one of which is the fact that it implements Ay .+ z, not Ay + z in the array-of-arrays case. So, when z is a Number, it is promoted to one-element vectors/matrices in some of the tests.

julia> using BenchmarkTools

julia> A = rand(5,4); x = rand(4);

julia> @btime muladd($A, $x, 3.4)'
  99.130 ns (1 allocation: 96 bytes)
1×5 adjoint(::Vector{Float64}) with eltype Float64:
 3.7513  3.91565  4.04009  3.66507  3.7944

Doing it the naive way gives us the expected behavior of only allocating the result, without a promotion.

But, how can z be a number while x and y are not?

julia> y' * x + 3.4
4.004396984735307

julia> muladd(y', x, 3.4)
4.004396984735307

also work as expected.

But I do see there are matmul test failures, so I'll take a look at what is going on tonight if no one beats me too it (please do!).

chriselrod · 2023-11-07T02:11:50Z

julia> @code_typed mul!(Cdd, Add, Bdd)
CodeInfo(
1 ─ %1 = invoke LinearAlgebra._generic_matmatmul!(C::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, A::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, B::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, $(QuoteNode(LinearAlgebra.MulAddMul{true, true, Bool, Bool}(true, false)))::LinearAlgebra.MulAddMul{true, true, Bool, Bool})::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
└──      return %1
) => Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}

julia> @code_typed mul!(Cdd, Add', Bdd)
CodeInfo(
1 ─ %1 = Base.getfield(A, :parent)::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
│   %2 = %new(Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, %1)::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}
│   %3 = invoke LinearAlgebra._generic_matmatmul!(C::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, %2::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, B::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, $(QuoteNode(LinearAlgebra.MulAddMul{true, true, Bool, Bool}(true, false)))::LinearAlgebra.MulAddMul{true, true, Bool, Bool})::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
└──      return %3
) => Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}

julia> @code_typed mul!(Cdd, Add', Bdd')
CodeInfo(
1 ─ %1 = Base.getfield(A, :parent)::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
│   %2 = Base.getfield(B, :parent)::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
│   %3 = %new(Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, %1)::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}
│   %4 = %new(Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, %2)::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}
│   %5 = invoke LinearAlgebra._generic_matmatmul!(C::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, %3::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, %4::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, $(QuoteNode(LinearAlgebra.MulAddMul{true, true, Bool, Bool}(true, false)))::LinearAlgebra.MulAddMul{true, true, Bool, Bool})::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
└──      return %5
) => Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}

julia> @code_typed mul!(Cdd, Add, Bdd')
CodeInfo(
1 ─ %1 = Base.getfield(B, :parent)::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
│   %2 = %new(Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, %1)::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}
│   %3 = invoke LinearAlgebra._generic_matmatmul!(C::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, A::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, %2::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, $(QuoteNode(LinearAlgebra.MulAddMul{true, true, Bool, Bool}(true, false)))::LinearAlgebra.MulAddMul{true, true, Bool, Bool})::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
└──      return %3
) => Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}

Looks like everything is compiling away successfully.

chriselrod · 2023-11-07T02:34:36Z

Here is a reproducer of those test failures:

julia> B=Bidiagonal([-8041718734995066674, -7402188931680778461], [-4293547541337790375], :L)
2×2 Bidiagonal{Int64, Vector{Int64}}:
 -8041718734995066674                     ⋅
 -4293547541337790375  -7402188931680778461

julia> A=Bidiagonal([-9007632524281690832, -8423219277671072315], [6624006889975070404], :L)
2×2 Bidiagonal{Int64, Vector{Int64}}:
 -9007632524281690832                     ⋅
  6624006889975070404  -8423219277671072315

julia> C = randn(2,2);

julia> mul!(C, A, B);

julia> (Array(A) * Array(B) .- C) ./ norm(C)
2×2 Matrix{Float64}:
 -0.746052   0.0
  0.176149  -0.642167

julia> (Float64.(Array(A)) * Float64.(Array(B)) .- C) ./ norm(C)
2×2 Matrix{Float64}:
 0.0  -0.0
 0.0   0.0

julia> (big.(Array(A)) * big.(Array(B)) .- C) ./ norm(C)
2×2 Matrix{BigFloat}:
 -7.93613e-17  0.0
 -6.38913e-18  2.01747e-17

while on Julia master, I get

julia> (Array(A) * Array(B) .- C) ./ norm(C)
2×2 Matrix{Float64}:
 0.0  0.0
 0.0  0.0

julia> (Float64.(Array(A)) * Float64.(Array(B)) .- C) ./ norm(C)
2×2 Matrix{Float64}:
  7.78132e18  -0.0
 -1.83723e18   6.6978e18

julia> (big.(Array(A)) * big.(Array(B)) .- C) ./ norm(C)
2×2 Matrix{BigFloat}:
  7.78132e+18  0.0
 -1.83723e+18  6.6978e+18

I'd suggest using Float64.(Array(x)) instead?
That mul!(::Matrix{Float64}, A, B) may promote intermediates to Float64 (the destination type) seems like it should be reasonably expected.
That we avoid integer overflows in the intermediate

julia> Base.mul_with_overflow(A[1,1], B[1,1])
(5263619963498179744, true)

sounds like a bonus. If someone wants modular integer arithmetic, shouldn't they be writing to an integer valued destination?

dkarrasch · 2023-11-07T09:39:23Z

Thanks again @chriselrod! I agree that promoting the comparison values (not the input to mul!) to Float64 is nothing but reasonable.

@nanosoldier runtests()

dkarrasch · 2023-11-07T19:22:25Z

@nanosoldier runbenchmarks("linalg"; vs = :master)

dkarrasch · 2023-11-07T19:22:57Z

@nanosoldier runbenchmarks("linalg"; vs = ":master")

dkarrasch · 2023-11-08T11:59:56Z

@nanosoldier runtests(["ModiaPlot_CairoMakie", "GridVisualize", "ABCDMatrixOptics", "DataFlowTasks", "QuantumControlBase", "PowerAnalytics", "CopEnt", "AiidaDFTK", "ModiaPlot_WGLMakie", "DiffEqFinancial", "MathepiaModels", "PortfolioAnalytics", "Porta", "MonolithicFEMVLFS", "FrequencySweep", "PowerModels", "OPFLearn", "TrajectoryGamesBase", "Bagyo", "Thermodynamics", "Maxvol", "GeoStatsBase", "CellListMap", "NMRTools", "Nonconvex", "UnitfulBuckinghamPi", "Distances", "Manopt", "MCPTrajectoryGameSolver", "QuantumPropagators", "DifferentiableFrankWolfe", "HashCode2014", "TwoStageOptimalControl", "EditorsRepo", "GeneralizedSasakiNakamura", "DelayEmbeddings", "GraphViz", "Controlz", "GpABC", "DifferentiableCollisions", "MultivariateStats", "HarmonicPowerFlow", "LiquidElectrolytes", "LazyAlgebra", "MomentClosure", "MPSKit", "SemiLagrangian"])

nanosoldier · 2023-11-08T18:40:44Z

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

dkarrasch · 2023-11-10T21:29:22Z

@nanosoldier runbenchmarks("linalg"; vs = ":master")

@nanosoldier runtests()

dkarrasch · 2023-11-10T22:12:58Z

We may need to consider backporting this to v1.10. #51961 seems requested for SparseArrays's reasons, but that would make compile times for generic matmatmul even worse, without any (runtime) performance benefit.

jishnub · 2023-11-10T22:43:03Z

We may rebase the other PR after this is merged, and only add constprop to methods where there's a distinct improvement

nanosoldier · 2023-11-10T23:57:00Z

Your benchmark job has completed - no performance regressions were detected. A full report can be found here.

nanosoldier · 2023-11-11T08:28:39Z

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

dkarrasch · 2023-11-11T13:54:09Z

We may rebase the other PR after this is merged, and only add constprop to methods where there's a distinct improvement

So far I have copied all instance of aggressive constant propagation from your PR. I thought we may wish to have them in v1.10 because of the SparseArrays regression? Or is it yet unclear what's causing it?

jishnub · 2023-11-13T06:06:18Z

This PR is about matmatmul, whereas the SparseArrays issue is about matvecmul. We may need to add some annotations to those methods as well

dkarrasch · 2023-11-13T10:48:46Z

Yes, I see I skipped the matvec mul related annotations. However, it's not constant propagation that's causing the issue there. The lack of constant propagation has little to any effect on runtime.

chriselrod

Looks great to me.

dkarrasch · 2023-11-14T11:16:38Z

@nanosoldier runtests(["Maxvol", "MLJText", "InteractiveErrors", "Dolo", "SumOfSquares", "OptimKit", "NRRD", "SOCRATESSingleColumnForcings", "Cthulhu", "MetidaBioeq", "MPIMeasurements", "DiffEqFinancial", "DifferentiableFrankWolfe", "MaxPlus", "IMFData", "Mellan", "EditorsRepo", "GenieSession", "MultivariateStats", "Kronecker", "BenchmarkingEconomicEfficiency", "NetworkJumpProcesses", "Consensus", "GeoParams", "CellListMap", "GeoIO", "TORA", "ODEInterfaceDiffEq", "ChaosTools", "NeuroAnalysis", "SignalTablesInterface_CairoMakie", "PolynomialGTM", "DeconvOptim", "FractalDimensions", "ImplicitDifferentiation", "AntennaPattern", "MINDFulMakie", "SphericalFunctions", "CropRootBox", "ABCDMatrixOptics", "CopEnt", "MathML", "JumpProcesses", "SMLMFrameConnection", "DiffEqProblemLibrary", "AbstractAlgebra", "TrajOptPlots", "Chamber", "DynamicPolynomials"])

nanosoldier · 2023-11-14T12:38:26Z

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

dkarrasch · 2023-11-14T18:26:13Z

That is interesting! I tested a few more non-BLAS combinations.

For the case both inputs Float64 and output Float32 (the benchmark case) the improvement is massive, like factor 4 across all combinations of plain and adjortrans.
For the case both inputs ComplexF64 and output ComplexF32 there is a regression, which might be fixed if we reduce the number in the sizeof-check.
For the case both inputs Int64 and output Float64, 5 combinations are improving and 4 combinations show a slight regression of 10%, so I'd say acceptable.

@chriselrod Have you benchmarked your own use cases with the latest status of this PR? Should we change anything in view of the complex case discussed above? For quick reference, this is some benchmark code:

using LinearAlgebra, BenchmarkTools
n = 256;
C = zeros(ComplexF32, n, n);
A = randn(ComplexF64, n, n);
B = randn(ComplexF64, n, n);
for A in (A, transpose(A), adjoint(A)), B in (B, transpose(B), adjoint(B))
    @show typeof(A), typeof(B)
    @btime mul!($C, $A, $B)
end

dkarrasch · 2023-11-14T18:33:16Z

I filed a few PRs to some failing packages. I have no idea what to about Cthulhu.jl (it fails to precompile) or how it could be affected, and InteractiveErrors.jl is failing due to Cthulhu.

Reduce compile time for generic matmatmul

af0d7d4

chriselrod reviewed Nov 5, 2023

View reviewed changes

chriselrod mentioned this pull request Nov 5, 2023

Add fastpath for generic mul! #51812

Closed

chriselrod reviewed Nov 6, 2023

View reviewed changes

stdlib/LinearAlgebra/src/matmul.jl Outdated Show resolved Hide resolved

Update stdlib/LinearAlgebra/src/matmul.jl

d50fbc5

Co-authored-by: Chris Elrod <[email protected]>

dkarrasch added 2 commits November 6, 2023 19:46

reduce repeated transposition

ef4b837

fix some tests

19d78b6

fix tests

c8e9a0b

dkarrasch added 2 commits November 7, 2023 10:50

minor detail

5d5592a

avoid overflow in integer mul

416fa1c

dkarrasch and others added 3 commits November 8, 2023 21:00

optimize double adjortrans case

5a30f30

fix typo

c100f20

fix typos

59f8b70

dkarrasch mentioned this pull request Nov 9, 2023

Should copy_transpose! also transpose elements? JuliaLang/LinearAlgebra.jl#1033

Closed

check isbitstype, fix some tests

4c09e52

maltezfaria mentioned this pull request Nov 10, 2023

Transpose elements in copy_transpose! #52116

Merged

dkarrasch added linear algebra Linear algebra performance Must go faster labels Nov 10, 2023

fix lazy conj case

3c75cd5

simplify test slightly

1a60c0f

PallHaraldsson mentioned this pull request Nov 11, 2023

GC corrution in Genie[Session] GenieFramework/Genie.jl#689

Open

chriselrod approved these changes Nov 13, 2023

View reviewed changes

This was referenced Nov 14, 2023

Relax kroneckersum test MichielStock/Kronecker.jl#133

Merged

Relax a test JuliaPhysics/ABCDMatrixOptics.jl#9

Merged

oscardssmith merged commit 0cf2bf1 into master Nov 14, 2023
2 checks passed

oscardssmith deleted the dk/fastmul branch November 14, 2023 21:58

Reduce compile time for generic matmatmul #52038

Reduce compile time for generic matmatmul #52038

Conversation

dkarrasch commented Nov 5, 2023

chriselrod Nov 5, 2023

Choose a reason for hiding this comment

dkarrasch Nov 5, 2023

Choose a reason for hiding this comment

oscardssmith commented Nov 5, 2023

dkarrasch commented Nov 5, 2023

chriselrod commented Nov 5, 2023

dkarrasch commented Nov 5, 2023

chriselrod commented Nov 6, 2023

chriselrod commented Nov 6, 2023

chriselrod commented Nov 6, 2023 • edited Loading

chriselrod commented Nov 6, 2023

dkarrasch commented Nov 6, 2023

dkarrasch commented Nov 6, 2023

jishnub commented Nov 6, 2023

chriselrod commented Nov 6, 2023 • edited Loading

oscardssmith commented Nov 6, 2023

chriselrod commented Nov 6, 2023

chriselrod commented Nov 6, 2023 • edited Loading

chriselrod commented Nov 7, 2023

chriselrod commented Nov 7, 2023 • edited Loading

dkarrasch commented Nov 7, 2023

dkarrasch commented Nov 7, 2023

dkarrasch commented Nov 7, 2023

dkarrasch commented Nov 8, 2023

nanosoldier commented Nov 8, 2023

dkarrasch commented Nov 10, 2023

dkarrasch commented Nov 10, 2023

jishnub commented Nov 10, 2023

nanosoldier commented Nov 10, 2023

nanosoldier commented Nov 11, 2023

dkarrasch commented Nov 11, 2023

jishnub commented Nov 13, 2023

dkarrasch commented Nov 13, 2023

chriselrod left a comment

Choose a reason for hiding this comment

dkarrasch commented Nov 14, 2023

nanosoldier commented Nov 14, 2023

dkarrasch commented Nov 14, 2023

dkarrasch commented Nov 14, 2023

chriselrod commented Nov 6, 2023 •

edited

Loading

chriselrod commented Nov 6, 2023 •

edited

Loading

chriselrod commented Nov 6, 2023 •

edited

Loading

chriselrod commented Nov 7, 2023 •

edited

Loading