Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce compile time for generic matmatmul #52038

Merged
merged 16 commits into from
Nov 14, 2023
Merged

Reduce compile time for generic matmatmul #52038

merged 16 commits into from
Nov 14, 2023

Conversation

dkarrasch
Copy link
Member

This is another attempt at improving the compile time issue with generic matmatmul, hopefully improving runtime performance also.

@chriselrod @jishnub

There seems to be a little typo/oversight somewhere, but it shows how it could work. Locally, this reduces benchmark times from #51812 (comment) by more than 50%.

Comment on lines 792 to 796
@inbounds for i in AxM, j in BxN
z2 = zero(A[i, a1]*B[b1, j] + A[i, a1]*B[b1, j])
Ctmp = convert(promote_type(R, typeof(z2)), z2)
for k in AxK
Ctmp += A[i, k]*B[k, j]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLVM's default loop vectorize isn't smart enough to optimize this effectively, so the Float64 performance of this will probably look worse than what I shared.

On the other hand, if sizeof(eltype(C)) is large and itself SIMD-able, then this order will probably perform better than re-loading and re-storing on every iteration of the inner most loop, so I think this is fine.
I chose the order I did to ensure a decisive win in the Float64 benchmark (to say "the tiling is really bad"), but obviously, this code isn't going to run with Float64 often, but ForwardDiff.Dual may be common and is the eltype I'm actually interested in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these are details that we can figure out in the process ("under the hood"). For now I basically kept the old order, as in the 'N'-'N' case.

@oscardssmith
Copy link
Member

this looks great! are all the constprop afterwards still needed now that the characters are introduced lower down?

@dkarrasch
Copy link
Member Author

Essentially, this removes the tiling stuff, and (in the generic case only) redirections to the 2x2 and 3x3 functions. I assume that compiling those added a lot to the compile times, with all the explicit wrapper branches. It would be interesting to see whether the aggresive constant propagation is really helpful, or just forces the compiler to work harder for little gain. At runtime, those checks should not dominate timings, so I guess it's not so important for runtime performance to "compile branches away".

@chriselrod
Copy link
Contributor

Thanks!
Most of the runtime performance problem came from the unhelpful tiling.
Optimized BLAS routines do of course tile, but the old implementation did such a bad job it actually hurt performance, even for very large arrays (as the benchmark showed).
So the simplest improvement is of course to just cut out that tiling.

Without the tiling, the 2x2 and 3x3 special cases are also less important.

I'll try this PR out tonight.

@dkarrasch
Copy link
Member Author

this looks great! are all the constprop afterwards still needed now that the characters are introduced lower down?

The characters are actually introduced higher up. They allow to split potential wrappers from the underlying storage arrays. This is where array packages hook into and dispatch on the storage array, just like the BLAS one dispatches on strided arrays with BlasFloat eltypes. Now, for the really generic case we rewrap immediately, once no other method caught the call. So, I guess we would like the compiler to understand that after unwrapping, one call away we are rewrapping again (in which wrapper exactly!) in the generic case. But I have no idea if that can be achieved by constant propagation, so we may need to turn things on and off and see what happens.

@chriselrod
Copy link
Contributor

I think this looks pretty good w/ respect to compile time when using ForwardDiff.Dual.
I'm comparing the latest master, this PR, and defining

using LinearAlgebra, BenchmarkTools
using LinearAlgebra: @lazy_str
function _generic_matmatmuladd!(C, A, B)
    AxM = axes(A, 1)
    AxK = axes(A, 2) # we use two `axes` calls in case of `AbstractVector`
    BxK = axes(B, 1)
    BxN = axes(B, 2)
    CxM = axes(C, 1)
    CxN = axes(C, 2)
    if AxM != CxM
        throw(DimensionMismatch(lazy"matrix A has axes ($AxM,$AxK), matrix C has axes ($CxM,$CxN)"))
    end
    if AxK != BxK
        throw(DimensionMismatch(lazy"matrix A has axes ($AxM,$AxK), matrix B has axes ($BxK,$CxN)"))
    end
    if BxN != CxN
        throw(DimensionMismatch(lazy"matrix B has axes ($BxK,$BxN), matrix C has axes ($CxM,$CxN)"))
    end
    for n = BxN, k = BxK, m = AxM
        C[m,n] = muladd(A[m,k], B[k,n], C[m,n])
    end
    return C
end
function _generic_matmatmul!(C, A, B)
    _generic_matmatmuladd!(fill!(C, zero(eltype(C))), A, B)
end

in the REPL.

Running loops like

d(x, n) = ForwardDiff.Dual(x, ntuple(_ -> randn(), n))

function dualify(A, n, j)
  if n > 0
    A = d.(A, n)
    if (j > 0)
      A = d.(A, j)
    end
  end
  A
end

@time for n = 0:8, j = (n!=0):4
  A = dualify.(randn(5,5), n, j);
  B = dualify.(randn(5,5), n, j);
  C = similar(A);
  mul!(C, A, B);
  mul!(C, A', B);
  mul!(C, A, B');
  mul!(C, A', B');
  mul!(C, transpose(A), B);
  mul!(C, A, transpose(B));
  mul!(C, transpose(A), transpose(B));
end
# or (not in the same Julia session!)
@time for n = 0:8, j = (n!=0):4
  A = dualify.(randn(5,5), n, j);
  B = dualify.(randn(5,5), n, j);
  C = similar(A);
  mul!(C, A, B);
end

I get, for mul! only:

`_generic_matmatmul!`: 4.654779 seconds (6.47 M allocations: 342.166 MiB, 0.76% gc time, 99.69% compilation time)
PR: 6.865700 seconds (21.86 M allocations: 1.167 GiB, 1.94% gc time, 99.76% compilation time)
master: 36.816275 seconds (99.52 M allocations: 5.147 GiB, 1.76% gc time, 99.95% compilation time)

All the permutations:

`_generic_matmatmul!`: 14.575288 seconds (14.71 M allocations: 799.559 MiB, 0.57% gc time, 99.86% compilation time)
PR: 13.251875 seconds (31.26 M allocations: 1.693 GiB, 1.68% gc time, 99.83% compilation time)
master: 37.544679 seconds (99.53 M allocations: 5.147 GiB, 1.68% gc time, 99.95% compilation time)

So this is comparable to my PR and a substantial improvement over master.

@chriselrod
Copy link
Contributor

Runtime performance, however, is worse for this example:
PR:

julia> B = dualify.(randn(5,5), 8, 2);

julia> A = dualify.(randn(5,5), 8, 2);

julia> C = similar(A);

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min  max):  4.632 μs   6.157 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     4.647 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.654 μs ± 46.285 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▆▇██▇▇▅▂                                          ▁       ▂
  █████████▆▅▃▁▁▁▁▃▃▁▁▁▁▃▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃▃▁▄▆▆▇█████▇▆▆▇ █
  4.63 μs      Histogram: log(frequency) by time     4.83 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark _generic_matmatmul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min  max):  2.248 μs   3.293 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.258 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.261 μs ± 32.042 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▂▇█                                                      
  ▂▃▆███▆▃▂▂▂▂▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂ ▂
  2.25 μs        Histogram: frequency by time        2.37 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Still much better than master, of course

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (min  max):  5.692 μs   2.369 ms  ┊ GC (min  max):  0.00%  97.50%
 Time  (median):     6.190 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   8.318 μs ± 46.745 μs  ┊ GC (mean ± σ):  13.28% ±  2.38%

    ▅█▆▄▃▂▂▁                        ▁▂▄▄▄▃▂                  ▂
  ▅▇█████████▇▆▆▅▄▁▁▁▁▄▁▄▁▃▃▄▄▃▁▁▃▁▃█████████▇▅▄▄▃▁▁▁▅▃▅▇▇██ █
  5.69 μs      Histogram: log(frequency) by time       13 μs <

 Memory estimate: 20.92 KiB, allocs estimate: 7.

@chriselrod
Copy link
Contributor

chriselrod commented Nov 6, 2023

Here, making the reduction the inner loop actually helps (as I said earlier for Dual numbers):

julia> function mulreduceinnerloop!(C, A, B)
           AxM = axes(A, 1)
           AxK = axes(A, 2) # we use two `axes` calls in case of `AbstractVector`
           BxK = axes(B, 1)
           BxN = axes(B, 2)
           CxM = axes(C, 1)
           CxN = axes(C, 2)
           if AxM != CxM
               throw(DimensionMismatch(lazy"matrix A has axes ($AxM,$AxK), matrix C has axes ($CxM,$CxN)"))
           end
           if AxK != BxK
               throw(DimensionMismatch(lazy"matrix A has axes ($AxM,$AxK), matrix B has axes ($BxK,$CxN)"))
           end
           if BxN != CxN
               throw(DimensionMismatch(lazy"matrix B has axes ($BxK,$BxN), matrix C has axes ($CxM,$CxN)"))
           end
           @inbounds for n = BxN, m = AxM
               Cmn = zero(eltype(C))
               for k = BxK
                   Cmn = muladd(A[m,k], B[k,n], Cmn)
               end
               C[m,n] = Cmn
           end
           return C
       end
mulreduceinnerloop! (generic function with 1 method)

julia> B = dualify.(randn(5,5), 8, 2);

julia> A = dualify.(randn(5,5), 8, 2);

julia> C = similar(A);

julia> @benchmark _generic_matmatmul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min  max):  2.243 μs   2.569 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.249 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.252 μs ± 17.621 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▅▇█▇▃                                                     ▂
  ██████▇▆▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▇██▆ █
  2.24 μs      Histogram: log(frequency) by time     2.36 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min  max):  4.657 μs   5.177 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     4.682 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.688 μs ± 36.267 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▂▅▆▇███▆▄▂                                       ▁       ▂
  ▅███████████▇▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▄▄▅▆▇▇█████▇▆▇ █
  4.66 μs      Histogram: log(frequency) by time     4.86 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mulreduceinnerloop!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min  max):  1.131 μs   1.454 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.138 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.141 μs ± 14.890 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▇█                                                       
  ▂▂▄██▆▄▄▅▃▂▁▁▂▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂ ▂
  1.13 μs        Histogram: frequency by time        1.23 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

so the overhead is coming from elsewhere. These benchmarks done on the PR branch, of course.

@chriselrod
Copy link
Contributor

It comes from not using muladd:

julia> @noinline function LinearAlgebra._generic_matmatmul!(C::AbstractVecOrMat{R}, A::AbstractVecOrMat{T}, B::AbstractVecOrMat{S},
                                    _add::LinearAlgebra.MulAddMul) where {T,S,R}
           AxM = axes(A, 1)
           AxK = axes(A, 2) # we use two `axes` calls in case of `AbstractVector`
           BxK = axes(B, 1)
           BxN = axes(B, 2)
           CxM = axes(C, 1)
           CxN = axes(C, 2)
           if AxM != CxM
               throw(DimensionMismatch(lazy"matrix A has axes ($AxM,$AxK), matrix C has axes ($CxM,$CxN)"))
           end
           if AxK != BxK
               throw(DimensionMismatch(lazy"matrix A has axes ($AxM,$AxK), matrix B has axes ($BxK,$CxN)"))
           end
           if BxN != CxN
               throw(DimensionMismatch(lazy"matrix B has axes ($BxK,$BxN), matrix C has axes ($CxM,$CxN)"))
           end
           if iszero(_add.alpha) || isempty(A) || isempty(B)
               return LinearAlgebra._rmul_or_fill!(C, _add.beta)
           end
           a1 = first(AxK)
           b1 = first(BxK)
           @inbounds for i in AxM, j in BxN
               z2 = zero(A[i, a1]*B[b1, j] + A[i, a1]*B[b1, j])
               Ctmp = convert(promote_type(R, typeof(z2)), z2)
               for k in AxK
                   Ctmp = muladd(A[i, k], B[k, j], Ctmp)
               end
               LinearAlgebra._modify!(_add, Ctmp, C, (i,j))
           end
           return C
       end

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min  max):  1.092 μs   1.376 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.098 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.099 μs ± 11.850 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     █▂                                                       
  ▂▆▄██▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂ ▂
  1.09 μs        Histogram: frequency by time        1.19 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

@dkarrasch
Copy link
Member Author

Thanks @chriselrod! I'll need to investigate those two super specific errors and fix them, and then we should perhaps launch a pkgeval run. What do you think about the necessity for the aggressive constant propagation? Seems like even with it compile times were pretty good.

@dkarrasch
Copy link
Member Author

Hm, muladd is adding issues, one of which is the fact that it implements A*y .+ z, not A*y + z in the array-of-arrays case. So, when z is a Number, it is promoted to one-element vectors/matrices in some of the tests. Another issue is that muladd(::Number, ::Number, ::Number) promotes its arguments, which is what we don't want for unitful arguments.

@jishnub
Copy link
Contributor

jishnub commented Nov 6, 2023

Do we want fma in general in such an operation, as the docstring for muladd states that "The result can be different on different machines and can also be different on the same machine due to constant propagation or other optimizations"?

@chriselrod
Copy link
Contributor

chriselrod commented Nov 6, 2023

Do we want fma in general in such an operation, as the docstring for muladd states that "The result can be different on different machines and can also be different on the same machine due to constant propagation or other optimizations"?

mul! for BlasFloat will be different on different machines already. Any thoughts @oscardssmith about the consistency?
I think the performance for the ForwardDiff.Dual case is well worth it on systems with fma.
The potential performance hit of using fma instead of muladd on systems without it though is extreme, even if those systems should be ashamed of themselves ;) .

What do you think about the necessity for the aggressive constant propagation? Seems like even with it compile times were pretty good.

Part of the hope is that it could help compile times by cutting out dead code that doesn't need to be compiled. But I haven't looked into it/checked if it's even working. I'll do that tonight.

Hm, muladd is adding issues, one of which is the fact that it implements Ay .+ z, not Ay + z in the array-of-arrays case. So, when z is a Number, it is promoted to one-element vectors/matrices in some of the tests. Another issue is that muladd(::Number, ::Number, ::Number) promotes its arguments, which is what we don't want for unitful arguments.

Sounds like we're missing some muladd overloads?
That definitely sounds like a Unitful bug.

@oscardssmith
Copy link
Member

IMO the consistency isn't a problem. matmul generally isn't consistent.

@chriselrod
Copy link
Contributor

@chriselrod
Copy link
Contributor

chriselrod commented Nov 6, 2023

Hm, muladd is adding issues, one of which is the fact that it implements Ay .+ z, not Ay + z in the array-of-arrays case. So, when z is a Number, it is promoted to one-element vectors/matrices in some of the tests.

julia> using BenchmarkTools

julia> A = rand(5,4); x = rand(4);

julia> @btime muladd($A, $x, 3.4)'
  99.130 ns (1 allocation: 96 bytes)
1×5 adjoint(::Vector{Float64}) with eltype Float64:
 3.7513  3.91565  4.04009  3.66507  3.7944

Doing it the naive way gives us the expected behavior of only allocating the result, without a promotion.

But, how can z be a number while x and y are not?

julia> y' * x + 3.4
4.004396984735307

julia> muladd(y', x, 3.4)
4.004396984735307

also work as expected.

But I do see there are matmul test failures, so I'll take a look at what is going on tonight if no one beats me too it (please do!).

@chriselrod
Copy link
Contributor

julia> @code_typed mul!(Cdd, Add, Bdd)
CodeInfo(
1%1 = invoke LinearAlgebra._generic_matmatmul!(C::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, A::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, B::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, $(QuoteNode(LinearAlgebra.MulAddMul{true, true, Bool, Bool}(true, false)))::LinearAlgebra.MulAddMul{true, true, Bool, Bool})::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
└──      return %1
) => Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}

julia> @code_typed mul!(Cdd, Add', Bdd)
CodeInfo(
1%1 = Base.getfield(A, :parent)::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}%2 = %new(Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, %1)::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}
│   %3 = invoke LinearAlgebra._generic_matmatmul!(C::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, %2::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, B::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, $(QuoteNode(LinearAlgebra.MulAddMul{true, true, Bool, Bool}(true, false)))::LinearAlgebra.MulAddMul{true, true, Bool, Bool})::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
└──      return %3
) => Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}

julia> @code_typed mul!(Cdd, Add', Bdd')
CodeInfo(
1%1 = Base.getfield(A, :parent)::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}%2 = Base.getfield(B, :parent)::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}%3 = %new(Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, %1)::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}
│   %4 = %new(Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, %2)::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}
│   %5 = invoke LinearAlgebra._generic_matmatmul!(C::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, %3::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, %4::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, $(QuoteNode(LinearAlgebra.MulAddMul{true, true, Bool, Bool}(true, false)))::LinearAlgebra.MulAddMul{true, true, Bool, Bool})::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
└──      return %5
) => Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}

julia> @code_typed mul!(Cdd, Add, Bdd')
CodeInfo(
1%1 = Base.getfield(B, :parent)::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}%2 = %new(Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, %1)::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}
│   %3 = invoke LinearAlgebra._generic_matmatmul!(C::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, A::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, %2::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, $(QuoteNode(LinearAlgebra.MulAddMul{true, true, Bool, Bool}(true, false)))::LinearAlgebra.MulAddMul{true, true, Bool, Bool})::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
└──      return %3
) => Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}

Looks like everything is compiling away successfully.

@chriselrod
Copy link
Contributor

chriselrod commented Nov 7, 2023

Here is a reproducer of those test failures:

julia> B=Bidiagonal([-8041718734995066674, -7402188931680778461], [-4293547541337790375], :L)
2×2 Bidiagonal{Int64, Vector{Int64}}:
 -8041718734995066674                     
 -4293547541337790375  -7402188931680778461

julia> A=Bidiagonal([-9007632524281690832, -8423219277671072315], [6624006889975070404], :L)
2×2 Bidiagonal{Int64, Vector{Int64}}:
 -9007632524281690832                     
  6624006889975070404  -8423219277671072315

julia> C = randn(2,2);

julia> mul!(C, A, B);

julia> (Array(A) * Array(B) .- C) ./ norm(C)
2×2 Matrix{Float64}:
 -0.746052   0.0
  0.176149  -0.642167

julia> (Float64.(Array(A)) * Float64.(Array(B)) .- C) ./ norm(C)
2×2 Matrix{Float64}:
 0.0  -0.0
 0.0   0.0

julia> (big.(Array(A)) * big.(Array(B)) .- C) ./ norm(C)
2×2 Matrix{BigFloat}:
 -7.93613e-17  0.0
 -6.38913e-18  2.01747e-17

while on Julia master, I get

julia> (Array(A) * Array(B) .- C) ./ norm(C)
2×2 Matrix{Float64}:
 0.0  0.0
 0.0  0.0

julia> (Float64.(Array(A)) * Float64.(Array(B)) .- C) ./ norm(C)
2×2 Matrix{Float64}:
  7.78132e18  -0.0
 -1.83723e18   6.6978e18

julia> (big.(Array(A)) * big.(Array(B)) .- C) ./ norm(C)
2×2 Matrix{BigFloat}:
  7.78132e+18  0.0
 -1.83723e+18  6.6978e+18

I'd suggest using Float64.(Array(x)) instead?
That mul!(::Matrix{Float64}, A, B) may promote intermediates to Float64 (the destination type) seems like it should be reasonably expected.
That we avoid integer overflows in the intermediate

julia> Base.mul_with_overflow(A[1,1], B[1,1])
(5263619963498179744, true)

sounds like a bonus. If someone wants modular integer arithmetic, shouldn't they be writing to an integer valued destination?

@dkarrasch
Copy link
Member Author

Thanks again @chriselrod! I agree that promoting the comparison values (not the input to mul!) to Float64 is nothing but reasonable.

@nanosoldier runtests()

@dkarrasch
Copy link
Member Author

@nanosoldier runbenchmarks("linalg"; vs = :master)

@dkarrasch
Copy link
Member Author

@nanosoldier runbenchmarks("linalg"; vs = ":master")

@dkarrasch
Copy link
Member Author

@nanosoldier runtests(["ModiaPlot_CairoMakie", "GridVisualize", "ABCDMatrixOptics", "DataFlowTasks", "QuantumControlBase", "PowerAnalytics", "CopEnt", "AiidaDFTK", "ModiaPlot_WGLMakie", "DiffEqFinancial", "MathepiaModels", "PortfolioAnalytics", "Porta", "MonolithicFEMVLFS", "FrequencySweep", "PowerModels", "OPFLearn", "TrajectoryGamesBase", "Bagyo", "Thermodynamics", "Maxvol", "GeoStatsBase", "CellListMap", "NMRTools", "Nonconvex", "UnitfulBuckinghamPi", "Distances", "Manopt", "MCPTrajectoryGameSolver", "QuantumPropagators", "DifferentiableFrankWolfe", "HashCode2014", "TwoStageOptimalControl", "EditorsRepo", "GeneralizedSasakiNakamura", "DelayEmbeddings", "GraphViz", "Controlz", "GpABC", "DifferentiableCollisions", "MultivariateStats", "HarmonicPowerFlow", "LiquidElectrolytes", "LazyAlgebra", "MomentClosure", "MPSKit", "SemiLagrangian"])

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

@dkarrasch
Copy link
Member Author

@nanosoldier runbenchmarks("linalg"; vs = ":master")

@nanosoldier runtests()

@dkarrasch
Copy link
Member Author

We may need to consider backporting this to v1.10. #51961 seems requested for SparseArrays's reasons, but that would make compile times for generic matmatmul even worse, without any (runtime) performance benefit.

@jishnub
Copy link
Contributor

jishnub commented Nov 10, 2023

We may rebase the other PR after this is merged, and only add constprop to methods where there's a distinct improvement

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - no performance regressions were detected. A full report can be found here.

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

@dkarrasch
Copy link
Member Author

We may rebase the other PR after this is merged, and only add constprop to methods where there's a distinct improvement

So far I have copied all instance of aggressive constant propagation from your PR. I thought we may wish to have them in v1.10 because of the SparseArrays regression? Or is it yet unclear what's causing it?

@jishnub
Copy link
Contributor

jishnub commented Nov 13, 2023

This PR is about matmatmul, whereas the SparseArrays issue is about matvecmul. We may need to add some annotations to those methods as well

@dkarrasch
Copy link
Member Author

Yes, I see I skipped the matvec mul related annotations. However, it's not constant propagation that's causing the issue there. The lack of constant propagation has little to any effect on runtime.

Copy link
Contributor

@chriselrod chriselrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me.

@dkarrasch
Copy link
Member Author

@nanosoldier runtests(["Maxvol", "MLJText", "InteractiveErrors", "Dolo", "SumOfSquares", "OptimKit", "NRRD", "SOCRATESSingleColumnForcings", "Cthulhu", "MetidaBioeq", "MPIMeasurements", "DiffEqFinancial", "DifferentiableFrankWolfe", "MaxPlus", "IMFData", "Mellan", "EditorsRepo", "GenieSession", "MultivariateStats", "Kronecker", "BenchmarkingEconomicEfficiency", "NetworkJumpProcesses", "Consensus", "GeoParams", "CellListMap", "GeoIO", "TORA", "ODEInterfaceDiffEq", "ChaosTools", "NeuroAnalysis", "SignalTablesInterface_CairoMakie", "PolynomialGTM", "DeconvOptim", "FractalDimensions", "ImplicitDifferentiation", "AntennaPattern", "MINDFulMakie", "SphericalFunctions", "CropRootBox", "ABCDMatrixOptics", "CopEnt", "MathML", "JumpProcesses", "SMLMFrameConnection", "DiffEqProblemLibrary", "AbstractAlgebra", "TrajOptPlots", "Chamber", "DynamicPolynomials"])

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

@dkarrasch
Copy link
Member Author

That is interesting! I tested a few more non-BLAS combinations.

  1. For the case both inputs Float64 and output Float32 (the benchmark case) the improvement is massive, like factor 4 across all combinations of plain and adjortrans.
  2. For the case both inputs ComplexF64 and output ComplexF32 there is a regression, which might be fixed if we reduce the number in the sizeof-check.
  3. For the case both inputs Int64 and output Float64, 5 combinations are improving and 4 combinations show a slight regression of 10%, so I'd say acceptable.

@chriselrod Have you benchmarked your own use cases with the latest status of this PR? Should we change anything in view of the complex case discussed above? For quick reference, this is some benchmark code:

using LinearAlgebra, BenchmarkTools
n = 256;
C = zeros(ComplexF32, n, n);
A = randn(ComplexF64, n, n);
B = randn(ComplexF64, n, n);
for A in (A, transpose(A), adjoint(A)), B in (B, transpose(B), adjoint(B))
    @show typeof(A), typeof(B)
    @btime mul!($C, $A, $B)
end

@dkarrasch
Copy link
Member Author

I filed a few PRs to some failing packages. I have no idea what to about Cthulhu.jl (it fails to precompile) or how it could be affected, and InteractiveErrors.jl is failing due to Cthulhu.

@oscardssmith oscardssmith merged commit 0cf2bf1 into master Nov 14, 2023
2 checks passed
@oscardssmith oscardssmith deleted the dk/fastmul branch November 14, 2023 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linear algebra Linear algebra performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants