-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce compile time for generic matmatmul #52038
Conversation
stdlib/LinearAlgebra/src/matmul.jl
Outdated
@inbounds for i in AxM, j in BxN | ||
z2 = zero(A[i, a1]*B[b1, j] + A[i, a1]*B[b1, j]) | ||
Ctmp = convert(promote_type(R, typeof(z2)), z2) | ||
for k in AxK | ||
Ctmp += A[i, k]*B[k, j] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLVM's default loop vectorize isn't smart enough to optimize this effectively, so the Float64
performance of this will probably look worse than what I shared.
On the other hand, if sizeof(eltype(C))
is large and itself SIMD-able, then this order will probably perform better than re-loading and re-storing on every iteration of the inner most loop, so I think this is fine.
I chose the order I did to ensure a decisive win in the Float64
benchmark (to say "the tiling is really bad"), but obviously, this code isn't going to run with Float64
often, but ForwardDiff.Dual
may be common and is the eltype
I'm actually interested in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these are details that we can figure out in the process ("under the hood"). For now I basically kept the old order, as in the 'N'
-'N'
case.
this looks great! are all the constprop afterwards still needed now that the characters are introduced lower down? |
Essentially, this removes the tiling stuff, and (in the generic case only) redirections to the 2x2 and 3x3 functions. I assume that compiling those added a lot to the compile times, with all the explicit wrapper branches. It would be interesting to see whether the aggresive constant propagation is really helpful, or just forces the compiler to work harder for little gain. At runtime, those checks should not dominate timings, so I guess it's not so important for runtime performance to "compile branches away". |
Thanks! Without the tiling, the 2x2 and 3x3 special cases are also less important. I'll try this PR out tonight. |
The characters are actually introduced higher up. They allow to split potential wrappers from the underlying storage arrays. This is where array packages hook into and dispatch on the storage array, just like the BLAS one dispatches on strided arrays with BlasFloat eltypes. Now, for the really generic case we rewrap immediately, once no other method caught the call. So, I guess we would like the compiler to understand that after unwrapping, one call away we are rewrapping again (in which wrapper exactly!) in the generic case. But I have no idea if that can be achieved by constant propagation, so we may need to turn things on and off and see what happens. |
I think this looks pretty good w/ respect to compile time when using using LinearAlgebra, BenchmarkTools
using LinearAlgebra: @lazy_str
function _generic_matmatmuladd!(C, A, B)
AxM = axes(A, 1)
AxK = axes(A, 2) # we use two `axes` calls in case of `AbstractVector`
BxK = axes(B, 1)
BxN = axes(B, 2)
CxM = axes(C, 1)
CxN = axes(C, 2)
if AxM != CxM
throw(DimensionMismatch(lazy"matrix A has axes ($AxM,$AxK), matrix C has axes ($CxM,$CxN)"))
end
if AxK != BxK
throw(DimensionMismatch(lazy"matrix A has axes ($AxM,$AxK), matrix B has axes ($BxK,$CxN)"))
end
if BxN != CxN
throw(DimensionMismatch(lazy"matrix B has axes ($BxK,$BxN), matrix C has axes ($CxM,$CxN)"))
end
for n = BxN, k = BxK, m = AxM
C[m,n] = muladd(A[m,k], B[k,n], C[m,n])
end
return C
end
function _generic_matmatmul!(C, A, B)
_generic_matmatmuladd!(fill!(C, zero(eltype(C))), A, B)
end in the REPL. Running loops like d(x, n) = ForwardDiff.Dual(x, ntuple(_ -> randn(), n))
function dualify(A, n, j)
if n > 0
A = d.(A, n)
if (j > 0)
A = d.(A, j)
end
end
A
end
@time for n = 0:8, j = (n!=0):4
A = dualify.(randn(5,5), n, j);
B = dualify.(randn(5,5), n, j);
C = similar(A);
mul!(C, A, B);
mul!(C, A', B);
mul!(C, A, B');
mul!(C, A', B');
mul!(C, transpose(A), B);
mul!(C, A, transpose(B));
mul!(C, transpose(A), transpose(B));
end
# or (not in the same Julia session!)
@time for n = 0:8, j = (n!=0):4
A = dualify.(randn(5,5), n, j);
B = dualify.(randn(5,5), n, j);
C = similar(A);
mul!(C, A, B);
end I get, for
All the permutations:
So this is comparable to my PR and a substantial improvement over master. |
Runtime performance, however, is worse for this example: julia> B = dualify.(randn(5,5), 8, 2);
julia> A = dualify.(randn(5,5), 8, 2);
julia> C = similar(A);
julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
Range (min … max): 4.632 μs … 6.157 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.647 μs ┊ GC (median): 0.00%
Time (mean ± σ): 4.654 μs ± 46.285 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▆▇██▇▇▅▂ ▁ ▂
█████████▆▅▃▁▁▁▁▃▃▁▁▁▁▃▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃▃▁▄▆▆▇█████▇▆▆▇ █
4.63 μs Histogram: log(frequency) by time 4.83 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark _generic_matmatmul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min … max): 2.248 μs … 3.293 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.258 μs ┊ GC (median): 0.00%
Time (mean ± σ): 2.261 μs ± 32.042 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▇█
▂▃▆███▆▃▂▂▂▂▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂ ▂
2.25 μs Histogram: frequency by time 2.37 μs <
Memory estimate: 0 bytes, allocs estimate: 0. Still much better than master, of course julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
Range (min … max): 5.692 μs … 2.369 ms ┊ GC (min … max): 0.00% … 97.50%
Time (median): 6.190 μs ┊ GC (median): 0.00%
Time (mean ± σ): 8.318 μs ± 46.745 μs ┊ GC (mean ± σ): 13.28% ± 2.38%
▅█▆▄▃▂▂▁ ▁▂▄▄▄▃▂ ▂
▅▇█████████▇▆▆▅▄▁▁▁▁▄▁▄▁▃▃▄▄▃▁▁▃▁▃█████████▇▅▄▄▃▁▁▁▅▃▅▇▇██ █
5.69 μs Histogram: log(frequency) by time 13 μs <
Memory estimate: 20.92 KiB, allocs estimate: 7. |
Here, making the reduction the inner loop actually helps (as I said earlier for julia> function mulreduceinnerloop!(C, A, B)
AxM = axes(A, 1)
AxK = axes(A, 2) # we use two `axes` calls in case of `AbstractVector`
BxK = axes(B, 1)
BxN = axes(B, 2)
CxM = axes(C, 1)
CxN = axes(C, 2)
if AxM != CxM
throw(DimensionMismatch(lazy"matrix A has axes ($AxM,$AxK), matrix C has axes ($CxM,$CxN)"))
end
if AxK != BxK
throw(DimensionMismatch(lazy"matrix A has axes ($AxM,$AxK), matrix B has axes ($BxK,$CxN)"))
end
if BxN != CxN
throw(DimensionMismatch(lazy"matrix B has axes ($BxK,$BxN), matrix C has axes ($CxM,$CxN)"))
end
@inbounds for n = BxN, m = AxM
Cmn = zero(eltype(C))
for k = BxK
Cmn = muladd(A[m,k], B[k,n], Cmn)
end
C[m,n] = Cmn
end
return C
end
mulreduceinnerloop! (generic function with 1 method)
julia> B = dualify.(randn(5,5), 8, 2);
julia> A = dualify.(randn(5,5), 8, 2);
julia> C = similar(A);
julia> @benchmark _generic_matmatmul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min … max): 2.243 μs … 2.569 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.249 μs ┊ GC (median): 0.00%
Time (mean ± σ): 2.252 μs ± 17.621 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▅▇█▇▃ ▂
██████▇▆▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▇██▆ █
2.24 μs Histogram: log(frequency) by time 2.36 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
Range (min … max): 4.657 μs … 5.177 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.682 μs ┊ GC (median): 0.00%
Time (mean ± σ): 4.688 μs ± 36.267 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▅▆▇███▆▄▂ ▁ ▂
▅███████████▇▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▄▄▅▆▇▇█████▇▆▇ █
4.66 μs Histogram: log(frequency) by time 4.86 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark mulreduceinnerloop!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.131 μs … 1.454 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.138 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.141 μs ± 14.890 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▇█
▂▂▄██▆▄▄▅▃▂▁▁▂▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂ ▂
1.13 μs Histogram: frequency by time 1.23 μs <
Memory estimate: 0 bytes, allocs estimate: 0. so the overhead is coming from elsewhere. These benchmarks done on the PR branch, of course. |
It comes from not using julia> @noinline function LinearAlgebra._generic_matmatmul!(C::AbstractVecOrMat{R}, A::AbstractVecOrMat{T}, B::AbstractVecOrMat{S},
_add::LinearAlgebra.MulAddMul) where {T,S,R}
AxM = axes(A, 1)
AxK = axes(A, 2) # we use two `axes` calls in case of `AbstractVector`
BxK = axes(B, 1)
BxN = axes(B, 2)
CxM = axes(C, 1)
CxN = axes(C, 2)
if AxM != CxM
throw(DimensionMismatch(lazy"matrix A has axes ($AxM,$AxK), matrix C has axes ($CxM,$CxN)"))
end
if AxK != BxK
throw(DimensionMismatch(lazy"matrix A has axes ($AxM,$AxK), matrix B has axes ($BxK,$CxN)"))
end
if BxN != CxN
throw(DimensionMismatch(lazy"matrix B has axes ($BxK,$BxN), matrix C has axes ($CxM,$CxN)"))
end
if iszero(_add.alpha) || isempty(A) || isempty(B)
return LinearAlgebra._rmul_or_fill!(C, _add.beta)
end
a1 = first(AxK)
b1 = first(BxK)
@inbounds for i in AxM, j in BxN
z2 = zero(A[i, a1]*B[b1, j] + A[i, a1]*B[b1, j])
Ctmp = convert(promote_type(R, typeof(z2)), z2)
for k in AxK
Ctmp = muladd(A[i, k], B[k, j], Ctmp)
end
LinearAlgebra._modify!(_add, Ctmp, C, (i,j))
end
return C
end
julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.092 μs … 1.376 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.098 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.099 μs ± 11.850 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▂
▂▆▄██▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂ ▂
1.09 μs Histogram: frequency by time 1.19 μs <
Memory estimate: 0 bytes, allocs estimate: 0. |
Co-authored-by: Chris Elrod <[email protected]>
Thanks @chriselrod! I'll need to investigate those two super specific errors and fix them, and then we should perhaps launch a pkgeval run. What do you think about the necessity for the aggressive constant propagation? Seems like even with it compile times were pretty good. |
Hm, |
Do we want |
Part of the hope is that it could help compile times by cutting out dead code that doesn't need to be compiled. But I haven't looked into it/checked if it's even working. I'll do that tonight.
Sounds like we're missing some |
IMO the consistency isn't a problem. matmul generally isn't consistent. |
@dkarrasch, re unitful |
julia> using BenchmarkTools
julia> A = rand(5,4); x = rand(4);
julia> @btime muladd($A, $x, 3.4)'
99.130 ns (1 allocation: 96 bytes)
1×5 adjoint(::Vector{Float64}) with eltype Float64:
3.7513 3.91565 4.04009 3.66507 3.7944 Doing it the naive way gives us the expected behavior of only allocating the result, without a promotion. But, how can julia> y' * x + 3.4
4.004396984735307
julia> muladd(y', x, 3.4)
4.004396984735307 also work as expected. But I do see there are |
julia> @code_typed mul!(Cdd, Add, Bdd)
CodeInfo(
1 ─ %1 = invoke LinearAlgebra._generic_matmatmul!(C::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, A::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, B::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, $(QuoteNode(LinearAlgebra.MulAddMul{true, true, Bool, Bool}(true, false)))::LinearAlgebra.MulAddMul{true, true, Bool, Bool})::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
└── return %1
) => Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
julia> @code_typed mul!(Cdd, Add', Bdd)
CodeInfo(
1 ─ %1 = Base.getfield(A, :parent)::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
│ %2 = %new(Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, %1)::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}
│ %3 = invoke LinearAlgebra._generic_matmatmul!(C::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, %2::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, B::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, $(QuoteNode(LinearAlgebra.MulAddMul{true, true, Bool, Bool}(true, false)))::LinearAlgebra.MulAddMul{true, true, Bool, Bool})::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
└── return %3
) => Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
julia> @code_typed mul!(Cdd, Add', Bdd')
CodeInfo(
1 ─ %1 = Base.getfield(A, :parent)::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
│ %2 = Base.getfield(B, :parent)::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
│ %3 = %new(Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, %1)::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}
│ %4 = %new(Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, %2)::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}
│ %5 = invoke LinearAlgebra._generic_matmatmul!(C::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, %3::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, %4::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, $(QuoteNode(LinearAlgebra.MulAddMul{true, true, Bool, Bool}(true, false)))::LinearAlgebra.MulAddMul{true, true, Bool, Bool})::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
└── return %5
) => Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
julia> @code_typed mul!(Cdd, Add, Bdd')
CodeInfo(
1 ─ %1 = Base.getfield(B, :parent)::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
│ %2 = %new(Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, %1)::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}
│ %3 = invoke LinearAlgebra._generic_matmatmul!(C::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, A::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}, %2::Transpose{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}, Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}}, $(QuoteNode(LinearAlgebra.MulAddMul{true, true, Bool, Bool}(true, false)))::LinearAlgebra.MulAddMul{true, true, Bool, Bool})::Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}}
└── return %3
) => Matrix{ForwardDiff.Dual{Nothing, ForwardDiff.Dual{Nothing, Float64, 4}, 2}} Looks like everything is compiling away successfully. |
Here is a reproducer of those test failures: julia> B=Bidiagonal([-8041718734995066674, -7402188931680778461], [-4293547541337790375], :L)
2×2 Bidiagonal{Int64, Vector{Int64}}:
-8041718734995066674 ⋅
-4293547541337790375 -7402188931680778461
julia> A=Bidiagonal([-9007632524281690832, -8423219277671072315], [6624006889975070404], :L)
2×2 Bidiagonal{Int64, Vector{Int64}}:
-9007632524281690832 ⋅
6624006889975070404 -8423219277671072315
julia> C = randn(2,2);
julia> mul!(C, A, B);
julia> (Array(A) * Array(B) .- C) ./ norm(C)
2×2 Matrix{Float64}:
-0.746052 0.0
0.176149 -0.642167
julia> (Float64.(Array(A)) * Float64.(Array(B)) .- C) ./ norm(C)
2×2 Matrix{Float64}:
0.0 -0.0
0.0 0.0
julia> (big.(Array(A)) * big.(Array(B)) .- C) ./ norm(C)
2×2 Matrix{BigFloat}:
-7.93613e-17 0.0
-6.38913e-18 2.01747e-17 while on Julia master, I get julia> (Array(A) * Array(B) .- C) ./ norm(C)
2×2 Matrix{Float64}:
0.0 0.0
0.0 0.0
julia> (Float64.(Array(A)) * Float64.(Array(B)) .- C) ./ norm(C)
2×2 Matrix{Float64}:
7.78132e18 -0.0
-1.83723e18 6.6978e18
julia> (big.(Array(A)) * big.(Array(B)) .- C) ./ norm(C)
2×2 Matrix{BigFloat}:
7.78132e+18 0.0
-1.83723e+18 6.6978e+18 I'd suggest using julia> Base.mul_with_overflow(A[1,1], B[1,1])
(5263619963498179744, true) sounds like a bonus. If someone wants modular integer arithmetic, shouldn't they be writing to an integer valued destination? |
Thanks again @chriselrod! I agree that promoting the comparison values (not the input to @nanosoldier |
@nanosoldier |
@nanosoldier |
@nanosoldier |
The package evaluation job you requested has completed - possible new issues were detected. |
@nanosoldier @nanosoldier |
We may need to consider backporting this to v1.10. #51961 seems requested for SparseArrays's reasons, but that would make compile times for generic matmatmul even worse, without any (runtime) performance benefit. |
We may rebase the other PR after this is merged, and only add constprop to methods where there's a distinct improvement |
Your benchmark job has completed - no performance regressions were detected. A full report can be found here. |
The package evaluation job you requested has completed - possible new issues were detected. |
So far I have copied all instance of aggressive constant propagation from your PR. I thought we may wish to have them in v1.10 because of the SparseArrays regression? Or is it yet unclear what's causing it? |
This PR is about matmatmul, whereas the |
Yes, I see I skipped the matvec mul related annotations. However, it's not constant propagation that's causing the issue there. The lack of constant propagation has little to any effect on runtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me.
@nanosoldier |
The package evaluation job you requested has completed - possible new issues were detected. |
That is interesting! I tested a few more non-BLAS combinations.
@chriselrod Have you benchmarked your own use cases with the latest status of this PR? Should we change anything in view of the complex case discussed above? For quick reference, this is some benchmark code: using LinearAlgebra, BenchmarkTools
n = 256;
C = zeros(ComplexF32, n, n);
A = randn(ComplexF64, n, n);
B = randn(ComplexF64, n, n);
for A in (A, transpose(A), adjoint(A)), B in (B, transpose(B), adjoint(B))
@show typeof(A), typeof(B)
@btime mul!($C, $A, $B)
end |
I filed a few PRs to some failing packages. I have no idea what to about Cthulhu.jl (it fails to precompile) or how it could be affected, and InteractiveErrors.jl is failing due to Cthulhu. |
This is another attempt at improving the compile time issue with generic matmatmul, hopefully improving runtime performance also.
@chriselrod @jishnub
There seems to be a little typo/oversight somewhere, but it shows how it could work. Locally, this reduces benchmark times from #51812 (comment) by more than 50%.