-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix cuarray dispatch & reduce TTFX #22
Conversation
Since this is an easy fix and the tests pass. I will merge this PR directly. |
I forgot to remove it in JuliaLinearAlgebra/Octavian.jl#181 |
I can not reproduce the performance issue. I did not see dynamic dispatch in the benchmark so I will keep the precompilation for now. The bug in Octavian is really wield, I can benchmark it on my machine for you to see if it is CPU architecture dependent. Beforejulia> using BenchmarkTools
julia> using TropicalGEMM; T = Float64; x = Tropical.(rand(T, 10, 10)); y = Tropical.(rand(T, 10, 10)); @benchmark $x * $y
BenchmarkTools.Trial: 10000 samples with 900 evaluations.
Range (min … max): 124.147 ns … 1.115 μs ┊ GC (min … max): 0.00% … 68.38%
Time (median): 132.959 ns ┊ GC (median): 0.00%
Time (mean ± σ): 150.138 ns ± 99.103 ns ┊ GC (mean ± σ): 7.41% ± 9.60%
█▄▂▂▁ ▁
██████▇▇▆▆▆▆▆▄▁▁▁█▆▄▃▃▁▁▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▃▅▆▄▅▅▄▆▆▅▆ █
124 ns Histogram: log(frequency) by time 858 ns <
Memory estimate: 896 bytes, allocs estimate: 1.
julia> using TropicalGEMM; T = Float64; x = Tropical.(rand(T, 1000, 1000)); y = Tropical.(rand(T, 1000, 1000)); @benchmark $x * $y
BenchmarkTools.Trial: 70 samples with 1 evaluation.
Range (min … max): 69.017 ms … 85.843 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 70.654 ms ┊ GC (median): 0.00%
Time (mean ± σ): 71.431 ms ± 2.633 ms ┊ GC (mean ± σ): 0.17% ± 0.44%
█
▄▆▄█▇█▃▇▄▆▁▇▄▆▃▄▁▆▆▁▄▃▁▄▁▁▄▁▃▃▁▁▃▁▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
69 ms Histogram: frequency by time 79.8 ms <
Memory estimate: 7.63 MiB, allocs estimate: 2. After Precompilingjulia> using TropicalGEMM; T = Float64; x = Tropical.(rand(T, 10, 10)); y = Tropical.(rand(T, 10, 10)); @benchmark $x * $y
[ Info: Precompiling TropicalGEMM [a4ad3063-64a7-4bad-8738-34ed09bc0236]
BenchmarkTools.Trial: 10000 samples with 888 evaluations.
Range (min … max): 127.241 ns … 786.229 ns ┊ GC (min … max): 0.00% … 52.64%
Time (median): 134.123 ns ┊ GC (median): 0.00%
Time (mean ± σ): 148.204 ns ± 61.213 ns ┊ GC (mean ± σ): 3.81% ± 7.61%
█▇▄▃▂▂▂▂▂ ▂
██████████▇▆▇▆▆▅▅▆▅▅▅▆▅▃▃▄▁▁▅█▆▆▃▄▃▁▃▁▄▃▃▁▃▄▃▄▁▅▅▅▄▅▄▅▆▅▄▅▅▄▅ █
127 ns Histogram: log(frequency) by time 557 ns <
Memory estimate: 896 bytes, allocs estimate: 1.
julia> using TropicalGEMM; T = Float64; x = Tropical.(rand(T, 1000, 1000)); y = Tropical.(rand(T, 1000, 1000)); @benchmark $x * $y
BenchmarkTools.Trial: 72 samples with 1 evaluation.
Range (min … max): 68.306 ms … 73.864 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 68.889 ms ┊ GC (median): 0.00%
Time (mean ± σ): 69.528 ms ± 1.179 ms ┊ GC (mean ± σ): 0.17% ± 0.44%
▃ █▅ ▂
▄█▄████▇▄▇▄▁▁▄▁▄▄▄▁▄▄▄▁▁▅▄▅▁▄▄▅▄▅▄▄▅▅▄▄▄▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▄ ▁
68.3 ms Histogram: frequency by time 72.7 ms <
Memory estimate: 7.63 MiB, allocs estimate: 2. |
I used your script for benchmarking Octavian, but did not see any dispatch issue. The Julia version is 1.9.1, system version is Ubuntu 22.04, CPU is Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz Without precompile,julia> using BenchmarkTools
julia> using Octavian
[ Info: Precompiling Octavian [6fd5a793-0b7e-452c-907f-f8bfe9c57db4]
A^[[A
julia> A = rand(-1_000:1_000, 200, 200);
julia> B = rand(-1_000:1_000, 200, 200);
julia> C = similar(A);
julia> Af64 = Float64.(A); Bf64 = Float64.(B); Cf64 = similar(Af64);
julia> @benchmark matmul!($Cf64, $Af64, $Bf64)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 278.507 μs … 524.333 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 286.815 μs ┊ GC (median): 0.00%
Time (mean ± σ): 289.723 μs ± 14.144 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▃ ▅█▅▄▃▃▂▂▁▁ ▂
██▆▃▄███████████▇█▇▇▇▆▆█████▇▆▆▅▆▆▆▆▆▆▆▆▆▆▆▇▅▅▆▆▅▅▆▆▄▄▄▃▅▅▁▃▅ █
279 μs Histogram: log(frequency) by time 353 μs <
Memory estimate: 0 bytes, allocs estimate: 0. With precompilejulia> using Octavian
[ Info: Precompiling Octavian [6fd5a793-0b7e-452c-907f-f8bfe9c57db4]
julia> A = rand(-1_000:1_000, 200, 200);
julia> B = rand(-1_000:1_000, 200, 200);
julia> C = similar(A);
julia> using BenchmarkTools
julia> Af64 = Float64.(A); Bf64 = Float64.(B); Cf64 = similar(Af64);
julia> @benchmark matmul!($Cf64, $Af64, $Bf64)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 283.227 μs … 553.582 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 285.139 μs ┊ GC (median): 0.00%
Time (mean ± σ): 286.645 μs ± 12.436 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅▅█▇█▆▅▄▄▃▃▃▃▂▂▂▁▁▁ ▁▁ ▂
███████████████████████▇██▇▆▇▇▇▇▆▆▅▅▆▅▄▄▅▄▃▄▅▃▃▄▅▅▃▅▅▃▄▃▁▄▄▅▃ █
283 μs Histogram: log(frequency) by time 309 μs <
Memory estimate: 0 bytes, allocs estimate: 0. Precompiling script@setup_workload begin
# Putting some things in `@setup_workload` instead of `@compile_workload` can reduce the size of the
# precompile file and potentially make loading faster.
@compile_workload begin
for T in (Float32, Float64)
x = rand(T, 10, 10)
y = rand(T, 10, 10)
Octavian.matmul(x, y)
end
end
end |
I was on Julia master. I hadn't seen the problem before in Octavian, but have seen it often enough elsewhere that I didn't want to spend any time on it, and just disabled the precompilation. |
Before
After
This is amazing speed up, which greatly improves the usability of
Octavian
andTropicalGEMM
.@chriselrod I notice that in
Octavian
, PrecompileTools is imported but not used. Do you have special concerns like compatibility or binary size? Otherwise, I just confirmed the following speed up of TTFX after the precompilation:Before this change, the TTFX is 6s on my machine.
Update
The binary size is also "amazing":