Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cuarray dispatch & reduce TTFX #22

Merged
merged 5 commits into from
Jun 30, 2023
Merged

Conversation

GiggleLiu
Copy link
Member

@GiggleLiu GiggleLiu commented Jun 30, 2023

Before

julia> @time (using TropicalGEMM; T = Float64; x = Tropical.(rand(T, 10, 10)); y = Tropical.(rand(T, 10, 10)); x * y)
 10.316412 seconds (18.54 M allocations: 1.056 GiB, 3.12% gc time, 92.85% compilation time)

After

julia> @time (using TropicalGEMM; T = Float64; x = Tropical.(rand(T, 10, 10)); y = Tropical.(rand(T, 10, 10)); x * y)
  1.196553 seconds (2.16 M allocations: 182.519 MiB, 5.15% gc time, 1.17% compilation time)

This is amazing speed up, which greatly improves the usability of Octavian and TropicalGEMM.

@chriselrod I notice that in Octavian, PrecompileTools is imported but not used. Do you have special concerns like compatibility or binary size? Otherwise, I just confirmed the following speed up of TTFX after the precompilation:

julia> @time (using Octavian; T = Float64; x = rand(T, 10, 10); y = rand(T, 10, 10); Octavian.matmul(x, y))
  0.796285 seconds (1.38 M allocations: 93.804 MiB, 4.00% gc time, 1.78% compilation time)

Before this change, the TTFX is 6s on my machine.

Update

The binary size is also "amazing":

-rw-rw-r-- 1 leo leo     19466 Jun 30 14:24 6vpGR_SKm48.ji
-rwxrwxrwx 1 leo leo 486264528 Jun 30 14:24 6vpGR_SKm48.so

@GiggleLiu
Copy link
Member Author

Since this is an easy fix and the tests pass. I will merge this PR directly.

@GiggleLiu GiggleLiu merged commit 71c1e01 into master Jun 30, 2023
2 of 4 checks passed
@chriselrod
Copy link
Collaborator

I forgot to remove it in JuliaLinearAlgebra/Octavian.jl#181
Precompilation often results in dynamic dispatches in the precompiled code that goes away if you don't precompile it.

@GiggleLiu
Copy link
Member Author

GiggleLiu commented Jul 1, 2023

I can not reproduce the performance issue. I did not see dynamic dispatch in the benchmark so I will keep the precompilation for now. The bug in Octavian is really wield, I can benchmark it on my machine for you to see if it is CPU architecture dependent.

Before

julia> using BenchmarkTools

julia> using TropicalGEMM; T = Float64; x = Tropical.(rand(T, 10, 10)); y = Tropical.(rand(T, 10, 10)); @benchmark $x * $y
BenchmarkTools.Trial: 10000 samples with 900 evaluations.
 Range (min  max):  124.147 ns   1.115 μs  ┊ GC (min  max): 0.00%  68.38%
 Time  (median):     132.959 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   150.138 ns ± 99.103 ns  ┊ GC (mean ± σ):  7.41% ±  9.60%

  █▄▂▂▁                                                        ▁
  ██████▇▇▆▆▆▆▆▄▁▁▁█▆▄▃▃▁▁▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▃▅▆▄▅▅▄▆▆▅▆ █
  124 ns        Histogram: log(frequency) by time       858 ns <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> using TropicalGEMM; T = Float64; x = Tropical.(rand(T, 1000, 1000)); y = Tropical.(rand(T, 1000, 1000)); @benchmark $x * $y
BenchmarkTools.Trial: 70 samples with 1 evaluation.
 Range (min  max):  69.017 ms  85.843 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     70.654 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   71.431 ms ±  2.633 ms  ┊ GC (mean ± σ):  0.17% ± 0.44%

     █                                                         
  ▄▆▄█▇█▃▇▄▆▁▇▄▆▃▄▁▆▆▁▄▃▁▄▁▁▄▁▃▃▁▁▃▁▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
  69 ms           Histogram: frequency by time        79.8 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.

After Precompiling

julia> using TropicalGEMM; T = Float64; x = Tropical.(rand(T, 10, 10)); y = Tropical.(rand(T, 10, 10)); @benchmark $x * $y
[ Info: Precompiling TropicalGEMM [a4ad3063-64a7-4bad-8738-34ed09bc0236]
BenchmarkTools.Trial: 10000 samples with 888 evaluations.
 Range (min  max):  127.241 ns  786.229 ns  ┊ GC (min  max): 0.00%  52.64%
 Time  (median):     134.123 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   148.204 ns ±  61.213 ns  ┊ GC (mean ± σ):  3.81% ±  7.61%

  █▇▄▃▂▂▂▂▂                                                     ▂
  ██████████▇▆▇▆▆▅▅▆▅▅▅▆▅▃▃▄▁▁▅█▆▆▃▄▃▁▃▁▄▃▃▁▃▄▃▄▁▅▅▅▄▅▄▅▆▅▄▅▅▄▅ █
  127 ns        Histogram: log(frequency) by time        557 ns <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> using TropicalGEMM; T = Float64; x = Tropical.(rand(T, 1000, 1000)); y = Tropical.(rand(T, 1000, 1000)); @benchmark $x * $y
BenchmarkTools.Trial: 72 samples with 1 evaluation.
 Range (min  max):  68.306 ms  73.864 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     68.889 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   69.528 ms ±  1.179 ms  ┊ GC (mean ± σ):  0.17% ± 0.44%

   ▃ █▅ ▂                                                      
  ▄█▄████▇▄▇▄▁▁▄▁▄▄▄▁▄▄▄▁▁▅▄▅▁▄▄▅▄▅▄▄▅▅▄▄▄▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▄ ▁
  68.3 ms         Histogram: frequency by time        72.7 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.

@GiggleLiu
Copy link
Member Author

GiggleLiu commented Jul 1, 2023

I used your script for benchmarking Octavian, but did not see any dispatch issue. The Julia version is 1.9.1, system version is Ubuntu 22.04, CPU is Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz

Without precompile,

julia> using BenchmarkTools

julia> using Octavian
[ Info: Precompiling Octavian [6fd5a793-0b7e-452c-907f-f8bfe9c57db4]
A^[[A
julia> A = rand(-1_000:1_000, 200, 200);

julia> B = rand(-1_000:1_000, 200, 200);

julia> C = similar(A);

julia> Af64 = Float64.(A); Bf64 = Float64.(B); Cf64 = similar(Af64);

julia> @benchmark matmul!($Cf64, $Af64, $Bf64)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  278.507 μs  524.333 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     286.815 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   289.723 μs ±  14.144 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▃   ▅█▅▄▃▃▂▂▁▁                                               ▂
  ██▆▃▄███████████▇█▇▇▇▆▆█████▇▆▆▅▆▆▆▆▆▆▆▆▆▆▆▇▅▅▆▆▅▅▆▆▄▄▄▃▅▅▁▃▅ █
  279 μs        Histogram: log(frequency) by time        353 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

With precompile

julia> using Octavian
[ Info: Precompiling Octavian [6fd5a793-0b7e-452c-907f-f8bfe9c57db4]

julia> A = rand(-1_000:1_000, 200, 200);

julia> B = rand(-1_000:1_000, 200, 200);

julia> C = similar(A);

julia> using BenchmarkTools

julia> Af64 = Float64.(A); Bf64 = Float64.(B); Cf64 = similar(Af64);

julia> @benchmark matmul!($Cf64, $Af64, $Bf64)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  283.227 μs  553.582 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     285.139 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   286.645 μs ±  12.436 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅▅█▇█▆▅▄▄▃▃▃▃▂▂▂▁▁▁ ▁▁                                        ▂
  ███████████████████████▇██▇▆▇▇▇▇▆▆▅▅▆▅▄▄▅▄▃▄▅▃▃▄▅▅▃▅▅▃▄▃▁▄▄▅▃ █
  283 μs        Histogram: log(frequency) by time        309 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Precompiling script

@setup_workload begin
    # Putting some things in `@setup_workload` instead of `@compile_workload` can reduce the size of the
    # precompile file and potentially make loading faster.
    @compile_workload begin
        for T in (Float32, Float64)
            x = rand(T, 10, 10)
            y = rand(T, 10, 10)
            Octavian.matmul(x, y)
        end
    end
end

@chriselrod
Copy link
Collaborator

I was on Julia master. I hadn't seen the problem before in Octavian, but have seen it often enough elsewhere that I didn't want to spend any time on it, and just disabled the precompilation.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants