Performance regression up to 20% when updating from Julia v1.10.4 to v1.11.0-rc1 #55009

benegee · 2024-07-03T12:35:12Z

We have recently started testing julia 1.11 in our projects Trixi.jl and TrixiParticles.jl. We observed performance regressions up to 20%. This has been confirmed on laptops (Apple M2, M3), a workstation (AMD Ryzen Threadripper 3990X), and the JUWELS HPC system at Jülich supercomputing centre (AMD EPYC Rome 7402 CPU).

@sloede has put together instructions and a MWE here: https://github.com/trixi-framework/performance-regression-julia-v1.10-vs-v1.11
It boils down to adding Trixi.jl and running BenchmarkTools for an isolated function in Trixi.jl.

using Trixi
using BenchmarkTools

equations = CompressibleEulerEquations3D(1.4)

volume_flux = flux_ranocha
solver = DGSEM(polydeg=3, surface_flux=flux_ranocha,
               volume_integral=VolumeIntegralFluxDifferencing(volume_flux))

coordinates_min = (0.0, 0.0, 0.0)
coordinates_max = (2.0, 2.0, 2.0)

trees_per_dimension = (4, 4, 4)

mesh = P4estMesh(trees_per_dimension, polydeg=1,
                 coordinates_min=coordinates_min, coordinates_max=coordinates_max,
                 periodicity=true, initial_refinement_level=1)

semi = SemidiscretizationHyperbolic(mesh, equations, initial_condition_convergence_test,
                                    solver)

u_ode = compute_coefficients(0.0, semi)
du_ode = similar(u_ode)

@benchmark Trixi.rhs!($du_ode, $u_ode, $semi, 0.0)

For this code we get on our workstation a median time of 6.079 ms with julia v1.10.4 vs. 7.200 ms with v1.11.0-rc1.

@sloede @ranocha @vchuravy @efaulhaber @LasNikas

The text was updated successfully, but these errors were encountered:

jishnub · 2024-07-03T12:38:57Z

Would it be possible to profile this to identify the point of regression?

benegee · 2024-07-03T14:29:58Z

Thanks for picking this up!

For a start, running @profile on 1000 iterations of rhs! and using PProf gives these flamegraphs (top: 1.10.4, bottom: 1.11-rc1)

We can of course dig deeper.

gbaraldi · 2024-07-03T14:41:19Z

The difference looks to be some getindex calls that now appear in the flamegraph.

benegee · 2024-07-03T15:19:14Z

Yes! For 1.11 we have checkbounds beneath these calls, which make up a fair amount of samples. For 1.10 they are not captured.

KristofferC · 2024-07-03T15:44:06Z

Would be good to get a MWE that gets a bit closer to those suspected functions.

gbaraldi · 2024-07-03T15:48:46Z

The issue is not the checkbounds btw. There's a whole getindex call that seems to be here. It might be Memory{T} bottom is 1.11

ranocha · 2024-07-03T18:18:50Z

Yeah, it looks indeed like the calls to get_contravariant_vector (get_contrav...) before get_node_vars and getindex after get_node_vars are much more expensive on Julia v1.11...

For reference: this code is called from

https://github.com/trixi-framework/Trixi.jl/blob/b4eef6d87d1e5a737ab31ad9c1e5bad293a19b4d/src/solvers/dgsem_structured/dg_3d.jl#L143-L156

get_contravariant_vector is defined in

https://github.com/trixi-framework/Trixi.jl/blob/b4eef6d87d1e5a737ab31ad9c1e5bad293a19b4d/src/solvers/dgsem_structured/dg.jl#L26-L29

gbaraldi · 2024-07-03T18:20:58Z

So there are two commits that might deserve testing here. 9aa7980 as reported in #53158 (comment) and #51720

oscardssmith · 2024-07-03T18:37:33Z

I do think we might need to actually do the initialized memory zeroing so we can revert that change. it has caused a lot of regressions.

sloede · 2024-07-04T04:18:06Z

Would be good to get a MWE that gets a bit closer to those suspected functions.

After running the code posted by @benegee above, you can directly benchmark get_contravariant_vector by executing the following lines:

@unpack contravariant_vectors = semi.cache.elements

ii = j = k = element = 1

@benchmark Trixi.get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)

On our Rocinante (AMD Ryzen), this gives me the following:

v1.10.4

julia> @benchmark Trixi.get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.930 ns … 32.081 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.050 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.061 ns ±  0.309 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

            █ ▂
  ▂▂▁▂▂▁▁▁▁▃█▄█▇▃▅▅▂▂▂▂▁▂▁▁▂▁▁▂▁▁▁▁▂▂▂▂▂▁▁▁▂▂▁▂▂▁▂▂▂▂▁▂▂▂▂▁▂ ▂
  3.93 ns        Histogram: frequency by time        4.47 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

v1.11.0-rc1

julia> @benchmark Trixi.get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  7.147 ns … 15.917 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.167 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.186 ns ±  0.183 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▃
  ▂██▂▂▅▃▆▂▂▂▁▁▁▂▂▁▂▂▁▁▁▂▁▁▁▁▁▂▁▂▁▁▁▁▁▁▂▁▂▁▂▁▂▁▁▂▂▁▂▂▁▁▁▁▂▁▂ ▂
  7.15 ns        Histogram: frequency by time        7.64 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

sloede · 2024-07-05T03:55:20Z

@gbaraldi @oscardssmith thanks for your input! It would be great if you could continue to help with the analysis of the root cause and to understand what needs to be fixed on the Julia side, as we do not have the expertise for something this close to the metal.

I know everyone has a full plate, but with this already being v1.11 RC1, on our side multiple research groups are slowly getting into panic mode 😅. Losing the performance boost of the now-defunct --check-bounds=no was already hard to swallow, but a 20-30% drop in serial performance would be near catastrophic for us in the HPC world.

vchuravy · 2024-07-05T05:00:12Z

Julia 1.10 will likely become an LTS release once 1.11 has been released. We will of course see if we can fix this in a patch release, but performance regression are not release blockers.

sgaure · 2024-07-05T06:29:53Z

I've bisected this to 909bcea, the big Memory{T} PR.
There seems to be a problem with optimizing getindex with the new Memory{T}-based arrays. Some of the performance difference, but not all, seems to go away by rewriting get_contravariant_vector slightly.

@inline function get_contravariant_vector(index, contravariant_vectors, indices...)
    cv = @view(contravariant_vectors[:, index, indices...])
    SVector(ntuple(dim->cv[dim], Val(ndims(contravariant_vectors) - 3)))
end

vtjnash · 2024-07-05T20:22:13Z

Seems mostly like a typo there too, since in the original code @inline(dim->contravariant_vectors[dim, index, indices...]),, the @inbounds does not apply to any call expression. Though it was good to see that the code previously worked well without needing to manually override the correctness of the bounds checking.

sloede · 2024-07-07T05:39:49Z

I've bisected this to 909bcea, the big Memory{T} PR. There seems to be a problem with optimizing getindex with the new Memory{T}-based arrays. Some of the performance difference, but not all, seems to go away by rewriting get_contravariant_vector slightly.
@inline function get_contravariant_vector(index, contravariant_vectors, indices...)
    cv = @view(contravariant_vectors[:, index, indices...])
    SVector(ntuple(dim->cv[dim], Val(ndims(contravariant_vectors) - 3)))
end

@sgaure Thanks a lot for taking a look. Your implementation does indeed recover some of the performance in v1.11.0-rc1. Interestingly, with it the v1.10.4 implementation is as fast as the v1.11.0-rc1 implementation, i.e., while it does make v1.11.0-rc1 faster, it actually slows down execution on v1.10.4.

On our Rocinante (AMD Ryzen Threadripper):

v1.10.4

julia> @benchmark Trixi.get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.930 ns … 20.820 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.030 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.054 ns ±  0.217 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                        █ ▆ ▅ ▃    ▅ ▂                   ▁ ▁ ▁
  ▃▁▅▁▆▁▃▁▁▁▁▁▁▁▁▁▁▁▁█▄▁█▁█▁█▁█▇▁█▁█▁█▁▄▁▄▁▁▇▁▃▁▄▁▃▁▁▄▁▄▁█▁█ █
  3.93 ns      Histogram: log(frequency) by time     4.19 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sgaure_get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  5.780 ns … 16.460 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.780 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.795 ns ±  0.183 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █
  █▄▁▁▁▁▁▁▁▁▁▁▁▁█▃▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  5.78 ns        Histogram: frequency by time        5.82 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

v1.11.0-rc1

julia> @benchmark Trixi.get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  7.147 ns … 17.998 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.157 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.172 ns ±  0.195 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█                                                         ▁
  ██▁█▁▃▁▃▁▁▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▁▁▃▁▁▁▃▁▃▁▁▄▁▄▁▃▁▄▁▄▁▁▅ █
  7.15 ns      Histogram: log(frequency) by time     7.47 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sgaure_get_contravariant_vector(1, $contravariant_vectors, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  5.770 ns … 12.080 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.800 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.819 ns ±  0.158 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁  █    ▁
  ▃█▁▄█▁▇▁▄█▁▅▂▂▁▁▂▂▁▁▁▁▂▁▂▂▁▂▁▂▁▁▁▂▁▁▁▂▁▁▁▂▁▂▁▂▁▁▂▁▁▂▁▂▂▁▁▂ ▂
  5.77 ns        Histogram: frequency by time        6.13 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

It thus seems that v1.10.4 is still able to optimize more of the code away in the original implementation, and v1.11.0-rc1 is not able to fully catch up. It's a start, but it does explain why v1.11.0-rc1 is not able to fully optimize the code yet, and it would be nice to also get the missing ~50% of performance back.

Furthermore, it would be nice to also understand why getindex starts showing up in the profile for v1.11.0-rc1.

sloede · 2024-07-07T05:49:52Z

Another interesting observation:

@sgaure I took your implementation for get_contravariant_vector and applied it also to get_node_vars:

@inline function sgaure_get_node_vars(u, equations, solver::DG, indices...)
    uv = @view(u[:, indices...])
    SVector(ntuple(idx->uv[idx], Val(nvariables(equations))))
end

with

u = Trixi.wrap_array(u_ode, semi)

ii = j = k = element = 1

@benchmark Trixi.get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)

@benchmark sgaure_get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)

it turns out that v1.10.4 and v1.11.0-rc1 are again equally fast and, to my surprise, both are faster than our original implementation on v1.10.4.

On our Rocinante (AMD Ryzen Threadripper):

v1.10.4

julia> @benchmark Trixi.get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.930 ns … 14.730 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.930 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.943 ns ±  0.188 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █
  █▁▁▁▁▁▆▂▁▁▁▁▂▂▁▁▁▁▁▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  3.93 ns        Histogram: frequency by time        4.02 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sgaure_get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.240 ns … 22.801 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.240 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.251 ns ±  0.220 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █
  █▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  3.24 ns        Histogram: frequency by time        3.26 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

v1.11.0-rc1

julia> @benchmark Trixi.get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.930 ns … 48.041 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.930 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.946 ns ±  0.457 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                   ▂
  █▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▃ ▂
  3.93 ns        Histogram: frequency by time        3.94 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sgaure_get_node_vars($u, $equations, $solver, $ii, $j, $k, $element)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.240 ns … 21.401 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.240 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.252 ns ±  0.214 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █
  █▁▄▁▂▁▂▁▁▁▁▁▁▂▁▂▁▂▁▁▂▁▂▁▂▁▂▁▁▁▁▂▁▁▁▂▁▁▁▁▁▁▂▁▂▁▂▁▁▂▁▂▁▂▁▂▁▂ ▂
  3.24 ns        Histogram: frequency by time         3.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, what's up with that? Seeing this, maybe you could take another look at get_contravariant_vector to massage it to getting the original performance back, @sgaure?

gbaraldi · 2024-07-07T16:05:59Z

The proper fix to this is to revive https://reviews.llvm.org/D136218 and make it correct

benegee changed the title ~~Performance regression when updating from Julia v1.10.4 to v1.11.0-rc1~~ Performance regression up to 20% when updating from Julia v1.10.4 to v1.11.0-rc1 Jul 3, 2024

ranocha added performance Must go faster kind:regression Regression in behavior compared to a previous version labels Jul 3, 2024

KristofferC added the status:bisect wanted label Jul 3, 2024

oscardssmith assigned vtjnash and oscardssmith Jul 5, 2024

jakobnissen removed the status:bisect wanted label Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression up to 20% when updating from Julia v1.10.4 to v1.11.0-rc1 #55009

Performance regression up to 20% when updating from Julia v1.10.4 to v1.11.0-rc1 #55009

benegee commented Jul 3, 2024

jishnub commented Jul 3, 2024

benegee commented Jul 3, 2024 •

edited

Loading

gbaraldi commented Jul 3, 2024

benegee commented Jul 3, 2024

KristofferC commented Jul 3, 2024

gbaraldi commented Jul 3, 2024

ranocha commented Jul 3, 2024

gbaraldi commented Jul 3, 2024

oscardssmith commented Jul 3, 2024

sloede commented Jul 4, 2024

sloede commented Jul 5, 2024

vchuravy commented Jul 5, 2024

sgaure commented Jul 5, 2024 •

edited

Loading

vtjnash commented Jul 5, 2024

sloede commented Jul 7, 2024

sloede commented Jul 7, 2024

gbaraldi commented Jul 7, 2024

Performance regression up to 20% when updating from Julia v1.10.4 to v1.11.0-rc1 #55009

Performance regression up to 20% when updating from Julia v1.10.4 to v1.11.0-rc1 #55009

Comments

benegee commented Jul 3, 2024

jishnub commented Jul 3, 2024

benegee commented Jul 3, 2024 • edited Loading

gbaraldi commented Jul 3, 2024

benegee commented Jul 3, 2024

KristofferC commented Jul 3, 2024

gbaraldi commented Jul 3, 2024

ranocha commented Jul 3, 2024

gbaraldi commented Jul 3, 2024

oscardssmith commented Jul 3, 2024

sloede commented Jul 4, 2024

v1.10.4

v1.11.0-rc1

sloede commented Jul 5, 2024

vchuravy commented Jul 5, 2024

sgaure commented Jul 5, 2024 • edited Loading

vtjnash commented Jul 5, 2024

sloede commented Jul 7, 2024

v1.10.4

v1.11.0-rc1

sloede commented Jul 7, 2024

v1.10.4

v1.11.0-rc1

gbaraldi commented Jul 7, 2024

benegee commented Jul 3, 2024 •

edited

Loading

sgaure commented Jul 5, 2024 •

edited

Loading