`codegen_level="min"` and broadcasting with function arguments result in runtime instability check #57

danielwe · 2024-07-22T01:19:26Z

Consider the following:

using DispatchDoctor: @stable

@stable default_codegen_level="min" function mymap!(f, y, x)
    for i in eachindex(y, x)
        y[i] = f(x[i])
    end
    return nothing
end

mymap! should be a fast, stable, and allocation-free function. Not so:

julia> using BenchmarkTools: @benchmark

julia> include("mymap.jl")

julia> @benchmark mymap!(identity, $(zeros(1000)), $(zeros(1000)))
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.670 μs …  24.436 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.278 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.451 μs ± 885.102 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▄▁▃█▇▄▇█▆▃▄▄▃               ▁                              ▂
  ███████████████▇▇▆▅▅▄▄▄▄▅▇██████▇▇▇▆▇▆▆▇██████▆▅▇▆▅▅▅▄▄▄▂▄▅ █
  3.67 μs      Histogram: log(frequency) by time      8.25 μs <

 Memory estimate: 1.27 KiB, allocs estimate: 16.

The allocations disappear and performance improves by a factor of 40 in either of the following conditions:

Removing @stable
Using default_codegen_level="debug"
Adding a type variable for f, as in function mymap!(f::F, y, x) where {F}

The culprit is likely that the generated mymap! implementation does not specialize on the type of f, since it's never called in mymap!'s body but only passed to other functions (specializing_typeof; mymap!_simulator`). Hence the stability check must be performed at runtime, leading to the observed performance and allocation penalty.

The text was updated successfully, but these errors were encountered:

danielwe · 2024-07-22T01:31:47Z

Suppose the function argument is only used in a broadcasting expression. In that case, the problem persists when default_codegen_level="debug", since the broadcast syntax only lowers to various higher-order calls with no direct calls to f, hence the explicit inlining of the function body when codegen_level="debug" doesn't help:

using DispatchDoctor: @stable

@stable default_codegen_level="debug" function mymap!(f, y, x)
    y .= f.(x)
    return y
end

julia> using BenchmarkTools: @benchmark

julia> include("mymap.jl")

julia> @benchmark mymap!(identity, $(zeros(1000)), $(zeros(1000)))
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.436 μs … 27.096 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.610 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.052 μs ±  1.282 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▃▅▄▂▂▄▁         ▁                                        ▁
  ███████████▇▅▅▅▄▅████▇▆▆▇▇▇▇▆██▆▇▆▇▇▆▆▆▆▅▅▆▅▅▅▄▅▄▅▄▃▅▅▂▅▄▅ █
  4.44 μs      Histogram: log(frequency) by time       11 μs <

 Memory estimate: 1.41 KiB, allocs estimate: 20.

Once again, removing @stable or adding a type variable f::F fixes the problem. (The four extra allocations compared to the previous case remain when @stable is removed, as they are due to the dynamic dispatch incurred by the calls to the broadcasting machinery. They disappear when adding f::F, as they are ultimately caused by the same lack of specialization.)

MilesCranmer · 2024-07-22T01:43:37Z

I think this is a real issue, not the fault of DispatchDoctor, no? Julia does not specialize to functions unless you declare a type parameter like f::F: https://docs.julialang.org/en/v1/manual/performance-tips/#Be-aware-of-when-Julia-avoids-specializing.

These will not specialize:

f_func(f, num) = ntuple(f, div(num, 2))
g_func(g::Function, num) = ntuple(g, div(num, 2))

but this will:

h_func(h::H, num) where {H} = ntuple(h, div(num, 2))

If it the function is getting inlined by the compiler, then performance will be saved (since it will know what f is), but otherwise, I think it's a real instability, and the runtime dispatch is inevitable here. If you expect inlining to always be performed then you could add an @unstable to it?

danielwe · 2024-07-22T01:46:42Z

The problem is the runtime instability check, tanking performance and adding allocations to a function that would otherwise be fast and allocation-free. It's not an instability; @code_warntype, DispatchDoctor, and every other mechanism agree that these functions are type stable.

danielwe · 2024-07-22T01:54:10Z

Inlining can fix it, but you have to be explicit about it, the compiler doesn't inline even this tiny example by default:

@stable default_codegen_level="min" function mymap!(f, y, x)
    for i in eachindex(y, x)
        y[i] = f(x[i])
    end
    return y
end

f!(f::F, y, x) where {F} = mymap!(f, y, x)
g!(f::F, y, x) where {F} = @inline mymap!(f, y, x)

julia> @benchmark f!(identity, $(zeros(1000)), $(zeros(1000)))
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.325 μs … 53.729 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.999 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.530 μs ±  1.657 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄▂▆▅█▆▇▄▂▂    ▁ ▁▁▁▁▁▁▁▁▁                                  ▂
  ██████████▇▇▇▇███████████████▇▇██▇▇▆▆▆▆▅▇▇▇▆▅▆▅▅▆▆▆▆▆▄▅▅▅▆ █
  3.33 μs      Histogram: log(frequency) by time     11.5 μs <

 Memory estimate: 1.27 KiB, allocs estimate: 16.

julia> @benchmark g!(identity, $(zeros(1000)), $(zeros(1000)))
BenchmarkTools.Trial: 10000 samples with 972 evaluations.
 Range (min … max):  74.403 ns … 186.966 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     76.914 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   82.468 ns ±  15.298 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃▅▆▅▁     ▂▂ ▁ ▁▁         ▁ ▁                               ▁
  ██████▇▆▄▃▅█████████▆▇▇██▇▆████▇▆▆▆▃▆▆▆▅▄▆▅▅▃▇▇▇▇▇▆▅▅▄▃▄▇▄▆▄ █
  74.4 ns       Histogram: log(frequency) by time       149 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

MilesCranmer · 2024-07-22T01:56:12Z

I still think this is a real type instability, it's just that it occurs outside the purview of the analysis tools. Basically, you are calling a function that is unknown to the compiler, so the return value of the function could be a Tuple{} (indicating the function f threw an error) or Nothing. But each time DispatchDoctor or @code_warntype are given a particular instance of a function f, they can validate that the return value is indeed Nothing.

At runtime, DispatchDoctor and @code_warntype are given the precise value of f – and therefore are able to eliminate the Tuple{}. But since the value of f isn't known to the compiler, it has to do runtime dispatch.

danielwe · 2024-07-22T01:59:39Z

In the non-broadcasting case, the instability is caused by the extra layer of indirection that DispatchDoctor inserts when codegen_level="min". The function mymap! would otherwise specialize since f is called within it's body.

The broadcasting case is perhaps a bit more nuanced, but DispatchDoctor certainly adds overhead due to the runtime stability check.

danielwe · 2024-07-22T02:00:17Z

From the docs you linked: "Julia will always specialize when the argument is used within the method, but not if the argument is just passed through to another function."

MilesCranmer · 2024-07-22T02:01:27Z

On my machine, before adding the f::F to mymap!:

julia> @benchmark f!(identity, $(zeros(1000)), $(zeros(1000)))
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.088 μs …   7.153 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.245 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.276 μs ± 177.738 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        █▇▅▇▁▁▂█▂▁▃  ▂  ▂
  ▁▁▁▁▂▄██████████████▇▇█▆▆▅▆▄▃▄▂▂▂▂▂▁▂▂▁▂▁▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  2.09 μs         Histogram: frequency by time        2.72 μs <

 Memory estimate: 1.27 KiB, allocs estimate: 16.

and after adding it:

julia> @benchmark f!(identity, $(zeros(1000)), $(zeros(1000)))
BenchmarkTools.Trial: 10000 samples with 943 evaluations.
 Range (min … max):  100.256 ns … 316.631 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     112.583 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   114.974 ns ±  20.494 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▆▇█
  ▃▃████▄▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▃
  100 ns           Histogram: frequency by time          297 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

danielwe · 2024-07-22T02:04:48Z

Yes. And if you remove both @stable and f::F, or remove just f::F and set codegen level to debug, those also look like the second case.

MilesCranmer · 2024-07-22T02:11:09Z

Oh I see what you mean. Yeah that's weird. Any idea why?

danielwe · 2024-07-22T02:14:53Z

I think so... Even without f::F, the function does specialize if f is actually called within the body of the method. That happens in all these cases except the original one, with codegen level "min" and no f::F

MilesCranmer · 2024-07-22T02:15:45Z

I mean, I guess I wouldn't expect mymap!(f, x, y) = ... to be fast in the first place, due to the specialization rules. I feel like we are getting lucky that it's fast, but usually you would have to do the f::F where {F} trick to not incur runtime dispatch.

Sometimes the compiler is so smart it doesn't matter though. I guess on DispatchDoctor.jl's side it would be good to not get in the way of the compiler here though (hence this issue needing a solution).

Also, the quote you noted above doesn't refer to function objects as far as I am aware (otherwise we wouldn't need to use f::F ever). But I'm still a bit confused about this... maybe I just need sleep. Note that @code_warntype is known to lie sometimes: JuliaLang/julia#32834

danielwe · 2024-07-22T02:22:29Z

Also, the quote you noted above doesn't refer to function objects as far as I am aware

The way I read that paragraph I think it does refer to function arguments, which is why I assume mymap!(f, y, x) to be reliably performant. The lack of specialization is relevant when mymap! only passes f through to another function, for example, mymap!(f, y, x) = myactualmap!(f, y, x), and that function is not inlined. This is effectively what's happening in the broadcasting example.

The crucial difference is whether the body contains f(x) vs. otherfunction(f).

MilesCranmer · 2024-07-22T02:31:55Z

Very weird. That does make sense to me though, thanks for explaining it.

So I guess maybe Expr(:call, :map, :f, :x) doesn't count as "using" it whereas, e.g., Expr(:call, :f, :x) does? It feels like this shouldn't be true though, because sometimes functions are types... and the type might just get passed between functions rather than called explicitly. Weird indeed.

It does look like Julia is actually specializing here:

julia> function mymap!(f, y, x)
           for i in eachindex(y, x)
               y[i] = f(x[i])
           end
           return y
       end
mymap! (generic function with 1 method)

julia> Base.specializations(@which mymap!(identity, zeros(1000), zeros(1000)))
Base.MethodSpecializations(MethodInstance for mymap!(::typeof(identity), ::Vector{Float64}, ::Vector{Float64}))

(which doesn't lie, unlike @code_warntype).

Ok, so, I guess this is the issue? When using codegen_level="min", Julia thinks the function object is not "used" and therefore the wrapped function should not specialize to it. Unless you write f::F which forces things.

So really what we need is a way to have the caller function have the exact same specializations as the simulator function...

MilesCranmer · 2024-07-22T02:44:27Z

One idea:

The first iteration of DispatchDoctor.jl used closures, like:

function f(a, b, c)
    closure() = $body
    type_instability(Base.promote_op(closure)) && ....
    return closure()
end

The problem with this is that due to JuliaLang/julia#15276, if the closure modifies a, b, or c, you will create a type instability.

However, maybe the simulator function could just get stuck inside the main function, arguments and all. It's just tricky because we need to sanitize all arguments so there is no shared arguments between the main function and closure – it should be like

function f(a, b, c)
    function closure(_a, _b, _c)
        let (a, b, c) = (_a, _b, _c)
            $body
        end
    end
    type_instability(Base.promote_op(closure, type.((a, b, c))...)) && ...
    return closure(a, b, c)
end

which might work. But it would require extreme care with the arguments and also type parameters. Maybe doable...

We'd also need to verify that this actually eliminates the problems discussed in the issue, or maybe specialization of the closure wouldn't affect specialization of the wrapper, or we'd be back to square one.

danielwe · 2024-07-22T03:31:55Z

Ok, so, I guess this is the issue?

Thinking a bit more, an ideal DispatchDoctor would never alter anything about specialization behavior or inlining, and never add extra runtime work. However, as a matter of priority, maybe the most important thing is to preserve inlining heuristics as much as possible? In practice, that might take care of the most common cases where this issue is encountered.

function f(a, b, c)
    function closure(_a, _b, _c)
        let (a, b, c) = (_a, _b, _c)
            $body
        end
    end
    type_instability(Base.promote_op(closure, type.((a, b, c))...)) && ...
    return closure(a, b, c)
end

I don't think it's necessary to generate new names here. closure only shares names with the outer function if they arent' defined locally through the argument list or an assignment, so reusing names is not a problem (other than for readability, but macro-generated code could be excused for lacking in that department).

Technically speaking, this isn't a closure, since it doesn't capture any variables. It's just a regular function whose name is only defined in a local scope.

However, I think I tried doing exactly this in a potential DispatchDoctor contribution that never went anywhere, and I think the problem was that the instability check could never be compiled away. Perhaps a world-age issue of some sort?

danielwe · 2024-07-22T03:41:40Z

My first thought was to inject a type variable for every argument to the outer function that doesn't already have one, to force specialization there. That should solve static inference of the instability check, and as long as f_simulator preserves the original signature, the overall specialization behavior should be unchanged for codegen level "min". However, this would sometimes change specialization behavior for codegen level "debug".

Another idea is to factor out the instability check and error path into a separate function where specialization for every argument is always enforced. Perhaps this could increase the chance of having the stability check compiled away, and it might also reduce the effect of @stable on inlining heuristics since the net change is now only a single extra line in the method body (just spitballing here, I have no idea how the inlining heuristics actually work). I don't know how this would affect overall specialization behavior.

One thing that does not work is adding @inline to the simulator call site for codegen level "min". Somehow that makes everything worse.

Ref MilesCranmer#57

danielwe added a commit to danielwe/DispatchDoctor.jl that referenced this issue Jul 22, 2024

First attempt at specialization transparency

70763a2

Ref MilesCranmer#57

danielwe mentioned this issue Jul 22, 2024

Attempts at specialization transparency, ref #57 #58

Draft

danielwe added a commit to danielwe/DispatchDoctor.jl that referenced this issue Jul 22, 2024

First attempt at specialization transparency

197a0ed

Ref MilesCranmer#57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`codegen_level="min"` and broadcasting with function arguments result in runtime instability check #57

`codegen_level="min"` and broadcasting with function arguments result in runtime instability check #57

danielwe commented Jul 22, 2024 •

edited

Loading

danielwe commented Jul 22, 2024 •

edited

Loading

MilesCranmer commented Jul 22, 2024 •

edited

Loading

danielwe commented Jul 22, 2024

danielwe commented Jul 22, 2024

MilesCranmer commented Jul 22, 2024 •

edited

Loading

danielwe commented Jul 22, 2024 •

edited

Loading

danielwe commented Jul 22, 2024

MilesCranmer commented Jul 22, 2024

danielwe commented Jul 22, 2024

MilesCranmer commented Jul 22, 2024

danielwe commented Jul 22, 2024

MilesCranmer commented Jul 22, 2024 •

edited

Loading

danielwe commented Jul 22, 2024 •

edited

Loading

MilesCranmer commented Jul 22, 2024 •

edited

Loading

MilesCranmer commented Jul 22, 2024 •

edited

Loading

danielwe commented Jul 22, 2024

danielwe commented Jul 22, 2024

codegen_level="min" and broadcasting with function arguments result in runtime instability check #57

codegen_level="min" and broadcasting with function arguments result in runtime instability check #57

Comments

danielwe commented Jul 22, 2024 • edited Loading

danielwe commented Jul 22, 2024 • edited Loading

MilesCranmer commented Jul 22, 2024 • edited Loading

danielwe commented Jul 22, 2024

danielwe commented Jul 22, 2024

MilesCranmer commented Jul 22, 2024 • edited Loading

danielwe commented Jul 22, 2024 • edited Loading

danielwe commented Jul 22, 2024

MilesCranmer commented Jul 22, 2024

danielwe commented Jul 22, 2024

MilesCranmer commented Jul 22, 2024

danielwe commented Jul 22, 2024

MilesCranmer commented Jul 22, 2024 • edited Loading

danielwe commented Jul 22, 2024 • edited Loading

MilesCranmer commented Jul 22, 2024 • edited Loading

MilesCranmer commented Jul 22, 2024 • edited Loading

danielwe commented Jul 22, 2024

danielwe commented Jul 22, 2024

`codegen_level="min"` and broadcasting with function arguments result in runtime instability check #57

`codegen_level="min"` and broadcasting with function arguments result in runtime instability check #57

danielwe commented Jul 22, 2024 •

edited

Loading

danielwe commented Jul 22, 2024 •

edited

Loading

MilesCranmer commented Jul 22, 2024 •

edited

Loading

MilesCranmer commented Jul 22, 2024 •

edited

Loading

danielwe commented Jul 22, 2024 •

edited

Loading

MilesCranmer commented Jul 22, 2024 •

edited

Loading

danielwe commented Jul 22, 2024 •

edited

Loading

MilesCranmer commented Jul 22, 2024 •

edited

Loading

MilesCranmer commented Jul 22, 2024 •

edited

Loading