Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

codegen_level="min" and broadcasting with function arguments result in runtime instability check #57

Open
danielwe opened this issue Jul 22, 2024 · 17 comments

Comments

@danielwe
Copy link
Contributor

danielwe commented Jul 22, 2024

Consider the following:

using DispatchDoctor: @stable

@stable default_codegen_level="min" function mymap!(f, y, x)
    for i in eachindex(y, x)
        y[i] = f(x[i])
    end
    return nothing
end

mymap! should be a fast, stable, and allocation-free function. Not so:

julia> using BenchmarkTools: @benchmark

julia> include("mymap.jl")

julia> @benchmark mymap!(identity, $(zeros(1000)), $(zeros(1000)))
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.670 μs …  24.436 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.278 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.451 μs ± 885.102 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▄▁▃█▇▄▇█▆▃▄▄▃               ▁                              ▂
  ███████████████▇▇▆▅▅▄▄▄▄▅▇██████▇▇▇▆▇▆▆▇██████▆▅▇▆▅▅▅▄▄▄▂▄▅ █
  3.67 μs      Histogram: log(frequency) by time      8.25 μs <

 Memory estimate: 1.27 KiB, allocs estimate: 16.

The allocations disappear and performance improves by a factor of 40 in either of the following conditions:

  • Removing @stable
  • Using default_codegen_level="debug"
  • Adding a type variable for f, as in function mymap!(f::F, y, x) where {F}

The culprit is likely that the generated mymap! implementation does not specialize on the type of f, since it's never called in mymap!'s body but only passed to other functions (specializing_typeof; mymap!_simulator`). Hence the stability check must be performed at runtime, leading to the observed performance and allocation penalty.

@danielwe
Copy link
Contributor Author

danielwe commented Jul 22, 2024

Suppose the function argument is only used in a broadcasting expression. In that case, the problem persists when default_codegen_level="debug", since the broadcast syntax only lowers to various higher-order calls with no direct calls to f, hence the explicit inlining of the function body when codegen_level="debug" doesn't help:

using DispatchDoctor: @stable

@stable default_codegen_level="debug" function mymap!(f, y, x)
    y .= f.(x)
    return y
end
julia> using BenchmarkTools: @benchmark

julia> include("mymap.jl")

julia> @benchmark mymap!(identity, $(zeros(1000)), $(zeros(1000)))
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.436 μs … 27.096 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.610 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.052 μs ±  1.282 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▃▅▄▂▂▄▁         ▁                                        ▁
  ███████████▇▅▅▅▄▅████▇▆▆▇▇▇▇▆██▆▇▆▇▇▆▆▆▆▅▅▆▅▅▅▄▅▄▅▄▃▅▅▂▅▄▅ █
  4.44 μs      Histogram: log(frequency) by time       11 μs <

 Memory estimate: 1.41 KiB, allocs estimate: 20.

Once again, removing @stable or adding a type variable f::F fixes the problem. (The four extra allocations compared to the previous case remain when @stable is removed, as they are due to the dynamic dispatch incurred by the calls to the broadcasting machinery. They disappear when adding f::F, as they are ultimately caused by the same lack of specialization.)

@MilesCranmer
Copy link
Owner

MilesCranmer commented Jul 22, 2024

I think this is a real issue, not the fault of DispatchDoctor, no? Julia does not specialize to functions unless you declare a type parameter like f::F: https://docs.julialang.org/en/v1/manual/performance-tips/#Be-aware-of-when-Julia-avoids-specializing.

These will not specialize:

f_func(f, num) = ntuple(f, div(num, 2))
g_func(g::Function, num) = ntuple(g, div(num, 2))

but this will:

h_func(h::H, num) where {H} = ntuple(h, div(num, 2))

If it the function is getting inlined by the compiler, then performance will be saved (since it will know what f is), but otherwise, I think it's a real instability, and the runtime dispatch is inevitable here. If you expect inlining to always be performed then you could add an @unstable to it?

@danielwe
Copy link
Contributor Author

The problem is the runtime instability check, tanking performance and adding allocations to a function that would otherwise be fast and allocation-free. It's not an instability; @code_warntype, DispatchDoctor, and every other mechanism agree that these functions are type stable.

@danielwe
Copy link
Contributor Author

Inlining can fix it, but you have to be explicit about it, the compiler doesn't inline even this tiny example by default:

@stable default_codegen_level="min" function mymap!(f, y, x)
    for i in eachindex(y, x)
        y[i] = f(x[i])
    end
    return y
end

f!(f::F, y, x) where {F} = mymap!(f, y, x)
g!(f::F, y, x) where {F} = @inline mymap!(f, y, x)
julia> @benchmark f!(identity, $(zeros(1000)), $(zeros(1000)))
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.325 μs … 53.729 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.999 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.530 μs ±  1.657 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄▂▆▅█▆▇▄▂▂    ▁ ▁▁▁▁▁▁▁▁▁                                  ▂
  ██████████▇▇▇▇███████████████▇▇██▇▇▆▆▆▆▅▇▇▇▆▅▆▅▅▆▆▆▆▆▄▅▅▅▆ █
  3.33 μs      Histogram: log(frequency) by time     11.5 μs <

 Memory estimate: 1.27 KiB, allocs estimate: 16.

julia> @benchmark g!(identity, $(zeros(1000)), $(zeros(1000)))
BenchmarkTools.Trial: 10000 samples with 972 evaluations.
 Range (min … max):  74.403 ns … 186.966 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     76.914 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   82.468 ns ±  15.298 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃▅▆▅▁     ▂▂ ▁ ▁▁         ▁ ▁                               ▁
  ██████▇▆▄▃▅█████████▆▇▇██▇▆████▇▆▆▆▃▆▆▆▅▄▆▅▅▃▇▇▇▇▇▆▅▅▄▃▄▇▄▆▄ █
  74.4 ns       Histogram: log(frequency) by time       149 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@MilesCranmer
Copy link
Owner

MilesCranmer commented Jul 22, 2024

I still think this is a real type instability, it's just that it occurs outside the purview of the analysis tools. Basically, you are calling a function that is unknown to the compiler, so the return value of the function could be a Tuple{} (indicating the function f threw an error) or Nothing. But each time DispatchDoctor or @code_warntype are given a particular instance of a function f, they can validate that the return value is indeed Nothing.

At runtime, DispatchDoctor and @code_warntype are given the precise value of f – and therefore are able to eliminate the Tuple{}. But since the value of f isn't known to the compiler, it has to do runtime dispatch.

@danielwe
Copy link
Contributor Author

danielwe commented Jul 22, 2024

In the non-broadcasting case, the instability is caused by the extra layer of indirection that DispatchDoctor inserts when codegen_level="min". The function mymap! would otherwise specialize since f is called within it's body.

The broadcasting case is perhaps a bit more nuanced, but DispatchDoctor certainly adds overhead due to the runtime stability check.

@danielwe
Copy link
Contributor Author

From the docs you linked: "Julia will always specialize when the argument is used within the method, but not if the argument is just passed through to another function."

@MilesCranmer
Copy link
Owner

On my machine, before adding the f::F to mymap!:

julia> @benchmark f!(identity, $(zeros(1000)), $(zeros(1000)))
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min  max):  2.088 μs    7.153 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.245 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.276 μs ± 177.738 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        █▇▅▇▁▁▂█▂▁▃  ▂  ▂
  ▁▁▁▁▂▄██████████████▇▇█▆▆▅▆▄▃▄▂▂▂▂▂▁▂▂▁▂▁▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  2.09 μs         Histogram: frequency by time        2.72 μs <

 Memory estimate: 1.27 KiB, allocs estimate: 16.

and after adding it:

julia> @benchmark f!(identity, $(zeros(1000)), $(zeros(1000)))
BenchmarkTools.Trial: 10000 samples with 943 evaluations.
 Range (min  max):  100.256 ns  316.631 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     112.583 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   114.974 ns ±  20.494 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▆▇█
  ▃▃████▄▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▃
  100 ns           Histogram: frequency by time          297 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@danielwe
Copy link
Contributor Author

Yes. And if you remove both @stable and f::F, or remove just f::F and set codegen level to debug, those also look like the second case.

@MilesCranmer
Copy link
Owner

Oh I see what you mean. Yeah that's weird. Any idea why?

@danielwe
Copy link
Contributor Author

I think so... Even without f::F, the function does specialize if f is actually called within the body of the method. That happens in all these cases except the original one, with codegen level "min" and no f::F

@MilesCranmer
Copy link
Owner

MilesCranmer commented Jul 22, 2024

I mean, I guess I wouldn't expect mymap!(f, x, y) = ... to be fast in the first place, due to the specialization rules. I feel like we are getting lucky that it's fast, but usually you would have to do the f::F where {F} trick to not incur runtime dispatch.

Sometimes the compiler is so smart it doesn't matter though. I guess on DispatchDoctor.jl's side it would be good to not get in the way of the compiler here though (hence this issue needing a solution).

Also, the quote you noted above doesn't refer to function objects as far as I am aware (otherwise we wouldn't need to use f::F ever). But I'm still a bit confused about this... maybe I just need sleep. Note that @code_warntype is known to lie sometimes: JuliaLang/julia#32834

@danielwe
Copy link
Contributor Author

danielwe commented Jul 22, 2024

Also, the quote you noted above doesn't refer to function objects as far as I am aware

The way I read that paragraph I think it does refer to function arguments, which is why I assume mymap!(f, y, x) to be reliably performant. The lack of specialization is relevant when mymap! only passes f through to another function, for example, mymap!(f, y, x) = myactualmap!(f, y, x), and that function is not inlined. This is effectively what's happening in the broadcasting example.

The crucial difference is whether the body contains f(x) vs. otherfunction(f).

@MilesCranmer
Copy link
Owner

MilesCranmer commented Jul 22, 2024

Very weird. That does make sense to me though, thanks for explaining it.

So I guess maybe Expr(:call, :map, :f, :x) doesn't count as "using" it whereas, e.g., Expr(:call, :f, :x) does? It feels like this shouldn't be true though, because sometimes functions are types... and the type might just get passed between functions rather than called explicitly. Weird indeed.

It does look like Julia is actually specializing here:

julia> function mymap!(f, y, x)
           for i in eachindex(y, x)
               y[i] = f(x[i])
           end
           return y
       end
mymap! (generic function with 1 method)

julia> Base.specializations(@which mymap!(identity, zeros(1000), zeros(1000)))
Base.MethodSpecializations(MethodInstance for mymap!(::typeof(identity), ::Vector{Float64}, ::Vector{Float64}))

(which doesn't lie, unlike @code_warntype).

Ok, so, I guess this is the issue? When using codegen_level="min", Julia thinks the function object is not "used" and therefore the wrapped function should not specialize to it. Unless you write f::F which forces things.

So really what we need is a way to have the caller function have the exact same specializations as the simulator function...

@MilesCranmer
Copy link
Owner

MilesCranmer commented Jul 22, 2024

One idea:

The first iteration of DispatchDoctor.jl used closures, like:

function f(a, b, c)
    closure() = $body
    type_instability(Base.promote_op(closure)) && ....
    return closure()
end

The problem with this is that due to JuliaLang/julia#15276, if the closure modifies a, b, or c, you will create a type instability.

However, maybe the simulator function could just get stuck inside the main function, arguments and all. It's just tricky because we need to sanitize all arguments so there is no shared arguments between the main function and closure – it should be like

function f(a, b, c)
    function closure(_a, _b, _c)
        let (a, b, c) = (_a, _b, _c)
            $body
        end
    end
    type_instability(Base.promote_op(closure, type.((a, b, c))...)) && ...
    return closure(a, b, c)
end

which might work. But it would require extreme care with the arguments and also type parameters. Maybe doable...

We'd also need to verify that this actually eliminates the problems discussed in the issue, or maybe specialization of the closure wouldn't affect specialization of the wrapper, or we'd be back to square one.

@danielwe
Copy link
Contributor Author

Ok, so, I guess this is the issue?

Thinking a bit more, an ideal DispatchDoctor would never alter anything about specialization behavior or inlining, and never add extra runtime work. However, as a matter of priority, maybe the most important thing is to preserve inlining heuristics as much as possible? In practice, that might take care of the most common cases where this issue is encountered.

function f(a, b, c)
    function closure(_a, _b, _c)
        let (a, b, c) = (_a, _b, _c)
            $body
        end
    end
    type_instability(Base.promote_op(closure, type.((a, b, c))...)) && ...
    return closure(a, b, c)
end

I don't think it's necessary to generate new names here. closure only shares names with the outer function if they arent' defined locally through the argument list or an assignment, so reusing names is not a problem (other than for readability, but macro-generated code could be excused for lacking in that department).

Technically speaking, this isn't a closure, since it doesn't capture any variables. It's just a regular function whose name is only defined in a local scope.

However, I think I tried doing exactly this in a potential DispatchDoctor contribution that never went anywhere, and I think the problem was that the instability check could never be compiled away. Perhaps a world-age issue of some sort?

@danielwe
Copy link
Contributor Author

My first thought was to inject a type variable for every argument to the outer function that doesn't already have one, to force specialization there. That should solve static inference of the instability check, and as long as f_simulator preserves the original signature, the overall specialization behavior should be unchanged for codegen level "min". However, this would sometimes change specialization behavior for codegen level "debug".

Another idea is to factor out the instability check and error path into a separate function where specialization for every argument is always enforced. Perhaps this could increase the chance of having the stability check compiled away, and it might also reduce the effect of @stable on inlining heuristics since the net change is now only a single extra line in the method body (just spitballing here, I have no idea how the inlining heuristics actually work). I don't know how this would affect overall specialization behavior.

One thing that does not work is adding @inline to the simulator call site for codegen level "min". Somehow that makes everything worse.

danielwe added a commit to danielwe/DispatchDoctor.jl that referenced this issue Jul 22, 2024
danielwe added a commit to danielwe/DispatchDoctor.jl that referenced this issue Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants