-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new call site inference algorithm #34742
Comments
This table compares three algorithms within julia 1.4-rc1. The first 4 rows are the standard algorithm. Algorithm
Column meanings: Note: I excluded the patch that fixes #34098 |
Patch for algorithm A: --- a/base/compiler/abstractinterpretation.jl
+++ b/base/compiler/abstractinterpretation.jl
@@ -98,6 +98,10 @@ function abstract_call_gf_by_type(@nospecialize(f), argtypes::Vector{Any}, @nosp
this_rt === Any && break
end
else
+ if max_methods != -1 && i > 1 && (!isdispatchtuple(sig) && !isdispatchelem(rettype))
+ rettype = Any
+ break
+ end
this_rt, edgecycle1, edge = abstract_call_method(method, sig, match[2]::SimpleVector, multiple_matches, sv)
edgecycle |= edgecycle1::Bool
if edge !== nothing Patch for algorithm B: --- a/base/compiler/abstractinterpretation.jl
+++ b/base/compiler/abstractinterpretation.jl
@@ -98,7 +98,17 @@ function abstract_call_gf_by_type(@nospecialize(f), argtypes::Vector{Any}, @nosp
this_rt === Any && break
end
else
- this_rt, edgecycle1, edge = abstract_call_method(method, sig, match[2]::SimpleVector, multiple_matches, sv)
+ sparams = match[2]::SimpleVector
+ if max_methods != -1 && multiple_matches && !isdispatchtuple(sig)
+ _sig = sig = method.sig
+ _sp = []
+ while _sig isa UnionAll
+ push!(_sp, _sig.var)
+ _sig = _sig.body
+ end
+ sparams = svec(_sp...)::SimpleVector
+ end
+ this_rt, edgecycle1, edge = abstract_call_method(method, sig, sparams, multiple_matches, sv)
edgecycle |= edgecycle1::Bool
if edge !== nothing
push!(edges, edge) Other suggestions welcome! |
Is there a way to quantify the performance impact of incompletely-inferred chunks of code? For instance, if I have a large chunk of code with many highly over-ridden methods, I imagine that I will run afoul of the The columns you've listed so far seem like they would capture that time difference, but I'm not sure how much of the variation is due to inference taking longer versus runtime being slower. Perhaps what we can do is hook up the timing profiler to a known workload, then find all backtraces that point to the dynamic dispatch locations to get a measurement of the % of time spent in dispatch overhead versus actually running user workload? |
Yes, good point, these numbers don't really say much about run-time performance impact. The cases measured here are basically all compile time. However, these changes only affect code where we were already unable to infer concrete types, so dynamic dispatch was already happening. "Fast"/fully-statically-typed code should not be affected. The bad case would be where inferring a more accurate abstract type somewhere allows much more accurate results elsewhere. That does happen, but it's fairly rare. I doubt we have many (or any) benchmarks that capture that kind of case, since we only tend to benchmark things that we feel are or should be really highly optimized. |
I thought that the rows in your table where |
It could cause more dynamic dispatch, but why many more cases? Maybe give an example of a change in how code might be optimized, and we can discuss which rows of the table could cause it? If a call has multiple targets (at inference time), it would already ~always have been a dynamic dispatch. So performance will only change if (1) there were 4 or fewer possible targets, (2) inferring all of those and unioning their return types gave useful type information, and (3) subsequent calls using the result could be narrowed down to 1 target based on that information. While that is definitely possible I don't think it's common. Put differently, if this change slows down your code, then a package somewhere adding a 5th matching method would also have slowed down your code, so you're in a bit of a fragile position anyway. Is this the sort of situation you're thinking of? |
Yes, that's precisely what I'm thinking. I don't have an intuition as to where this is or is not the case; perhaps the majority of collections of methods within a given namespace are either 1 or much larger than 4; I have no idea. I don't want to drag this issue off-topic, since I think this is more focused on compile-time performance than runtime performance, but I did my best to use the toy example from above to construct a pathological case. Here we have a method that is expected to do roughly nothing, so that we can attempt to see the effect of dynamic dispatch on the runtime performance: using InteractiveUtils, BenchmarkTools, Profile, Printf
f(::Int) = 0
f(::String) = 0
f(::Dict) = 0
f(::Array) = 0
function f_loop(N)
x = Any[1][1]
accum = 0
for idx in 1:N
accum += f(x)
end
return accum
end Benchmarking this with the following harness: function do_benchmark(N)
ci, rettype = @code_typed f_loop(N)
println(" -> Inferred as returning: $(rettype)")
# Benchmark it
perf = @benchmark f_loop(N)
min_time = minimum(perf).time
println(@sprintf(" -> benchmark minimum: %.2f ms", min_time/1e6))
# Also do a Profile-based estimation of the amount of time we spend in `jl_apply_generic`:
Profile.clear()
@profile f_loop(N)
prof_data, lidict = Profile.retrieve()
lilist, inclusive, self, num_snapshots = Profile.parse_flat(Profile.StackFrame, prof_data, lidict, true)
# The following line is probably not 100% correct; but it gives us a reasonable number, so I'm keeping it.
num_gf_snapshots = sum([self[idx] for idx in 1:length(lilist) if basename(string(lilist[idx].file)) == "gf.c"])
gf_proportion = num_gf_snapshots/num_snapshots
println(@sprintf(" -> estimated gf overhead percentage: %.2f%%", 100*gf_proportion))
return min_time, gf_proportion
end And invoking it with N = 10000000
println("Four methods run of f_loop():")
fast_time, fast_gf_proportion = do_benchmark(N)
f(::Tuple) = 0
println("Five methods run of f_loop():")
slow_time, slow_gf_proportion = do_benchmark(N)
println(@sprintf("Estimated dynamic dispatch overhead per call: %.2f ns", (slow_time - fast_time)/N)) This results in the following timings:
Now I don't think that |
Would it make sense to keep a table that tracks some kind of "widest possible return type" for each function, and return that (instead of In the example from the top post: When the fifth method: A lot of functions probably already have well-defined output types. The drawback would be that adding a method to a function potentially invalidates the widest possible type not only for that function, but also recursively for any function that calls it. However, each function's entry would be updated a maximum of 2-3 times (e.g: concrete type -> small type union -> Any) so that the total amount of work would still be linear in the number of functions. |
Perhaps I'm misunderstanding the suggestion, but the trouble is that you have to analyze all of the methods that could possibly be called in order to know the widest possible return type. It cannot return |
That idea does make sense. Actually it's similar to my "algorithm B" above, which infers methods on their defined signatures. Those are all cached, so once they are inferred once each one is just a lookup. Of course that is still O(n) in the number of methods, so additionally caching the overall type might be an improvement. |
Let me see if I'm understanding algorithm B. Currently, inference is run on each method for the tightest type that is known for the possible arguments and the return type and everything else is inferred based on that. A method must have non-empty intersection with a method's signature in order to be considered—since otherwise it's not applicable—but you can't just reuse the inference because it is, in general, bespoke to a given call site. The idea behind algorithm B is to run inference for the types of the method's signature, which are more general than the types at the call site intersected with the method's signature—and therefore potentially less informative—with the upside being that it can be cached and reused per method instead of per call site. |
Yes exactly. |
I guess that what I suggested is very similar to to algorithm B, but doing the extra work of computing the widest type per function (as if m_m were ∞). In the useful cases, having a ceiling on the return type would speed up inference. Whether the net result is positive or negative is hard to guess. Widest types also be computed eagerly during module precompilation and/or in a background thread. (As can algorithm-B return types.) |
We're doing some early experiments trying to compute and store the widest possible return type for each function. So far it seems to be quite expensive --- inferring all methods (and often on abstract argument types) does more work in many cases than we do now. Currently, we're usually able to avoid inferring many methods entirely, and when we do infer them it's often on concrete types which is more efficient. So this might not be viable. |
Currently one the rules for programming in Julia is that type annotation in methods signature is only for dispatch, and beginners are regularly corrected in their belief that it has any impact on performance. Wouldn't algorithm B mean invalidation of this rule, and "force" us (when maximal performance is sought) to write concretely-typed signatures for some functions when abstractly-typed signatures would lead to non-concretely inferred code? (sorry if I misunderstood totally this topic) |
What about computing the widest return type for each concrete method signature and doing so lazily? (Perhaps that’s not any different from what is currently being done.) |
That can already happen. Method signatures can affect inference precision. Say you have a call But in that scenario, the performance issue is the call site Anyway, that rule of thumb only exists to counter the first impression people sometimes get that "my code will be really slow unless I declare types on all arguments"; it's not a formal property.
Yes; when we're lucky enough that a method signature is concrete we will infer it as soon as we see a possible call to it, and cache the result. Unfortunately most functions have some methods with abstract signatures. |
To elaborate on my suggestion, since it's kind of vague, the idea here is that instead of intersecting type information from the call site and the method that might get called, you broaden the type information from the call site to the signature of the method and then infer and cache that, the premise being that different call sites might project onto the same type information for the same method. In writing that, I guess this can be generalized: any way of projecting call site information onto a function or method in a way that is guaranteed to contain the actual return type could be used. The question is what projection would work best in the sense of being often reusable and giving usable type information. |
I think that's basically algorithm B. |
A few things that could make the widest-type computation faster. (I don't know if any of these have already been tried.)
|
Closes #23210 |
I added algorithm C to the table above. This sets max_methods to 1, unless the function has fewer than 5 methods, in which case we use max_methods == 4. The intuition is that it's more valuable to infer multiple methods if that means inferring all methods. (My implementation of this isn't perfect but should be enough to estimate performance.) This could be a cheap way to lessen the impact of lost inference precision. Otherwise I would almost recommend just setting max_methods to 1; it's definitely the fastest and is easy to explain and non-arbitrary. I also explored trying to do even less inference than max_methods==1, by recursively inferring only if the known argument types are concrete. That made compilation significantly slower, which contributes to my sense that max_methods==1 is fairly optimal. |
Here are some new numbers and a new candidate algorithm. I had to take a new baseline because unfortunately most of the times have regressed over the past couple months. Algorithm D uses max_methods==4 like we do now, except it bails out immediately if more than one intersected signature is abstract. The idea is that it's basically always useful and OK to infer concrete signatures, since you will probably need them eventually, but the number of abstract signatures we infer per call site should be limited.
TTFP = time to first plot |
what's pre in the chart? |
Time to generate precompile statements (see above) |
I'd love it if we could get away from a max # of methods altogether, since that would make the compiler completely predictable. To me it seems that the most important thing about inferring abstract calls is the predictability that the result type gives for future calls, i.e., in h(c) = g(f(c[1])) the benefit of successfully deducing that all methods of So here's a more radical proposal: what if we reserve union-splitting, as it currently exists, for cases where inference returns an explicit Union (with all concrete types) for
where that
Then, as soon as you match the signature triggering the Not saying this is easy to implement; when you have multiple objects that end up determining how calls dispatch, I wonder if this too might blow up. But such strategies seem to deserve at least a moment's reflection. A feature of this proposal is that we'd never run inference with "intermediate" abstract types: all argument types would either be concrete or would act like |
An alternative implementation of this general idea might be to support "method fragments." Using the example above, the compiled version of function h(c::Vector{Any})
x = c[1]
_h_fragment(x)
end
_h_fragment(x) = g(f(x)) You dynamically look up which instance of If function h(c)
s = 0
for x in c
y = f(x)
z = g(y)
s += k(y, z)
end
return sin(s) + 2*length(c)
end then the compiled version would look like function h(c::Vector{Any})
s = 0
for x in c
s = _h_fragment1(s, x)
end
return _h_fragment2(s, 2*length(c)) # 2*length(c) could either stay in the "parent" or move to fragment (its result is inferrable)
end
function _h_fragment1(s, x)
y = f(x)
z = g(y)
return s + k(y, z)
end
_h_fragment2(s, m) = sin(s) + m On balance I suspect method fragments would be more performant than "multiple lookup," while perhaps also being easier to implement. You basically terminate inference as soon as you get to a non-concrete type and split out the remainder of the lowered code, respecting scope and block boundaries. Inference still needs to handle abstract types to compute return types, but most type-inferred method bodies with non-concrete ssavalues become pretty short. The concern is this might generate too many methods, but since they would be shorter I wonder if that would be a smaller problem that it would be now? |
This wouldn't have to affect the size of the global method table, since each function could have its own "fragment table", right? (Fragments cannot be called from anywhere else.) If this is implemented, then the user would no longer have to worry about inserting function barriers as the compiler would do it automatically! |
This issue elaborates on parts of #33326.
One of the most crucial choices in our type inference algorithm is what to do when a call site has multiple possible targets (i.e. the inferred argument types aren't specific enough to know exactly which method would be called). Currently, we simply infer up to 4 possible targets, and if more than 4 match we give up and say
Any
. So for example you can get the following behavior:This is clearly non-ideal, since the inferred type changes abruptly when a fifth unrelated method is added. To some extent this is unavoidable --- any inference of call sites with inexact argument types might change if a method is added or its signature changes. But, a new algorithm could still do better in three ways: (1) be less arbitrary and more intuitive, (2) make inference generally faster, (3) reduce the likelihood of inference "blowups" as in issues like #34098, #34681.
I have tried a couple alternate algorithms; data incoming!
The text was updated successfully, but these errors were encountered: