Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide inline-worthiness based on a more nuanced cost model #22210

Merged
merged 5 commits into from
Jul 8, 2017

Conversation

timholy
Copy link
Member

@timholy timholy commented Jun 4, 2017

In #22202 I noted a disadvantage of our current inlining heuristics: that we penalize verbosity even when two bits of code do the same thing. I decided to explore the feasibility of replacing our cost model with something based a little more closely on CPU cycles, with the idea that we should inline things that are obviously cheaper than a function call. I decided to be a bit generous and set a default threshold of 200 "cycles", and assigned costs to operations loosely inspired by this chart.

This is a horribly rough approximation: for example, there's a very large difference between success and failure when it comes to branch-prediction, and it might be reasonable to guess that the branch in

x < zero(x) && throw(DomainError())

is much less likely to be hit than

if x < rand()
    # option 1
else
    # option 2
end

Such subtleties (not to mention cache prefetch, etc) are far beyond the simple approach here. In general I tried to aim for an optimistic model, but worked in some amount of penalty for the possibility that this wouldn't hold, with the amount depending entirely on how I was feeling at that particular moment 😄.

It's also worth pointing out that I've done nothing whatsoever to test how well this works. My entire goal was to get julia to build.

This still has debugging statements in the submitted code because I can't imagine that this isn't going to need at least a teensy bit of revision 😉.

@timholy
Copy link
Member Author

timholy commented Jun 4, 2017

Just realized there's a really big hole in this: a for or while loop will only get counted as if its body is executed once. Probably it would be best to just not inline functions with such loops. I don't see a "loop detector" presently, but perhaps I'm just missing it. Or can one detect a loop simply by a goto to an earlier line?

@yuyichao
Copy link
Contributor

yuyichao commented Jun 6, 2017

Not an expert on inlining heuristic but I think using the number of operation in a function to decide inline-worthiness should be OK. It's perfectly fine to inline a small loop. As long as the inlining doesn't blow up the size of the parent function it shouldn't hurt much by inlining it. Obviously it might not help much but that's true in general and you probably need a much more involved cost model of the inlining or actually trying to inline it to tell that.

yuyichao added a commit that referenced this pull request Jun 6, 2017
The change to `isassigned(::SimpleVector, ...)` and `isassigned(::BitArray, ...)`
are currently no op since they will unlikely be inlined.
The `SimpleVector` version is also currently unsafe to inline.
These might change with a better inlining heuristic (#22210) and a smarter GC frame
allocation pass (#21888).
@timholy timholy changed the title WIP/RFC: Decide inline-worthiness based on a more nuanced cost model Decide inline-worthiness based on a more nuanced cost model Jun 6, 2017
@timholy timholy changed the title Decide inline-worthiness based on a more nuanced cost model (Do not merge) Decide inline-worthiness based on a more nuanced cost model Jun 6, 2017
@timholy
Copy link
Member Author

timholy commented Jun 6, 2017

OK, I've taken this further; a modified version of this (that had some debugging code) locally passed all tests but one. For demonstration purposes, I've also pushed a branch, teh/disable_inline_macros, that gets rid of all forced-inlining in Base; the point of this is to force the inliner to "stand on its own" and see what happens. I then rebased this PR on that branch. My original intention was to then run nanosoldier to compare the differences, but from my local testing there are so many differences that it's a bit overwhelming.

The bottom line seems pretty simple, though. Behold:
inline_perf

So while there are a few regressions, they are both outnumbered and outweighed by the improvements. (EDIT: each dot is one test in BaseBenchmarks.jl, >2000 tests overall.)

As one example, map(abs, t) where t = ntuple(identity, n) inlines (on its own) up to n=3 on master but n=14 on this branch; a more expensive function, map(sqrt, t), inlines up to n=5.

Now, one concern might be that this simply inlines everything, and that's cheating. What's the best way to assess such concerns? I checked the "resident size" in top, and compared to 0.6 (I didn't feel like rebuilding master) the size was pretty comparable.

@yuyichao
Copy link
Contributor

yuyichao commented Jun 6, 2017

Am I reading it correctly that there's one case that shows 10^6 speed up?

Now, one concern might be that this simply inlines everything, and that's cheating.

Well, then all optimizing compilers have been cheating all along.... I'm pretty sure most C/C++ compiler do much more aggressive inlining than we do so I wouldn't call it cheating. The important thing is to measure the overhead. Maybe measuring the time to compile sysimg, or some other standard precompilation test cases. You can also measure the total size in inferenced IR by record a statistic of non-void expression in inference after finishing each functions and if you want more low level info you can also check size -A of sysimg and record the JIT code size by adding a counter in RTDyldMemoryManagerJL::allocateCodeSection in cgmemmgr.cpp.

@JeffBezanson
Copy link
Member

Wow, that's really cool. If you re-enable manual inlining declarations, do the regressions go away?

@JeffBezanson JeffBezanson added the performance Must go faster label Jun 6, 2017
@timholy
Copy link
Member Author

timholy commented Jun 6, 2017

Am I reading it correctly that there's one case that shows 10^6 speed up?

Yes, that's basically right; several others cases get better than 10^4 speed up. Or rather, "not slowed down"; I am sure that our performance annotations take care of all the egregious cases. I very much doubt we'll see much improvement compared to "real" master from this, but of course it's worth testing.

Maybe measuring the time to compile sysimg

Great idea.

task master "master" (with disabled @inline annotations) PR
build from scratch (incl. inference.jl) 2min 48s 2min 46s 3min 6s
build base/precompile.jl 1min 38s 1min 36s 1min 50s

So there's definitely a slowdown with this. Would be interested in people's thoughts on how bad this is.

If you re-enable manual inlining declarations, do the regressions go away?

Will test. But right now I'm testing how this fares against "real" master, with all of its performance annotations; I'm curious to know if we could scale back on our usage of @inline.

@yuyichao
Copy link
Contributor

yuyichao commented Jun 6, 2017

I believe we also have a ENABLE_TIMING option in options.h which can give you a breakdown of the timing. That timing different doesn't sound too bad to me though compile time is also not the thing I care about the most....

@timholy
Copy link
Member Author

timholy commented Jun 7, 2017

Here's a comparison against master with all of its @inline annotations:

inline_vs_master

Definitely worse on balance, but better than I expected: to my eye, this branch without performance annotations is closer to master with annotations than it is to master with the annotations stripped out.

Obviously some of the regressions are pretty severe, but I've done literally no performance tuning here. Probably worth looking into some of the regressions by hand and seeing what I can come up with...in particular, the ones that have ~10-100ns of runtime on master seem likely to be just me assigning the wrong cost and it therefore failing to inline a pretty simple function.

@quinnj
Copy link
Member

quinnj commented Jun 7, 2017

I think it'd be really cool to expose the internal cost model here via profiling; I think it'd be a neat alternative to time/allocations/samples and another good signal in terms of identifying code performance bottlenecks and the relative "cost" of various code paths. Sometimes you don't realize the relative costs of various parts of a workflow or have some kind of type instability, all of which seems like would stick out with seeing this kind of data.

@timholy
Copy link
Member Author

timholy commented Jun 8, 2017

@quinnj, the cost model is almost certainly not accurate enough to expose to users; it doesn't take into account looping, and the cost of non-inferrability is handled crudely.

However, it can be useful to understand why a particular function didn't get inlined. I'll document how to call statement_cost for a function body.

@timholy
Copy link
Member Author

timholy commented Jun 9, 2017

So, I'm thinking of putting this on the back burner until #21888 merges; I'm seeing quite a few issues that I hope get resolved by better GC marking. The last commit here (c981dd1) is just one example; I'm also seeing a few big performance regressions that seem to occur because we allocate when we don't need to. A good example is mean(rand(10^3)), where I'm getting allocations due to unnecessarily boxing the blksize argument. I'm pretty sure this is not caused by my changes here, but it does get triggered by these changes.

Of course the question is whether its benefits on LLVM 3.9 will fix these issues, or whether it will have to wait until we move to LLVM 5.

@timholy
Copy link
Member Author

timholy commented Jun 9, 2017

Speaking of bugs triggered by this, https://travis-ci.org/JuliaLang/julia/builds/240808331 hit an assertion in gf.c. (It was the Int version that was a subtype of the Int... version of that function.) I worked around it by declaring that argument as Vararg{Int,N}, but it suggests a bug elsewhere.

@timholy timholy changed the title (Do not merge) Decide inline-worthiness based on a more nuanced cost model Decide inline-worthiness based on a more nuanced cost model Jun 18, 2017
@timholy
Copy link
Member Author

timholy commented Jun 18, 2017

@nanosoldier runbenchmarks(ALL, vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@KristofferC
Copy link
Member

Cool stuff!

This greatly reduces the cost of failure-to-inline
This switches to a model in which we estimate the runtime of the
function; if it's fast, then we should inline it. The estimate of
runtime is extremely crude, and doesn't even take loops into account.
@timholy
Copy link
Member Author

timholy commented Jul 7, 2017

If this makes it through CI, I'll merge in the morning.

@vchuravy
Copy link
Member

vchuravy commented Jul 8, 2017

Appveyor is failing due to libgit2-online. Do all the intermediate commits pass tests as well?

@timholy
Copy link
Member Author

timholy commented Jul 8, 2017

Yes, they do. (EDIT: only the last commit addresses the real topic of this PR. The earlier ones compensate for issues I detected at various points along the way while developing this. As with the @noinline around the throws, it's quite likely that they aren't strictly necessary because the final version of the inference change solved the problems more systematically. But neither are they problematic either, and in some ways represent improvements on their own.)

@timholy timholy merged commit df63db6 into master Jul 8, 2017
@timholy timholy deleted the teh/inlining_cost_model branch July 8, 2017 08:18
@KristofferC
Copy link
Member

Yay!

@martinholters
Copy link
Member

Hm, looks like this makes the ACME.jl tests crash. Not sure if the problem was introduced here or just uncovered by different inlining. I haven't really gotten anywhere investigating the issue, but somewhat up the call stack of a sightly reduced case, I found this:

#16 0x00007ffff7574ed4 in jl_call_fptr_internal (fptr=0x7fffffffb940, meth=0x7fffe999ec90, args=0x7fffffffbb48, nargs=2)
    at ./src/julia_internal.h:369
369	        return fptr->fptr1(args[0], &args[1], nargs-1);
(gdb) call jl_(args[0])
getfield(Base, Symbol("##494#495")){Base.MethodError}(ex=Base.MethodError(f=typeof(Base.:(*))(), args=<?#0x7fffe906f230::Core.Inference.InferenceParams(world=0x0000000000005677, inlining=true, inline_cost_threshold=100, inline_nonleaf_penalty=1000, inline_tupleret_bonus=400, MAX_METHODS=4, MAX_TUPLETYPE_LEN=15, MAX_TUPLE_DEPTH=4, MAX_TUPLE_SPLAT=16, MAX_UNION_SPLITTING=4, MAX_APPLY_UNION_ENUM=8)>, world=0x0000000000005677))

Note that the type of (inner) args is an instance of InferenceParams. Of course, this could also just be a random artifact of some memory corruption. Does anyone have any ideas or has seen anything similar?

@timholy
Copy link
Member Author

timholy commented Jul 11, 2017

As long as it doesn't crash in inline_worthy or statement_cost, there's a sense in which this can't introduce new bugs: the only thing it changes is the value of a single boolean flag, the decision to inline or not. It doesn't change anything about how that inlining is performed.

However, it's perfectly capable of exposing old bugs. Indeed it's already done so: at least #22516, and (at an intermediate stage) what is probably the same bug as #22592.

I've not seen that one before, though.

@martinholters
Copy link
Member

Seems indeed to be a missing GC root again, ref. #22770 (comment).

@tkelman
Copy link
Contributor

tkelman commented Jul 12, 2017

I'll write up a devdocs on this once it's been reviewed.

Sometime later then if you have time?

vtjnash added a commit that referenced this pull request Jul 9, 2020
Added in #22210 (and earlier begun in #20853), this is no longer
necessary to avoid heap allocations, and thus serves little purpose now.
vtjnash added a commit that referenced this pull request Jul 13, 2020
Added in #22210 (and earlier begun in #20853), this is no longer
necessary to avoid heap allocations, and thus serves little purpose now.
vtjnash added a commit that referenced this pull request Jul 20, 2020
Added in #22210 (and earlier begun in #20853), this is no longer
necessary to avoid heap allocations, and thus serves little purpose now.
vtjnash added a commit that referenced this pull request Sep 1, 2020
Added in #22210 (and earlier begun in #20853), this is no longer
necessary to avoid heap allocations, and thus serves little purpose now.
vtjnash added a commit that referenced this pull request Sep 3, 2020
Added in #22210 (and earlier begun in #20853), this is no longer
necessary to avoid heap allocations, and thus serves little purpose now.
c42f pushed a commit that referenced this pull request Sep 23, 2020
Added in #22210 (and earlier begun in #20853), this is no longer
necessary to avoid heap allocations, and thus serves little purpose now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.