-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve optimization passes to produce more compact IR #20853
Conversation
base/inference.jl
Outdated
@@ -305,7 +305,8 @@ const workq = Vector{InferenceState}() # set of InferenceState objects that can | |||
|
|||
#### helper functions #### | |||
|
|||
@inline slot_id(s) = isa(s, SlotNumber) ? (s::SlotNumber).id : (s::TypedSlot).id # using a function to ensure we can infer this | |||
@inline slot_id(s::Slot) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type-inference should be good enough now that this won't cause a regression, but I'm not sure it's ideal to require it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? How could adding this declaration cause a regression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it blocks inlining turning the isa
tests into a dynamic dispatch, unless we've correctly inferred that the value <: Slot
@@ -3359,7 +3360,8 @@ function is_pure_builtin(f::ANY) | |||
f === Intrinsics.checked_srem_int || | |||
f === Intrinsics.checked_urem_int || | |||
f === Intrinsics.check_top_bit || | |||
f === Intrinsics.sqrt_llvm) | |||
f === Intrinsics.sqrt_llvm || | |||
f === Intrinsics.cglobal) # cglobal throws an error for symbol-not-found |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of them can throw errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True... it's just the BLAS code relies on a cglobal
call in void context throwing an error, which is a very atypical use of an intrinsic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just fix that code. I've never been super happy with it throwing an exception during every normal startup anyways.
base/inference.jl
Outdated
end | ||
|
||
function var_matches(a::Union{Slot,SSAValue}, b::Union{Slot,SSAValue}) | ||
return ((isa(a,SSAValue) && isa(b,SSAValue)) || (isa(a,Slot) && isa(b,Slot))) && a.id == b.id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be more compact and readable just to write this using dispatch
base/inference.jl
Outdated
@@ -3657,7 +3657,7 @@ function inlineable(f::ANY, ft::ANY, e::Expr, atypes::Vector{Any}, sv::Inference | |||
if sv.params.inlining | |||
if isa(e.typ, Const) # || isconstType(e.typ) | |||
if (f === apply_type || f === fieldtype || f === typeof || | |||
istopfunction(topmod, f, :typejoin) || | |||
istopfunction(topmod, f, :typejoin) || f === (===) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should have i's own line to be consistent with the other items here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that ===
should often show up here as a Conditional
object, not a Const
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels |
A very frustrating inlining-related regression here: the code for a |
I find It seems to me that for one-liners (or two-liners, etc) this should be relatively safe - only code which would have been partially inlined anyway will get inlined just one more level (per Thus, I think we could relatively safely add (Maybe small functions need a |
It seems like maybe the inlining threshold should be (partially) based on the pre-inlined method body, or some ratio? But this is probably a discussion for a different place. |
f121363
to
c4dec1a
Compare
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels |
this was affected by using `Const` in more cases instead of `Type{}`
don't inline into a function `f` if doing so would put it over the inlining threshhold, and if inlining `f` itself would help avoid tuple allocations. so far this is only used on `promote`, to limit the effects as much as possible.
c4dec1a
to
763f36c
Compare
@nanosoldier |
Ok, what I did here was adjust the inlining pass to accumulate added statements into a single buffer, which I think makes the code a bit simpler, and allows us to easily observe how big the enclosing function is getting. Then I use this to avoid inlining into |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels |
interesting, factorizations got worse but almost everything else got better |
@nanosoldier Double check. |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels |
svd and |
@nanosoldier |
I think I figured this one out. Replacing certain slots with equivalent ssavalues was causing LLVM to emit excessive numbers of memcpys to move tuples around. This can probably be considered a quasi-bug in LLVM (SROA pass I believe), since it should have been able to figure out that a tuple should be stack allocated to begin with and left in place. Here's a sample of the IR:
|
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels |
Much better, but still a couple regressions. I'll try tweaking the new rule here. |
Ok, the first 2 regressions seem to be noise; I don't see any code differences. The sparse indexing regression is real, but a bit perverse: the simpler code this PR generates allows LLVM to vectorize over some of the integer vectors, which seems to cause a slight slowdown. A vectorization cost model issue I suppose. I'm not sure if we can or should do anything about it; that would amount to copying values from arrays into unnecessary temporary variables in hope of blocking vectorization, but we don't know when that would be profitable any better than LLVM does. To reproduce the LLVM IR:
|
Sounds best to ignore the single, minor sparse indexing performance regression then? The widespread improvements this change provides are fantastic. |
I strongly agree |
Seems like the best course of action is to merge and then open an issue about the regression. |
Probably report the IR example upstream, if vectorization is causing a slowdown. |
I've noticed a few functions with pretty bloated IR (redundant variables, useless statements, etc.). This improves the optimization passes to address some of these cases. An extreme example is vector+vector broadcast, which with these changes goes from 248 statements to 127.