-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent query cycles in the MIR inliner #68828
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pessimistic approach to inlining, since we strictly have more calls in the
validated_mir
than we have inoptimized_mir
Except for box_free
calls, which drop elaboration adds.
I think I get it now, this is similar to computing SCCs (of the callgraph) ahead of time, but both better (incremental) and worse (not cached nor using an SCC algorithm) at the same time. We discussed something like this ages ago, maybe @nikomatsakis can remember where. I think your query helps both with caching and incremental red-green, but it's not necessary to get the correctness, you could "lookahead" by using One of the problems with precomputing SCCs in an on-demand/incremental fashion is deciding on which (callgraph) node "defines" the SCCs, as it needs to be the same no matter which node in the SCC the information is queried for. And... I think we've had an answer for a while now! We sort several things by stable hashes. So we could roughly have this query (where struct Cycle {
root: LocalDefId,
members: FxHashSet<LocalDefId>,
}
fn mir_inliner_cycle(tcx: TyCtxt<'tcx>, root: LocalDefId) -> Option<&'tcx Cycle> {
let root_hash = tcx.def_path_hash(root);
let mut cycle = None;
let mut queue = VecDeque::new();
queue.push_back(root);
while let Some(caller) = queue.pop_front() {
for callee in tcx.mir_inliner_callees(caller) {
if callee == def_id {
cycle = Some(Cycle {
root,
// TODO: accumulate cycle members.
// Can probably use a stack instead of a queue, emulating
// the runtime callstack, but this example is large as is.
members: FxHashSet::new(),
});
continue;
}
if tcx.def_path_hash(callee) < root_hash {
// We may share a cycle with `callee`, in which case
// it's "closer" to the cycle's real root (lower hash).
if let Some(callee_cycle) = tcx.mir_inliner_cycle(callee) {
if callee_cycle.members.contains(&root) {
return Some(callee_cycle);
}
}
} else {
// Keep exploring areas of the callgraph that could
// contain a cycle that we're the root of
// (i.e. all other nodes have higher hash).
queue.push_back(callee);
}
}
}
cycle
} Alright, I got carried away sketching that, but I think we've made progress here! EDIT: to be clear, I don't think we can't use a real whole-graph SCC algorithm, everything has to be one SCC at a time (i.e. "find SCC this node is in") for it to work with incremental queries. |
Is this true? Couldn't optimizations expose calls which were previously e.g. indirect? |
I'm not sure how that could work, since |
Yea, I guess const prop could turn
I had a more complex approach before the one showed here. That approach also returned the substs of the function calls from the query so we can adjust them to the call site's substs. That's basically what inlining does right now. I scrapped it because I couldn't wrap my head around all the edge cases and wanted to publish something. I plan to deduplicate that logic with the inliner's logic so we have a single source of truth instead of having two schemes that cause cycles if they diverge. |
I believe my plan was to abandon this constraint from @eddyb
Basically, we would have a single "call graph" query that walks the crate and constructs the call graph, presumably based on reading the "validated MIR". This query would be wrapped with other queries that extract the individual SCCs and return their contents -- with a bit of careful work on the keys, we could make it work so that if the contents of an SCC don't change, the query yields up green, and thus avoid the need to re-optimize things that contain inlining. (I am imagining something like: the key that identifies an SCC is a hash of its contents, perhaps? Or it is named by some def-id within which has the lowest hash. Something like that, so that it is not dependent on the order in which we traversed the call graph.) Assumptions:
This strikes me as a fairly simple design that would likely work reasonably well? But I'm open to other alternatives for sure. I'm not sure how far off that is from the pseudo-code that @eddyb sketched. |
@nikomatsakis I thought we decided long ago that it's a no-starter, otherwise... we would've done if already, wouldn't we? We even had the SCC computation code and everything, I think we just removed it. The problem with a monolith query is that it depends on every function in the crate. Sure, individual My alternative would be more expensive the first time, probably, but it should explore smaller parts of the callgraph later. @oli-obk Can you make a crate-wide query that just queries your |
I don't recall any such decision, but I could believe that there was a consensus reached, either without me, or that I've forgotten about.
Yes. I am not convinced this would be a problem in practice, but it is definitely a downside. My hope was that this specific query would be "small enough" in scope not to be a big deal.
I guess I need to look more closely. It seems to me that one challenge here is that, if the query is "figure out which SCC this def-id is a part of", then we'd probably want to extend the query system, because otherwise you'll have to recompute that computation for each member of the SCC, right? (i.e., you could start from many different entry points) |
Would cycle error recovery be sound by discarding everything up to the oldest try_query call on a cycle, even when there is a newer try_query call. That way the query will be recomputed for every possible cycle entry point. eg The returning back to the oldest try_query could be implemented using:
|
I do think we could extend the query system to permit cycles, potentially, and avoid caching for intermediate nodes, but it raises some interesting questions. There are various models for this. It also interacts with parallel rustc. I guess I just need to read this PR and @eddyb's comment in more detail so I actually understand what's being proposed here and stop talking about hypotheticals. |
src/librustc_mir/transform/inline.rs
Outdated
// and is already optimized. | ||
self.tcx.optimized_mir(callsite.callee) | ||
}; | ||
let mut seen: FxHashSet<_> = Default::default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this cycle check should be extracted to a function.
It bails out as soon as it finds another cycle member with a lower hash (i.e. "closer to the root"), but you're right that it might be too expensive (I think we should benchmark both your and my version).
Heh, that's what I don't think we can do that for |
If we opt to have cycles discard caching for the contents of the cycle, it is reasonable, but it does mean that EDIT: Just to clarify -- it's exactly the possibility of 'inconsistent' results that causes us problems using it for mir-optimize, since it means that the final results depend on query order, which we do not want. EDIT: The prolog "tabling" solution would be to remember the result you got at the end and then re-run the query, but this time you use that result for any recursive results, and you require that it reaches a fixed point. I had hoped we could get away with the approach that I was advocating for earlier in these sorts of scenarios, but I am open to the idea that it won't work. I feel though that we should at least try it before we reach that conclusion. As @eddyb said, building a few options and benchmarking would be ideal. |
What I meant was that at a cycle error everything after the first try_query is discarded, so for every possible entry point of the cycle a cycle error will be reported. on the next query of the cycle.
|
Haha, I've been meaning to look into tabling for a while now. |
☔ The latest upstream changes (presumably #69592) made this pull request unmergeable. Please resolve the merge conflicts. |
13 StorageLive(_3); // scope 0 at $DIR/cycle.rs:4:5: 4:6
14 _3 = &_1; // scope 0 at $DIR/cycle.rs:4:5: 4:6
15 StorageLive(_4); // scope 0 at $DIR/cycle.rs:4:5: 4:8
- _2 = <impl Fn() as Fn<()>>::call(move _3, move _4) -> [return: bb1, unwind: bb3]; // scope 0 at $DIR/cycle.rs:4:5: 4:8
+ _2 = <impl Fn() as Fn<()>>::call(move _3, move _4) -> bb1; // scope 0 at $DIR/cycle.rs:4:5: 4:8
17 // mir::Constant
18 // + span: $DIR/cycle.rs:4:5: 4:6
19 // + literal: Const { ty: for<'r> extern "rust-call" fn(&'r impl Fn(), ()) -> <impl Fn() as std::ops::FnOnce<()>>::Output {<impl Fn() as std::ops::Fn<()>>::call}, val: Value(Scalar(<ZST>)) }
24 StorageDead(_3); // scope 0 at $DIR/cycle.rs:4:7: 4:8
25 StorageDead(_2); // scope 0 at $DIR/cycle.rs:4:8: 4:9
26 _0 = const (); // scope 0 at $DIR/cycle.rs:3:20: 5:2
- drop(_1) -> [return: bb2, unwind: bb4]; // scope 0 at $DIR/cycle.rs:5:1: 5:2
+ drop(_1) -> bb2; // scope 0 at $DIR/cycle.rs:5:1: 5:2
29
30 bb2: {
31 return; // scope 0 at $DIR/cycle.rs:5:2: 5:2
-
-
- bb3 (cleanup): {
- drop(_1) -> bb4; // scope 0 at $DIR/cycle.rs:5:1: 5:2
-
-
- bb4 (cleanup): {
- resume; // scope 0 at $DIR/cycle.rs:3:1: 5:2
41 }
42
|
@bors r=wesleywiser wasm generates different MIR |
📌 Commit d38553c has been approved by |
⌛ Testing commit d38553c with merge 12c5b6d144fd893cecdab3539aa2723d0c860909... |
The job Click to see the possible cause of the failure (guessed by this bot)
|
💔 Test failed - checks-actions |
@bors retry |
☀️ Test successful - checks-actions |
It looks like this was a moderate regression in instruction counts on several benchmarks (https://perf.rust-lang.org/compare.html?start=7fba12bb1d3877870758a7a53e2fe766bb19bd60&end=f4eb5d9f719cd3c849befc8914ad8ce0ddcf34ed&stat=instructions:u), though there was also some improvement. @oli-obk - was this expected? I guess there were some earlier perf runs that were much worse, so we've improved since then at least... |
the previous runs here were with inlining activated. I should have re-run without inlining activated, but I forgot. The results are really odd. When looking at the detailed perf (e.g. for keccak), it seems like typeck has gotten slower, which makes absolutely no sense considering this PR doesn't actually change anything without mir-opt-level=2... I'll start looking into this. |
r? @eddyb @wesleywiser
cc @rust-lang/wg-mir-opt
The general design is that we have a new query that is run on the
validated_mir
instead of on theoptimized_mir
. That query is forced before going into the optimization pipeline, so as to not try to read from a stolen MIR.The query should not be cached cross crate, as you should never call it for items from other crates. By its very design calls into other crates can never cause query cycles.
This is a pessimistic approach to inlining, since we strictly have more calls in the
validated_mir
than we have inoptimized_mir
, but that's not a problem imo.