-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial implementation of regalloc2. #1
Conversation
6988a3f
to
8e923b0
Compare
We currently use a heuristic that our scan for an available PReg starts at an index into the register list that rotates with the bundle index. This is a simple way to distribute contention across the whole register file more evenly and avoid repeating less-likely-to-succeed reg-map probes to lower-numbered registers for every bundle. After some experimentation with different options (queue that dynamically puts registers at end after allocating, various ways of mixing/hashing indices, etc.), adding the *instruction offset* (of the start of the first range in the bundle) as well gave the best results. This is very simple and gives us a likely better-than-random conflict avoidance because ranges tend to be local, so rotating through registers as we scan down the list of instructions seems like a very natural strategy. On the tests used by our `cargo bench` benchmark, this reduces regfile probes for the largest (459 instruction) benchmark from 1538 to 829, i.e., approximately by half, and results in an 11% allocation speedup.
b35acdb
to
33ac6cb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new interface is much nicer than the old one in regalloc.rs!
src/lib.rs
Outdated
/// Operand must be in a fixed register. | ||
FixedReg(PReg), | ||
/// On defs only: reuse a use's register. Which use is given by `preg` field. | ||
Reuse(usize), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful to have an ExplicitStack(slot)
policy which indicates an incoming/outgoing stack argument. The register allocator should be aware of this for proper placement of spills & reloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about this some more and I suspect it may actually be better to (eventually) handle this requirement with a sort of simple rematerialization. Basically if we indicate (somehow) that a vreg's one def is [arbitrary idempotent/repeatable instruction], we can do many things with this, including a load from a particular arg slot. (I have more thoughts on how this would interact with heuristics, and whether it should allow or not allow other vregs as input, but for another time...)
The alternative is to merge the SpillSlot
and user-defined arg-area stack addressing, which is certainly possible (and I guess more conventional) but is a bit more involved.
Anyway, I'll think more about this; we can track it in an issue perhaps; but for the initial Cranelift glue I plan to just use explicit loads at the top of the function, as we do today.
…I changes. The main enhancement in this commit is support for reference types and stackmaps. This requires tracking whether each VReg is a "reference" or "pointer". At certain instructions designated as "safepoints", the regalloc will (i) ensure that all references are in spillslots rather than in registers, and (ii) provide a list of exactly which spillslots have live references at that program point. This can be used by, e.g., a GC to trace and possibly modify pointers. The stackmap of spillslots is precise: it includes all live references, and *only* live references. This commit also brings in some API tweaks as part of the in-progress Cranelift glue. In particular, it makes Allocations and Operands mutually disjoint by using the same bitfield for the type-tag in both and choosing non-overlapping tags. This will allow instructions to carry an Operand for each register slot and then overwrite these in place with Allocations. The `OperandOrAllocation` type does the necessary magic to make this look like an enum, but staying in 32 bits.
757042c
to
a08b012
Compare
bb6f829
to
940c1b7
Compare
36f0e4f
to
520d7d7
Compare
…epends on smallvec and log.
520d7d7
to
414f3f8
Compare
- Support preferred and non-preferred subsets of a register class. This allows allocating, e.g., caller-saved registers before callee-saved registers. - Allow branch blockparam args to start an a certain offset in branch operands; this allows branches to have other operands too (e.g., conditional-branch inputs). - Allow `OperandOrAllocation` to be constructed from an `Allocation` and `OperandKind` as well (i.e., an allocation with an use/def bit).
Just to keep folks up to speed on progress here: the work to adapt Cranelift to regalloc2 is coming along in my |
This generalizes the allocator to accept multiple defs by making defs just another type of "use" (uses are now perhaps more properly called "mentions", but for now we abuse the terminology slightly). It turns out that this actually was not terribly hard, because we don't rely on the properties that a strict SSA requirement otherwise might allow us to: e.g., defs always at exactly the start of a vreg's ranges. Because we already accepted arbitrary block order and irreducible CFGs, and approximated live-ranges with the single-pass algorithm, we are robust in our "stitching" (move insertion) and so all we really care about is computing some superset of the actual live-ranges and then a non-interfering coloring of (split pieces of) those ranges. Multiple defs don't change that, as long as we compute the ranges properly. We still have blockparams in this design, so the client *can* provide SSA directly, and everything will work as before. But a client that produces non-SSA need not use them at all; it can just happily reassign to vregs and everything will Just Work. This is part of the effort to port Cranelift over to regalloc2; I have decided that it may be easier to build a compatibility shim that matches regalloc.rs's interface than to continue boiling the ocean and converting all of the lowering sequences to SSA. It then becomes a separable piece of work (and simply further performance improvements and simplifications) to remove the need for this shim.
…em more hints. Also fix spill-weight caching.
Thanks for the detailed review, @Amanieu! There are a few unresolved comments above still, one for which I'll be curious to hear Julian's API-design thoughts as well, but everything else is, I think, addressed.
Yes, this is a good vein of future optimization, I agree! I'm not sure when I'll be able to go into a dedicated "RA2 optimization" push again, but I've added this to the |
You missed two instances of |
…ertions`. Also fix some code that did not build in debug mode anymore (d'oh!) in `src/ion/merges.rs`, as exposed by this change.
Fixed, thank you! |
8ed83e3 seems to have regressed something, I'm getting assert failures:
|
@julian-seward1 thanks very much for all of your comments! I hope I've addressed everything; still open to discussions or other suggestions on some of the naming points; this has all been very good feedback. Let me know if there's anything else! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth doing a clippy run and addressing the code style issues it raises.
Interesting; I haven't been in the habit of doing this (we don't do it in the Cranelift/Wasmtime repo for example) and it seems to raise a bunch of lints (62 warnings) many of which seem pretty opinionated to me; to take a random example, it wants me to merge these two if-statements but idiomatically the outer is the main action (merging sets, if changed...) and the inner is part of a "if not in workqueue, add to workqueue" idiom. I may be weird and old-school but putting those two together and relying on the short-circuiting of I'd be happy to create an issue to track something like "look into which Clippy lints are appropriate" so that someone can eventually spend more time determining which lints make sense, but I think it's a wider issue and if we are serious about being lint-clean, we should look across our projects, define a common policy, and then adjust to it... if you or anyone else strongly disagrees, of course, I'm happy to reconsider :-) |
Alright, I think I have addressed everything in the comments. Thank you again to @julian-seward1 and @Amanieu especially for the reviews, and everyone else who commented as well! I've written down a list of issues to track followup work that I'll create once I merge this, so the loose ends here shouldn't be forgotten; if I missed any then please feel free to create more. And with that, I think it is time to merge this PR, based on Julian's r+! |
This was excluded by mistake in bytecodealliance#1; without a configuration, cargo-deny runs with a default one, and rejects a lot of things (largely due to open-source-but-not-allowlisted licenses). This `deny.toml` comes from the regalloc.rs repo. It results in one warning currently that will be resolved once bytecodealliance#7 is.
regalloc2: another register allocator
This is a register allocator that started life as, and is about 75% still, a port of IonMonkey's backtracking register allocator to Rust. The data structures and invariants have been simplified a little bit, and the interfaces made a little more generic and reusable. In addition, it contains substantial amounts of testing infrastructure (fuzzing harnesses and checkers) that does not exist in the original IonMonkey allocator.
More details on the differences between this allocator and the IonMonkey one on which it is modeled are given in the README. In brief, it is an SSA-based allocator that strives, as much as possible, to use efficient data structures and scans over code (rather than quadratic or worse operations) in order to implement a backtracking register allocation algorithm.
Rough Performance Comparison with Regalloc.rs
The allocator has not yet been wired up to a suitable compiler backend
(such as Cranelift) to perform a true apples-to-apples compile-time
and runtime comparison. However, we can get some idea of compile speed
by running suitable test cases through the allocator and measuring
throughput: that is, instructions per second for which registers are
allocated.
To do so, I measured the
qsort2
benchmark inregalloc.rs,
register-allocated with default options in that crate's backtracking
allocator, using the Criterion benchmark framework to measure ~620K
instructions per second:
I then measured three different fuzztest-SSA-generator test cases in
this allocator,
regalloc2
, measuring between 1.1M and 2.3Minstructions per second (closer to the former for larger functions):
Though not apples-to-apples (SSA vs. non-SSA, completely different
code only with similar length), this is at least some evidence that
regalloc2
is likely to lead to at least a compile-time improvementwhen used in e.g. Cranelift.