Initial implementation of regalloc2. #1

cfallin · 2021-04-14T00:38:40Z

regalloc2: another register allocator

This is a register allocator that started life as, and is about 75% still, a port of IonMonkey's backtracking register allocator to Rust. The data structures and invariants have been simplified a little bit, and the interfaces made a little more generic and reusable. In addition, it contains substantial amounts of testing infrastructure (fuzzing harnesses and checkers) that does not exist in the original IonMonkey allocator.

More details on the differences between this allocator and the IonMonkey one on which it is modeled are given in the README. In brief, it is an SSA-based allocator that strives, as much as possible, to use efficient data structures and scans over code (rather than quadratic or worse operations) in order to implement a backtracking register allocation algorithm.

Rough Performance Comparison with Regalloc.rs

The allocator has not yet been wired up to a suitable compiler backend
(such as Cranelift) to perform a true apples-to-apples compile-time
and runtime comparison. However, we can get some idea of compile speed
by running suitable test cases through the allocator and measuring
throughput: that is, instructions per second for which registers are
allocated.

To do so, I measured the qsort2 benchmark in
regalloc.rs,
register-allocated with default options in that crate's backtracking
allocator, using the Criterion benchmark framework to measure ~620K
instructions per second:

benches/0               time:   [365.68 us 367.36 us 369.04 us]
                        thrpt:  [617.82 Kelem/s 620.65 Kelem/s 623.49 Kelem/s]

I then measured three different fuzztest-SSA-generator test cases in
this allocator, regalloc2, measuring between 1.1M and 2.3M
instructions per second (closer to the former for larger functions):

==== 459 instructions
benches/0               time:   [377.91 us 378.09 us 378.27 us]
                        thrpt:  [1.2134 Melem/s 1.2140 Melem/s 1.2146 Melem/s]

==== 225 instructions
benches/1               time:   [202.03 us 202.14 us 202.27 us]
                        thrpt:  [1.1124 Melem/s 1.1131 Melem/s 1.1137 Melem/s]

==== 21 instructions
benches/2               time:   [9.5605 us 9.5655 us 9.5702 us]
                        thrpt:  [2.1943 Melem/s 2.1954 Melem/s 2.1965 Melem/s]

Though not apples-to-apples (SSA vs. non-SSA, completely different
code only with similar length), this is at least some evidence that
regalloc2 is likely to lead to at least a compile-time improvement
when used in e.g. Cranelift.

We currently use a heuristic that our scan for an available PReg starts at an index into the register list that rotates with the bundle index. This is a simple way to distribute contention across the whole register file more evenly and avoid repeating less-likely-to-succeed reg-map probes to lower-numbered registers for every bundle. After some experimentation with different options (queue that dynamically puts registers at end after allocating, various ways of mixing/hashing indices, etc.), adding the *instruction offset* (of the start of the first range in the bundle) as well gave the best results. This is very simple and gives us a likely better-than-random conflict avoidance because ranges tend to be local, so rotating through registers as we scan down the list of instructions seems like a very natural strategy. On the tests used by our `cargo bench` benchmark, this reduces regfile probes for the largest (459 instruction) benchmark from 1538 to 829, i.e., approximately by half, and results in an 11% allocation speedup.

benches/regalloc.rs

fuzz/.gitignore

fuzz/Cargo.toml

src/postorder.rs

Amanieu

The new interface is much nicer than the old one in regalloc.rs!

src/lib.rs

Amanieu · 2021-04-14T12:22:34Z

src/lib.rs

+    /// Operand must be in a fixed register.
+    FixedReg(PReg),
+    /// On defs only: reuse a use's register. Which use is given by `preg` field.
+    Reuse(usize),


It would be useful to have an ExplicitStack(slot) policy which indicates an incoming/outgoing stack argument. The register allocator should be aware of this for proper placement of spills & reloads.

I've been thinking about this some more and I suspect it may actually be better to (eventually) handle this requirement with a sort of simple rematerialization. Basically if we indicate (somehow) that a vreg's one def is [arbitrary idempotent/repeatable instruction], we can do many things with this, including a load from a particular arg slot. (I have more thoughts on how this would interact with heuristics, and whether it should allow or not allow other vregs as input, but for another time...)

The alternative is to merge the SpillSlot and user-defined arg-area stack addressing, which is certainly possible (and I guess more conventional) but is a bit more involved.

Anyway, I'll think more about this; we can track it in an issue perhaps; but for the initial Cranelift glue I plan to just use explicit loads at the top of the function, as we do today.

src/lib.rs

…I changes. The main enhancement in this commit is support for reference types and stackmaps. This requires tracking whether each VReg is a "reference" or "pointer". At certain instructions designated as "safepoints", the regalloc will (i) ensure that all references are in spillslots rather than in registers, and (ii) provide a list of exactly which spillslots have live references at that program point. This can be used by, e.g., a GC to trace and possibly modify pointers. The stackmap of spillslots is precise: it includes all live references, and *only* live references. This commit also brings in some API tweaks as part of the in-progress Cranelift glue. In particular, it makes Allocations and Operands mutually disjoint by using the same bitfield for the type-tag in both and choosing non-overlapping tags. This will allow instructions to carry an Operand for each register slot and then overwrite these in place with Allocations. The `OperandOrAllocation` type does the necessary magic to make this look like an enum, but staying in 32 bits.

src/lib.rs

…epends on smallvec and log.

- Support preferred and non-preferred subsets of a register class. This allows allocating, e.g., caller-saved registers before callee-saved registers. - Allow branch blockparam args to start an a certain offset in branch operands; this allows branches to have other operands too (e.g., conditional-branch inputs). - Allow `OperandOrAllocation` to be constructed from an `Allocation` and `OperandKind` as well (i.e., an allocation with an use/def bit).

cfallin · 2021-05-04T08:14:39Z

Just to keep folks up to speed on progress here: the work to adapt Cranelift to regalloc2 is coming along in my regalloc2 branch; the diff is growing to be a little bit larger than I had initially hoped but the end result will be cleaner, I think. I'm planning to drive that work to the point that we can take some performance measurements, then we can decide what to do with this (hopefully continue with refining it, assuming that we see improvements over status quo!).

This generalizes the allocator to accept multiple defs by making defs just another type of "use" (uses are now perhaps more properly called "mentions", but for now we abuse the terminology slightly). It turns out that this actually was not terribly hard, because we don't rely on the properties that a strict SSA requirement otherwise might allow us to: e.g., defs always at exactly the start of a vreg's ranges. Because we already accepted arbitrary block order and irreducible CFGs, and approximated live-ranges with the single-pass algorithm, we are robust in our "stitching" (move insertion) and so all we really care about is computing some superset of the actual live-ranges and then a non-interfering coloring of (split pieces of) those ranges. Multiple defs don't change that, as long as we compute the ranges properly. We still have blockparams in this design, so the client *can* provide SSA directly, and everything will work as before. But a client that produces non-SSA need not use them at all; it can just happily reassign to vregs and everything will Just Work. This is part of the effort to port Cranelift over to regalloc2; I have decided that it may be easier to build a compatibility shim that matches regalloc.rs's interface than to continue boiling the ocean and converting all of the lowering sequences to SSA. It then becomes a separable piece of work (and simply further performance improvements and simplifications) to remove the need for this shim.

…in one inst.

…em more hints. Also fix spill-weight caching.

… operand/inst handling

cfallin · 2021-08-12T21:54:07Z

Thanks for the detailed review, @Amanieu! There are a few unresolved comments above still, one for which I'll be curious to hear Julian's API-design thoughts as well, but everything else is, I think, addressed.

In my profiles, 5-7% of the time is spent in malloc/realloc/free. Perhaps this is worth optimizing? Perhaps the various Vecs can be preserved across multiple invocations of the register allocator like Cranelift's Context.

Yes, this is a good vein of future optimization, I agree! I'm not sure when I'll be able to go into a dedicated "RA2 optimization" push again, but I've added this to the TODO list.

Amanieu · 2021-08-13T00:13:10Z

You missed two instances of cfg(debug) and cfg!(debug). (should be debug_assertions)

…ertions`. Also fix some code that did not build in debug mode anymore (d'oh!) in `src/ion/merges.rs`, as exposed by this change.

cfallin · 2021-08-13T00:37:11Z

Fixed, thank you!

Amanieu · 2021-08-13T16:34:56Z

8ed83e3 seems to have regressed something, I'm getting assert failures:

thread 'main' panicked at 'assertion failed: self.vregs[vreg.index()].ranges.is_empty() ||\n    range.to <=\n        self.ranges[self.vregs[vreg.index()].ranges.last().unwrap().index.index()].range.from', /home/amanieu/code/regalloc2/src/ion/liveranges.rs:147:9

src/bitvec.rs

src/lib.rs

src/bitvec.rs

src/checker.rs

src/moves.rs

src/ion/liveranges.rs

cfallin · 2021-08-31T00:56:26Z

@julian-seward1 thanks very much for all of your comments!

I hope I've addressed everything; still open to discussions or other suggestions on some of the naming points; this has all been very good feedback. Let me know if there's anything else!

src/lib.rs

Amanieu

It might be worth doing a clippy run and addressing the code style issues it raises.

src/lib.rs

src/ion/merge.rs

src/ion/moves.rs

…values.

cfallin · 2021-09-01T01:15:34Z

It might be worth doing a clippy run and addressing the code style issues it raises.

Interesting; I haven't been in the habit of doing this (we don't do it in the Cranelift/Wasmtime repo for example) and it seems to raise a bunch of lints (62 warnings) many of which seem pretty opinionated to me; to take a random example, it wants me to merge these two if-statements but idiomatically the outer is the main action (merging sets, if changed...) and the inner is part of a "if not in workqueue, add to workqueue" idiom. I may be weird and old-school but putting those two together and relying on the short-circuiting of && for implicit control flow is too subtle for me. To take another example, it's complaining that two branches of a conditional in the checker both assign CheckerValue::Conflicted; that's intentional, in a sort of explicit break-down-the-cases chain of conditionals that define a meet-function. Merging them would be less readable IMHO.

I'd be happy to create an issue to track something like "look into which Clippy lints are appropriate" so that someone can eventually spend more time determining which lints make sense, but I think it's a wider issue and if we are serious about being lint-clean, we should look across our projects, define a common policy, and then adjust to it... if you or anyone else strongly disagrees, of course, I'm happy to reconsider :-)

cfallin · 2021-09-01T01:19:48Z

Alright, I think I have addressed everything in the comments. Thank you again to @julian-seward1 and @Amanieu especially for the reviews, and everyone else who commented as well! I've written down a list of issues to track followup work that I'll create once I merge this, so the loose ends here shouldn't be forgotten; if I missed any then please feel free to create more.

And with that, I think it is time to merge this PR, based on Julian's r+!

This was excluded by mistake in bytecodealliance#1; without a configuration, cargo-deny runs with a default one, and rejects a lot of things (largely due to open-source-but-not-allowlisted licenses). This `deny.toml` comes from the regalloc.rs repo. It results in one warning currently that will be resolved once bytecodealliance#7 is.

cfallin requested review from julian-seward1 and bnjbvr April 14, 2021 00:38

Initial public commit of regalloc2.

8e923b0

cfallin force-pushed the initial-regalloc branch from 6988a3f to 8e923b0 Compare April 14, 2021 00:40

cfallin force-pushed the initial-regalloc branch from b35acdb to 33ac6cb Compare April 14, 2021 06:31

bjorn3 reviewed Apr 14, 2021

View reviewed changes

benches/regalloc.rs Outdated Show resolved Hide resolved

fuzz/.gitignore Outdated Show resolved Hide resolved

fuzz/Cargo.toml Outdated Show resolved Hide resolved

src/postorder.rs Outdated Show resolved Hide resolved

Amanieu reviewed Apr 14, 2021

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

cfallin force-pushed the initial-regalloc branch from 757042c to a08b012 Compare April 18, 2021 04:30

Amanieu reviewed Apr 18, 2021

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

src/lib.rs Outdated Show resolved Hide resolved

src/lib.rs Outdated Show resolved Hide resolved

src/lib.rs Outdated Show resolved Hide resolved

src/lib.rs Show resolved Hide resolved

src/lib.rs Show resolved Hide resolved

Changes from review comments.

940c1b7

cfallin force-pushed the initial-regalloc branch from bb6f829 to 940c1b7 Compare April 18, 2021 20:12

Add GitHub CI config.

34ab744

cfallin force-pushed the initial-regalloc branch from 36f0e4f to 520d7d7 Compare April 18, 2021 21:11

Factored out test program and fuzzing features; core crate now only d…

414f3f8

…epends on smallvec and log.

cfallin force-pushed the initial-regalloc branch from 520d7d7 to 414f3f8 Compare April 18, 2021 21:19

cfallin added 2 commits April 30, 2021 21:14

Derive Ord/hash on OperandOrAllocation.

9e7021c

cfallin force-pushed the initial-regalloc branch from 0db1666 to 15ed2d6 Compare May 6, 2021 05:49

cfallin added 4 commits May 5, 2021 23:08

BitVec::get() takes immutable self

48fbc23

MachineEnv fields are public

ab828b6

Properly handle multiple same-fixed-reg constraints to the same vreg …

80cdd0c

…in one inst.

Some performance tweaks -- try to reduce register probe count with so…

1a7a0c5

…em more hints. Also fix spill-weight caching.

cfallin force-pushed the initial-regalloc branch from dea3902 to 1a7a0c5 Compare May 6, 2021 23:10

cfallin added 2 commits May 6, 2021 16:19

Some micro-optimizations in BitVec.

747c56c

Handle moves specially with move-insertion logic rather than ordinary…

e2beb47

… operand/inst handling

cfallin added 2 commits August 12, 2021 14:49

Debug output for Operands: omit default/most common positions.

ffc06b2

Add malloc/free optimization to TODO

f1a989f

Fix BitVec::get_or_insert to scan only once.

8ed83e3

Replace remaining instances of use of debug feature with `debug_ass…

69ad31f

…ertions`. Also fix some code that did not build in debug mode anymore (d'oh!) in `src/ion/merges.rs`, as exposed by this change.

Amanieu reviewed Aug 13, 2021

View reviewed changes

src/bitvec.rs Outdated Show resolved Hide resolved

Fix bug in refactored BitVec (found by @Amanieu).

e10bffb

julian-seward1 reviewed Aug 23, 2021

View reviewed changes

julian-seward1 reviewed Aug 30, 2021

View reviewed changes

src/moves.rs Outdated Show resolved Hide resolved

julian-seward1 reviewed Aug 30, 2021

View reviewed changes

src/ion/liveranges.rs Show resolved Hide resolved

cfallin added 2 commits August 30, 2021 17:15

Address review comments: more doc comments and some minor refactorings.

6d313f2

Addressed more review comments.

3a18564

julian-seward1 reviewed Aug 31, 2021

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

julian-seward1 approved these changes Aug 31, 2021

View reviewed changes

Amanieu reviewed Aug 31, 2021

View reviewed changes

src/lib.rs Show resolved Hide resolved

src/lib.rs Show resolved Hide resolved

src/lib.rs Outdated Show resolved Hide resolved

src/ion/merge.rs Outdated Show resolved Hide resolved

src/ion/moves.rs Outdated Show resolved Hide resolved

cfallin added 3 commits August 31, 2021 17:31

Rename operand positions to Early and Late, and make weights f16/f32 …

b19fa48

…values.

Address review comments.

6389071

Address review comments.

6f0893d

cfallin merged commit 57ccaef into bytecodealliance:main Sep 1, 2021

cfallin deleted the initial-regalloc branch September 1, 2021 01:20

cfallin mentioned this pull request Sep 1, 2021

Investigate using Clippy and following lints' recommendations #6

Open

This was referenced Sep 3, 2021

Add cargo-deny config. #10

Merged

Investigate better bitpacking for Operand and Use #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial implementation of regalloc2. #1

Initial implementation of regalloc2. #1

cfallin commented Apr 14, 2021 •

edited

Loading

Amanieu left a comment

Amanieu Apr 14, 2021

cfallin Apr 18, 2021

cfallin commented May 4, 2021

cfallin commented Aug 12, 2021 •

edited

Loading

Amanieu commented Aug 13, 2021

cfallin commented Aug 13, 2021

Amanieu commented Aug 13, 2021

cfallin commented Aug 31, 2021

Amanieu left a comment

cfallin commented Sep 1, 2021

cfallin commented Sep 1, 2021

Initial implementation of regalloc2. #1

Initial implementation of regalloc2. #1

Conversation

cfallin commented Apr 14, 2021 • edited Loading

regalloc2: another register allocator

Rough Performance Comparison with Regalloc.rs

Amanieu left a comment

Choose a reason for hiding this comment

Amanieu Apr 14, 2021

Choose a reason for hiding this comment

cfallin Apr 18, 2021

Choose a reason for hiding this comment

cfallin commented May 4, 2021

cfallin commented Aug 12, 2021 • edited Loading

Amanieu commented Aug 13, 2021

cfallin commented Aug 13, 2021

Amanieu commented Aug 13, 2021

cfallin commented Aug 31, 2021

Amanieu left a comment

Choose a reason for hiding this comment

cfallin commented Sep 1, 2021

cfallin commented Sep 1, 2021

cfallin commented Apr 14, 2021 •

edited

Loading

cfallin commented Aug 12, 2021 •

edited

Loading