Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of regalloc2. #1

Merged
merged 155 commits into from
Sep 1, 2021

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented Apr 14, 2021

regalloc2: another register allocator

This is a register allocator that started life as, and is about 75% still, a port of IonMonkey's backtracking register allocator to Rust. The data structures and invariants have been simplified a little bit, and the interfaces made a little more generic and reusable. In addition, it contains substantial amounts of testing infrastructure (fuzzing harnesses and checkers) that does not exist in the original IonMonkey allocator.

More details on the differences between this allocator and the IonMonkey one on which it is modeled are given in the README. In brief, it is an SSA-based allocator that strives, as much as possible, to use efficient data structures and scans over code (rather than quadratic or worse operations) in order to implement a backtracking register allocation algorithm.

Rough Performance Comparison with Regalloc.rs

The allocator has not yet been wired up to a suitable compiler backend
(such as Cranelift) to perform a true apples-to-apples compile-time
and runtime comparison. However, we can get some idea of compile speed
by running suitable test cases through the allocator and measuring
throughput: that is, instructions per second for which registers are
allocated.

To do so, I measured the qsort2 benchmark in
regalloc.rs,
register-allocated with default options in that crate's backtracking
allocator, using the Criterion benchmark framework to measure ~620K
instructions per second:

benches/0               time:   [365.68 us 367.36 us 369.04 us]
                        thrpt:  [617.82 Kelem/s 620.65 Kelem/s 623.49 Kelem/s]

I then measured three different fuzztest-SSA-generator test cases in
this allocator, regalloc2, measuring between 1.1M and 2.3M
instructions per second (closer to the former for larger functions):

==== 459 instructions
benches/0               time:   [377.91 us 378.09 us 378.27 us]
                        thrpt:  [1.2134 Melem/s 1.2140 Melem/s 1.2146 Melem/s]

==== 225 instructions
benches/1               time:   [202.03 us 202.14 us 202.27 us]
                        thrpt:  [1.1124 Melem/s 1.1131 Melem/s 1.1137 Melem/s]

==== 21 instructions
benches/2               time:   [9.5605 us 9.5655 us 9.5702 us]
                        thrpt:  [2.1943 Melem/s 2.1954 Melem/s 2.1965 Melem/s]

Though not apples-to-apples (SSA vs. non-SSA, completely different
code only with similar length), this is at least some evidence that
regalloc2 is likely to lead to at least a compile-time improvement
when used in e.g. Cranelift.

We currently use a heuristic that our scan for an available PReg
starts at an index into the register list that rotates with the bundle
index. This is a simple way to distribute contention across the whole
register file more evenly and avoid repeating less-likely-to-succeed
reg-map probes to lower-numbered registers for every bundle.

After some experimentation with different options (queue that
dynamically puts registers at end after allocating, various
ways of mixing/hashing indices, etc.), adding the *instruction offset*
(of the start of the first range in the bundle) as well gave the best
results. This is very simple and gives us a likely better-than-random
conflict avoidance because ranges tend to be local, so rotating
through registers as we scan down the list of instructions seems like
a very natural strategy.

On the tests used by our `cargo bench` benchmark, this reduces regfile
probes for the largest (459 instruction) benchmark from 1538 to 829,
i.e., approximately by half, and results in an 11% allocation speedup.
benches/regalloc.rs Outdated Show resolved Hide resolved
fuzz/.gitignore Outdated Show resolved Hide resolved
fuzz/Cargo.toml Outdated Show resolved Hide resolved
src/postorder.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@Amanieu Amanieu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new interface is much nicer than the old one in regalloc.rs!

src/lib.rs Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated
/// Operand must be in a fixed register.
FixedReg(PReg),
/// On defs only: reuse a use's register. Which use is given by `preg` field.
Reuse(usize),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to have an ExplicitStack(slot) policy which indicates an incoming/outgoing stack argument. The register allocator should be aware of this for proper placement of spills & reloads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this some more and I suspect it may actually be better to (eventually) handle this requirement with a sort of simple rematerialization. Basically if we indicate (somehow) that a vreg's one def is [arbitrary idempotent/repeatable instruction], we can do many things with this, including a load from a particular arg slot. (I have more thoughts on how this would interact with heuristics, and whether it should allow or not allow other vregs as input, but for another time...)

The alternative is to merge the SpillSlot and user-defined arg-area stack addressing, which is certainly possible (and I guess more conventional) but is a bit more involved.

Anyway, I'll think more about this; we can track it in an issue perhaps; but for the initial Cranelift glue I plan to just use explicit loads at the top of the function, as we do today.

src/lib.rs Outdated Show resolved Hide resolved
…I changes.

The main enhancement in this commit is support for reference types and
stackmaps. This requires tracking whether each VReg is a "reference" or
"pointer". At certain instructions designated as "safepoints", the
regalloc will (i) ensure that all references are in spillslots rather
than in registers, and (ii) provide a list of exactly which spillslots
have live references at that program point. This can be used by, e.g., a
GC to trace and possibly modify pointers. The stackmap of spillslots is
precise: it includes all live references, and *only* live references.

This commit also brings in some API tweaks as part of the in-progress
Cranelift glue. In particular, it makes Allocations and Operands
mutually disjoint by using the same bitfield for the type-tag in both
and choosing non-overlapping tags. This will allow instructions to carry
an Operand for each register slot and then overwrite these in place with
Allocations. The `OperandOrAllocation` type does the necessary magic to
make this look like an enum, but staying in 32 bits.
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Show resolved Hide resolved
src/lib.rs Show resolved Hide resolved
- Support preferred and non-preferred subsets of a register class. This
  allows allocating, e.g., caller-saved registers before callee-saved
  registers.
- Allow branch blockparam args to start an a certain offset in branch
  operands; this allows branches to have other operands too (e.g.,
  conditional-branch inputs).
- Allow `OperandOrAllocation` to be constructed from an `Allocation` and
  `OperandKind` as well (i.e., an allocation with an use/def bit).
@cfallin
Copy link
Member Author

cfallin commented May 4, 2021

Just to keep folks up to speed on progress here: the work to adapt Cranelift to regalloc2 is coming along in my regalloc2 branch; the diff is growing to be a little bit larger than I had initially hoped but the end result will be cleaner, I think. I'm planning to drive that work to the point that we can take some performance measurements, then we can decide what to do with this (hopefully continue with refining it, assuming that we see improvements over status quo!).

This generalizes the allocator to accept multiple defs by making defs
just another type of "use" (uses are now perhaps more properly called
"mentions", but for now we abuse the terminology slightly).

It turns out that this actually was not terribly hard, because we don't
rely on the properties that a strict SSA requirement otherwise might
allow us to: e.g., defs always at exactly the start of a vreg's ranges.
Because we already accepted arbitrary block order and irreducible CFGs,
and approximated live-ranges with the single-pass algorithm, we are
robust in our "stitching" (move insertion) and so all we really care
about is computing some superset of the actual live-ranges and then a
non-interfering coloring of (split pieces of) those ranges. Multiple
defs don't change that, as long as we compute the ranges properly.

We still have blockparams in this design, so the client *can* provide
SSA directly, and everything will work as before. But a client that
produces non-SSA need not use them at all; it can just happily reassign
to vregs and everything will Just Work.

This is part of the effort to port Cranelift over to regalloc2; I have
decided that it may be easier to build a compatibility shim that matches
regalloc.rs's interface than to continue boiling the ocean and
converting all of the lowering sequences to SSA. It then becomes a
separable piece of work (and simply further performance improvements and
simplifications) to remove the need for this shim.
@cfallin
Copy link
Member Author

cfallin commented Aug 12, 2021

Thanks for the detailed review, @Amanieu! There are a few unresolved comments above still, one for which I'll be curious to hear Julian's API-design thoughts as well, but everything else is, I think, addressed.

In my profiles, 5-7% of the time is spent in malloc/realloc/free. Perhaps this is worth optimizing? Perhaps the various Vecs can be preserved across multiple invocations of the register allocator like Cranelift's Context.

Yes, this is a good vein of future optimization, I agree! I'm not sure when I'll be able to go into a dedicated "RA2 optimization" push again, but I've added this to the TODO list.

@Amanieu
Copy link
Contributor

Amanieu commented Aug 13, 2021

You missed two instances of cfg(debug) and cfg!(debug). (should be debug_assertions)

…ertions`.

Also fix some code that did not build in debug mode anymore (d'oh!) in
`src/ion/merges.rs`, as exposed by this change.
@cfallin
Copy link
Member Author

cfallin commented Aug 13, 2021

Fixed, thank you!

@Amanieu
Copy link
Contributor

Amanieu commented Aug 13, 2021

8ed83e3 seems to have regressed something, I'm getting assert failures:

thread 'main' panicked at 'assertion failed: self.vregs[vreg.index()].ranges.is_empty() ||\n    range.to <=\n        self.ranges[self.vregs[vreg.index()].ranges.last().unwrap().index.index()].range.from', /home/amanieu/code/regalloc2/src/ion/liveranges.rs:147:9

src/bitvec.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Show resolved Hide resolved
src/lib.rs Show resolved Hide resolved
src/lib.rs Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Show resolved Hide resolved
src/bitvec.rs Outdated Show resolved Hide resolved
src/checker.rs Show resolved Hide resolved
src/checker.rs Show resolved Hide resolved
src/moves.rs Outdated Show resolved Hide resolved
@cfallin
Copy link
Member Author

cfallin commented Aug 31, 2021

@julian-seward1 thanks very much for all of your comments!

I hope I've addressed everything; still open to discussions or other suggestions on some of the naming points; this has all been very good feedback. Let me know if there's anything else!

src/lib.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@Amanieu Amanieu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth doing a clippy run and addressing the code style issues it raises.

src/lib.rs Show resolved Hide resolved
src/lib.rs Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/ion/merge.rs Outdated Show resolved Hide resolved
src/ion/moves.rs Outdated Show resolved Hide resolved
@cfallin
Copy link
Member Author

cfallin commented Sep 1, 2021

It might be worth doing a clippy run and addressing the code style issues it raises.

Interesting; I haven't been in the habit of doing this (we don't do it in the Cranelift/Wasmtime repo for example) and it seems to raise a bunch of lints (62 warnings) many of which seem pretty opinionated to me; to take a random example, it wants me to merge these two if-statements but idiomatically the outer is the main action (merging sets, if changed...) and the inner is part of a "if not in workqueue, add to workqueue" idiom. I may be weird and old-school but putting those two together and relying on the short-circuiting of && for implicit control flow is too subtle for me. To take another example, it's complaining that two branches of a conditional in the checker both assign CheckerValue::Conflicted; that's intentional, in a sort of explicit break-down-the-cases chain of conditionals that define a meet-function. Merging them would be less readable IMHO.

I'd be happy to create an issue to track something like "look into which Clippy lints are appropriate" so that someone can eventually spend more time determining which lints make sense, but I think it's a wider issue and if we are serious about being lint-clean, we should look across our projects, define a common policy, and then adjust to it... if you or anyone else strongly disagrees, of course, I'm happy to reconsider :-)

@cfallin
Copy link
Member Author

cfallin commented Sep 1, 2021

Alright, I think I have addressed everything in the comments. Thank you again to @julian-seward1 and @Amanieu especially for the reviews, and everyone else who commented as well! I've written down a list of issues to track followup work that I'll create once I merge this, so the loose ends here shouldn't be forgotten; if I missed any then please feel free to create more.

And with that, I think it is time to merge this PR, based on Julian's r+!

@cfallin cfallin merged commit 57ccaef into bytecodealliance:main Sep 1, 2021
@cfallin cfallin deleted the initial-regalloc branch September 1, 2021 01:20
cfallin added a commit to cfallin/regalloc2 that referenced this pull request Sep 3, 2021
This was excluded by mistake in bytecodealliance#1; without a configuration, cargo-deny
runs with a default one, and rejects a lot of things (largely due to
open-source-but-not-allowlisted licenses).

This `deny.toml` comes from the regalloc.rs repo. It results in one
warning currently that will be resolved once bytecodealliance#7 is.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants