Use optimized assembly for hardware division #299
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This implements the hardware divider usage directly in assembler. The result is a significant speedup in the common (no state save required) case as well as a small one when a state save is required.
In truth, I'm not sure this specific variant is the best option (see below), because blocks of assembler are always hard to maintain. The reason I ultimately decided on this variant was simply because division is common enough that having it be as fast as possible seemed worth the tradeoff of some gnarly code. If I could have found a way to convince the compiler to eliminate the shims/prologues from the pure Rust implementation, then that would likely be superior: it would only sacrifice some performance in the less common save/restore case.
Requirements
This PR currently requires a nightly compiler. This is because is makes use of global asm and because it depends on a change to compiler-builtins (rust-lang/rust#93696) to allow for the
__aeabi_[u]idivmod
intrinsics to be defined. I believe both of these are expected in 1.60, so it probably can't be merged until then.__aeabi_[u]idivmod
definitionsThe
__aeabi_[u]idivmod
functions are defined with a modified AAPCS ABI: they return results in bothr0
andr1
. However, this can be emulated with plain AAPCS because of how it defines returning a 64-bit result. Basically: just return a 64-bit result with the remainder in the high order 32 bits and AAPCS guarantees the correct register assignment.Alternatives
I also experimented with several alternatives to improve performance.
Some benchmarks, taken from a simple test program.
opt-level = 's'
opt-level = 'z'
opt-level = '2'
opt-level = 's'
opt-level = 'z'
opt-level = '2'
Fighting the optimizer
I was not able to figure out a way to make the
s
andz
optimization levels eliminate the shims generated for the intrinsics without also duplicating the bodies when any modulus operators are used (which defeats the point of optimizing for size). Atopt-level
≥ 2, with a minor change to the intrinsics alias generation, the shims are correctly eliminated and reduced to a direct call to the divider function. The global asm version avoids this by declaring the intrinsics in the assembler block.This is the main source of the problem for the pure Rust variant when optimizing for size. The other half for the pure Rust one is the equivalent prologue on the actual division function, but that's unavoidable as long as there's no way to disable Rust generating LLVM frame pointers for ARM Thumb.
The generated shims:
Add 11 cycles and 10 bytes each, compared to 24 cycles (no state save) for the actual interior division, so this is unfortunately not a trivial overhead.
Also note that the
s
andz
levels seem to be kind of erratic in producing these shims: sometimes I was getting call sequences that look like:__aeabi_uidiv
->__aeabi_uidivmod
->rp2040_hal::sio::divider_unsigned
onz
but not ons
(with the prologue/epilogue generated for each one).