Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use optimized assembly for hardware division #299

Merged
merged 3 commits into from
Apr 30, 2022

Conversation

Sizurka
Copy link
Contributor

@Sizurka Sizurka commented Feb 23, 2022

This implements the hardware divider usage directly in assembler. The result is a significant speedup in the common (no state save required) case as well as a small one when a state save is required.

In truth, I'm not sure this specific variant is the best option (see below), because blocks of assembler are always hard to maintain. The reason I ultimately decided on this variant was simply because division is common enough that having it be as fast as possible seemed worth the tradeoff of some gnarly code. If I could have found a way to convince the compiler to eliminate the shims/prologues from the pure Rust implementation, then that would likely be superior: it would only sacrifice some performance in the less common save/restore case.

Requirements

This PR currently requires a nightly compiler. This is because is makes use of global asm and because it depends on a change to compiler-builtins (rust-lang/rust#93696) to allow for the __aeabi_[u]idivmod intrinsics to be defined. I believe both of these are expected in 1.60, so it probably can't be merged until then.

__aeabi_[u]idivmod definitions

The __aeabi_[u]idivmod functions are defined with a modified AAPCS ABI: they return results in both r0 and r1. However, this can be emulated with plain AAPCS because of how it defines returning a 64-bit result. Basically: just return a 64-bit result with the remainder in the high order 32 bits and AAPCS guarantees the correct register assignment.

Alternatives

I also experimented with several alternatives to improve performance.

  1. Restructuring the Rust code to make it interact with the optimizer better. The compiler generates nearly the same code as the optimized assembly for the actual division operation. The main problem it exhibits is how it shuffles around the prologue/epilogue and wastes a bunch of time saving registers it doesn't need to in the common case. So this approach tries to mitigate that.
  2. Inline assembly this is basically a prototype of the global asm version with the only advantage being that it doesn't require a global symbol for the intrinsics disabled case. It also interacts slightly better with the optimizer at times.

Some benchmarks, taken from a simple test program.

Size Change (bytes, negative is smaller) opt-level = 's' opt-level = 'z' opt-level = '2'
Baseline 0 0 0
Restructured Rust +16 +24 -64
Inline asm -28 -24 -200
Global asm -8 -4 -168
Single division (cycles, includes all overhead) opt-level = 's' opt-level = 'z' opt-level = '2'
Baseline 51 63 51
Restructured Rust 45 57 35
Inline asm 38 38 38
Global asm 26 26 26

Fighting the optimizer

I was not able to figure out a way to make the s and z optimization levels eliminate the shims generated for the intrinsics without also duplicating the bodies when any modulus operators are used (which defeats the point of optimizing for size). At opt-level ≥ 2, with a minor change to the intrinsics alias generation, the shims are correctly eliminated and reduced to a direct call to the divider function. The global asm version avoids this by declaring the intrinsics in the assembler block.

This is the main source of the problem for the pure Rust variant when optimizing for size. The other half for the pure Rust one is the equivalent prologue on the actual division function, but that's unavoidable as long as there's no way to disable Rust generating LLVM frame pointers for ARM Thumb.

The generated shims:

10002688 <__aeabi_uidivmod>:
10002688:       b580            push    {r7, lr}
1000268a:       af00            add     r7, sp, #0
1000268c:       f7ff ffa8       bl      100025e0 <rp2040_hal::sio::divider_unsigned>
10002690:       bd80            pop     {r7, pc}

10002692 <__aeabi_uidiv>:
10002692:       b580            push    {r7, lr}
10002694:       af00            add     r7, sp, #0
10002696:       f7ff ffa3       bl      100025e0 <rp2040_hal::sio::divider_unsigned>
1000269a:       bd80            pop     {r7, pc}

Add 11 cycles and 10 bytes each, compared to 24 cycles (no state save) for the actual interior division, so this is unfortunately not a trivial overhead.

Also note that the s and z levels seem to be kind of erratic in producing these shims: sometimes I was getting call sequences that look like: __aeabi_uidiv -> __aeabi_uidivmod -> rp2040_hal::sio::divider_unsigned on z but not on s (with the prologue/epilogue generated for each one).

@jannic
Copy link
Member

jannic commented Feb 23, 2022

I'm really impressed by this detailed analysis of the pros and cons of the different alternatives. Thank you!

Sizurka added 2 commits April 12, 2022 08:50
Using the full module structure generated by the intrinsics macro
interacts oddly with inlining on some optimization levels, causing
a duplicate identical function body to be generated.  This doesn't
affect performance, but it wastes space, so just declare the alias
directly which seems to cause the symbol to be aliased as it should
be.
Putting the quotient first makes the compiler emit slightly better
code by putting the quotient in r0 which is generally what the
intrinsics want.
@Sizurka Sizurka force-pushed the divider-global-asm branch from b187fc6 to bf4b88b Compare April 12, 2022 14:55
Convert the hardware divider to optimized assembler.
@Sizurka Sizurka force-pushed the divider-global-asm branch from bf4b88b to f9d2610 Compare April 12, 2022 16:18
@Sizurka Sizurka marked this pull request as ready for review April 12, 2022 16:25
@Sizurka
Copy link
Contributor Author

Sizurka commented Apr 12, 2022

With 1.60 released, this now builds on stable. So it should be able to be merged now.

Copy link
Member

@9names 9names left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally got around to reviewing this.
I have nothing to add except to say: This is very good. Thanks for the great PR!

LGTM

@9names 9names merged commit ed0cda5 into rp-rs:main Apr 30, 2022
@Sizurka Sizurka deleted the divider-global-asm branch May 1, 2022 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants