Use optimized assembly for hardware division #299

Sizurka · 2022-02-23T21:56:32Z

This implements the hardware divider usage directly in assembler. The result is a significant speedup in the common (no state save required) case as well as a small one when a state save is required.

In truth, I'm not sure this specific variant is the best option (see below), because blocks of assembler are always hard to maintain. The reason I ultimately decided on this variant was simply because division is common enough that having it be as fast as possible seemed worth the tradeoff of some gnarly code. If I could have found a way to convince the compiler to eliminate the shims/prologues from the pure Rust implementation, then that would likely be superior: it would only sacrifice some performance in the less common save/restore case.

Requirements

This PR currently requires a nightly compiler. This is because is makes use of global asm and because it depends on a change to compiler-builtins (rust-lang/rust#93696) to allow for the __aeabi_[u]idivmod intrinsics to be defined. I believe both of these are expected in 1.60, so it probably can't be merged until then.

`__aeabi_[u]idivmod` definitions

The __aeabi_[u]idivmod functions are defined with a modified AAPCS ABI: they return results in both r0 and r1. However, this can be emulated with plain AAPCS because of how it defines returning a 64-bit result. Basically: just return a 64-bit result with the remainder in the high order 32 bits and AAPCS guarantees the correct register assignment.

Alternatives

I also experimented with several alternatives to improve performance.

Restructuring the Rust code to make it interact with the optimizer better. The compiler generates nearly the same code as the optimized assembly for the actual division operation. The main problem it exhibits is how it shuffles around the prologue/epilogue and wastes a bunch of time saving registers it doesn't need to in the common case. So this approach tries to mitigate that.
Inline assembly this is basically a prototype of the global asm version with the only advantage being that it doesn't require a global symbol for the intrinsics disabled case. It also interacts slightly better with the optimizer at times.

Some benchmarks, taken from a simple test program.

Size Change (bytes, negative is smaller)	`opt-level = 's'`	`opt-level = 'z'`	`opt-level = '2'`
Baseline	0	0	0
Restructured Rust	+16	+24	-64
Inline asm	-28	-24	-200
Global asm	-8	-4	-168

Single division (cycles, includes all overhead)	`opt-level = 's'`	`opt-level = 'z'`	`opt-level = '2'`
Baseline	51	63	51
Restructured Rust	45	57	35
Inline asm	38	38	38
Global asm	26	26	26

Fighting the optimizer

I was not able to figure out a way to make the s and z optimization levels eliminate the shims generated for the intrinsics without also duplicating the bodies when any modulus operators are used (which defeats the point of optimizing for size). At opt-level ≥ 2, with a minor change to the intrinsics alias generation, the shims are correctly eliminated and reduced to a direct call to the divider function. The global asm version avoids this by declaring the intrinsics in the assembler block.

This is the main source of the problem for the pure Rust variant when optimizing for size. The other half for the pure Rust one is the equivalent prologue on the actual division function, but that's unavoidable as long as there's no way to disable Rust generating LLVM frame pointers for ARM Thumb.

The generated shims:

10002688 <__aeabi_uidivmod>:
10002688:       b580            push    {r7, lr}
1000268a:       af00            add     r7, sp, #0
1000268c:       f7ff ffa8       bl      100025e0 <rp2040_hal::sio::divider_unsigned>
10002690:       bd80            pop     {r7, pc}

10002692 <__aeabi_uidiv>:
10002692:       b580            push    {r7, lr}
10002694:       af00            add     r7, sp, #0
10002696:       f7ff ffa3       bl      100025e0 <rp2040_hal::sio::divider_unsigned>
1000269a:       bd80            pop     {r7, pc}

Add 11 cycles and 10 bytes each, compared to 24 cycles (no state save) for the actual interior division, so this is unfortunately not a trivial overhead.

Also note that the s and z levels seem to be kind of erratic in producing these shims: sometimes I was getting call sequences that look like: __aeabi_uidiv -> __aeabi_uidivmod -> rp2040_hal::sio::divider_unsigned on z but not on s (with the prologue/epilogue generated for each one).

jannic · 2022-02-23T22:13:02Z

I'm really impressed by this detailed analysis of the pros and cons of the different alternatives. Thank you!

Using the full module structure generated by the intrinsics macro interacts oddly with inlining on some optimization levels, causing a duplicate identical function body to be generated. This doesn't affect performance, but it wastes space, so just declare the alias directly which seems to cause the symbol to be aliased as it should be.

Putting the quotient first makes the compiler emit slightly better code by putting the quotient in r0 which is generally what the intrinsics want.

Convert the hardware divider to optimized assembler.

Sizurka · 2022-04-12T16:26:02Z

With 1.60 released, this now builds on stable. So it should be able to be merged now.

9names

Finally got around to reviewing this.
I have nothing to add except to say: This is very good. Thanks for the great PR!

LGTM

Sizurka added 2 commits April 12, 2022 08:50

Change hardware divider results structure order

a15c109

Putting the quotient first makes the compiler emit slightly better code by putting the quotient in r0 which is generally what the intrinsics want.

Sizurka force-pushed the divider-global-asm branch from b187fc6 to bf4b88b Compare April 12, 2022 14:55

Use direct assembler calls for the divider

f9d2610

Convert the hardware divider to optimized assembler.

Sizurka force-pushed the divider-global-asm branch from bf4b88b to f9d2610 Compare April 12, 2022 16:18

Sizurka marked this pull request as ready for review April 12, 2022 16:25

9names approved these changes Apr 29, 2022

View reviewed changes

9names merged commit ed0cda5 into rp-rs:main Apr 30, 2022

Sizurka deleted the divider-global-asm branch May 1, 2022 13:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use optimized assembly for hardware division #299

Use optimized assembly for hardware division #299

Sizurka commented Feb 23, 2022

jannic commented Feb 23, 2022

Sizurka commented Apr 12, 2022

9names left a comment

Use optimized assembly for hardware division #299

Use optimized assembly for hardware division #299

Conversation

Sizurka commented Feb 23, 2022

Requirements

__aeabi_[u]idivmod definitions

Alternatives

Fighting the optimizer

jannic commented Feb 23, 2022

Sizurka commented Apr 12, 2022

9names left a comment

Choose a reason for hiding this comment

`__aeabi_[u]idivmod` definitions