-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Emit power-of-two constant multiply as shift #13128
Conversation
@dotnet/jit-contrib PTAL. This rewrites ~6,000 multiplications across ~570 files in the jit-diff input set. See stats and discussion for motivation/background. |
src/jit/codegenxarch.cpp
Outdated
else if (!requiresOverflowCheck && rmOp->isUsedFromReg() && isPow2(imm)) | ||
{ | ||
// Use shift for constant multiply when legal | ||
unsigned int shiftAmount = genLog2((uint64_t)imm); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why uint64_t
and not size_t
? And a static_cast would be preferable to a C style cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why
uint64_t
and notsize_t
?
I thought it best to exactly match one of the overloads of genLog2
.
a static_cast would be preferable to a C style cast
Indeed; updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, right, unlike isPow2
genLog2
is not a template.
ff96170
to
15fab59
Compare
@JosephTremoulet can you post an example or two of the diffs you're seeing? Also, it might be productive to file an issue to track moving this to LGTM otherwise. |
What if |
imm should be cast to size_t in the call to isPow2. But multiplying by 2^31 is so rare that this is probably not useful. And presumably we don't handle negative multipliers either - does PS
Looks like morph handles this but this codegen implementation obviously won't. Anyway, this optimization should be moved out of morph into lower/codegen and at that point it will work properly. |
Most of these get translated to shifts at Morph, but adding a failsafe here next to the transform for 3/5/9->LEA ensures we won't have these in the emitted code when they (re-)appear in later phases.
15fab59
to
60be3ea
Compare
Updated to use a bit test that will include min_int (might as well), but not to check for negative powers of two and insert the neg (my analysis showed almost none of these popping up at CodeGen, so I agree that deferring this until migrating the more complete pattern-matching from morph to lower makes sense, same as 3/5/9 times power-of-two). Also updated to handle the case that src and target reg differ. |
Sure, here's a small method that changes: Before: ; Assembly listing for method TestApp:test_88(ref,long):long
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) ref -> rcx class-hnd
; V01 arg1 [V01,T01] ( 3, 3 ) long -> rdx
;* V02 loc0 [V02 ] ( 0, 0 ) long -> zero-ref
; V03 loc1 [V03 ] ( 2, 2 ) byref -> [rsp+0x20] must-init pinned
; V04 tmp0 [V04,T02] ( 2, 4 ) long -> rax
;* V05 tmp1 [V05 ] ( 0, 0 ) long -> zero-ref
;* V06 tmp2 [V06 ] ( 0, 0 ) long -> zero-ref
; V07 tmp3 [V07,T04] ( 3, 3 ) byref -> rax
; V08 tmp4 [V08,T03] ( 2, 4 ) long -> rax
; V09 OutArgs [V09 ] ( 1, 1 ) lclBlk (32) [rsp+0x00]
;
; Lcl frame size = 40
G_M10220_IG01:
sub rsp, 40
xor rax, rax
mov qword ptr [rsp+20H], rax
G_M10220_IG02:
lea rax, [rdx-1]
mov edx, 1
sub eax, dword ptr [rcx+24]
cmp eax, dword ptr [rcx+16]
jae SHORT G_M10220_IG04
sub edx, dword ptr [rcx+28]
cmp edx, dword ptr [rcx+20]
jae SHORT G_M10220_IG04
mov r8d, dword ptr [rcx+20]
imul r8, rax
mov rax, rdx
add rax, r8
imul rax, rax, 16 ; <-------- multiply by constant power-of-two
lea rax, bword ptr [rcx+rax+32]
cmp dword ptr [rax], eax
add rax, 8
mov bword ptr [rsp+20H], rax
mov rax, bword ptr [rsp+20H]
mov rax, qword ptr [rax]
G_M10220_IG03:
add rsp, 40
ret
G_M10220_IG04:
call CORINFO_HELP_RNGCHKFAIL
int3
; Total bytes of code 89, prolog size 11 for method TestApp:test_88(ref,long):long After: ; Assembly listing for method TestApp:test_88(ref,long):long
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) ref -> rcx class-hnd
; V01 arg1 [V01,T01] ( 3, 3 ) long -> rdx
;* V02 loc0 [V02 ] ( 0, 0 ) long -> zero-ref
; V03 loc1 [V03 ] ( 2, 2 ) byref -> [rsp+0x20] must-init pinned
; V04 tmp0 [V04,T02] ( 2, 4 ) long -> rax
;* V05 tmp1 [V05 ] ( 0, 0 ) long -> zero-ref
;* V06 tmp2 [V06 ] ( 0, 0 ) long -> zero-ref
; V07 tmp3 [V07,T04] ( 3, 3 ) byref -> rax
; V08 tmp4 [V08,T03] ( 2, 4 ) long -> rax
; V09 OutArgs [V09 ] ( 1, 1 ) lclBlk (32) [rsp+0x00]
;
; Lcl frame size = 40
G_M10220_IG01:
sub rsp, 40
xor rax, rax
mov qword ptr [rsp+20H], rax
G_M10220_IG02:
lea rax, [rdx-1]
mov edx, 1
sub eax, dword ptr [rcx+24]
cmp eax, dword ptr [rcx+16]
jae SHORT G_M10220_IG04
sub edx, dword ptr [rcx+28]
cmp edx, dword ptr [rcx+20]
jae SHORT G_M10220_IG04
mov r8d, dword ptr [rcx+20]
imul r8, rax
mov rax, rdx
add rax, r8
shl rax, 4 ; <-------- changed to shift
lea rax, bword ptr [rcx+rax+32]
cmp dword ptr [rax], eax
add rax, 8
mov bword ptr [rsp+20H], rax
mov rax, bword ptr [rsp+20H]
mov rax, qword ptr [rax]
G_M10220_IG03:
add rsp, 40
ret
G_M10220_IG04:
call CORINFO_HELP_RNGCHKFAIL
int3
; Total bytes of code 89, prolog size 11 for method TestApp:test_88(ref,long):long |
Created #13150 |
@BruceForstall, good with update? |
Most of these get translated to shifts at Morph, but adding a failsafe
here next to the transform for 3/5/9->LEA ensures we won't have these in
the emitted code when they (re-)appear in later phases.