Emit power-of-two constant multiply as shift #13128

JosephTremoulet · 2017-07-31T20:41:51Z

Most of these get translated to shifts at Morph, but adding a failsafe
here next to the transform for 3/5/9->LEA ensures we won't have these in
the emitted code when they (re-)appear in later phases.

JosephTremoulet · 2017-07-31T20:44:12Z

@dotnet/jit-contrib PTAL. This rewrites ~6,000 multiplications across ~570 files in the jit-diff input set. See stats and discussion for motivation/background.

mikedn · 2017-07-31T20:49:07Z

src/jit/codegenxarch.cpp

+        else if (!requiresOverflowCheck && rmOp->isUsedFromReg() && isPow2(imm))
+        {
+            // Use shift for constant multiply when legal
+            unsigned int shiftAmount = genLog2((uint64_t)imm);


Why uint64_t and not size_t? And a static_cast would be preferable to a C style cast.

Why uint64_t and not size_t?

I thought it best to exactly match one of the overloads of genLog2.

a static_cast would be preferable to a C style cast

Indeed; updated.

Hmm, right, unlike isPow2 genLog2 is not a template.

pgavlin · 2017-07-31T21:43:34Z

@JosephTremoulet can you post an example or two of the diffs you're seeing? Also, it might be productive to file an issue to track moving this to Lower.

LGTM otherwise.

BruceForstall · 2017-07-31T21:56:54Z

What if imm==2^31? In that case, ssize_t on x86 makes it negative, so I would think isPow2 will return false. (Similar for 2^63 on x64?)

mikedn · 2017-08-01T05:43:48Z

What if imm==2^31? In that case, ssize_t on x86 makes it negative, so I would think isPow2 will return false. (Similar for 2^63 on x64?)

imm should be cast to size_t in the call to isPow2. But multiplying by 2^31 is so rare that this is probably not useful.

And presumably we don't handle negative multipliers either - does x * -4 work?

PS

And presumably we don't handle negative multipliers either - does x * -4 work?

Looks like morph handles this but this codegen implementation obviously won't. Anyway, this optimization should be moved out of morph into lower/codegen and at that point it will work properly.

Most of these get translated to shifts at Morph, but adding a failsafe here next to the transform for 3/5/9->LEA ensures we won't have these in the emitted code when they (re-)appear in later phases.

JosephTremoulet · 2017-08-01T20:32:31Z

Updated to use a bit test that will include min_int (might as well), but not to check for negative powers of two and insert the neg (my analysis showed almost none of these popping up at CodeGen, so I agree that deferring this until migrating the more complete pattern-matching from morph to lower makes sense, same as 3/5/9 times power-of-two).

Also updated to handle the case that src and target reg differ.

JosephTremoulet · 2017-08-01T20:35:15Z

can you post an example or two of the diffs you're seeing?

Sure, here's a small method that changes:

Before:

; Assembly listing for method TestApp:test_88(ref,long):long
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )     ref  ->  rcx         class-hnd
;  V01 arg1         [V01,T01] (  3,  3   )    long  ->  rdx        
;* V02 loc0         [V02    ] (  0,  0   )    long  ->  zero-ref   
;  V03 loc1         [V03    ] (  2,  2   )   byref  ->  [rsp+0x20]   must-init pinned
;  V04 tmp0         [V04,T02] (  2,  4   )    long  ->  rax        
;* V05 tmp1         [V05    ] (  0,  0   )    long  ->  zero-ref   
;* V06 tmp2         [V06    ] (  0,  0   )    long  ->  zero-ref   
;  V07 tmp3         [V07,T04] (  3,  3   )   byref  ->  rax        
;  V08 tmp4         [V08,T03] (  2,  4   )    long  ->  rax        
;  V09 OutArgs      [V09    ] (  1,  1   )  lclBlk (32) [rsp+0x00]  
;
; Lcl frame size = 40
G_M10220_IG01:
       sub      rsp, 40
       xor      rax, rax
       mov      qword ptr [rsp+20H], rax
G_M10220_IG02:
       lea      rax, [rdx-1]
       mov      edx, 1
       sub      eax, dword ptr [rcx+24]
       cmp      eax, dword ptr [rcx+16]
       jae      SHORT G_M10220_IG04
       sub      edx, dword ptr [rcx+28]
       cmp      edx, dword ptr [rcx+20]
       jae      SHORT G_M10220_IG04
       mov      r8d, dword ptr [rcx+20]
       imul     r8, rax
       mov      rax, rdx
       add      rax, r8
       imul     rax, rax, 16             ; <-------- multiply by constant power-of-two
       lea      rax, bword ptr [rcx+rax+32]
       cmp      dword ptr [rax], eax
       add      rax, 8
       mov      bword ptr [rsp+20H], rax
       mov      rax, bword ptr [rsp+20H]
       mov      rax, qword ptr [rax]
G_M10220_IG03:
       add      rsp, 40
       ret      
G_M10220_IG04:
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
; Total bytes of code 89, prolog size 11 for method TestApp:test_88(ref,long):long

After:

; Assembly listing for method TestApp:test_88(ref,long):long
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )     ref  ->  rcx         class-hnd
;  V01 arg1         [V01,T01] (  3,  3   )    long  ->  rdx        
;* V02 loc0         [V02    ] (  0,  0   )    long  ->  zero-ref   
;  V03 loc1         [V03    ] (  2,  2   )   byref  ->  [rsp+0x20]   must-init pinned
;  V04 tmp0         [V04,T02] (  2,  4   )    long  ->  rax        
;* V05 tmp1         [V05    ] (  0,  0   )    long  ->  zero-ref   
;* V06 tmp2         [V06    ] (  0,  0   )    long  ->  zero-ref   
;  V07 tmp3         [V07,T04] (  3,  3   )   byref  ->  rax        
;  V08 tmp4         [V08,T03] (  2,  4   )    long  ->  rax        
;  V09 OutArgs      [V09    ] (  1,  1   )  lclBlk (32) [rsp+0x00]  
;
; Lcl frame size = 40
G_M10220_IG01:
       sub      rsp, 40
       xor      rax, rax
       mov      qword ptr [rsp+20H], rax
G_M10220_IG02:
       lea      rax, [rdx-1]
       mov      edx, 1
       sub      eax, dword ptr [rcx+24]
       cmp      eax, dword ptr [rcx+16]
       jae      SHORT G_M10220_IG04
       sub      edx, dword ptr [rcx+28]
       cmp      edx, dword ptr [rcx+20]
       jae      SHORT G_M10220_IG04
       mov      r8d, dword ptr [rcx+20]
       imul     r8, rax
       mov      rax, rdx
       add      rax, r8
       shl      rax, 4             ; <-------- changed to shift
       lea      rax, bword ptr [rcx+rax+32]
       cmp      dword ptr [rax], eax
       add      rax, 8
       mov      bword ptr [rsp+20H], rax
       mov      rax, bword ptr [rsp+20H]
       mov      rax, qword ptr [rax]
G_M10220_IG03:
       add      rsp, 40
       ret      
G_M10220_IG04:
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
; Total bytes of code 89, prolog size 11 for method TestApp:test_88(ref,long):long

JosephTremoulet · 2017-08-01T20:40:01Z

it might be productive to file an issue to track moving this to Lower.

Created #13150

JosephTremoulet · 2017-08-02T14:03:48Z

@BruceForstall, good with update?

JosephTremoulet requested a review from russellhadley July 31, 2017 20:41

dnfclas added the cla-already-signed label Jul 31, 2017

mikedn reviewed Jul 31, 2017

View reviewed changes

JosephTremoulet force-pushed the Mulshift branch from ff96170 to 15fab59 Compare July 31, 2017 20:55

pgavlin approved these changes Jul 31, 2017

View reviewed changes

Emit power-of-two constant multiply as shift

60be3ea

Most of these get translated to shifts at Morph, but adding a failsafe here next to the transform for 3/5/9->LEA ensures we won't have these in the emitted code when they (re-)appear in later phases.

JosephTremoulet force-pushed the Mulshift branch from 15fab59 to 60be3ea Compare August 1, 2017 20:27

BruceForstall approved these changes Aug 2, 2017

View reviewed changes

JosephTremoulet merged commit 21d7b38 into dotnet:master Aug 2, 2017

JosephTremoulet deleted the Mulshift branch August 2, 2017 18:28

karelz modified the milestone: 2.1.0 Aug 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit power-of-two constant multiply as shift #13128

Emit power-of-two constant multiply as shift #13128

JosephTremoulet commented Jul 31, 2017

JosephTremoulet commented Jul 31, 2017

mikedn Jul 31, 2017

JosephTremoulet Jul 31, 2017

mikedn Jul 31, 2017

pgavlin commented Jul 31, 2017

BruceForstall commented Jul 31, 2017

mikedn commented Aug 1, 2017

JosephTremoulet commented Aug 1, 2017

JosephTremoulet commented Aug 1, 2017

JosephTremoulet commented Aug 1, 2017

JosephTremoulet commented Aug 2, 2017

Emit power-of-two constant multiply as shift #13128

Emit power-of-two constant multiply as shift #13128

Conversation

JosephTremoulet commented Jul 31, 2017

JosephTremoulet commented Jul 31, 2017

mikedn Jul 31, 2017

Choose a reason for hiding this comment

JosephTremoulet Jul 31, 2017

Choose a reason for hiding this comment

mikedn Jul 31, 2017

Choose a reason for hiding this comment

pgavlin commented Jul 31, 2017

BruceForstall commented Jul 31, 2017

mikedn commented Aug 1, 2017

JosephTremoulet commented Aug 1, 2017

JosephTremoulet commented Aug 1, 2017

JosephTremoulet commented Aug 1, 2017

JosephTremoulet commented Aug 2, 2017