Skip to content

Vector.Narrow performance #9766

@benaadams

Description

@benaadams

I was trying to work out why I wasn't getting the performance expected from Vector.Narrow

using System;
using System.Linq;
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Text;

class Program
{
    static unsafe void Main(string[] args)
    {
        var charArray = Enumerable.Repeat('a', Vector<ushort>.Count * 2).ToArray();
        var byteArray = new byte[Vector<byte>.Count];

        fixed (char* pChar = charArray)
        fixed (byte* pByte = byteArray)
        {
            Narrow(pChar, pByte);
        }

        Console.WriteLine(Encoding.ASCII.GetString(byteArray));
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static unsafe void Narrow(char* input, byte* output)
    {
        var bytes = Vector.Narrow(
            Unsafe.AsRef<Vector<ushort>>(input), 
            Unsafe.AsRef<Vector<ushort>>(input + Vector<ushort>.Count));
        Unsafe.AsRef<Vector<byte>>(output) = bytes;
    }
}

Generates

; Assembly listing for method Program:Narrow(long,long)
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  4   )    long  ->  rcx        
;  V01 arg1         [V01,T01] (  3,  3   )    long  ->  rdx        
;  V02 loc0         [V02,T07] (  2,  2   )  simd32  ->  mm0        
;  V03 tmp0         [V03,T02] (  2,  4   )  simd32  ->  mm0        
;* V04 tmp1         [V04,T08] (  0,  0   )   byref  ->  zero-ref   
;* V05 tmp2         [V05,T09] (  0,  0   )   byref  ->  zero-ref   
;  V06 tmp3         [V06,T03] (  2,  2   )   byref  ->  rax        
;* V07 tmp4         [V07    ] (  0,  0   )    long  ->  zero-ref   
;  V08 tmp5         [V08,T04] (  2,  2   )   byref  ->  rax        
;  V09 tmp6         [V09,T05] (  2,  2   )   byref  ->  rax        
;  V10 tmp7         [V10,T06] (  2,  2   )   byref  ->  rax        
;# V11 OutArgs      [V11    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]  
;
; Lcl frame size = 0

G_M2586_IG01:
       C5F877               vzeroupper 

G_M2586_IG02:
       C4E17D1001           vmovupd  ymm0, ymmword ptr[rcx]
       488D4120             lea      rax, [rcx+32]
       C4E17D1008           vmovupd  ymm1, ymmword ptr[rax]
       C4E37D46D920         vperm2i128 ymm3, ymm0, ymm1, 32
       C4E37D46D131         vperm2i128 ymm2, ymm0, ymm1, 49
       C4E16571F308         vpsllw   ymm3, 8
       C4E16571D308         vpsrlw   ymm3, 8
       C4E16D71F208         vpsllw   ymm2, 8
       C4E16D71D208         vpsrlw   ymm2, 8
       C4E16567C2           vpackuswb ymm0, ymm3, ymm2
       488BC2               mov      rax, rdx
       C4E17D1100           vmovupd  ymmword ptr[rax], ymm0

G_M2586_IG03:
       C5F877               vzeroupper 
       C3                   ret      

; Total bytes of code 70, prolog size 3 for method Program:Narrow(long,long)

Is this correct (performance wise)?

/cc @mikedn @CarolEidt

category:cq
theme:vector-codegen
skill-level:intermediate
cost:medium

Metadata

Metadata

Assignees

No one assigned

    Labels

    Cost:SWork that requires one engineer up to 1 weekJitUntriagedCLR JIT issues needing additional triagearea-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIhelp wanted[up-for-grabs] Good issue for external contributorstenet-performancePerformance related issue

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions