-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Closed
Labels
Cost:SWork that requires one engineer up to 1 weekWork that requires one engineer up to 1 weekJitUntriagedCLR JIT issues needing additional triageCLR JIT issues needing additional triagearea-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMICLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIhelp wanted[up-for-grabs] Good issue for external contributors[up-for-grabs] Good issue for external contributorstenet-performancePerformance related issuePerformance related issue
Milestone
Description
I was trying to work out why I wasn't getting the performance expected from Vector.Narrow
using System;
using System.Linq;
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Text;
class Program
{
static unsafe void Main(string[] args)
{
var charArray = Enumerable.Repeat('a', Vector<ushort>.Count * 2).ToArray();
var byteArray = new byte[Vector<byte>.Count];
fixed (char* pChar = charArray)
fixed (byte* pByte = byteArray)
{
Narrow(pChar, pByte);
}
Console.WriteLine(Encoding.ASCII.GetString(byteArray));
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static unsafe void Narrow(char* input, byte* output)
{
var bytes = Vector.Narrow(
Unsafe.AsRef<Vector<ushort>>(input),
Unsafe.AsRef<Vector<ushort>>(input + Vector<ushort>.Count));
Unsafe.AsRef<Vector<byte>>(output) = bytes;
}
}Generates
; Assembly listing for method Program:Narrow(long,long)
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 4, 4 ) long -> rcx
; V01 arg1 [V01,T01] ( 3, 3 ) long -> rdx
; V02 loc0 [V02,T07] ( 2, 2 ) simd32 -> mm0
; V03 tmp0 [V03,T02] ( 2, 4 ) simd32 -> mm0
;* V04 tmp1 [V04,T08] ( 0, 0 ) byref -> zero-ref
;* V05 tmp2 [V05,T09] ( 0, 0 ) byref -> zero-ref
; V06 tmp3 [V06,T03] ( 2, 2 ) byref -> rax
;* V07 tmp4 [V07 ] ( 0, 0 ) long -> zero-ref
; V08 tmp5 [V08,T04] ( 2, 2 ) byref -> rax
; V09 tmp6 [V09,T05] ( 2, 2 ) byref -> rax
; V10 tmp7 [V10,T06] ( 2, 2 ) byref -> rax
;# V11 OutArgs [V11 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00]
;
; Lcl frame size = 0
G_M2586_IG01:
C5F877 vzeroupper
G_M2586_IG02:
C4E17D1001 vmovupd ymm0, ymmword ptr[rcx]
488D4120 lea rax, [rcx+32]
C4E17D1008 vmovupd ymm1, ymmword ptr[rax]
C4E37D46D920 vperm2i128 ymm3, ymm0, ymm1, 32
C4E37D46D131 vperm2i128 ymm2, ymm0, ymm1, 49
C4E16571F308 vpsllw ymm3, 8
C4E16571D308 vpsrlw ymm3, 8
C4E16D71F208 vpsllw ymm2, 8
C4E16D71D208 vpsrlw ymm2, 8
C4E16567C2 vpackuswb ymm0, ymm3, ymm2
488BC2 mov rax, rdx
C4E17D1100 vmovupd ymmword ptr[rax], ymm0
G_M2586_IG03:
C5F877 vzeroupper
C3 ret
; Total bytes of code 70, prolog size 3 for method Program:Narrow(long,long)Is this correct (performance wise)?
/cc @mikedn @CarolEidt
category:cq
theme:vector-codegen
skill-level:intermediate
cost:medium
Metadata
Metadata
Assignees
Labels
Cost:SWork that requires one engineer up to 1 weekWork that requires one engineer up to 1 weekJitUntriagedCLR JIT issues needing additional triageCLR JIT issues needing additional triagearea-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMICLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIhelp wanted[up-for-grabs] Good issue for external contributors[up-for-grabs] Good issue for external contributorstenet-performancePerformance related issuePerformance related issue