-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Vector instead of direct Intrinsics #41790
Conversation
AVX may have startup-cost, so the claim "slightly improves" may or may not hold. Aside from that: there are many more places that could be updated here. |
That's why it is still a draft :) Is there an existing benchmark or a list of real-world values to consider for a benchmark? Or should I use random values of different length to validate the change? |
@gfoidl I've added benchmark results to the PR description. |
Thanks 👍🏻
My guess is
So once that issue is resolved (a PR is already open) Vector128 shouldn't be slower.But: one would need to cross-check the disassembly if my guess is correct. |
Good catch! assembly; Vector128Helper.ReplacePlusWithSpaceCore(System.Span`1<Char>, IntPtr)
push rax
vzeroupper
xor eax,eax
mov [rsp],rax
mov rax,[rcx]
mov [rsp],rax
xor r8d,r8d
mov ecx,[rcx+8]
cmp rcx,8
jl short M01_L01
vmovupd xmm0,[7FF9930F3F50]
vmovupd xmm1,[7FF9930F3F60]
lea r9,[rcx+0FFF8]
M01_L00:
vmovdqu xmm2,xmmword ptr [rdx+r8*2]
vpcmpeqw xmm3,xmm2,xmm0
vpand xmm4,xmm1,xmm3
vpandn xmm2,xmm3,xmm2
vpor xmm2,xmm4,xmm2
vmovdqu xmmword ptr [rax+r8*2],xmm2
add r8,8
cmp r8,r9
jle short M01_L00
M01_L01:
cmp r8,rcx
jge short M01_L05
M01_L02:
movzx r9d,word ptr [rdx+r8*2]
cmp r9d,2B
je short M01_L03
mov [rax+r8*2],r9w
jmp short M01_L04
M01_L03:
mov word ptr [rax+r8*2],20
M01_L04:
inc r8
cmp r8,rcx
jl short M01_L02
M01_L05:
xor eax,eax
mov [rsp],rax
add rsp,8
ret
; Total bytes of code 135 |
I've updated the benchmark results with the latest build available on dotnet/installer. The Vector128 implementation is now equivalent to the current implementation. The Vector256 implementation is slightly slower for small string values but improves the perf for larger strings (not by far). |
Sse2.LoadVector128
with cross-platform methods fromVector128
andVector256
. (There is a similar move in dotnet/runtime Switch from direct intrinsics usage to Vector/Vector64/Vector128/Vector256 runtime#64451)Vector256
which slightly improves the performance for strings larger than 16 characters.Benchmark (7.0.100-preview.6.22307.20)
source