Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace store of freeze in allocop and in emit new struct with memset since aggregate stores are bad #55879

Merged
merged 6 commits into from
Oct 18, 2024

Conversation

gbaraldi
Copy link
Member

This fixes the issues found in slack in the reinterprets of

julia> split128_v2(x::UInt128) = (first(reinterpret(NTuple{2, UInt}, x)), last(reinterpret(NTuple{2, UInt}, x)))
split128_v2 (generic function with 1 method)

julia> split128(x::UInt128) = reinterpret(NTuple{2, UInt}, x)
split128 (generic function with 1 method)

@code_native  split128(UInt128(5))
	push	rbp
	mov	rbp, rsp
	mov	rax, rdi
	mov	qword ptr [rdi + 8], rdx
	mov	qword ptr [rdi], rsi
	pop	rbp
	ret

@code_native  split128_v2(UInt128(5))
	push	rbp
	mov	rbp, rsp
	mov	rax, rdi
	mov	qword ptr [rdi], rsi
	mov	qword ptr [rdi + 8], rdx
	pop	rbp
	ret

vs on master where

julia> @code_native  split128(UInt128(5))

	push	rbp
	mov	rbp, rsp
	mov	eax, esi
	shr	eax, 8
	mov	ecx, esi
	shr	ecx, 16
	mov	r8, rsi
	mov	r9, rsi
	vmovd	xmm0, esi
	vpinsrb	xmm0, xmm0, eax, 1
	mov	rax, rsi
	vpinsrb	xmm0, xmm0, ecx, 2
	mov	rcx, rsi
	shr	esi, 24
	vpinsrb	xmm0, xmm0, esi, 3
	shr	r8, 32
	vpinsrb	xmm0, xmm0, r8d, 4
	shr	r9, 40
	vpinsrb	xmm0, xmm0, r9d, 5
	shr	rax, 48
	vpinsrb	xmm0, xmm0, eax, 6
	shr	rcx, 56
	vpinsrb	xmm0, xmm0, ecx, 7
	vpinsrb	xmm0, xmm0, edx, 8
	mov	eax, edx
	shr	eax, 8
	vpinsrb	xmm0, xmm0, eax, 9
	mov	eax, edx
	shr	eax, 16
	vpinsrb	xmm0, xmm0, eax, 10
	mov	eax, edx
	shr	eax, 24
	vpinsrb	xmm0, xmm0, eax, 11
	mov	rax, rdx
	shr	rax, 32
	vpinsrb	xmm0, xmm0, eax, 12
	mov	rax, rdx
	shr	rax, 40
	vpinsrb	xmm0, xmm0, eax, 13
	mov	rax, rdx
	shr	rax, 48
	vpinsrb	xmm0, xmm0, eax, 14
	mov	rax, rdi
	shr	rdx, 56
	vpinsrb	xmm0, xmm0, edx, 15
	vmovdqu	xmmword ptr [rdi], xmm0
	pop	rbp
	ret

@gbaraldi gbaraldi added performance Must go faster merge me PR is reviewed. Merge when all tests are passing and removed merge me PR is reviewed. Merge when all tests are passing labels Oct 16, 2024
@gbaraldi gbaraldi requested a review from vtjnash October 18, 2024 14:15
@vtjnash vtjnash merged commit e5aff12 into master Oct 18, 2024
10 checks passed
@vtjnash vtjnash deleted the gb/memset branch October 18, 2024 15:01
@oscardssmith oscardssmith added compiler:codegen Generation of LLVM IR and native code compiler:llvm For issues that relate to LLVM labels Oct 18, 2024
KristofferC pushed a commit that referenced this pull request Oct 21, 2024
…th memset since aggregate stores are bad (#55879)

This fixes the issues found in slack in the reinterprets of
```julia
julia> split128_v2(x::UInt128) = (first(reinterpret(NTuple{2, UInt}, x)), last(reinterpret(NTuple{2, UInt}, x)))
split128_v2 (generic function with 1 method)

julia> split128(x::UInt128) = reinterpret(NTuple{2, UInt}, x)
split128 (generic function with 1 method)

@code_native  split128(UInt128(5))
	push	rbp
	mov	rbp, rsp
	mov	rax, rdi
	mov	qword ptr [rdi + 8], rdx
	mov	qword ptr [rdi], rsi
	pop	rbp
	ret

@code_native  split128_v2(UInt128(5))
	push	rbp
	mov	rbp, rsp
	mov	rax, rdi
	mov	qword ptr [rdi], rsi
	mov	qword ptr [rdi + 8], rdx
	pop	rbp
	ret
```

vs on master where
```julia
julia> @code_native  split128(UInt128(5))

	push	rbp
	mov	rbp, rsp
	mov	eax, esi
	shr	eax, 8
	mov	ecx, esi
	shr	ecx, 16
	mov	r8, rsi
	mov	r9, rsi
	vmovd	xmm0, esi
	vpinsrb	xmm0, xmm0, eax, 1
	mov	rax, rsi
	vpinsrb	xmm0, xmm0, ecx, 2
	mov	rcx, rsi
	shr	esi, 24
	vpinsrb	xmm0, xmm0, esi, 3
	shr	r8, 32
	vpinsrb	xmm0, xmm0, r8d, 4
	shr	r9, 40
	vpinsrb	xmm0, xmm0, r9d, 5
	shr	rax, 48
	vpinsrb	xmm0, xmm0, eax, 6
	shr	rcx, 56
	vpinsrb	xmm0, xmm0, ecx, 7
	vpinsrb	xmm0, xmm0, edx, 8
	mov	eax, edx
	shr	eax, 8
	vpinsrb	xmm0, xmm0, eax, 9
	mov	eax, edx
	shr	eax, 16
	vpinsrb	xmm0, xmm0, eax, 10
	mov	eax, edx
	shr	eax, 24
	vpinsrb	xmm0, xmm0, eax, 11
	mov	rax, rdx
	shr	rax, 32
	vpinsrb	xmm0, xmm0, eax, 12
	mov	rax, rdx
	shr	rax, 40
	vpinsrb	xmm0, xmm0, eax, 13
	mov	rax, rdx
	shr	rax, 48
	vpinsrb	xmm0, xmm0, eax, 14
	mov	rax, rdi
	shr	rdx, 56
	vpinsrb	xmm0, xmm0, edx, 15
	vmovdqu	xmmword ptr [rdi], xmm0
	pop	rbp
	ret
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code compiler:llvm For issues that relate to LLVM performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants