Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: try integrate BFloat16 #124

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

WIP: try integrate BFloat16 #124

wants to merge 1 commit into from

Conversation

KristofferC
Copy link
Collaborator

@KristofferC KristofferC commented Aug 8, 2024

This is just to see how things behave (#123)

julia> using SIMD, BFloat16s

julia> v = Vec(ntuple(i->BFloat16(rand()), Val(8)))
<8 x BFloat16>[0.796875, 0.56640625, 0.97265625, 0.68359375, 0.26367188, 0.6640625, 0.63671875, 0.8125]

julia> v+v
<8 x BFloat16>[1.59375, 1.1328125, 1.9453125, 1.3671875, 0.52734375, 1.328125, 1.2734375, 1.625]

For v+v we generate the following LLVM code:

%3 = fadd  <8 x bfloat> %0, %1
ret <8 x bfloat> %3

but:

julia> @code_llvm v+v
; Function Signature: +(SIMD.Vec{8, Core.BFloat16}, SIMD.Vec{8, Core.BFloat16})
;  @ /home/kc/JuliaPkgs/SIMD.jl/src/simdvec.jl:256 within `+`
define void @"julia_+_6263"(ptr noalias nocapture noundef nonnull sret([1 x <8 x bfloat>]) align 16 dereferenceable(16) %sret_return, ptr nocapture noundef nonnull readonly align 16 dereferenceable(16) %"x::Vec", ptr nocapture noundef nonnull readonly align 16 dereferenceable(16) %"y::Vec") #0 {
top:
;  @ /home/kc/JuliaPkgs/SIMD.jl/src/simdvec.jl:257 within `+`
; ┌ @ /home/kc/JuliaPkgs/SIMD.jl/src/LLVM_intrinsics.jl:221 within `fadd` @ /home/kc/JuliaPkgs/SIMD.jl/src/LLVM_intrinsics.jl:221
; │┌ @ /home/kc/JuliaPkgs/SIMD.jl/src/LLVM_intrinsics.jl:231 within `macro expansion`
    %"x::Vec.data_ptr.unbox" = load <8 x bfloat>, ptr %"x::Vec", align 16
    %"y::Vec.data_ptr.unbox" = load <8 x bfloat>, ptr %"y::Vec", align 16
    %0 = fpext <8 x bfloat> %"x::Vec.data_ptr.unbox" to <8 x float>
    %1 = fpext <8 x bfloat> %"y::Vec.data_ptr.unbox" to <8 x float>
    %2 = fadd <8 x float> %0, %1
    %3 = fptrunc <8 x float> %2 to <8 x bfloat>
; └└
; ┌ @ /home/kc/JuliaPkgs/SIMD.jl/src/simdvec.jl:2 within `Vec`
   store <8 x bfloat> %3, ptr %sret_return, align 16
   ret void
; └
}
julia> versioninfo()
Julia Version 1.12.0-DEV.1013
Commit c767032b8ff (2024-08-08 02:12 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × 12th Gen Intel(R) Core(TM) i9-12900K
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, alderlake)

cc @maleadt

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.61%. Comparing base (c332a03) to head (03a4e3d).

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #124      +/-   ##
==========================================
- Coverage   88.77%   88.61%   -0.16%     
==========================================
  Files           5        5              
  Lines         561      562       +1     
==========================================
  Hits          498      498              
- Misses         63       64       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@maleadt
Copy link
Contributor

maleadt commented Aug 8, 2024

Ah nice, that simplifies my example:

julia> v = Vec(ntuple(i->Float32(rand()), Val(16)))
<16 x Float32>[0.35481927, 0.14949146, 0.33511126, 0.23023836, 0.16776331, 0.9152977, 0.19988814, 0.22910726, 0.10502812, 0.54989743, 0.14419909, 0.19571519, 0.21844539, 0.84552854, 0.03142407, 0.9895877]

julia> @code_native convert(Vec{16,BFloat16},v)
	push	rbp
	mov	rbp, rsp
	mov	rax, rdi
; │ @ /home/sdp/SIMD/src/simdvec.jl:57 within `convert`
; │┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:716 within `fptrunc`
; ││┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:730 within `macro expansion`
	vcvtneps2bf16	ymm0, zmmword ptr [rsi]
; │└└
	vmovups	ymmword ptr [rdi], ymm0
	pop	rbp
	vzeroupper
	ret

Is it possible to express a dot product using SIMD.jl? That should ideally result in the other AVX512BF16 instruction being emitted.

Other operations, like the + you're doing, are not natively supported and will get demoted to Float32 by an LLVM pass of ours. That should probably be revisited for architectures that do actually support native Float16, but that's not what I was testing here.

@KristofferC
Copy link
Collaborator Author

Is it possible to express a dot product using SIMD.jl?

There is sum(v*v) at least..

@maleadt
Copy link
Contributor

maleadt commented Aug 8, 2024

There is sum(v*v) at least..

Hmm, with that the demote to Float32 is likely to get in the way and prevent any potential fusion between the fmul and reduce.fadd, if that's even supported:

julia> @code_llvm f(v)
; Function Signature: f(SIMD.Vec{16, Core.BFloat16})
;  @ REPL[33]:1 within `f`
define bfloat @julia_f_11886(ptr nocapture noundef nonnull readonly align 16 dereferenceable(32) %"v::Vec") #0 {
top:
; ┌ @ /home/sdp/SIMD/src/simdvec.jl:257 within `*`
; │┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:221 within `fmul` @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:221
; ││┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:230 within `macro expansion`
     %"v::Vec.data_ptr.unbox" = load <16 x bfloat>, ptr %"v::Vec", align 16
     %0 = fpext <16 x bfloat> %"v::Vec.data_ptr.unbox" to <16 x float>
     %1 = fmul <16 x float> %0, %0
     %2 = fptrunc <16 x float> %1 to <16 x bfloat>
; └└└
; ┌ @ /home/sdp/SIMD/src/simdvec.jl:483 within `sum`
; │┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:858 within `reduce_fadd`
; ││┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:874 within `macro expansion`
     %res.i = call reassoc bfloat @llvm.vector.reduce.fadd.v16bf16(bfloat 0xR0000, <16 x bfloat> %2)
     ret bfloat %res.i
; └└└
}

I guess that needs some solution at the LLVM level so that we don't need these demotes (i.e., llvm/llvm-project#97975). In the meantime, it would be possible to add explicit calls to e.g. the @llvm.x86.avx512bf16.dpbf16ps intrinsic, but that's probably for another PR.

@KristofferC
Copy link
Collaborator Author

In the meantime, it would be possible to add explicit calls to e.g. the @llvm.x86.avx512bf16.dpbf16ps intrinsic, but that's probably for another PR.

So far, it has worked quite well here to only map to the "generic" LLVM intrinsic and let the LLVM make the choice of the actual native intrinsic to run. That gives the package a reasonably limited scope compared to supporting n_architectures * n_instructions and it also means that code written with SIMD.jl is not platform or vector size specific.

@maleadt
Copy link
Contributor

maleadt commented Aug 8, 2024

The "problem" is that LLVM doesn't have a generic dot product intrinsic, so for any matching of dot product-like instructions to vdpbf16ps to work we probably need to get rid of the demotions in between. Which is something that needs to be fixed upstream.

Anyway, this isn't terribly important right now. It's a good start that we already match the conversion intrinsics without more invasive changes. And this is only an x86 problem; on some ARM platforms we shouldn't be demoting at all (JuliaLang/julia#55417).

@KristofferC
Copy link
Collaborator Author

And this is only an x86 problem; on some ARM platforms we shouldn't be demoting at all (JuliaLang/julia#55417).

It would be interesting to see how this PR performs on such a system.

@maleadt
Copy link
Contributor

maleadt commented Aug 21, 2024

Alright, with JuliaLang/julia#55486 the sum(v*v) dot product example from above yields much cleaner LLVM code:

julia> @code_llvm f(v)
; Function Signature: f(SIMD.Vec{8, Core.BFloat16})
;  @ REPL[4]:1 within `f`
define bfloat @julia_f_6882(ptr nocapture noundef nonnull readonly align 16 dereferenceable(16) %"v::Vec") #0 {
top:
; ┌ @ /home/sdp/SIMD/src/simdvec.jl:257 within `*`
; │┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:221 within `fmul` @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:221
; ││┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:230 within `macro expansion`
     %"v::Vec.data_ptr.unbox" = load <8 x bfloat>, ptr %"v::Vec", align 16
     %0 = fmul <8 x bfloat> %"v::Vec.data_ptr.unbox", %"v::Vec.data_ptr.unbox"
; └└└
; ┌ @ /home/sdp/SIMD/src/simdvec.jl:483 within `sum`
; │┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:858 within `reduce_fadd`
; ││┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:874 within `macro expansion`
     %res.i = call reassoc bfloat @llvm.vector.reduce.fadd.v8bf16(bfloat 0xR0000, <8 x bfloat> %0)
     ret bfloat %res.i
; └└└
}

On LLVM 18 as used by current master branch of Julia that isn't enough to match to AVXBF16 operations other than the conversion to and from single precision though:

julia> @code_native f(v)
	.text
	.file	"f"
	.section	.ltext,"axl",@progbits
	.globl	julia_f_6889                    # -- Begin function julia_f_6889
	.p2align	4, 0x90
	.type	julia_f_6889,@function
julia_f_6889:                           # @julia_f_6889
; Function Signature: f(SIMD.Vec{8, Core.BFloat16})
; ┌ @ REPL[4]:1 within `f`
# %bb.0:                                # %top
	#DEBUG_VALUE: f:v <- [$rdi+0]
	push	rbp
	mov	rbp, rsp
; │┌ @ /home/sdp/SIMD/src/simdvec.jl:257 within `*`
; ││┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:221 within `fmul` @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:221
; │││┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:230 within `macro expansion`
	vmovdqa	xmm0, xmmword ptr [rdi]
	vpextrw	eax, xmm0, 4
	shl	eax, 16
	vmovd	xmm1, eax
	vmulss	xmm1, xmm1, xmm1
	vcvtneps2bf16	xmm1, xmm1
	vpextrw	eax, xmm0, 6
	shl	eax, 16
	vmovd	xmm2, eax
	vmulss	xmm2, xmm2, xmm2
...

LLVM trunk looks better, but still uses single precision: https://godbolt.org/z/obhd33MKc

I guess this is as much as we can do in Julia though; fmul + @llvm.vector.reduce.fadd is pretty idiomatic and probably something that should be matched to VDPBF16PS?

@KristofferC
Copy link
Collaborator Author

fmul + @llvm.vector.reduce.fadd is pretty idiomatic and probably something that should be matched to VDPBF16PS?

Hopefully, yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants