WIP: try integrate BFloat16 #124

KristofferC · 2024-08-08T12:09:54Z

This is just to see how things behave (#123)

julia> using SIMD, BFloat16s

julia> v = Vec(ntuple(i->BFloat16(rand()), Val(8)))
<8 x BFloat16>[0.796875, 0.56640625, 0.97265625, 0.68359375, 0.26367188, 0.6640625, 0.63671875, 0.8125]

julia> v+v
<8 x BFloat16>[1.59375, 1.1328125, 1.9453125, 1.3671875, 0.52734375, 1.328125, 1.2734375, 1.625]

For v+v we generate the following LLVM code:

%3 = fadd  <8 x bfloat> %0, %1
ret <8 x bfloat> %3

but:

julia> @code_llvm v+v
; Function Signature: +(SIMD.Vec{8, Core.BFloat16}, SIMD.Vec{8, Core.BFloat16})
;  @ /home/kc/JuliaPkgs/SIMD.jl/src/simdvec.jl:256 within `+`
define void @"julia_+_6263"(ptr noalias nocapture noundef nonnull sret([1 x <8 x bfloat>]) align 16 dereferenceable(16) %sret_return, ptr nocapture noundef nonnull readonly align 16 dereferenceable(16) %"x::Vec", ptr nocapture noundef nonnull readonly align 16 dereferenceable(16) %"y::Vec") #0 {
top:
;  @ /home/kc/JuliaPkgs/SIMD.jl/src/simdvec.jl:257 within `+`
; ┌ @ /home/kc/JuliaPkgs/SIMD.jl/src/LLVM_intrinsics.jl:221 within `fadd` @ /home/kc/JuliaPkgs/SIMD.jl/src/LLVM_intrinsics.jl:221
; │┌ @ /home/kc/JuliaPkgs/SIMD.jl/src/LLVM_intrinsics.jl:231 within `macro expansion`
    %"x::Vec.data_ptr.unbox" = load <8 x bfloat>, ptr %"x::Vec", align 16
    %"y::Vec.data_ptr.unbox" = load <8 x bfloat>, ptr %"y::Vec", align 16
    %0 = fpext <8 x bfloat> %"x::Vec.data_ptr.unbox" to <8 x float>
    %1 = fpext <8 x bfloat> %"y::Vec.data_ptr.unbox" to <8 x float>
    %2 = fadd <8 x float> %0, %1
    %3 = fptrunc <8 x float> %2 to <8 x bfloat>
; └└
; ┌ @ /home/kc/JuliaPkgs/SIMD.jl/src/simdvec.jl:2 within `Vec`
   store <8 x bfloat> %3, ptr %sret_return, align 16
   ret void
; └
}

julia> versioninfo()
Julia Version 1.12.0-DEV.1013
Commit c767032b8ff (2024-08-08 02:12 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × 12th Gen Intel(R) Core(TM) i9-12900K
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, alderlake)

cc @maleadt

codecov-commenter · 2024-08-08T12:12:34Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.61%. Comparing base (c332a03) to head (03a4e3d).

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #124      +/-   ##
==========================================
- Coverage   88.77%   88.61%   -0.16%     
==========================================
  Files           5        5              
  Lines         561      562       +1     
==========================================
  Hits          498      498              
- Misses         63       64       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

maleadt · 2024-08-08T12:56:16Z

Ah nice, that simplifies my example:

julia> v = Vec(ntuple(i->Float32(rand()), Val(16)))
<16 x Float32>[0.35481927, 0.14949146, 0.33511126, 0.23023836, 0.16776331, 0.9152977, 0.19988814, 0.22910726, 0.10502812, 0.54989743, 0.14419909, 0.19571519, 0.21844539, 0.84552854, 0.03142407, 0.9895877]

julia> @code_native convert(Vec{16,BFloat16},v)
	push	rbp
	mov	rbp, rsp
	mov	rax, rdi
; │ @ /home/sdp/SIMD/src/simdvec.jl:57 within `convert`
; │┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:716 within `fptrunc`
; ││┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:730 within `macro expansion`
	vcvtneps2bf16	ymm0, zmmword ptr [rsi]
; │└└
	vmovups	ymmword ptr [rdi], ymm0
	pop	rbp
	vzeroupper
	ret

Is it possible to express a dot product using SIMD.jl? That should ideally result in the other AVX512BF16 instruction being emitted.

Other operations, like the + you're doing, are not natively supported and will get demoted to Float32 by an LLVM pass of ours. That should probably be revisited for architectures that do actually support native Float16, but that's not what I was testing here.

KristofferC · 2024-08-08T13:01:46Z

Is it possible to express a dot product using SIMD.jl?

There is sum(v*v) at least..

maleadt · 2024-08-08T13:38:06Z

There is sum(v*v) at least..

Hmm, with that the demote to Float32 is likely to get in the way and prevent any potential fusion between the fmul and reduce.fadd, if that's even supported:

julia> @code_llvm f(v)
; Function Signature: f(SIMD.Vec{16, Core.BFloat16})
;  @ REPL[33]:1 within `f`
define bfloat @julia_f_11886(ptr nocapture noundef nonnull readonly align 16 dereferenceable(32) %"v::Vec") #0 {
top:
; ┌ @ /home/sdp/SIMD/src/simdvec.jl:257 within `*`
; │┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:221 within `fmul` @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:221
; ││┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:230 within `macro expansion`
     %"v::Vec.data_ptr.unbox" = load <16 x bfloat>, ptr %"v::Vec", align 16
     %0 = fpext <16 x bfloat> %"v::Vec.data_ptr.unbox" to <16 x float>
     %1 = fmul <16 x float> %0, %0
     %2 = fptrunc <16 x float> %1 to <16 x bfloat>
; └└└
; ┌ @ /home/sdp/SIMD/src/simdvec.jl:483 within `sum`
; │┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:858 within `reduce_fadd`
; ││┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:874 within `macro expansion`
     %res.i = call reassoc bfloat @llvm.vector.reduce.fadd.v16bf16(bfloat 0xR0000, <16 x bfloat> %2)
     ret bfloat %res.i
; └└└
}

I guess that needs some solution at the LLVM level so that we don't need these demotes (i.e., llvm/llvm-project#97975). In the meantime, it would be possible to add explicit calls to e.g. the @llvm.x86.avx512bf16.dpbf16ps intrinsic, but that's probably for another PR.

KristofferC · 2024-08-08T14:11:47Z

In the meantime, it would be possible to add explicit calls to e.g. the @llvm.x86.avx512bf16.dpbf16ps intrinsic, but that's probably for another PR.

So far, it has worked quite well here to only map to the "generic" LLVM intrinsic and let the LLVM make the choice of the actual native intrinsic to run. That gives the package a reasonably limited scope compared to supporting n_architectures * n_instructions and it also means that code written with SIMD.jl is not platform or vector size specific.

maleadt · 2024-08-08T14:38:47Z

The "problem" is that LLVM doesn't have a generic dot product intrinsic, so for any matching of dot product-like instructions to vdpbf16ps to work we probably need to get rid of the demotions in between. Which is something that needs to be fixed upstream.

Anyway, this isn't terribly important right now. It's a good start that we already match the conversion intrinsics without more invasive changes. And this is only an x86 problem; on some ARM platforms we shouldn't be demoting at all (JuliaLang/julia#55417).

KristofferC · 2024-08-08T14:41:30Z

And this is only an x86 problem; on some ARM platforms we shouldn't be demoting at all (JuliaLang/julia#55417).

It would be interesting to see how this PR performs on such a system.

maleadt · 2024-08-21T09:32:46Z

Alright, with JuliaLang/julia#55486 the sum(v*v) dot product example from above yields much cleaner LLVM code:

julia> @code_llvm f(v)
; Function Signature: f(SIMD.Vec{8, Core.BFloat16})
;  @ REPL[4]:1 within `f`
define bfloat @julia_f_6882(ptr nocapture noundef nonnull readonly align 16 dereferenceable(16) %"v::Vec") #0 {
top:
; ┌ @ /home/sdp/SIMD/src/simdvec.jl:257 within `*`
; │┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:221 within `fmul` @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:221
; ││┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:230 within `macro expansion`
     %"v::Vec.data_ptr.unbox" = load <8 x bfloat>, ptr %"v::Vec", align 16
     %0 = fmul <8 x bfloat> %"v::Vec.data_ptr.unbox", %"v::Vec.data_ptr.unbox"
; └└└
; ┌ @ /home/sdp/SIMD/src/simdvec.jl:483 within `sum`
; │┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:858 within `reduce_fadd`
; ││┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:874 within `macro expansion`
     %res.i = call reassoc bfloat @llvm.vector.reduce.fadd.v8bf16(bfloat 0xR0000, <8 x bfloat> %0)
     ret bfloat %res.i
; └└└
}

On LLVM 18 as used by current master branch of Julia that isn't enough to match to AVXBF16 operations other than the conversion to and from single precision though:

julia> @code_native f(v)
	.text
	.file	"f"
	.section	.ltext,"axl",@progbits
	.globl	julia_f_6889                    # -- Begin function julia_f_6889
	.p2align	4, 0x90
	.type	julia_f_6889,@function
julia_f_6889:                           # @julia_f_6889
; Function Signature: f(SIMD.Vec{8, Core.BFloat16})
; ┌ @ REPL[4]:1 within `f`
# %bb.0:                                # %top
	#DEBUG_VALUE: f:v <- [$rdi+0]
	push	rbp
	mov	rbp, rsp
; │┌ @ /home/sdp/SIMD/src/simdvec.jl:257 within `*`
; ││┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:221 within `fmul` @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:221
; │││┌ @ /home/sdp/SIMD/src/LLVM_intrinsics.jl:230 within `macro expansion`
	vmovdqa	xmm0, xmmword ptr [rdi]
	vpextrw	eax, xmm0, 4
	shl	eax, 16
	vmovd	xmm1, eax
	vmulss	xmm1, xmm1, xmm1
	vcvtneps2bf16	xmm1, xmm1
	vpextrw	eax, xmm0, 6
	shl	eax, 16
	vmovd	xmm2, eax
	vmulss	xmm2, xmm2, xmm2
...

LLVM trunk looks better, but still uses single precision: https://godbolt.org/z/obhd33MKc

I guess this is as much as we can do in Julia though; fmul + @llvm.vector.reduce.fadd is pretty idiomatic and probably something that should be matched to VDPBF16PS?

KristofferC · 2024-08-21T12:05:19Z

fmul + @llvm.vector.reduce.fadd is pretty idiomatic and probably something that should be matched to VDPBF16PS?

Hopefully, yes.

try integrate BFloat16

03a4e3d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: try integrate BFloat16 #124

WIP: try integrate BFloat16 #124

KristofferC commented Aug 8, 2024 •

edited

Loading

codecov-commenter commented Aug 8, 2024

maleadt commented Aug 8, 2024

KristofferC commented Aug 8, 2024

maleadt commented Aug 8, 2024

KristofferC commented Aug 8, 2024

maleadt commented Aug 8, 2024

KristofferC commented Aug 8, 2024

maleadt commented Aug 21, 2024

KristofferC commented Aug 21, 2024

WIP: try integrate BFloat16 #124

Are you sure you want to change the base?

WIP: try integrate BFloat16 #124

Conversation

KristofferC commented Aug 8, 2024 • edited Loading

codecov-commenter commented Aug 8, 2024

Codecov Report

maleadt commented Aug 8, 2024

KristofferC commented Aug 8, 2024

maleadt commented Aug 8, 2024

KristofferC commented Aug 8, 2024

maleadt commented Aug 8, 2024

KristofferC commented Aug 8, 2024

maleadt commented Aug 21, 2024

KristofferC commented Aug 21, 2024

KristofferC commented Aug 8, 2024 •

edited

Loading