-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor package into one part dealing with LLVM and one part that builds a Vec on top of that #63
Conversation
c7de73a
to
de0b04e
Compare
6c4876a
to
e9bc504
Compare
Codecov Report
@@ Coverage Diff @@
## master #63 +/- ##
==========================================
+ Coverage 86.02% 88.61% +2.58%
==========================================
Files 1 4 +3
Lines 866 404 -462
==========================================
- Hits 745 358 -387
+ Misses 121 46 -75
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic. I started a similar refactor twice and those stalled out on grounds of lacking time!
Made some comments inline, the pattern @eval @generated
can just be replaced with @eval
and the spliced in piece being hoisted.
I'm unsure about this, but I feel the burden would be on me to provide the test cases where
The
|
Does this mean that Julia 1.4 will be required for these changes? I know many people who still use Julia 1.2, and in particular if there is also an API change, being backward compatible across multiple Julia versions would be a benefit. |
I forgot the most important comment: Thanks for the work! I agree that this is the right direction. |
The only thing that requires 1.4 is the new horizontal reduction intrinsics I think. So if we conditionally use the old hand-rolled one, it should be possible to support pre 1.4. On the other hand, it isn't like the package was getting too much development (last functional change ~ a year ago) so those who use it in older Julia versions are probably just happy to keep using it as is and not to keen in upgrading packages at all. |
Regarding indexing API, what about That's the API I settled on for LoopVectorization's code generation, since it means I can generate the same code largely independent of broader context (when it comes to loads/stores), and have multiple dispatch make it do the right thing. Perhaps that is less desirable for code people are actively writing, because they will be more intentional about what they're doing; they're likely to know whether an index will be a scalar, |
@KristofferC In principle yes, but given that Julia 1.4 isn't release yet, that would be a bit harsh. It would be nice (if it isn't too much effort) to give people a bit of time to upgrade (in the sense that Julia and SIMD can be upgraded separately). |
@chriselrod I modelled the |
I need to fix the "codegen" for e.g.
and in some cases methods that are defined on
Should the just be considered an |
In SIMD code, changing the vector width is often expensive. Some systems have therefore different bool sizes, such as I don't know whether or how we can make LLVM generate efficient code here. I think the best would be to represent booleans as I recommend against automatic conversion to We could define a type |
badbe6e
to
f6ce1a6
Compare
Yeah, agreed. FWIW, I ran the tests on https://github.com/KristofferC/Tensors.jl (which uses SIMD.jl) and they passed. If anyone else is actually using SIMD.jl in their code, would be nice if you could run with this PR to see that it is non-breaking. |
8041e0a
to
291459d
Compare
exception of some of the indexing implemented by tkf) while keeping the API intact. The reason for this is that I felt that the code could gain a lot of clarity by clearly separating the parts that deals with LLVM/`llvmcall` and then build a `Vec` on top of that. The number of lines of code has also been reduced from ~1600 to 1000. The code is structured as follows: - `LLVM_Intrinsics.jl` is pretty much a direct mapping of Julia Vectors (`NTuple{N, VecElement{T}}`) to the operators and intrinsics defined in https://llvm.org/docs/LangRef.html. It contains almost no higher level logic. - `simdvec.jl` contains the `Vec` (wrapping the tuple of `VecElement`s) with definitions defined on it that maps to the intrinsics defined in `LLVM.jl`. In some cases this is pretty automatic but in some cases requires some logic (like in the bitshifts partly to avoid undefined behavior or in the different conversions). - `arrayops.jl` is the stuff that deals with Julia `Array` like `vload`, `vstore`, `vgather`. Things that have gotten added to the API: - The `count_ones, count_zeros, leading_ones, leading_zeros, trailing_ones, trailing_zeros` family of functions. - Type conversions and different types of reinterprets from scalar to vectors and back and between vectors of different size: ```jl julia> v = Vec((Int32(2), Int32(4))) <2 x Int32>[2, 4] julia> reinterpret(Int64, v) 17179869186 julia> reinterpret(Vec{4, Int16}, v) <4 x Int16>[2, 0, 4, 0] julia> reinterpret(Vec{2, Int32}, 4) <2 x Int32>[4, 0] julia> convert(Vec{2, Float32}, v) <2 x Float32>[2.0, 4.0] ``` - Uses the LLVM vector reduction intrinsics (https://llvm.org/docs/LangRef.html#experimental-vector-reduction-intrinsics) instead of a hand rolled reducer. Things that has been removed from the API: - Removed the `Val` arguments from many functions (`setindex`, `>>` etc). Julia's constant propagation + LLVM's optimization are enough for these not to be needed. Things are specialized on the constant just as well as if using `Val`. - Removed the `Val{}` arguments and just use `Val()` consistently everywhere. - Removed `exp10`. This used to just call `10^v` but the reason you would use `exp10` is that there is a more efficient implementation for it than the naive one. I feel that providing `exp10` gives the false impression that it provides a benefit over the naive version Co-Authored-By: Valentin Churavy <[email protected]>
291459d
to
68c0b2d
Compare
154f08c
to
e3321e5
Compare
fixup: fix supported element types
From my point of view, it is almost ready to be merged. The last feature I added runs into a bug on x86 (https://ci.appveyor.com/project/eschnett/simd-jl/builds/30989111/job/p0g8uhmkamy7ut8n#L56) which is similar to JuliaLang/julia#29447. I think Julia needs to ship compiler-rt for it to work. I will just disable that feature on X86 (overflow intrinsics) This PR requires Julia 1.4 (not 1.5) but the overflow intrinsics are only available on 1.5. Regarding wanting to support previous Julia versions, one often splits out a |
the error we get is "LLVM ERROR: Symbols not found: { __mulodi4 }" which seems like it would require compiler-rt support"
But before we tag, I want to go through the existing issues and see which ones are no longer relevant or see which ones can be easily fixed. |
From my point of view, this could be merged whenever. I still have a few tests to add (like making sure there are no spurious bounds checks inside |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just read through it again. LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super cool!
|
||
########### | ||
# Bitcast # | ||
########### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need a trunc-and-bitcast intruction to represent vpmovmskb.
That would take <n x i8>
, truncate it to <n x i1>
, zero-pad it, and cast it to e.g. i32
(or avx2). This can afaik not be reproduced with the other intrinsics here.
Likewise, we probably want an operation that takes <n x i8>
, truncates to <n x i1>
and sext
s to <n x i8>
; possibly extending even more. This is in order to reduce mismatch between julia/llvm and x86 semantics (x86 tends to set all the bits, i.e. sext
; julia has no representation of <n x i1>
, and julia's Bool uses zext
-style <n x i8>
).
# See https://github.com/JuliaLang/julia/blob/7426625b5c07b0d93110293246089a259a0a677d/src/intrinsics.cpp#L1179-L1196 | ||
# Shifting with a value larger than the number of bits in the type is undefined behavior | ||
# so set to zero in those cases. | ||
@inline function shl_int(x::Vec{N, T1}, y::Vec{N, T2}) where {N, T1<:IntegerTypes, T2<:IntegerTypes} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be much better to use llvm-semantics and document that they differ from julia semantics. People who use SIMD.jl
care about speed, and can deal with the fact that shifts by more that width bits behave weirdly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not "weirdly", it gives you a poison value which gives you undefined behavior in many cases. I don't think we should expose undefined behavior so easily from the Vec
type. You can always just call Intrinsics.shl
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also do the bitshifts modulo the size of the integer (which would match with e.g. rust). That would be cheaper since you can just and
it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a good idea. OpenCL does the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we do about negative bitshifts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Julia distinguishes between signed and unsigned bitshifts. For signed ones, it handles negative values as positive shifts in the opposite direction.
If we use the same function/operator name, we should use the same semantics. For full performance, people have to specify unsigned shift counts. If we think this is confusing, then we can reject signed shift counts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use the same function/operator name, we should use the same semantics.
But in #63 (comment) we basically said that we would not keep the same semantics? In Julia, a shift larger than the bitsize does not shift with wrapped around shift but sets the value to zero which to me is a very different semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I thought Julia was undefined here, and we'd only be tightening things. But that's LLVM's semantics, not Julia's.
Should we keep Julia's semantics, and add new functions (shl
/ shr
) that re-interpret as unsigned and wrap around?
In many cases, the shift count will be known at compile time, and in this case, things will be efficient anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I did some more reading on https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX,AVX2&cats=Shift
For hare-brained reason _mm256_slli_epi64 et al have julia semantics, while non-vectorized shifts ignore the upper bits. LLVM often fails to properly optimize code that enforces julia/AVX semantics; but if we emit x86_64 semantics, we have a mismatch with the AVX instructions we actually want. ARM does something else, because why not.
I see your point that the poison-value variant is slightly too poisonous for the default. So I guess we should document that poisonous shl
is preferable to <<
if known safe, and consider it our offering to ill-thought out llvm semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hereby volunteer you to write the doc string for these functions.
I added julia> f(a,b,c,d) = @fastmath a * b + c - d;
julia> v = Vec(1.0, 2.0, 3.0, 4.0);
julia> f(v, 1.0, v, 2.0)
<4 x Float64>[0.0, 2.0, 4.0, 6.0]
julia> @code_llvm debuginfo=:none f(v, 1.0, v, 2.0)
define void @julia_f_17933([1 x <4 x double>]* noalias nocapture sret, [1 x <4 x double>]
...
%5 = insertelement <4 x double> undef, double %2, i32 0
%res.i = shufflevector <4 x double> %5, <4 x double> undef, <4 x i32> zeroinitializer
%6 = getelementptr inbounds [1 x <4 x double>], [1 x <4 x double>] addrspace(11)* %1, i64 0, i64 0
%7 = load <4 x double>, <4 x double> addrspace(11)* %6, align 16
%8 = fmul fast <4 x double> %7, %res.i
%9 = getelementptr inbounds [1 x <4 x double>], [1 x <4 x double>] addrspace(11)* %3, i64 0, i64 0
%10 = load <4 x double>, <4 x double> addrspace(11)* %9, align 16
%11 = insertelement <4 x double> undef, double %4, i32 0
%12 = fsub fast <4 x double> <double -0.000000e+00, double undef, double undef, double undef>, %11
%res.i1.neg = shufflevector <4 x double> %12, <4 x double> undef, <4 x i32> zeroinitializer
%13 = fadd fast <4 x double> %8, %res.i1.neg
%14 = fadd fast <4 x double> %13, %10
%15 = getelementptr inbounds [1 x <4 x double>], [1 x <4 x double>]* %0, i64 0, i64 0
store <4 x double> %14, <4 x double>* %15, align 32
ret void
} julia> @code_native debuginfo=:none f(v, 1.0, v, 2.0)
.section __TEXT,__text,regular,pure_instructions
movq %rdi, %rax
vbroadcastsd %xmm0, %ymm0
vbroadcastsd %xmm1, %ymm1
vfmsub231pd (%rsi), %ymm0, %ymm1 ## ymm1 = (ymm0 * mem) - ymm1
vaddpd (%rdx), %ymm1, %ymm0
vmovapd %ymm0, (%rdi)
vzeroupper
retq
nop
julia> g(a,b,c,d) = a * b + c - d; # no @fastmath
julia> @code_native debuginfo=:none g(v, 1.0, v, 2.0)
.section __TEXT,__text,regular,pure_instructions
movq %rdi, %rax
vbroadcastsd %xmm0, %ymm0
vmulpd (%rsi), %ymm0, %ymm0
vaddpd (%rdx), %ymm0, %ymm0
vbroadcastsd %xmm1, %ymm1
vsubpd %ymm1, %ymm0, %ymm0
vmovapd %ymm0, (%rdi)
vzeroupper
retq
nopw %cs:(%rax,%rax)
nopl (%rax,%rax) |
Added the fastmath commit to this PR. |
I just noticed that It calls Does it make sense to use When it's Alternatively, if |
Added some docs for the |
1.4 is now released. Any thoughts on how to progress here @eschnett |
@KristofferC In general – once your pull request is ready, it should be applied. Are you referring to a particular question or suggestion (shifts, alignment, ...)? |
This PR pretty much rewrites the package from scratch (with the exception of the indexing implemented by @tkf) while keeping the API intact. The reason for this is that I felt that the code could gain a lot of clarity by clearly separating the parts that deal with LLVM/
llvmcall
and then build a SIMDVec
on top of that. The number of lines of code has also been reduced from ~1600 to ~1000 giving some support to this claim.The code is structured as follows:
LLVM_intrinsics.jl
is pretty much a direct mapping of Julia Vectors (NTuple{N, VecElement{T}}
) to the operators and intrinsics defined in https://llvm.org/docs/LangRef.html. It contains almost no higher level logic.simdvec.jl
contains theVec
(wrapping the tuple ofVecElement
s) with definitions defined on it that maps to the intrinsics defined inLLVM.jl
. In some cases, this is pretty automatic but in some cases requires some logic (like in the bitshifts partly to avoid undefined behavior or in the different conversions).arrayops.jl
is the stuff that deals with JuliaArray
likevload
,vstore
,vgather
.Things that have gotten added to the API:
The
count_ones, count_zeros, leading_ones, leading_zeros, trailing_ones, trailing_zeros
family of functions.Type conversions and different types of reinterprets from scalar to vectors and back and between vectors of different size:
Things that has been removed from the API:
Removed the
Val
arguments from many functions (setindex
,>>
etc). Julia's constant propagation + LLVM's optimization is enough for these not to be needed. Things are specialized on the constant just as well as if usingVal
.Removed the
Val{}
arguments and just useVal()
consistently everywhere.Removed
exp10
. This used to just call10^v
but the reason you would useexp10
is that there is a more efficient implementation for it than the naive one. I feel that providingexp10
gives a false impression that it provides a benefit over the naive version.Removed
all
onVec
ofInt
. There is no such correspondence to Julia numbers (all
should operator on Bools).For the future, we should also think a bit how we could allow one to hook into the fast math flags defined in https://llvm.org/docs/LangRef.html#fast-math-flags. I guess we could try hook into the functionality provided by
@fastmath
.I think a weak spot right now is all the different indexing. The combination of being able to use
Vec{N, T}
,VecRange{N}
as the first argument, as well as the combination of alignments and non-temporal settings create a huge number of method combinations. In SIMD.jl many hundreds of lines are just defining similar methods with different orders of arguments and default values. Somehow the abstraction doesn't feel right here. I would want to at least makeVecRange{N}(i)
the default way to index because it feels unnecessary to have to pass theT
inVec{N,T}
when it doesn't add any information.Also, the
Aligned
flag sets the alignment toN * sizeof(T)
. Can that really be right?Tagging some people that might be interested / can review. @tkf, @eschnett , @nlw0, @vchuravy, @chethega, @chriselrod
Fixes #65
Fixes #54
Fixes #51
Fixes #20