Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor package into one part dealing with LLVM and one part that builds a Vec on top of that #63

Merged
merged 20 commits into from
Mar 31, 2020

Conversation

KristofferC
Copy link
Collaborator

@KristofferC KristofferC commented Feb 12, 2020

This PR pretty much rewrites the package from scratch (with the exception of the indexing implemented by @tkf) while keeping the API intact. The reason for this is that I felt that the code could gain a lot of clarity by clearly separating the parts that deal with LLVM/llvmcall and then build a SIMD Vec on top of that. The number of lines of code has also been reduced from ~1600 to ~1000 giving some support to this claim.


The code is structured as follows:

  • LLVM_intrinsics.jl is pretty much a direct mapping of Julia Vectors (NTuple{N, VecElement{T}}) to the operators and intrinsics defined in https://llvm.org/docs/LangRef.html. It contains almost no higher level logic.
  • simdvec.jl contains the Vec (wrapping the tuple of VecElements) with definitions defined on it that maps to the intrinsics defined in LLVM.jl. In some cases, this is pretty automatic but in some cases requires some logic (like in the bitshifts partly to avoid undefined behavior or in the different conversions).
  • arrayops.jl is the stuff that deals with Julia Array like vload, vstore, vgather.

Things that have gotten added to the API:

  • The count_ones, count_zeros, leading_ones, leading_zeros, trailing_ones, trailing_zeros family of functions.

  • Type conversions and different types of reinterprets from scalar to vectors and back and between vectors of different size:

julia> v = Vec((Int32(2), Int32(4)))
<2 x Int32>[2, 4]

julia> reinterpret(Int64, v)
17179869186

julia> reinterpret(Vec{4, Int16}, v)
<4 x Int16>[2, 0, 4, 0]

julia> reinterpret(Vec{2, Int32}, 4)
<2 x Int32>[4, 0]

julia> convert(Vec{2, Float32}, v)
<2 x Float32>[2.0, 4.0]

Things that has been removed from the API:

  • Removed the Val arguments from many functions (setindex, >> etc). Julia's constant propagation + LLVM's optimization is enough for these not to be needed. Things are specialized on the constant just as well as if using Val.

  • Removed the Val{} arguments and just use Val() consistently everywhere.

  • Removed exp10. This used to just call 10^v but the reason you would use exp10 is that there is a more efficient implementation for it than the naive one. I feel that providing exp10 gives a false impression that it provides a benefit over the naive version.

  • Removed all on Vec of Int. There is no such correspondence to Julia numbers (all should operator on Bools).


For the future, we should also think a bit how we could allow one to hook into the fast math flags defined in https://llvm.org/docs/LangRef.html#fast-math-flags. I guess we could try hook into the functionality provided by @fastmath.

I think a weak spot right now is all the different indexing. The combination of being able to use Vec{N, T}, VecRange{N} as the first argument, as well as the combination of alignments and non-temporal settings create a huge number of method combinations. In SIMD.jl many hundreds of lines are just defining similar methods with different orders of arguments and default values. Somehow the abstraction doesn't feel right here. I would want to at least make VecRange{N}(i) the default way to index because it feels unnecessary to have to pass the T in Vec{N,T} when it doesn't add any information.

Also, the Aligned flag sets the alignment to N * sizeof(T). Can that really be right?


Tagging some people that might be interested / can review. @tkf, @eschnett , @nlw0, @vchuravy, @chethega, @chriselrod

Fixes #65
Fixes #54
Fixes #51
Fixes #20

@KristofferC KristofferC changed the title Refactor package into a one part dealing with LLVM and one part that builds on top of that Refactor package into a one part dealing with LLVM and one part that builds a Vec on top of that Feb 12, 2020
@KristofferC KristofferC force-pushed the kc/rewrite branch 2 times, most recently from 6c4876a to e9bc504 Compare February 12, 2020 13:18
@codecov-io
Copy link

codecov-io commented Feb 12, 2020

Codecov Report

Merging #63 into master will increase coverage by 2.58%.
The diff coverage is 88.58%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #63      +/-   ##
==========================================
+ Coverage   86.02%   88.61%   +2.58%     
==========================================
  Files           1        4       +3     
  Lines         866      404     -462     
==========================================
- Hits          745      358     -387     
+ Misses        121       46      -75
Impacted Files Coverage Δ
src/SIMD.jl 100% <ø> (+13.97%) ⬆️
src/simdvec.jl 77.01% <77.01%> (ø)
src/arrayops.jl 95.91% <95.91%> (ø)
src/LLVM_intrinsics.jl 96.52% <96.52%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5a9fc0a...e8f5815. Read the comment docs.

@KristofferC KristofferC changed the title Refactor package into a one part dealing with LLVM and one part that builds a Vec on top of that Refactor package into one part dealing with LLVM and one part that builds a Vec on top of that Feb 12, 2020
Copy link
Collaborator

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic. I started a similar refactor twice and those stalled out on grounds of lacking time!

Made some comments inline, the pattern @eval @generated can just be replaced with @eval and the spliced in piece being hoisted.

src/LLVM_intrinsics.jl Show resolved Hide resolved
src/LLVM_intrinsics.jl Show resolved Hide resolved
src/arrayops.jl Show resolved Hide resolved
src/LLVM_intrinsics.jl Outdated Show resolved Hide resolved
@eschnett
Copy link
Owner

Things that has been removed from the API:

Removed the Val arguments from many functions (setindex, >> etc). Julia's constant propagation + LLVM's optimization is enough for these not to be needed. Things are specialized on the constant just as well as if using Val.

I'm unsure about this, but I feel the burden would be on me to provide the test cases where Val provides a benefit.

Removed the Val{} arguments and just use Val() consistently everywhere.

The Val{} were a design error and should now indeed be removed. I kept them for compatibility.

Removed exp10. This used to just call 10^v but the reason you would use exp10 is that there is a more efficient implementation for it than the naive one. I feel that providing exp10 gives a false impression that it provides a benefit over the naive version.

exp10 should be implemented in a way similar to exp2 that is much more efficient than 10^. There is a large performance difference. A simple implementation would be something like exp10(x) = exp2(log2(10) * x). exp2 is much faster than pow in libc.

@eschnett
Copy link
Owner

Does this mean that Julia 1.4 will be required for these changes? I know many people who still use Julia 1.2, and in particular if there is also an API change, being backward compatible across multiple Julia versions would be a benefit.

@eschnett
Copy link
Owner

I forgot the most important comment: Thanks for the work! I agree that this is the right direction.

README.md Show resolved Hide resolved
@KristofferC
Copy link
Collaborator Author

Does this mean that Julia 1.4 will be required for these changes? I know many people who still use Julia 1.2, and in particular if there is also an API change, being backward compatible across multiple Julia versions would be a benefit.

The only thing that requires 1.4 is the new horizontal reduction intrinsics I think. So if we conditionally use the old hand-rolled one, it should be possible to support pre 1.4.

On the other hand, it isn't like the package was getting too much development (last functional change ~ a year ago) so those who use it in older Julia versions are probably just happy to keep using it as is and not to keen in upgrading packages at all.

@chriselrod
Copy link
Contributor

Regarding indexing API, what about vload/vstore methods that accept AbstractArray{T,N}s and N-length tuples of indices, using promotion to determine which method should be used?
If all indices are integers, it may be a scalar load.
If the AbstractArray is contiguous and column major, a VecRange as the first index, while all following indices are integers, would result in a llvm load/store.
The presence of VecRanges in any other position within the index-tuple, or the presence of Vec{<:Any,<:Integer} cause use of gather/scatter.

That's the API I settled on for LoopVectorization's code generation, since it means I can generate the same code largely independent of broader context (when it comes to loads/stores), and have multiple dispatch make it do the right thing.

Perhaps that is less desirable for code people are actively writing, because they will be more intentional about what they're doing; they're likely to know whether an index will be a scalar, VecRange, or a Vec.
But it may still help write more generic code in the future, where iterators may generate different sorts of indexes based on input arguments.

@eschnett
Copy link
Owner

@KristofferC In principle yes, but given that Julia 1.4 isn't release yet, that would be a bit harsh. It would be nice (if it isn't too much effort) to give people a bit of time to upgrade (in the sense that Julia and SIMD can be upgraded separately).

@eschnett
Copy link
Owner

@chriselrod I modelled the vload / vstore functions after OpenCL C. Since Julia has a much better type system than C, maybe there is room for improvement. Your suggestion sounds interesting.

@KristofferC
Copy link
Collaborator Author

KristofferC commented Feb 13, 2020

I need to fix the "codegen" for e.g. Bool + Bool because I haven't special cased that yet to do the trunc + zext dance yet. It is a bit unclear for me what the semantics for Bool should be. In Julia, Bools immediately promote when used in arithmetic:

julia> true - true
0

and in some cases methods that are defined on Int are not defined on Bool:

julia> count_zeros(true)
ERROR: MethodError: no method matching count_ones(::Bool)

Should the just be considered an i1 and then have all normal Int functions defined on them? Codegen for Bool's seems to have some issues though (JuliaLang/julia#21712).
Or should we try to match Julia as much as possible. Any implicit promotion in SIMD.jl feels pretty pointless, however.

@eschnett
Copy link
Owner

In SIMD code, changing the vector width is often expensive. Some systems have therefore different bool sizes, such as Bool32 and Bool64. They might internally represented very different from regular bools, i.e. they might even live in floating-point registers. I think SSE2 uses floating-point values with all-zero-bits and all-one-bits, PPC uses floating-point values that compare >=0, etc. Of course, more modern architectures (AVX512, CUDA) have special mask registers.

I don't know whether or how we can make LLVM generate efficient code here. I think the best would be to represent booleans as i1, and hope that LLVM knows about the efficient low-level SIMD representations.

I recommend against automatic conversion to Int, as Int would often have the wrong bit-width and thus lead to inefficient code. There is basically no automatic type conversion in SIMD.jl. I would also keep a distinction between Bool and Int, so even if the internal representation of i1 and i8 sounds similar, it doesn't have to be the same at a higher level. I don't see a point in defining -(::Bool, ::Bool) without automatic conversion and would omit it, similar for count_zeros etc.

We could define a type Int1 (and I know cases where a vectorized Int4 would be most convenient to have), but that's a different bag of chips.

@KristofferC
Copy link
Collaborator Author

I recommend against automatic conversion to Int, as Int would often have the wrong bit-width and thus lead to inefficient code. There is basically no automatic type conversion in SIMD.jl. I would also keep a distinction between Bool and Int, so even if the internal representation of i1 and i8 sounds similar, it doesn't have to be the same at a higher level. I don't see a point in defining -(::Bool, ::Bool) without automatic conversion and would omit it, similar for count_zeros etc.

Yeah, agreed.

FWIW, I ran the tests on https://github.com/KristofferC/Tensors.jl (which uses SIMD.jl) and they passed. If anyone else is actually using SIMD.jl in their code, would be nice if you could run with this PR to see that it is non-breaking.

KristofferC and others added 2 commits February 21, 2020 19:11
exception of some of the indexing implemented by tkf) while keeping the
API intact. The reason for this is that I felt that the code could gain
a lot of clarity by clearly separating the parts that deals with
LLVM/`llvmcall` and then build a `Vec` on top of that. The number of
lines of code has also been reduced from ~1600 to 1000.

The code is structured as follows:

- `LLVM_Intrinsics.jl` is pretty much a direct mapping of Julia Vectors
(`NTuple{N, VecElement{T}}`) to the operators and intrinsics defined in
https://llvm.org/docs/LangRef.html. It contains almost no higher level
logic.  - `simdvec.jl` contains the `Vec` (wrapping the tuple of
`VecElement`s) with definitions defined on it that maps to the
intrinsics defined in `LLVM.jl`. In some cases this is pretty automatic
but in some cases requires some logic (like in the bitshifts partly to
avoid undefined behavior or in the different conversions).  -
`arrayops.jl` is the stuff that deals with Julia `Array` like `vload`,
`vstore`, `vgather`.

Things that have gotten added to the API:

- The `count_ones, count_zeros, leading_ones, leading_zeros,
trailing_ones, trailing_zeros` family of functions.

- Type conversions and different types of reinterprets from scalar to
vectors and back and between vectors of different size:

```jl
julia> v = Vec((Int32(2), Int32(4)))
<2 x Int32>[2, 4]

julia> reinterpret(Int64, v)
17179869186

julia> reinterpret(Vec{4, Int16}, v)
<4 x Int16>[2, 0, 4, 0]

julia> reinterpret(Vec{2, Int32}, 4)
<2 x Int32>[4, 0]

julia> convert(Vec{2, Float32}, v)
<2 x Float32>[2.0, 4.0]
```

- Uses the LLVM vector reduction intrinsics
(https://llvm.org/docs/LangRef.html#experimental-vector-reduction-intrinsics)
instead of a hand rolled reducer.

Things that has been removed from the API:

- Removed the `Val` arguments from many functions (`setindex`, `>>`
etc). Julia's constant propagation + LLVM's optimization are enough for
these not to be needed. Things are specialized on the constant just as
well as if using `Val`.

- Removed the `Val{}` arguments and just use `Val()` consistently everywhere.

- Removed `exp10`. This used to just call `10^v` but the reason you
would use `exp10` is that there is a more efficient implementation for
it than the naive one. I feel that providing `exp10` gives the false
impression that it provides a benefit over the naive version

Co-Authored-By: Valentin Churavy <[email protected]>
@KristofferC
Copy link
Collaborator Author

From my point of view, it is almost ready to be merged. The last feature I added runs into a bug on x86 (https://ci.appveyor.com/project/eschnett/simd-jl/builds/30989111/job/p0g8uhmkamy7ut8n#L56) which is similar to JuliaLang/julia#29447. I think Julia needs to ship compiler-rt for it to work. I will just disable that feature on X86 (overflow intrinsics)

This PR requires Julia 1.4 (not 1.5) but the overflow intrinsics are only available on 1.5.

Regarding wanting to support previous Julia versions, one often splits out a release-2.0 branch from the commit before the Julia version requirement got updated and if bugfixes are needed for the previous Julia version those are made against that branch. Releases for 2.x can then be made from that branch at will.

the error we get is "LLVM ERROR: Symbols not found: { __mulodi4 }" which seems
like it would require compiler-rt support"
@KristofferC
Copy link
Collaborator Author

But before we tag, I want to go through the existing issues and see which ones are no longer relevant or see which ones can be easily fixed.

@KristofferC
Copy link
Collaborator Author

From my point of view, this could be merged whenever. I still have a few tests to add (like making sure there are no spurious bounds checks inside @inbounds�) but the PR is getting pretty big as it is.

Copy link
Collaborator

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just read through it again. LGTM

Copy link

@chethega chethega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super cool!


###########
# Bitcast #
###########

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need a trunc-and-bitcast intruction to represent vpmovmskb.

That would take <n x i8>, truncate it to <n x i1>, zero-pad it, and cast it to e.g. i32 (or avx2). This can afaik not be reproduced with the other intrinsics here.

Likewise, we probably want an operation that takes <n x i8>, truncates to <n x i1> and sexts to <n x i8>; possibly extending even more. This is in order to reduce mismatch between julia/llvm and x86 semantics (x86 tends to set all the bits, i.e. sext; julia has no representation of <n x i1>, and julia's Bool uses zext -style <n x i8>).

# See https://github.com/JuliaLang/julia/blob/7426625b5c07b0d93110293246089a259a0a677d/src/intrinsics.cpp#L1179-L1196
# Shifting with a value larger than the number of bits in the type is undefined behavior
# so set to zero in those cases.
@inline function shl_int(x::Vec{N, T1}, y::Vec{N, T2}) where {N, T1<:IntegerTypes, T2<:IntegerTypes}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be much better to use llvm-semantics and document that they differ from julia semantics. People who use SIMD.jl care about speed, and can deal with the fact that shifts by more that width bits behave weirdly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not "weirdly", it gives you a poison value which gives you undefined behavior in many cases. I don't think we should expose undefined behavior so easily from the Vec type. You can always just call Intrinsics.shl?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also do the bitshifts modulo the size of the integer (which would match with e.g. rust). That would be cheaper since you can just and it.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a good idea. OpenCL does the same.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we do about negative bitshifts?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Julia distinguishes between signed and unsigned bitshifts. For signed ones, it handles negative values as positive shifts in the opposite direction.

If we use the same function/operator name, we should use the same semantics. For full performance, people have to specify unsigned shift counts. If we think this is confusing, then we can reject signed shift counts.

Copy link
Collaborator Author

@KristofferC KristofferC Feb 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use the same function/operator name, we should use the same semantics.

But in #63 (comment) we basically said that we would not keep the same semantics? In Julia, a shift larger than the bitsize does not shift with wrapped around shift but sets the value to zero which to me is a very different semantics.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I thought Julia was undefined here, and we'd only be tightening things. But that's LLVM's semantics, not Julia's.

Should we keep Julia's semantics, and add new functions (shl / shr) that re-interpret as unsigned and wrap around?

In many cases, the shift count will be known at compile time, and in this case, things will be efficient anyway.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I did some more reading on https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX,AVX2&cats=Shift

For hare-brained reason _mm256_slli_epi64 et al have julia semantics, while non-vectorized shifts ignore the upper bits. LLVM often fails to properly optimize code that enforces julia/AVX semantics; but if we emit x86_64 semantics, we have a mismatch with the AVX instructions we actually want. ARM does something else, because why not.

I see your point that the poison-value variant is slightly too poisonous for the default. So I guess we should document that poisonous shl is preferable to << if known safe, and consider it our offering to ill-thought out llvm semantics.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hereby volunteer you to write the doc string for these functions.

@KristofferC
Copy link
Collaborator Author

I added @fastmath support in KristofferC#1.

julia> f(a,b,c,d) = @fastmath a * b + c - d;

julia> v = Vec(1.0, 2.0, 3.0, 4.0);

julia> f(v, 1.0, v, 2.0)
<4 x Float64>[0.0, 2.0, 4.0, 6.0]

julia> @code_llvm debuginfo=:none f(v, 1.0, v, 2.0)
define void @julia_f_17933([1 x <4 x double>]* noalias nocapture sret, [1 x <4 x double>] 
...
  %5 = insertelement <4 x double> undef, double %2, i32 0
  %res.i = shufflevector <4 x double> %5, <4 x double> undef, <4 x i32> zeroinitializer
  %6 = getelementptr inbounds [1 x <4 x double>], [1 x <4 x double>] addrspace(11)* %1, i64 0, i64 0
  %7 = load <4 x double>, <4 x double> addrspace(11)* %6, align 16
  %8 = fmul fast <4 x double> %7, %res.i
  %9 = getelementptr inbounds [1 x <4 x double>], [1 x <4 x double>] addrspace(11)* %3, i64 0, i64 0
  %10 = load <4 x double>, <4 x double> addrspace(11)* %9, align 16
  %11 = insertelement <4 x double> undef, double %4, i32 0
  %12 = fsub fast <4 x double> <double -0.000000e+00, double undef, double undef, double undef>, %11
  %res.i1.neg = shufflevector <4 x double> %12, <4 x double> undef, <4 x i32> zeroinitializer
  %13 = fadd fast <4 x double> %8, %res.i1.neg
  %14 = fadd fast <4 x double> %13, %10
  %15 = getelementptr inbounds [1 x <4 x double>], [1 x <4 x double>]* %0, i64 0, i64 0
  store <4 x double> %14, <4 x double>* %15, align 32
  ret void
}
julia> @code_native debuginfo=:none f(v, 1.0, v, 2.0)
        .section        __TEXT,__text,regular,pure_instructions
        movq    %rdi, %rax
        vbroadcastsd    %xmm0, %ymm0
        vbroadcastsd    %xmm1, %ymm1
        vfmsub231pd     (%rsi), %ymm0, %ymm1 ## ymm1 = (ymm0 * mem) - ymm1
        vaddpd  (%rdx), %ymm1, %ymm0
        vmovapd %ymm0, (%rdi)
        vzeroupper
        retq
        nop

julia> g(a,b,c,d) =  a * b + c - d; # no @fastmath

julia> @code_native debuginfo=:none g(v, 1.0, v, 2.0)
        .section        __TEXT,__text,regular,pure_instructions
        movq    %rdi, %rax
        vbroadcastsd    %xmm0, %ymm0
        vmulpd  (%rsi), %ymm0, %ymm0
        vaddpd  (%rdx), %ymm0, %ymm0
        vbroadcastsd    %xmm1, %ymm1
        vsubpd  %ymm1, %ymm0, %ymm0
        vmovapd %ymm0, (%rdi)
        vzeroupper
        retq
        nopw    %cs:(%rax,%rax)
        nopl    (%rax,%rax)

@KristofferC
Copy link
Collaborator Author

Added the fastmath commit to this PR.

@tkf
Copy link
Contributor

tkf commented Mar 20, 2020

Also, the Aligned flag sets the alignment to N * sizeof(T). Can that really be right?

I just noticed that Threads.Atomic uses Base.gc_alignment. For example: https://github.com/JuliaLang/julia/blob/644753491de0d5d933c96e6f1f5876b3c7603cf4/base/atomics.jl#L347-L352

It calls jl_gc_alignment C API: https://github.com/JuliaLang/julia/blob/644753491de0d5d933c96e6f1f5876b3c7603cf4/src/julia_internal.h#L216-L236

Does it make sense to use Base.gc_alignment(T) when the alignment flag is false?

When it's true, it's basically SIMD.jl who defines what it means to be "aligned" so I suppose it's OK? (Or was the question asking if it is "right" in terms of the performance?)

Alternatively, if SIMD.Intrinsics is meant to be a direct API to LLVM IR, maybe it can use Val(integer)? The default argument can still use Base.gc_alignment(T).

src/arrayops.jl Outdated Show resolved Hide resolved
@KristofferC KristofferC reopened this Mar 22, 2020
@KristofferC
Copy link
Collaborator Author

Added some docs for the @fastmath integration.

@KristofferC
Copy link
Collaborator Author

1.4 is now released. Any thoughts on how to progress here @eschnett

@eschnett
Copy link
Owner

@KristofferC In general – once your pull request is ready, it should be applied. Are you referring to a particular question or suggestion (shifts, alignment, ...)?

@KristofferC KristofferC merged commit 8eb7d50 into eschnett:master Mar 31, 2020
@KristofferC KristofferC mentioned this pull request Jul 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants