Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change type parameter range for AbstractStridedPointer #49

Merged
merged 3 commits into from
Feb 25, 2021

Conversation

GiggleLiu
Copy link
Contributor

This is for supporting different number types.

JuliaSIMD/LoopVectorization.jl#201

@chriselrod
Copy link
Member

There is an ambiguity error and an error in the memory tests. The memory test error should be fixed on master.

Mind merging in master and trying to fix the ambiguity?

 vstore!(ptr::Union{Ptr, VectorizationBase.AbstractStridedPointer}, v::Number) in VectorizationBase at /home/runner/work/VectorizationBase.jl/VectorizationBase.jl/src/llvm_intrin/memory_addr.jl:850
vstore!(ptr::VectorizationBase.AbstractStridedPointer{T, N, C, B, R, X, O} where {N, C, B, R, X<:Tuple{Vararg{Any, N}}, O<:Tuple{Vararg{Any, N}}}, v::T) where T in VectorizationBase at /home/runner/work/VectorizationBase.jl/VectorizationBase.jl/src/strided_pointers/stridedpointers.jl:121

Previously, the second method was always more specific, as all NativeTypes <: Number, and T <: NativeTypes` implicitly in the second definition.

So, perhaps try changing the first definition to just vstore!(ptr::Union{Ptr, VectorizationBase.AbstractStridedPointer}, v)?

@GiggleLiu
Copy link
Contributor Author

Thanks for noticing, I just fixed it.

"Less Interface"ism

I notice that there are 57 methods called vstore!. Since I am a "less interface"ism, I have to preach the benefit of keeping less interface

e.g.

  1. when I can currify a function
f(b) = x -> f(x, b)

I choose not to.

  1. when I can specify a default parameters like this
f(x, b) = ...
f(x) = f(x, 1.0)

I choose not to.

"Less interface" rule does not apply when the interface decrease the user input dedundancy.

e.g.
when

f(x) = f(x, x^2)

is good. because people has less probability to make an error from repeated input.

Just sharing my aesthetics of coding, by no means to change the current design since they are already thouroughly tested.

@codecov-io
Copy link

Codecov Report

Merging #49 (2800fa0) into master (c79e5d4) will decrease coverage by 1.45%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #49      +/-   ##
==========================================
- Coverage   60.88%   59.43%   -1.46%     
==========================================
  Files          31       31              
  Lines        3981     3976       -5     
==========================================
- Hits         2424     2363      -61     
- Misses       1557     1613      +56     
Impacted Files Coverage Δ
src/strided_pointers/stridedpointers.jl 26.92% <0.00%> (ø)
src/llvm_intrin/verf.jl 60.00% <0.00%> (-13.08%) ⬇️
src/promotion.jl 23.75% <0.00%> (-12.50%) ⬇️
src/llvm_intrin/masks.jl 58.36% <0.00%> (-6.83%) ⬇️
src/llvm_intrin/intrin_funcs.jl 54.63% <0.00%> (-6.02%) ⬇️
src/llvm_intrin/conversion.jl 63.20% <0.00%> (-4.00%) ⬇️
src/llvm_intrin/nonbroadcastingops.jl 97.46% <0.00%> (-1.27%) ⬇️
src/VectorizationBase.jl 67.83% <0.00%> (-0.70%) ⬇️
src/early_definitions.jl 63.15% <0.00%> (+0.65%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c79e5d4...2800fa0. Read the comment docs.

@chriselrod chriselrod merged commit 088edc8 into JuliaSIMD:master Feb 25, 2021
@chriselrod
Copy link
Member

chriselrod commented Feb 25, 2021

The interface is much smaller than the number of methods.
Many of the methods come from the equivalent of:

f(x::Float64, b) = ...
f(x::Int, b) = ...
f(x::String, b) = ...
f(x::Tuple, b) = ...

That is, many of the methods do not represent a different API.
They're specializations for different input types, to make different inputs act the same way.

There is also a processing pipeline from surface API before final code generation.
You should never specify the ::StaticInt{RS} parameter yourself, for example.
Every vstore! method where you see ::StaticInt{RS} as one of the arguments is an internal method that you shouldn't be manually calling.

So a change I probably should make is renaming all those vstore! and vloads (probably to _vstore! and _vload) to make that obvious.

For vload, the API is simply:

vload(p::AbstractStridedPointer, i::Tuple)
vload(p::AbstractStridedPointer, i::Tuple, m::Mask)
vload(p::AbstractStridedPointer, i::Unroll)
vload(p::AbstractStridedPointer, i::Unroll, m::Mask)

p is the stridedpointer you want to load from, i is the index, and m is an optional mask to avoid loading where the mask is off.
There are also vloada equivalents, that assume the memory is aligned, but you should use vload in general. So there are 4, or maybe 8 variants if you want to count vloada. Yet we have 41 vload methods:

julia> vload
vload (generic function with 41 methods)

Example usage, and what all that internal processing is for:

julia> colors = [(R = rand(), G = rand(), B = rand()) for i  1:100];

julia> colormat = reinterpret(reshape, Float64, colors)
3×100 reinterpret(reshape, Float64, ::Vector{NamedTuple{(:R, :G, :B), Tuple{Float64, Float64, Float64}}}) with eltype Float64:
 0.101462  0.566472  0.395725   0.156595  0.247472  0.58609   0.451675  0.329288  0.628678  0.0161756    0.457912  0.929442  0.209048  0.225449  0.570513    0.373047   0.0599544  0.487379  0.48549   0.219519    0.758767  0.349387  0.575183  0.0668234  0.204329  0.371181   0.197348  0.652758  0.630789
 0.919857  0.852129  0.0622284  0.479014  0.596842  0.600178  0.188863  0.411865  0.909364  0.000210617  0.254854  0.469283  0.552034  0.202351  0.948111     0.537816   0.65176    0.619699  0.520092  0.515346    0.791153  0.326027  0.104903  0.888952   0.459569  0.0474623  0.161321  0.59407   0.0856169
 0.980903  0.841869  0.525539   0.933114  0.766957  0.116385  0.167431  0.964176  0.636977  0.491324     0.584939  0.135019  0.58517   0.467258  0.654256     0.0479113  0.162333   0.363111  0.162873  0.00662285  0.204877  0.789972  0.536066  0.0866196  0.753488  0.279345   0.556696  0.625233  0.364667

julia> colors[1]
(R = 0.10146162439576822, G = 0.9198573051020194, B = 0.980902779143441)

julia> vload(stridedpointer(colormat), (2,3)) # load element [2,3]
0.0622284186540214

julia> vload(stridedpointer(colormat), (2,MM{8}(3))) # load elements colormat[2,3:10]
Vec{8, Float64}<0.0622284186540214, 0.47901368495584085, 0.596842316190509, 0.6001775663920466, 0.18886316876982878, 0.41186519614364814, 0.9093641942999946, 0.00021061724559756634>

julia> colormat[2,3:10]'
1×8 adjoint(::Vector{Float64}) with eltype Float64:
 0.0622284  0.479014  0.596842  0.600178  0.188863  0.411865  0.909364  0.000210617

julia> vload(stridedpointer(colormat), (MM{4}(1),3), mask(Val(4), 3)) # load elements colormat[1:3,3]
Vec{4, Float64}<0.39572490568820506, 0.0622284186540214, 0.5255392424137211, 0.0>

julia> colormat[1:3,3]'
1×3 adjoint(::Vector{Float64}) with eltype Float64:
 0.395725  0.0622284  0.525539

julia> vload(stridedpointer(colormat), (MM{4}(1),MM{4}(3)), mask(Val(4), 3)) # load elements colormat[(1,3), (2,4), (3,5)]
Vec{4, Float64}<0.39572490568820506, 0.47901368495584085, 0.7669572786918826, 0.0>

julia> getindex.(Ref(colormat), 1:3, 3:5)'
1×3 adjoint(::Vector{Float64}) with eltype Float64:
 0.395725  0.479014  0.766957

MMs represent a vector of indices. Passing them as arguments basically means to take a slice. Adding a mask lets you avoid reading out of bounds.

The Unroll indices are for if we want to load multiple vectors at one time. The API is really awkward (I always have to check ?Unroll myself) and I'm open to suggestions for improvement. Example:

julia> vload(stridedpointer(colormat), Unroll{1,1,3,2,8,zero(UInt),1}((1,1))) #axis 1 unrolled w/ step of 1, unrolled 3x, axis 2 vectorized, loading 8 elements,mask 0 of the loads, loaded vector elements are 1 apart
3 x Vec{8, Float64}
Vec{8, Float64}<0.10146162439576822, 0.5664720363768123, 0.39572490568820506, 0.1565950460890433, 0.24747228039940494, 0.586089973376654, 0.4516746290804887, 0.3292882151577483>
Vec{8, Float64}<0.9198573051020194, 0.852129002305116, 0.0622284186540214, 0.47901368495584085, 0.596842316190509, 0.6001775663920466, 0.18886316876982878, 0.41186519614364814>
Vec{8, Float64}<0.980902779143441, 0.8418685033011257, 0.5255392424137211, 0.9331143002012057, 0.7669572786918826, 0.11638477303011419, 0.16743089573675807, 0.964175916870037>

Why do this instead of just using the first API 3 times?

julia> vload(stridedpointer(colormat), (1,MM{8}(1))) # load [1,1:8]
Vec{8, Float64}<0.10146162439576822, 0.5664720363768123, 0.39572490568820506, 0.1565950460890433, 0.24747228039940494, 0.586089973376654, 0.4516746290804887, 0.3292882151577483>

julia> vload(stridedpointer(colormat), (2,MM{8}(1))) # load [2,1:8]
Vec{8, Float64}<0.9198573051020194, 0.852129002305116, 0.0622284186540214, 0.47901368495584085, 0.596842316190509, 0.6001775663920466, 0.18886316876982878, 0.41186519614364814>

julia> vload(stridedpointer(colormat), (3,MM{8}(1))) # load [3,1:8]
Vec{8, Float64}<0.980902779143441, 0.8418685033011257, 0.5255392424137211, 0.9331143002012057, 0.7669572786918826, 0.11638477303011419, 0.16743089573675807, 0.964175916870037>

Because the Unroll is sometimes much faster:

julia> @benchmark (vload(stridedpointer($colormat), (1,MM{8}(1))), vload(stridedpointer($colormat), (2,MM{8}(1))),  vload(stridedpointer($colormat), (3,MM{8}(1))))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.940 ns (0.00% GC)
  median time:      6.040 ns (0.00% GC)
  mean time:        6.076 ns (0.00% GC)
  maximum time:     14.404 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark vload(stridedpointer($colormat), Unroll{1,1,3,2,8,zero(UInt),1}((1,1)))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.080 ns (0.00% GC)
  median time:      3.097 ns (0.00% GC)
  mean time:        3.108 ns (0.00% GC)
  maximum time:     8.722 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

It's often possible to optimize groups of loads and stores that are done together, and that is what the Unroll type enables.
Note that the master branch of LoopVectorization doesn't use the Unroll type yet, but the vecunroll branch does, and I will merge it with master in the next few weeks or perhaps days.
While Unroll looks awkward for us, it's straightforward for LoopVectorization to plug in the correct parameters.

The case for vstore! is similar, but it also accepts a function as a first argument to tell it to possibly use that function to "reduce" a vector into a scalar. This is again something that is possible to optimize when done in Unrolled batches.

I'm definitely open to suggestions for cleaning things up, and do thing the vload/vstore! and also the type promotion code could use a lot of cleaning up. At the very least, renaming all the internal vload/vstore!s is an easy thing to do.
The code could almost certainly be orthoganalized better.

But the actual API of these functions is pretty straightforward, with most of the methods coming from determining the behavior from the types.

I should probably document this somewhere more clearly!
But I hope the explanation helps.

@GiggleLiu
Copy link
Contributor Author

Thanks for your very detailed explaination. The reason why I am confused because the list is very long when I type ?vload and ?vstore! in an REPL. Now I am completely convinced by you.

Writing many docstring in a too early stage is generally not good for future change of design, but a minimum documentation of most important functions would help people to understand the design. Let me see if I can help adding some documentations.

Cheers.

chriselrod added a commit that referenced this pull request Mar 3, 2021
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants