Making sum(skipmissing(x)) faster #27679

nalimilan · 2018-06-20T07:44:05Z

sum(skipmissing(x)) is reasonably fast, but since #27651 a naive implementation of sum is even faster (up to 5×). This is particularly clear when the input array does contain missing values, probably because the naive sum uses masked SIMD instructions while skipmissing relies on branching (and the presence of missing values kills prediction).

I haven't investigated this deeply, but IIUC sum(skipmissing(x)) dispatches to mapfoldl, which uses the iteration protocol to go over the input. This means we lose the benefit of @inbounds and of @simd. Maybe something like #27384 would help with @inbounds at least?

function mysum(X)
   s = zero(eltype(X))
   @inbounds @simd for x in X
       if x !== missing
           s += x
       end
   end
   s
end

# With Vector{Int}
X1 = rand(Int, 10_000_000);
X2 = Vector{Union{Missing,Int}}(X1);
X3 = ifelse.(rand(length(X2)) .< 0.9, X2, missing);

julia> @btime mysum(X1);
 5.594 ms (1 allocation: 16 bytes)

julia> @btime mysum(X2);
 5.791 ms (1 allocation: 16 bytes)

julia> @btime mysum(X3);
 5.681 ms (1 allocation: 16 bytes)

julia> @btime sum(skipmissing(X1));
 7.316 ms (2 allocations: 32 bytes)

julia> @btime sum(skipmissing(X2));
 18.175 ms (2 allocations: 32 bytes)

julia> @btime sum(skipmissing(X3));
 28.240 ms (2 allocations: 32 bytes)

# With Vector{Float64}
Y1 = rand(Float64, 10_000_000);
Y2 = Vector{Union{Missing,Float64}}(Y1);
Y3 = ifelse.(rand(length(Y2)) .< 0.9, Y2, missing);

julia> @btime mysum(Y1);
 5.860 ms (1 allocation: 16 bytes)

julia> @btime mysum(Y2);
 13.849 ms (1 allocation: 16 bytes)

julia> @btime mysum(Y3);
 17.428 ms (1 allocation: 16 bytes)

julia> @btime sum(skipmissing(Y1));
 13.816 ms (2 allocations: 32 bytes)

julia> @btime sum(skipmissing(Y2));
 18.254 ms (2 allocations: 32 bytes)

julia> @btime sum(skipmissing(Y3));
 28.430 ms (2 allocations: 32 bytes)

The text was updated successfully, but these errors were encountered:

nalimilan · 2018-06-22T08:33:24Z

After some experimentation, it turns out @simd is really the annotation which makes a difference for performance, while @inbounds has only a very small effect alone. This is particularly the case for Vector{Int}, which is surprising since I would have expected it to give more room to the compiler for Vector{Float64}. This is worth investigating, as it doesn't seem we should need an explicit @simd to get good performance for Int (another question is why aren't SIMD instructions used for Float64, but it's kind of orthogonal).

The more general problem is that we cannot use @simd like sum(::Array) does since the involved methods use while loops here rather than for loops. The pattern used by iterate on SkipMissing objects is probably too hard to optimize for LLVM: it's a while loop returning the first non-missing value. Unless/until we can make LLVM detect that this pattern can be translated to masked SIMD instructions, it looks like we should implement special reduction methods for SkipMissing arguments which call reduce on the underlying array with a helper function which does nothing when the input is missing.

nalimilan · 2018-06-23T14:31:23Z

Actually it doesn't seem possible to implement an efficient mapreduce method for SkipMissing by just calling mapreduce on the wrapped array with a custom function wrapping op:

x === missing is used by inference only if it's in the same function (ismissing(x) much slower than x === missing #27681 (comment)), so calling it inside a function wrapping op and passed to mapreduce isn't efficient (even if it's inlined).
Even if that was fast, it's not clear to me how to get the function wrapping op to work when passed two missing values, as it's always supposed to return a non-missing value. That would work if a valid v0 was always passed, but that's not the case, and we only know a neutral v0 for standard operations like sum, not in general. Maybe it could work with a pseudo-missing value which would disappear instead of propagating, but that would require defining all the methods that missing overrides.

So I had to implement a full mapreduce implementation for SkipMissing{<:AbstractArray}. It's too bad we need to duplicate/adapt methods, but at least the results are good in terms of performance. See #27743.

nalimilan added performance Must go faster missing data Base.missing and related functionality labels Jun 20, 2018

nalimilan mentioned this issue Jun 23, 2018

Add optimized mapreduce implementation for SkipMissing #27743

Merged

nalimilan closed this as completed in #27743 Jul 6, 2018

tkf mentioned this issue Oct 27, 2019

Transducer as an optimization: map, filter and flatten #33526

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making sum(skipmissing(x)) faster #27679

Making sum(skipmissing(x)) faster #27679

nalimilan commented Jun 20, 2018

nalimilan commented Jun 22, 2018

nalimilan commented Jun 23, 2018

Making sum(skipmissing(x)) faster #27679

Making sum(skipmissing(x)) faster #27679

Comments

nalimilan commented Jun 20, 2018

nalimilan commented Jun 22, 2018

nalimilan commented Jun 23, 2018