Transducer as an optimization: map, filter and flatten #33526

tkf · 2019-10-11T03:10:35Z

This PR implements very minimal (stateless) transducers in Base and use it internally in foldl (mapfoldl). My intention is to make transducer-based code completely an implementation detail and so that code using iterators can just become faster. Here is a benchmark:

using BenchmarkTools
xs = [abs(x) < 1 ? x : missing for x in randn(1000)]
@btime foldl(+, (x for x in $xs if x !== missing))

Before (a68237f): 1.484 μs (9 allocations: 144 bytes)
After (fb89009): 692.771 ns (2 allocations: 32 bytes)
@btime sum(skipmissing($xs)): 881.146 ns (5 allocations: 80 bytes)

This benchmark shows that transducer-based optimization can yield 2x speed up for a simple filter. Also, it's "faster" than skipmissing although it does pair-wise summation so this is not really a fair comparison. However, transducer-based optimization can also be applied to reduce which then would eliminate the need for special mapreduce_impl implementation (ref #27743). Furthermore, similar speed up may happen for any type-based filtering, not just missing.

(By the way, specialized mapreduce_impl for skipmissing seems to be required due to the requirement for hard-coding === missing according to this comment #27681 (comment) by @Keno. However, this seems to be not applied to transducers. Why is this the case? This is puzzling me for a while.)

I think I can optimize it more but I think this is a good starting point where code is still simple and so easy to review. ~~Other "stateless" iterator transforms like Flatten should be easy to support.~~

tl;dr

Transducers make foldl combined with iterator comprehensions (Iterator.filter and Generator) faster.
~~Full comprehension support is easy (i.e., adding Flatten).~~ (Edit: Flatten is supported in this PR now)
Some skipmissing-specific code can be removed (maybe).

What do people think about this? Is it worth using simple transducers in Julia Base?

Implementation

Quoting the docstring, the central idea is implemented in _xfadjoint which (roughly speaking) converts iterators to transducers:

_xfadjoint(op, itr) -> op′, itr′
Given a pair of reducing function op and an iterator itr, return a pair (op′, itr′) of similar types. If the iterator itr is transformed by an iterator transform ixf whose adjoint transducer xf is known, op′ = xf(op) and itr′ = ixf⁻¹(itr) is returned. Otherwise, op and itr are returned as-is. For example, transducer rf -> MappingRF(f, rf) is the adjoint of iterator transform itr -> Generator(f, itr).

Nested iterator transforms are converted recursively. That is to say, given op and
itr = (ixf₁ ∘ ixf₂ ∘ ... ∘ ixfₙ)(itr′)
what is returned is itr′ and
op′ = (xfₙ ∘ ... ∘ xf₂ ∘ xf₁)(op)

This conversion is invoked inside foldl/mapfoldl just before starting the main loop.

JeffBezanson · 2019-10-11T15:25:59Z

Nice, this seems a lot more elegant than having mapfold as a special case.

Now I better understand why you want inner "transformed" iterators to have consistent field names. It would be fine to change them all to be consistent.

ararslan · 2019-10-11T21:46:48Z

@nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2019-10-12T05:07:08Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

tkf · 2019-10-12T16:59:42Z

I added the transducer for Flatten too, as I noticed that doing so would force me to refactor/simplify some implementations.

chriselrod · 2019-10-26T18:40:03Z

Aren't Array{Union{T,Missing}} an Array{T} and a BitArray under the hood somehow? If so, and all we're talking about is summing a non-missing elements, we can get a lot faster. Of course, foldl is with "guaranteed left associativity", so it shouldn't be vectorized like the simple loop.

using BenchmarkTools
xs = [abs(x) < 1 ? x : missing for x in randn(1000)];
xsf64 = [x isa Missing ? NaN : x for x ∈ xs];
xsb = isa.(xs, Missing);
function summissing(x::AbstractArray{T}, b::BitArray) where {T}
    s = zero(T)
    @inbounds @simd for i ∈ eachindex(x,b)
        s += b[i] ? zero(T) : x[i]
    end
    s
end

Now:

julia> @btime foldl(+, (x for x in $xs if x !== missing))
  1.473 μs (9 allocations: 144 bytes)
-17.98924543402577

julia> @btime summissing($xsf64, $xsb)
  311.917 ns (0 allocations: 0 bytes)
-17.989245434025776

julia> @btime sum(skipmissing($xs))
  1.252 μs (5 allocations: 80 bytes)
-17.98924543402577

312 ns vs >1.2 μs.

The 312 ns is also a lot better than I get when trying to use the Union Array:

julia> function summissing(x::AbstractVector{Union{T,Missing}}) where {T}
           s = zero(T)
           @inbounds @simd for i ∈ eachindex(x)
               xᵢ = x[i]; s += ismissing(xᵢ) ? zero(T) : xᵢ
           end
           s
       end
summissing (generic function with 2 methods)

julia> @btime summissing($xs)
  874.055 ns (0 allocations: 0 bytes)
25.598370606424098

The actual Float64 elements are stored contiguously, so as long as there is a string of bits located somewhere (instead of Bools), we should be able to get that kind of performance.

julia> xsp = Base.unsafe_convert(Ptr{Float64}, pointer(xs)); # Don't want Ptr{Union{Float64,Missing}}

julia> xs'
1×1000 LinearAlgebra.Adjoint{Union{Missing, Float64},Array{Union{Missing, Float64},1}}:
 -0.0323233  -0.426845  missing  -0.141168  0.0985111  -0.258741  missing  missing  0.116394  missing  0.0855161  -0.949796  0.751656  0.998423  missing  -0.921113  -0.952986  0.330352  missing  0.322523  …  -0.0225253  -0.117459  missing  missing  0.327171  missing  missing  missing  0.663337  missing  0.25228  0.152217  -0.658007  -0.1019  -0.903252  -0.832974  0.964921  0.117777  missing  0.760332  0.000516222

julia> unsafe_load(xsp,1), unsafe_load(xsp,2), unsafe_load(xsp,4)
(-0.032323336118395316, -0.42684453051201715, -0.1411677613982569)

tkf · 2019-10-27T05:34:58Z

Slowness of ismissing(xᵢ) ? zero(T) : xᵢ is exactly why ismissing has a special mapreduce (see #27679 and #27681 (comment)). If I use if xᵢ !== missing to help inference

function summissing2(x::AbstractVector{Union{T,Missing}}) where {T}
    s = zero(T)
    @inbounds @simd for i ∈ eachindex(x)
        xᵢ = x[i]
        if xᵢ !== missing
            s += xᵢ
        end
    end
    s
end

then I get

julia> @btime summissing($xsf64, $xsb)
  430.477 ns (0 allocations: 0 bytes)
11.40270702296504

julia> @btime summissing($xs)
  985.214 ns (0 allocations: 0 bytes)
11.402707022965044

julia> @btime summissing2($xs)
  666.930 ns (0 allocations: 0 bytes)
11.402707022965044

summissing2 is still slower than summissing(x::AbstractArray{T}, b::BitArray) but not as drastic as with summissing(::AbstractVector{Union{T,Missing}}).

Note that the code structure of summissing2 is exactly what filtering transducer produces (except the @simd).

Aren't Array{Union{T,Missing}} an Array{T} and a BitArray under the hood somehow?

If the implementation hasn't changed since this blog post https://julialang.org/blog/2018/06/missing, it is Array{UInt8}. The rationale is explained in the post:

Even if it consumes more memory, the Array{UInt8} mask approach is faster (at least in the current state of BitArray), and it generalizes to Unions of more than two types.

(though I guess your benchmark is a counter example of the claim that it's faster)

JeffBezanson · 2019-12-02T18:12:04Z

Please rebase and I'll merge this.

quinnj · 2019-12-02T20:45:07Z

base/reduce.jl

+    rf::T
+end
+
+@inline (op::MappingRF)(acc, x) = op.rf(acc, op.f(x))


I'm curious about the use of @inline here; with such a minimal implementation, you'd think it wouldn't be needed. Yet I think I've found myself doing the same thing because, if I understand correctly, even though op.f might itself have @inline, having this simple function barrier can prevent total inlining (I imagine because the op.f gets inlined into this one-liner, but then itself doesn't get inlined due to the now-non-inlineable nature). Is that correct thinking?

Yes, that's probably what happens.

tkf · 2019-12-03T07:32:45Z

In d0b1c04, I merged master by re-doing what was done in #33917.

tkf · 2019-12-03T08:07:47Z

Hmm... So, it seems that the speedup by using transducers is completely gone at current master. What is puzzling is that @descending to _foldl_impl does not show any difference in the IR when comparing d0b1c04 to dfcd792 (which shows the speedup). I uploaded the Julia IR and LLVM IR here: https://gist.github.com/tkf/d9397c14f8612f61be9ea72257c50591 You can see that there is basically no difference in the Julia IR but there are some differences in LLVM IR. Were there some recent updates in the pipeline between these IRs?

KristofferC · 2019-12-03T12:24:33Z

What is puzzling is that @descending to _foldl_impl does not show any difference in the IR when comparing d0b1c04 to dfcd792 (which shows the speedup)

In cases where specialization doesn't happen due to compiler heuristics, the code reflections doesn't show the code that will execute (see e.g. #33142).

JeffBezanson · 2019-12-03T21:11:14Z

Wow, thanks for checking that again.

tkf · 2019-12-04T02:58:00Z

@KristofferC Thanks, that make sense.

@JeffBezanson I probably should be using less "magic" example of transducers; i.e., the part that does not interact with compiler detail. After adding a iterator-to-transducer conversion I forgot to implement before (ce49c7e), following example shows a good speedup over current master:

using BenchmarkTools
@btime foldl(+, (y for x in 1:1000 for y in 1:x if y % 2 == 0))

Current master (5a9cce1): 219.535 μs (0 allocations: 0 bytes)
This PR (ce49c7e): 72.512 μs (0 allocations: 0 bytes)

Can this be merged?

tkf · 2019-12-04T11:48:48Z

Actually, I still see the speedup I mentioned in the OP if I set init:

using BenchmarkTools
xs = [abs(x) < 1 ? x : missing for x in randn(1000)]
@btime foldl(+, (x for x in $xs if x !== missing); init=0.0)

This PR (ce49c7e): 676.987 ns (0 allocations: 0 bytes)
Master (5a9cce1): 1.279 μs (2 allocations: 32 bytes)

I guess init-less version is just slightly complicated such that it crosses some threshold changed by the recent change for the compilation time fixes for 1.3 or something?

StefanKarpinski · 2019-12-04T13:24:08Z

I'm not too worried about the lost optimization for the moment since the fact that it used to be faster proves that it generates simpler, easier to optimize code and seems to me to therefore remain justifiable. Of course, it would be nice to recover the optimization as well, but that's somewhat independent of this change.

nalimilan · 2020-01-05T14:24:52Z

Sorry, I had missed this. Can anything be done to improve interaction of skipmissing with reductions and/or drop the specialized implementation? Would adding _xfadjoint(op, itr::SkipMissing) make sense? Would a special Filter-like iterator but using === work?

In particular, it would be nice to be able to implement reductions over dimensions without copying all the definitions like #28027 currently does.

tkf · 2020-01-05T20:06:36Z

I think we need two things. First, as you said, we need to add

_xfadjoint(op, itr::SkipMissing) = _xfadjoint(FilteringRF(!ismissing, op), itr.itr)

Second, we need to hook the transducers into reduce. See mapfoldl_impl for how to use _xfadjoint.

But, as I mentioned in #33526 (comment) and #33526 (comment), it seems that you need to have the concrete identity element (init) in order to have union splitting (?) with the current compiler heuristics. Maybe "tail call function-barrier" can solve this but it is a bit tricky to implement due to the problem I discussed in this discourse post.

tkf · 2020-01-07T00:43:48Z

Maybe "tail call function-barrier" can solve this

It does. See #34293.

tkf · 2020-01-07T01:06:26Z

I thought to give a shot to this (at least single-dimensional version) but then realized that it would introduce a big conflict with my other PR #31020...

nalimilan · 2020-01-07T17:20:28Z

OK, thanks. Maybe let's revive #31020? It would be nice to get this merged anyway!

tkf · 2020-01-07T21:30:35Z

#31020 is still alive and waiting for @mbauman to review.

Use map and filter transducers in foldl as an optimization

Use map and filter transducers in foldl as an optimization

fb89009

StefanKarpinski requested review from JeffBezanson and Keno October 11, 2019 04:17

tkf added 6 commits October 10, 2019 21:27

Fix mapfoldr

2ac3fc4

Fix empty value handling

a2e0029

Specialize _foldl_impl for Tuple

abbbb62

Fix @allocated(foldr(-, x)) test

6c0c0c1

Use reduce_first in foldl

d706056

Add/tweak docstrings

3a266b3

JeffBezanson added the performance Must go faster label Oct 11, 2019

Add a transducer for Flatten

dfcd792

tkf changed the title ~~Transducer as an optimization: map and filter~~ Transducer as an optimization: map, filter and flatten Oct 12, 2019

tkf mentioned this pull request Oct 26, 2019

Very WIP: Composable log filters c42f/MicroLogging.jl#22

Draft

tkf mentioned this pull request Nov 1, 2019

Add ASCII alias compose of ∘ #33573

Closed

JeffBezanson approved these changes Dec 2, 2019

View reviewed changes

quinnj reviewed Dec 2, 2019

View reviewed changes

Merge remote-tracking branch 'origin/master' into filterxf

d0b1c04

tkf added 2 commits December 3, 2019 18:43

Improve _xfadjoint(op, itr::Generator); process inner iterator

af0b06c

Take adjoint inside FlatteningRF as well

ce49c7e

JeffBezanson merged commit 3c182bc into JuliaLang:master Dec 4, 2019

tkf deleted the filterxf branch January 5, 2020 20:04

This was referenced Jan 6, 2020

Add benchmarks for foldl on iterators JuliaCI/BaseBenchmarks.jl#254

Merged

Improve foldl with tail-call function-barrier #34293

Open

tkf mentioned this pull request Feb 4, 2020

Implement accumulate and friends for Tuple #34654

Merged

This was referenced Mar 7, 2020

Optimize foldl/foreach for zip(arrays...), CartesianIndices, etc. #35036

Open

Non-propogation, skipmissing-related improvements to Missing handling. #35050

Open

KristofferC pushed a commit that referenced this pull request Apr 11, 2020

Transducer as an optimization: map, filter and flatten (#33526)

905c255

Use map and filter transducers in foldl as an optimization

tkf mentioned this pull request Apr 19, 2020

Widening-based map() to remove return_type andyferris/StaticArraysLite.jl#1

Open

devmotion mentioned this pull request May 18, 2020

Add a Sample transducer TuringLang/AbstractMCMC.jl#39

Merged

tkf mentioned this pull request Jun 24, 2020

Question on eduction JuliaFolds/Transducers.jl#313

Closed

inkydragon mentioned this pull request Aug 14, 2022

Regression in performance of sum on Broadcasted with small Union eltype #39425

Closed

MasonProtter mentioned this pull request May 10, 2023

More transducers #49735

Open

stevengj mentioned this pull request Dec 5, 2023

use pairwise order for mapreduce on arbitrary iterators #52397

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transducer as an optimization: map, filter and flatten #33526

Transducer as an optimization: map, filter and flatten #33526

tkf commented Oct 11, 2019 •

edited

Loading

JeffBezanson commented Oct 11, 2019

ararslan commented Oct 11, 2019

nanosoldier commented Oct 12, 2019

tkf commented Oct 12, 2019

chriselrod commented Oct 26, 2019

tkf commented Oct 27, 2019

JeffBezanson commented Dec 2, 2019

quinnj Dec 2, 2019

JeffBezanson Dec 2, 2019

tkf commented Dec 3, 2019

tkf commented Dec 3, 2019

KristofferC commented Dec 3, 2019

JeffBezanson commented Dec 3, 2019

tkf commented Dec 4, 2019 •

edited

Loading

tkf commented Dec 4, 2019

StefanKarpinski commented Dec 4, 2019

nalimilan commented Jan 5, 2020

tkf commented Jan 5, 2020

tkf commented Jan 7, 2020

tkf commented Jan 7, 2020

nalimilan commented Jan 7, 2020

tkf commented Jan 7, 2020

Transducer as an optimization: map, filter and flatten #33526

Transducer as an optimization: map, filter and flatten #33526

Conversation

tkf commented Oct 11, 2019 • edited Loading

tl;dr

Implementation

JeffBezanson commented Oct 11, 2019

ararslan commented Oct 11, 2019

nanosoldier commented Oct 12, 2019

tkf commented Oct 12, 2019

chriselrod commented Oct 26, 2019

tkf commented Oct 27, 2019

JeffBezanson commented Dec 2, 2019

quinnj Dec 2, 2019

Choose a reason for hiding this comment

JeffBezanson Dec 2, 2019

Choose a reason for hiding this comment

tkf commented Dec 3, 2019

tkf commented Dec 3, 2019

KristofferC commented Dec 3, 2019

JeffBezanson commented Dec 3, 2019

tkf commented Dec 4, 2019 • edited Loading

tkf commented Dec 4, 2019

StefanKarpinski commented Dec 4, 2019

nalimilan commented Jan 5, 2020

tkf commented Jan 5, 2020

tkf commented Jan 7, 2020

tkf commented Jan 7, 2020

nalimilan commented Jan 7, 2020

tkf commented Jan 7, 2020

tkf commented Oct 11, 2019 •

edited

Loading

tkf commented Dec 4, 2019 •

edited

Loading