-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transducer as an optimization: map, filter and flatten #33526
Conversation
Nice, this seems a lot more elegant than having Now I better understand why you want inner "transformed" iterators to have consistent field names. It would be fine to change them all to be consistent. |
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
I added the transducer for |
Aren't using BenchmarkTools
xs = [abs(x) < 1 ? x : missing for x in randn(1000)];
xsf64 = [x isa Missing ? NaN : x for x ∈ xs];
xsb = isa.(xs, Missing);
function summissing(x::AbstractArray{T}, b::BitArray) where {T}
s = zero(T)
@inbounds @simd for i ∈ eachindex(x,b)
s += b[i] ? zero(T) : x[i]
end
s
end Now: julia> @btime foldl(+, (x for x in $xs if x !== missing))
1.473 μs (9 allocations: 144 bytes)
-17.98924543402577
julia> @btime summissing($xsf64, $xsb)
311.917 ns (0 allocations: 0 bytes)
-17.989245434025776
julia> @btime sum(skipmissing($xs))
1.252 μs (5 allocations: 80 bytes)
-17.98924543402577 312 ns vs >1.2 μs. The 312 ns is also a lot better than I get when trying to use the Union Array: julia> function summissing(x::AbstractVector{Union{T,Missing}}) where {T}
s = zero(T)
@inbounds @simd for i ∈ eachindex(x)
xᵢ = x[i]; s += ismissing(xᵢ) ? zero(T) : xᵢ
end
s
end
summissing (generic function with 2 methods)
julia> @btime summissing($xs)
874.055 ns (0 allocations: 0 bytes)
25.598370606424098 The actual julia> xsp = Base.unsafe_convert(Ptr{Float64}, pointer(xs)); # Don't want Ptr{Union{Float64,Missing}}
julia> xs'
1×1000 LinearAlgebra.Adjoint{Union{Missing, Float64},Array{Union{Missing, Float64},1}}:
-0.0323233 -0.426845 missing -0.141168 0.0985111 -0.258741 missing missing 0.116394 missing 0.0855161 -0.949796 0.751656 0.998423 missing -0.921113 -0.952986 0.330352 missing 0.322523 … -0.0225253 -0.117459 missing missing 0.327171 missing missing missing 0.663337 missing 0.25228 0.152217 -0.658007 -0.1019 -0.903252 -0.832974 0.964921 0.117777 missing 0.760332 0.000516222
julia> unsafe_load(xsp,1), unsafe_load(xsp,2), unsafe_load(xsp,4)
(-0.032323336118395316, -0.42684453051201715, -0.1411677613982569) |
Slowness of function summissing2(x::AbstractVector{Union{T,Missing}}) where {T}
s = zero(T)
@inbounds @simd for i ∈ eachindex(x)
xᵢ = x[i]
if xᵢ !== missing
s += xᵢ
end
end
s
end then I get julia> @btime summissing($xsf64, $xsb)
430.477 ns (0 allocations: 0 bytes)
11.40270702296504
julia> @btime summissing($xs)
985.214 ns (0 allocations: 0 bytes)
11.402707022965044
julia> @btime summissing2($xs)
666.930 ns (0 allocations: 0 bytes)
11.402707022965044
Note that the code structure of
If the implementation hasn't changed since this blog post https://julialang.org/blog/2018/06/missing, it is
(though I guess your benchmark is a counter example of the claim that it's faster) |
Please rebase and I'll merge this. |
rf::T | ||
end | ||
|
||
@inline (op::MappingRF)(acc, x) = op.rf(acc, op.f(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about the use of @inline
here; with such a minimal implementation, you'd think it wouldn't be needed. Yet I think I've found myself doing the same thing because, if I understand correctly, even though op.f
might itself have @inline
, having this simple function barrier can prevent total inlining (I imagine because the op.f
gets inlined into this one-liner, but then itself doesn't get inlined due to the now-non-inlineable nature). Is that correct thinking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's probably what happens.
Hmm... So, it seems that the speedup by using transducers is completely gone at current master. What is puzzling is that |
In cases where specialization doesn't happen due to compiler heuristics, the code reflections doesn't show the code that will execute (see e.g. #33142). |
Wow, thanks for checking that again. |
@KristofferC Thanks, that make sense. @JeffBezanson I probably should be using less "magic" example of transducers; i.e., the part that does not interact with compiler detail. After adding a iterator-to-transducer conversion I forgot to implement before (ce49c7e), following example shows a good speedup over current master: using BenchmarkTools
@btime foldl(+, (y for x in 1:1000 for y in 1:x if y % 2 == 0))
Can this be merged? |
Actually, I still see the speedup I mentioned in the OP if I set using BenchmarkTools
xs = [abs(x) < 1 ? x : missing for x in randn(1000)]
@btime foldl(+, (x for x in $xs if x !== missing); init=0.0)
I guess |
I'm not too worried about the lost optimization for the moment since the fact that it used to be faster proves that it generates simpler, easier to optimize code and seems to me to therefore remain justifiable. Of course, it would be nice to recover the optimization as well, but that's somewhat independent of this change. |
Sorry, I had missed this. Can anything be done to improve interaction of In particular, it would be nice to be able to implement reductions over dimensions without copying all the definitions like #28027 currently does. |
I think we need two things. First, as you said, we need to add _xfadjoint(op, itr::SkipMissing) = _xfadjoint(FilteringRF(!ismissing, op), itr.itr) Second, we need to hook the transducers into But, as I mentioned in #33526 (comment) and #33526 (comment), it seems that you need to have the concrete identity element ( |
It does. See #34293. |
I thought to give a shot to this (at least single-dimensional version) but then realized that it would introduce a big conflict with my other PR #31020... |
OK, thanks. Maybe let's revive #31020? It would be nice to get this merged anyway! |
Use map and filter transducers in foldl as an optimization
This PR implements very minimal (stateless) transducers in Base and use it internally in
foldl
(mapfoldl
). My intention is to make transducer-based code completely an implementation detail and so that code using iterators can just become faster. Here is a benchmark:@btime sum(skipmissing($xs))
: 881.146 ns (5 allocations: 80 bytes)This benchmark shows that transducer-based optimization can yield 2x speed up for a simple filter. Also, it's "faster" than
skipmissing
although it does pair-wise summation so this is not really a fair comparison. However, transducer-based optimization can also be applied toreduce
which then would eliminate the need for specialmapreduce_impl
implementation (ref #27743). Furthermore, similar speed up may happen for any type-based filtering, not justmissing
.(By the way, specialized
mapreduce_impl
forskipmissing
seems to be required due to the requirement for hard-coding=== missing
according to this comment #27681 (comment) by @Keno. However, this seems to be not applied to transducers. Why is this the case? This is puzzling me for a while.)I think I can optimize it more but I think this is a good starting point where code is still simple and so easy to review.
Other "stateless" iterator transforms likeFlatten
should be easy to support.tl;dr
foldl
combined with iterator comprehensions (Iterator.filter
andGenerator
) faster.Full comprehension support is easy (i.e., adding(Edit: Flatten is supported in this PR now)Flatten
).skipmissing
-specific code can be removed (maybe).What do people think about this? Is it worth using simple transducers in Julia Base?
Implementation
Quoting the docstring, the central idea is implemented in
_xfadjoint
which (roughly speaking) converts iterators to transducers:This conversion is invoked inside
foldl
/mapfoldl
just before starting the main loop.