Add parallel reduction supports for RowIterator and NamedTupleIterator #187

tkf · 2020-08-10T00:12:45Z

This PR implements SplittablesBase.jl interface halve on RowIterator and NamedTupleIterator. This let us use parallel reductions built on top of SplittablesBase.jl such as Transducers.jl, ThreadsX.jl, and FLoops.jl:

using Tables
table = Tables.rows((key = 1:1000, value = randn(1000)))

using FLoops
using UnPack: @unpack

@floop for row in table
    @unpack key, value = row
    @reduce() do (kmax; key), (vmax; value)
        if vmax < value
            vmax = value
            kmax = key
        end
    end
end
@show kmax vmax

A tricky part of this PR is that, since SplittablesTesting.test_ordered uses isequal to compare items (rows), I needed to relax isequal to ignore the storage type of columns. The difference is that

isequal(
    first(Tables.rows((a = view([0], 1:1),))),
    first(Tables.rows((a = [0],))),
)

is false before this PR and true after this PR. I think it makes sense that ColumnsRow to be compared as if they are lowered to NamedTuples. This is also compatible with that the equalities on arrays ignore the type

julia> [0] == view([0], 1:1)
true

What do you think?

codecov · 2020-08-10T00:37:31Z

Codecov Report

Merging #187 into master will increase coverage by 0.09%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #187      +/-   ##
==========================================
+ Coverage   96.71%   96.80%   +0.09%     
==========================================
  Files           6        6              
  Lines         456      469      +13     
==========================================
+ Hits          441      454      +13     
  Misses         15       15

Impacted Files	Coverage Δ
src/Tables.jl	`92.92% <ø> (ø)`
src/fallbacks.jl	`97.69% <100.00%> (+0.21%)`	⬆️
src/namedtuples.jl	`98.27% <100.00%> (+0.06%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4a875bf...ecab1ac. Read the comment docs.

quinnj · 2020-08-11T04:28:40Z

Can you share a bit more on the motivation here? Obviously, we want to be cautious/wary of taking on new dependencies when it will also affect so many downstream packages. Questions that pop up in my mind:

What are some examples of what you could do w/ the implementation here?
Why only implemented for NamedTupleIterator and not rows/columns more generally?
What kind of package is SplittablesBase.jl? Minimal? Lots of changes? What kind of commitment to stability there?

tkf · 2020-08-20T06:53:51Z

Hi, thanks for the response and sorry for this late reply.

Obviously, we want to be cautious/wary of taking on new dependencies when it will also affect so many downstream packages.

Yes, I understand this and I should've clarified what SplittablesBase.jl is.

Essentially I am hoping halve to be a fundamental infrastructure for parallel processing in Julia in the sense that iterate, at the moment, is a fundamental infrastructure for sequential processing. The goal is to make halve+iterate (or halve+foldl) the interface between the data structures (tables, arrays, dicts, sets, strings, ...) and parallel processing functions (map, reduce, group-by, join, ...).

I probably should open an RFC in JuliaLang/julia but I've been a bit hesitant to do so since I don't feel like this interface is tested outside my packages. I thought of this PR as a step toward accumulating such experience.

What are some examples of what you could do w/ the implementation here?

The example in the OP with FLoops.jl is one thing. We'd be able to use ThreadsX with this. I think ThreadsX.jl + OnlineStats.jl integration is appealing to the Tables.jl users. Underneath, they all boils down to Transducers.foldxt that uses halve. For example, you can compute min/max of max/min over columns a, b and c by foldxt(ProductRF(min, max), table |> Map(r -> (max(r.a, r.b, r.c), min(r.a, r.b, r.c)))) in one go (OK, I have no idea when you need this particular function but it's a fun example).

Why only implemented for NamedTupleIterator and not rows/columns more generally?

If it is already an array, the generic fallback in SplittablesBase.jl covers it already. So, I don't need to add a specific implementation for RowTable. I can't provide an implementation for AbstractColumns (or column table in general) because I'd like to keep halve and iterate consistent in the sense that each halve implementation satisfies what I call "vcat law":

(1) If the original collection is ordered, concatenating the
sub-collections returned by halve must create a collection that is
equivalent to the original collection. More precisely,
isequal(
    vec(collect(collection)),
    vcat(vec(collect(left)), vec(collect(right))),
)
must hold.

--- https://juliafolds.github.io/SplittablesBase.jl/dev/#SplittablesBase.halve

What kind of package is SplittablesBase.jl? Minimal? Lots of changes? What kind of commitment to stability there?

My intention is making it very minimal although I have to put the implementation for Base there. It currently also contains the code for testing. However, the public API is to use it via a shim package SplittablesTesting. So, I can remove it at any point without introducing breaking changes.

I think it's almost 1.0-ready but there is one specification of an optional API amount JuliaFolds/SplittablesBase.jl#31 that I want to clarify before 1.0.

If you want to postpone merging this at least until SplittablesBase.jl hits 1.0, I think that's a very reasonable decision. I can extract out this PR to a separate package SplittableTables.jl for this to work (by touching the internals of Tables.jl a bit). But it'd be nice if we can tweak isequal as in this PR (as this is impossible to do outside Tables.jl without a serious type-piracy).

mattwigway · 2021-12-04T02:33:06Z

This is a really useful PR, but I see @quinnj's point about introducing an extra dependency on all of the downstream dependents of Tables. Maybe it would make more sense to have a separate package SplittableTables.jl that implements halve() for table iterators?

ParadaCarleton · 2023-06-23T01:55:00Z

If it is already an array, the generic fallback in SplittablesBase.jl covers it already.

I'm not sure this is true, given halve doesn't work with DataFrameRows (which is an AbstractVector).

This is a really useful PR, but I see @quinnj's point about introducing an extra dependency on all of the downstream dependents of Tables. Maybe it would make more sense to have a separate package SplittableTables.jl that implements halve() for table iterators?

I think SplittablesBase.jl is fine, given it's a very small dependency.

cc @quinnj and @MasonProtter -- being able to use JuliaFolds with Tables and DataFrames would be awesome.

tkf added 2 commits August 9, 2020 17:02

Define halve on RowIterator and NamedTupleIterator

dbfd05c

Relax isless and isequal to discard column storage types

92d57d0

tkf changed the title ~~Add parallel reduction supports on RowIterator and NamedTupleIterator~~ Add parallel reduction supports for RowIterator and NamedTupleIterator Aug 10, 2020

Test empty rows

ecab1ac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parallel reduction supports for RowIterator and NamedTupleIterator #187

Add parallel reduction supports for RowIterator and NamedTupleIterator #187

tkf commented Aug 10, 2020

codecov bot commented Aug 10, 2020 •

edited

Loading

quinnj commented Aug 11, 2020

tkf commented Aug 20, 2020

mattwigway commented Dec 4, 2021

ParadaCarleton commented Jun 23, 2023 •

edited

Loading

Add parallel reduction supports for RowIterator and NamedTupleIterator #187

Are you sure you want to change the base?

Add parallel reduction supports for RowIterator and NamedTupleIterator #187

Conversation

tkf commented Aug 10, 2020

codecov bot commented Aug 10, 2020 • edited Loading

Codecov Report

quinnj commented Aug 11, 2020

tkf commented Aug 20, 2020

mattwigway commented Dec 4, 2021

ParadaCarleton commented Jun 23, 2023 • edited Loading

codecov bot commented Aug 10, 2020 •

edited

Loading

ParadaCarleton commented Jun 23, 2023 •

edited

Loading