Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parallel reduction supports for RowIterator and NamedTupleIterator #187

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

tkf
Copy link
Contributor

@tkf tkf commented Aug 10, 2020

This PR implements SplittablesBase.jl interface halve on RowIterator and NamedTupleIterator. This let us use parallel reductions built on top of SplittablesBase.jl such as Transducers.jl, ThreadsX.jl, and FLoops.jl:

using Tables
table = Tables.rows((key = 1:1000, value = randn(1000)))

using FLoops
using UnPack: @unpack

@floop for row in table
    @unpack key, value = row
    @reduce() do (kmax; key), (vmax; value)
        if vmax < value
            vmax = value
            kmax = key
        end
    end
end
@show kmax vmax

A tricky part of this PR is that, since SplittablesTesting.test_ordered uses isequal to compare items (rows), I needed to relax isequal to ignore the storage type of columns. The difference is that

isequal(
    first(Tables.rows((a = view([0], 1:1),))),
    first(Tables.rows((a = [0],))),
)

is false before this PR and true after this PR. I think it makes sense that ColumnsRow to be compared as if they are lowered to NamedTuples. This is also compatible with that the equalities on arrays ignore the type

julia> [0] == view([0], 1:1)
true

What do you think?

@tkf tkf changed the title Add parallel reduction supports on RowIterator and NamedTupleIterator Add parallel reduction supports for RowIterator and NamedTupleIterator Aug 10, 2020
@codecov
Copy link

codecov bot commented Aug 10, 2020

Codecov Report

Merging #187 into master will increase coverage by 0.09%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #187      +/-   ##
==========================================
+ Coverage   96.71%   96.80%   +0.09%     
==========================================
  Files           6        6              
  Lines         456      469      +13     
==========================================
+ Hits          441      454      +13     
  Misses         15       15              
Impacted Files Coverage Δ
src/Tables.jl 92.92% <ø> (ø)
src/fallbacks.jl 97.69% <100.00%> (+0.21%) ⬆️
src/namedtuples.jl 98.27% <100.00%> (+0.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4a875bf...ecab1ac. Read the comment docs.

@quinnj
Copy link
Member

quinnj commented Aug 11, 2020

Can you share a bit more on the motivation here? Obviously, we want to be cautious/wary of taking on new dependencies when it will also affect so many downstream packages. Questions that pop up in my mind:

  1. What are some examples of what you could do w/ the implementation here?
  2. Why only implemented for NamedTupleIterator and not rows/columns more generally?
  3. What kind of package is SplittablesBase.jl? Minimal? Lots of changes? What kind of commitment to stability there?

@tkf
Copy link
Contributor Author

tkf commented Aug 20, 2020

Hi, thanks for the response and sorry for this late reply.

Obviously, we want to be cautious/wary of taking on new dependencies when it will also affect so many downstream packages.

Yes, I understand this and I should've clarified what SplittablesBase.jl is.

Essentially I am hoping halve to be a fundamental infrastructure for parallel processing in Julia in the sense that iterate, at the moment, is a fundamental infrastructure for sequential processing. The goal is to make halve+iterate (or halve+foldl) the interface between the data structures (tables, arrays, dicts, sets, strings, ...) and parallel processing functions (map, reduce, group-by, join, ...).

I probably should open an RFC in JuliaLang/julia but I've been a bit hesitant to do so since I don't feel like this interface is tested outside my packages. I thought of this PR as a step toward accumulating such experience.

  1. What are some examples of what you could do w/ the implementation here?

The example in the OP with FLoops.jl is one thing. We'd be able to use ThreadsX with this. I think ThreadsX.jl + OnlineStats.jl integration is appealing to the Tables.jl users. Underneath, they all boils down to Transducers.foldxt that uses halve. For example, you can compute min/max of max/min over columns a, b and c by foldxt(ProductRF(min, max), table |> Map(r -> (max(r.a, r.b, r.c), min(r.a, r.b, r.c)))) in one go (OK, I have no idea when you need this particular function but it's a fun example).

  1. Why only implemented for NamedTupleIterator and not rows/columns more generally?

If it is already an array, the generic fallback in SplittablesBase.jl covers it already. So, I don't need to add a specific implementation for RowTable. I can't provide an implementation for AbstractColumns (or column table in general) because I'd like to keep halve and iterate consistent in the sense that each halve implementation satisfies what I call "vcat law":

(1) If the original collection is ordered, concatenating the
sub-collections returned by halve must create a collection that is
equivalent to the original collection. More precisely,

isequal(
    vec(collect(collection)),
    vcat(vec(collect(left)), vec(collect(right))),
)

must hold.

--- https://juliafolds.github.io/SplittablesBase.jl/dev/#SplittablesBase.halve

  1. What kind of package is SplittablesBase.jl? Minimal? Lots of changes? What kind of commitment to stability there?

My intention is making it very minimal although I have to put the implementation for Base there. It currently also contains the code for testing. However, the public API is to use it via a shim package SplittablesTesting. So, I can remove it at any point without introducing breaking changes.

I think it's almost 1.0-ready but there is one specification of an optional API amount JuliaFolds/SplittablesBase.jl#31 that I want to clarify before 1.0.

If you want to postpone merging this at least until SplittablesBase.jl hits 1.0, I think that's a very reasonable decision. I can extract out this PR to a separate package SplittableTables.jl for this to work (by touching the internals of Tables.jl a bit). But it'd be nice if we can tweak isequal as in this PR (as this is impossible to do outside Tables.jl without a serious type-piracy).

@mattwigway
Copy link

This is a really useful PR, but I see @quinnj's point about introducing an extra dependency on all of the downstream dependents of Tables. Maybe it would make more sense to have a separate package SplittableTables.jl that implements halve() for table iterators?

@ParadaCarleton
Copy link

ParadaCarleton commented Jun 23, 2023

If it is already an array, the generic fallback in SplittablesBase.jl covers it already.

I'm not sure this is true, given halve doesn't work with DataFrameRows (which is an AbstractVector).

This is a really useful PR, but I see @quinnj's point about introducing an extra dependency on all of the downstream dependents of Tables. Maybe it would make more sense to have a separate package SplittableTables.jl that implements halve() for table iterators?

I think SplittablesBase.jl is fine, given it's a very small dependency.

cc @quinnj and @MasonProtter -- being able to use JuliaFolds with Tables and DataFrames would be awesome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants