Skip to content
This repository has been archived by the owner on May 5, 2019. It is now read-only.

update vcat to mimic Base.vcat and enhance promotion rules of mixed column type #45

Merged
merged 2 commits into from
May 12, 2017
Merged

Conversation

cjprybol
Copy link
Contributor

@cjprybol cjprybol commented Mar 31, 2017

The current vcat operates on Arrays of DataTables DataTable[], while Base.vcat utilizes slurping to capture any number of inputs as a tuple or argument to concatenate. This PR changes vcat to follow Base.vcat's lead and utilizes a function structure of vcat(dts::AbstractDataTable...). Uses @nalimilan's @generated function to implement a new type of AbstractArray promotion rule that ideally can be tested here and transfered into Base in the near future JuliaLang/julia#20815. This also removes the prior assumptions that were made about how to join datatables that were out of order or had columns that were only present in some of the arguments and not others.

@cjprybol cjprybol changed the title update vcat to better mimic Base.vcat and better preserve column type update vcat to mimic Base.vcat and better preserve column type Mar 31, 2017
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks mostly good. Please copy the explanations to the commit message (BTW, I think by "slurping" you mean "splatting"?).

It's not completely correct to say "The current vcat operates on Arrays of DataTable", since the vararg form was also supported: what you are doing is that you remove the form taking a vector of tables.

allheaders = map(names, dts)
# don't vcat empty DataTables
notempty = find(x -> length(x) > 0, allheaders)
uniqueheaders = unique(allheaders[notempty])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this: is there a strong reason not to check headers of empty tables? If they don't match, it could indicate a bug in the user code, and I can't find a situation where this would be useful.

Also, it's not correct to return DataTable() in that case since that doesn't preserve the headers if only empty tables are passed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we agree here, but just to clarify

julia> vcat(DataTable(), DataTable(), DataTable()) #this is correct
0×0 DataTables.DataTable


julia> vcat(DataTable(), DataTable(), DataTable(x=[])) # this should be an error
0×1 DataTables.DataTable


julia> vcat(DataTable(), DataTable(), DataTable(x=[1])) # should also be an error
1×1 DataTables.DataTable
│ Row │ x │
├─────┼───┤
│ 11

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I meant.

estrings = Vector{String}(length(uniqueheaders))
for (i, u) in enumerate(uniqueheaders)
matchingloci = find(h -> u == h, allheaders)
headerdiff = filter(x -> !in(x, u), coldiff)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just setdiff(coldiff, u), right?

for (i, u) in enumerate(uniqueheaders)
matchingloci = find(h -> u == h, allheaders)
headerdiff = filter(x -> !in(x, u), coldiff)
headerdiff = join(string.(headerdiff), ", ", " and ")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid reusing variable names for values of different types, this creates a type instability. Same below.

Also string isn't needed here an below. So you might be able to put the commands directly inside the string literal.

filter!(u -> Set(u) != Set(unionunique), uniqueheaders)
estrings = Vector{String}(length(uniqueheaders))
for (i, u) in enumerate(uniqueheaders)
matchingloci = find(h -> u == h, allheaders)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "loci" really useful here?

for i in 1:length(cols)
data = [dt[i] for dt in dts_to_vcat]
res = promote_col_type(data...)(mapreduce(length, +, data))
cols[i] = copy!(res, vcat(data...))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling vcat will create a temporary array. Better repeatedly call copy! with a changing offset (see do argument).

@@ -193,7 +193,7 @@ combine(map(d -> mean(dropnull(d[:c])), gd))
"""
function combine(ga::GroupApplied)
gd, vals = ga.gd, ga.vals
valscat = vcat(vals)
valscat = vcat(vals...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this go into a separate commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different commit but same PR? I can do that 👍

@cjprybol
Copy link
Contributor Author

cjprybol commented Apr 3, 2017

Thanks for the review! I'll get to these edits soon. re: slurping/splatting, I'm not sure if I used the right one. Both descriptions are used in the docs, but I'll say something more explicit in the commit to avoid any ambiguity.

@nalimilan
Copy link
Member

Funny, I didn't remember that term. It's fine then.

@cjprybol cjprybol changed the title update vcat to mimic Base.vcat and better preserve column type update vcat to mimic Base.vcat and enhance promotion rules of mixed column type Apr 4, 2017
@cjprybol
Copy link
Contributor Author

cjprybol commented Apr 4, 2017

That should address all of the comments. I added more tests to assert that the array promotion was working as intended, and a couple tests for vcatting 3 or more datatables to make sure my offset copy!s were working as intended. I think I'm finally starting to get squashing/rebasing/amending commits too. Coverage drops seem unrelated and should be addressed by #31.

@cjprybol
Copy link
Contributor Author

@nalimilan

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it's hard for me to keep up with the reviews. Should be good to go after this round.

for i in 1:length(cols)
data = [dt[i] for dt in dts]
lens = map(length, data)
cols[i] = promote_col_type(data...)(reduce(+, lens))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just sum(lens)? Same below.

cols[i] = promote_col_type(data...)(reduce(+, lens))
copy!(cols[i], data[1])
for j in 2:length(data)
offset = reduce(+, lens[1:j-1]) + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset = lens[1] for initialization and offset += lens[j] after each iteration would be simpler.

test/cat.jl Outdated
@test eltypes(dt)[1] == Any
dt = vcat(DataTable([NullableArray([1])], [:x]), DataTable([[1]], [:x]))
@test dt == DataTable([NullableArray([1, 1])], [:x])
@test eltypes(dt)[1] == Nullable{Int}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rather test the type of the column, to make sure we get a NullableArray rather than an Array{Nullable}? Same below for (Nullable)CategoricalArray vs. Array{CategoricalValue}`.

test/grouping.jl Outdated
@test isequal(sum(dt2[:x]), Nullable(0))
@test size(by(e->1, DataTable(x=Int64[1]), :x)) == (1,2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add this line in particular, and check only its size? It looks out of place in this block. Either remove it or make it more complete in its own block.

@cjprybol
Copy link
Contributor Author

cjprybol commented May 2, 2017

No worries! It's not that easy to keep up with the rate at which you do manage to review either. Thanks, as always, for putting in the time to help me improve these PRs.

@cjprybol
Copy link
Contributor Author

cjprybol commented May 3, 2017

Coverage drop doesn't seem related here either. Several functions flagged by Coveralls are handled by #31

test/cat.jl Outdated
@test typeof.(dt.columns) == [NullableVector{Int}]
dt = vcat(DataTable([CategoricalArray([1])], [:x]), DataTable([[1]], [:x]))
@test dt == DataTable([CategoricalArray([1, 1])], [:x])
@test typeof.(dt.columns) == [CategoricalVector{Int, UInt32}]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave out the UInt32 (or replace it with CategoricalArrays.DefaultRefType, but probably not worth it).

test/grouping.jl Outdated
@@ -148,7 +148,6 @@ module TestGrouping
dt2 = by(e->1, DataTable(x=Int64[]), :x)
@test size(dt2) == (0,2)
@test isequal(sum(dt2[:x]), Nullable(0))
@test size(by(e->1, DataTable(x=Int64[1]), :x)) == (1,2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should put this in the first commit.

This change is helpful to support changing vcat to be more consistent
with Base.vcat, where passing an array is not allowed.
This PR removes vcat support for arrays of datatables and makes the
Base.vcat style of vcat(args...) the only call option. Removes
assumptions for joining datatables with missing, unique, and out of
order columns. vcat'ing datatables with unmatched headers results in
error messages that explain how the columns are inconsistent. Uses
@nalimilan's @generated function to implement a new type of
AbstractArray promotion rule that improves handling of NullableArrays
and CategoricalArrays. Extends vcat testing.
@nalimilan nalimilan merged commit 776f293 into JuliaData:master May 12, 2017
@cjprybol
Copy link
Contributor Author

Thanks!

This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants