Skip to content
This repository has been archived by the owner on May 5, 2019. It is now read-only.

Enhance joining and grouping #17

Merged
merged 33 commits into from
Mar 6, 2017
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
dd68a65
merge alyst groupby to DataTables
cjprybol Feb 20, 2017
a5fd472
passing tests
cjprybol Feb 20, 2017
9424201
Merge branch 'master' into cjp/alyst-groupby
cjprybol Feb 20, 2017
a652768
revert seemingly unrelated changes
cjprybol Feb 20, 2017
0cdf755
revert unnecessary changes for variable name and spacing
cjprybol Feb 21, 2017
d292dd3
fix indentation issue
cjprybol Feb 21, 2017
53774f5
add nonunique()
cjprybol Feb 22, 2017
2adc883
commit join.jl merge for Alyst to debug
cjprybol Feb 22, 2017
d52c791
make the easy changes requested during review
cjprybol Feb 22, 2017
5e9664a
add docstrings to row permutation functions
cjprybol Feb 23, 2017
e1b4d0e
clarify error message
cjprybol Feb 23, 2017
74c36d1
remove unused function
cjprybol Feb 23, 2017
de09a5c
update function to use isequal(a::Nullable, b::Nullable) from base
cjprybol Feb 23, 2017
160be5c
frame -> table
cjprybol Feb 23, 2017
7f28a14
update merge based on helpful diff
cjprybol Feb 23, 2017
61bf607
pass all tests that don't use Categorical
cjprybol Feb 23, 2017
6147d0c
added back commented out functions
cjprybol Feb 23, 2017
bab097f
minor cleanup
cjprybol Feb 23, 2017
cdac010
Merge branch 'master' into cjp/alyst-groupby
cjprybol Feb 23, 2017
8cf4a67
more changes suggested during review
cjprybol Feb 23, 2017
199f96b
use explicit vcat, indendation, parentheses
cjprybol Feb 23, 2017
f3b06a3
more indentation
cjprybol Feb 23, 2017
8308879
fix test/join.jl errors using `resize!` in
cjprybol Feb 24, 2017
1c842dc
passing all tests!
cjprybol Feb 25, 2017
637b8cf
update categorical Arrays version
cjprybol Feb 27, 2017
49d6328
incorporate edits suggested during review
cjprybol Mar 1, 2017
b6c1f98
more fixes
cjprybol Mar 3, 2017
839c558
added tests and trimmed unneccessary functions
cjprybol Mar 4, 2017
46aaae2
update new function name
cjprybol Mar 4, 2017
01b3ce8
revert code deletions and address most comments
cjprybol Mar 6, 2017
7b9b8e2
revert bad edit, function is untested
cjprybol Mar 6, 2017
7fe0389
revert some changes, clean up tests
cjprybol Mar 6, 2017
cf0486a
change docstring to comment
cjprybol Mar 6, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 10 additions & 11 deletions src/datatablerow/datatablerow.jl
Original file line number Diff line number Diff line change
Expand Up @@ -37,18 +37,17 @@ Base.convert(::Type{Array}, r::DataTableRow) = convert(Array, r.dt[r.row,:])

Base.collect(r::DataTableRow) = Tuple{Symbol, Any}[x for x in r]

# the equal elements of nullable and normal arrays would have the same hashes
const NULL_MAGIC = 0xBADDEED # what to hash if the element is null

# hash column element
Base.@propagate_inbounds hash_colel(v::AbstractArray, i, h::UInt = zero(UInt)) = hash(v[i], h)
Base.@propagate_inbounds hash_colel{T<:Nullable}(v::AbstractArray{T}, i, h::UInt = zero(UInt)) =
isnull(v[i]) ? hash(Base.nullablehash_seed, h) : hash(get(v[i]), h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also use unsafe_get here.

Base.@propagate_inbounds hash_colel{T}(v::NullableArray{T}, i, h::UInt = zero(UInt)) =
isnull(v, i) ? hash(NULL_MAGIC, h) : hash(get(v[i]), h)
isnull(v, i) ? hash(Base.nullablehash_seed, h) : hash(v.values[i], h)
Base.@propagate_inbounds hash_colel{T}(v::AbstractCategoricalArray{T}, i, h::UInt = zero(UInt)) =
hash(CategoricalArrays.index(v.pool)[v.refs[i]], h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is really more efficient than the default hash method for CategoricalValue?

Copy link
Contributor

@alyst alyst Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's not as efficient. But in these functions the constraint is to make hashes invariant to the hashed value representation: whether it's nullable or not and whether it's stored "as is" or in a categorical array. Otherwise joins would not work (we may require that joins only use the columns of identical types, but that would result in too much overhead on the user side). So we have to check if the default hash functions have this property (Nullable AFAIR is not).

Copy link
Member

@nalimilan nalimilan Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be surprising that it would be significantly slower, since the code is very similar. Though since we need the special method for NullableCategoricalArray to avoid the cost of creating a Nullable just to unwrap it, I guess it doesn't matter too much what we do here.

Base.@propagate_inbounds function hash_colel{T}(v::AbstractCategoricalArray{T}, i, h::UInt = zero(UInt))
Base.@propagate_inbounds function hash_colel{T}(v::AbstractNullableCategoricalArray{T}, i, h::UInt = zero(UInt))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add a test to distinguish this from above function per now out-of-date comment

AbstractCategoricalArray should be AbstractNullableCategoricalArray AFAICT. That also means a test is missing to catch this.

ref = v.refs[i]
ref == 0 ? hash(NULL_MAGIC, h) : hash(CategoricalArrays.index(v.pool)[ref], h)
ref == 0 ? hash(Base.nullablehash_seed, h) : hash(CategoricalArrays.index(v.pool)[ref], h)
end

# hash of DataTable rows based on its values
Expand Down Expand Up @@ -79,7 +78,7 @@ function @compat(Base.:(==))(r1::DataTableRow, r2::DataTableRow)
end
return eq
else
r1.row == r2.row && return Nullable(true)
r1.row == r2.row && return Nullable(true)
eq = Nullable(true)
@inbounds for col in columns(r1.dt)
eq_col = convert(Nullable{Bool}, col[r1.row] == col[r2.row])
Expand All @@ -104,13 +103,13 @@ function isequal_colel{T}(col::Union{NullableArray{T},
end

isequal_colel(a::Any, b::Any) = isequal(a, b)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two-argument definitions are not needed AFAICT. The only place where they are called could use isequal directly instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests fail after making the changes. They all seem necessary to get around Nullable comparisons

julia> using DataTables

julia> DataTables.isequal_colel(Nullable(1), 1)
true

julia> isequal(Nullable(1), 1)
false

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, I forgot about the need to unwrap nullables when comparing with non-nullable. That behavior is quite annoying, but well...

isequal_colel(a::Nullable, b::Any) = !isnull(a) && isequal(get(a), b)
isequal_colel(a::Nullable, b::Any) = !isnull(a) & isequal(unsafe_get(a), b)
isequal_colel(a::Any, b::Nullable) = isequal_colel(b, a)
isequal_colel(a::Nullable, b::Nullable) = isnull(a)==isnull(b) && (isnull(a) || isequal(get(a), get(b)))
isequal_colel(a::Nullable, b::Nullable) = isnull(a)==isnull(b) && (isnull(a) || isequal(a, b))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose here it could be just

isequal_colel(a::Nullable, b::Nullable) = isequal(a, b)


# comparison of DataTable rows
function isequal_row(dt::AbstractDataTable, r1::Int, r2::Int)
(r1 == r2) && return true # same raw
(r1 == r2) && return true # same row
@inbounds for col in columns(dt)
isequal_colel(col, r1, r2) || return false
end
Expand All @@ -120,7 +119,7 @@ end
function isequal_row(dt1::AbstractDataTable, r1::Int, dt2::AbstractDataTable, r2::Int)
(dt1 === dt2) && return isequal_row(dt1, r1, r2)
(ncol(dt1) == ncol(dt2)) ||
throw(ArgumentError("Rows of the data frames that have different number of columns cannot be compared ($(ncol(dt1)) and $(ncol(dt2)))"))
throw(ArgumentError("Rows of the data tables that have different number of columns cannot be compared ($(ncol(dt1)) and $(ncol(dt2)))"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"of data tables"
Also, inside the parentheses, probably more explicit to say "got X and Y columns".

@inbounds for (col1, col2) in zip(columns(dt1), columns(dt2))
isequal_colel(col1[r1], col2[r2]) || return false
end
Expand Down
26 changes: 13 additions & 13 deletions src/datatablerow/utils.jl
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,11 @@ function hashrows_col!(h::Vector{UInt}, v::AbstractVector)
h
end

function hashrows_col!{T}(h::Vector{UInt}, v::AbstractVector{T})
function hashrows_col!{T<:Nullable}(h::Vector{UInt}, v::AbstractVector{T})
@inbounds for i in eachindex(h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you call hash_colel instead of rewriting the hashing logic? That would allow merging all the similar methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIR, hashrows_col!() methods were supposed to provide optimizations to the hashing for specific data types (i.e. pre-built hash for CategoricalArray). But probably most of the current methods could be replaced by the default implementation that uses hash_colel().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, hash_colel can still be specialized, but hashrows_col! doesn't need to AFAICT.

h[i] = isnull(v[i]) ?
hash(NULL_MAGIC, h[i]) :
hash(v[i], h[i])
hash(Base.nullablehash_seed, h[i]) :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be h[i] + nullablehash_seed, no need to call hash in that case (like in base/nullable.jl).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also applies above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got hash errors after making these changes (and others probably downstream of this)

Test Failed
  Expression: dt_rowhashes == [hash(dr) for dr = eachrow(dt)]
   Evaluated: UInt64[0x1bdefb2976bd94c3,0xdd9fa2a42135ac50,0x1e6098864a87ed7d,0x1bdefb2976bd94c3,0xdd9fa2a42135ac50,0x0e42569badde05fc] == UInt64[0x1bdefb2976bd94c3,0xbb2ac60630b5eb56,0x1e6098864a87ed7d,0x1bdefb2976bd94c3,0xbb2ac60630b5eb56,0xd4ac22e6c0fc8065]

I think you're referencing this line, but hash(Base.nullablehash_seed, h[i]) doesn't call that function since the first argument is not a Nullable but instead Base.nullablehash_seed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to imply that that function was called, just that it was consistent with it. I don't think changing this should generate any failures since the hash functions are completely under our control. Have you made the same change to all places where hash is involved in the PR?

hash(unsafe_get(v[i]), h[i])
end
h
end
Expand All @@ -48,7 +48,7 @@ function hashrows_col!{T}(h::Vector{UInt}, v::AbstractNullableCategoricalVector{
# TODO is it possible to optimize by hashing the pool values once?
@inbounds for (i, ref) in enumerate(v.refs)
h[i] = ref == 0 ?
hash(NULL_MAGIC, h[i]) :
hash(Base.nullablehash_seed, h[i]) :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

hash(CategoricalArrays.index(v.pool)[ref], h[i])
end
h
Expand All @@ -72,7 +72,7 @@ end
# the indices of the first row in a group
# Optional group vector is set to the group indices of each row
function row_group_slots(dt::AbstractDataTable,
groups::Union{Vector{Int}, Void} = nothing)
groups::Union{Vector{Int}, Void} = nothing)
@assert groups === nothing || length(groups) == nrow(dt)
rhashes = hashrows(dt)
sz = Base._tablesz(length(rhashes))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining why we need this?

Copy link
Contributor

@alyst alyst Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is based on the Dict hash table from the Base. AFAIR, this line should reduce the number of reallocations, since we know the final hash size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. So would be good to explain that it's based on the Base code. Isn't there any generic data structure that we could use instead of custom code? Would it make sense to have this in e.g. DataStructures.jl?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alyst Could you explain what's the general strategy here and how it differs from the current one? Can it be as fast as the current one when grouping on CategoricalArray columns? AFAICT hash-based lookups are necessarily quite slower than integer-based indexing.

While the benchmark improvements compared with master are great, they don't seem to match the performance I obtained in #12 for the most efficient case. Of course this PR greatly improves performance for other cases, which are quite common, but ideally we would get the best possible performance for each case.

See also my question above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan I'm not quite sure I understand what you mean by the general strategy. Partly I explained it above. Here we are using hash to group the identical rows. AFAIR, at the time I was writing the original PR DataFrames used a temporary row Ids array and indexed like this

row_id = index(a[i]) * (N(b) + index(b[i])*(N(c) + index(c[i])) ....

where a, b, c are columns, i is the row index, index() gives the unique integer identifier for the column value, N() is the number of unique elements in a column. As you can see, the size of an array had to be at least N(a)*N(b)*N(c)*..., which lead to a memory overflow quite easily (or at least to a very inefficient memory usage). Hash solves this issue.
For single-column joins over categorical arrays one can potentially write an alternative version. Although it would still need to use the element value hashes, so that the joins over categorical/noncategorical are working properly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the old code overflowed quite easily, but Pandas (from which it was inspired) has a solution to compress the integer codes to avoid that. I'm not saying it's necessarily the best solution but they must have considered this question quite carefully. So before adopting a different approach, I'd like to be sure it can be about as fast as Pandas. I guess the only way to find out is to benchmark it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan Benchmarking is a nice idea anyway, because the proper benchmark should be based on some big data-derived datasets (millions of rows, >1000-categories arrays, multi-column joins using different data types, etc). There's a long term benefit in having it. I was testing PR850 using somewhat realistic datasets. Unfortunately, ATM I don't have a test system and time resources to help you with the benchmark.
Re "Pandas" or "master" vs PR approach, IMO the main performance determinant would be memory allocation and memory access patterns. Hashing should not be significantly slower than Pandas indexing, with the exception of grouping/joining by one categorical column, as we discussed above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also explain why we can't simply use a Dict here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of my intermediate implementations was using Dict{DataFramesRow, ...}, but the current one has more efficient memory access patterns when generating hashes (by column instead of by row). And it doesn't allocate memory for Dict values (the values would be the sets of row indices from the same group), because this information could be efficiently retrieved from RowGroupDict (rperm, starts etc -- these are just vectors of integers).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks, it's good to have this written somewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just add a comment mentioning that this code is inspired by the dict code from Base?

Expand Down Expand Up @@ -106,7 +106,7 @@ function row_group_slots(dt::AbstractDataTable,
end
slotix = slotix & szm1 + 1 # check the next slot
probe += 1
probe < sz || error("Cannot find free row slot")
@assert probe < sz
end
if groups !== nothing
groups[i] = gix
Expand All @@ -115,15 +115,15 @@ function row_group_slots(dt::AbstractDataTable,
return ngroups, rhashes, gslots
end

# Builds RowGroupDict for a given dataframe.
# Builds RowGroupDict for a given datatable.
# Partly uses the code of Wes McKinney's groupsort_indexer in pandas (file: src/groupby.pyx).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should update the name of the file while we remember it.

function group_rows(dt::AbstractDataTable)
groups = Vector{Int}(nrow(dt))
ngroups, rhashes, gslots = row_group_slots(dt, groups)

# count elements in each group
stops = zeros(Int, ngroups)
for g_ix in groups
@inbounds for g_ix in groups
stops[g_ix] += 1
end

Expand Down Expand Up @@ -170,20 +170,20 @@ function findrow(gd::RowGroupDict, dt::DataTable, row::Int)
return 0 # not found
end

# Finds indices of rows in 'gd' that match given row by content.
# returns empty set if no row matches
# Find indices of rows in 'gd' that match given row by content.
# return empty set if no row matches
function Base.get(gd::RowGroupDict, dt::DataTable, row::Int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function and the next one don't seem to match the signature and behavior of the corresponding Base methods. Better give them a separate name or adapt their signature so that they take a single key/index.

g_row = findrow(gd, dt, row)
(g_row == 0) && return Compat.view(gd.rperm, 0:-1)
(g_row == 0) && return view(gd.rperm, 0:-1)
gix = gd.groups[g_row]
return Compat.view(gd.rperm, gd.starts[gix]:gd.stops[gix])
return view(gd.rperm, gd.starts[gix]:gd.stops[gix])
end

function Base.getindex(gd::RowGroupDict, dtr::DataTableRow)
g_row = findrow(gd, dtr.dt, dtr.row)
(g_row == 0) && throw(KeyError(dtr))
gix = gd.groups[g_row]
return Compat.view(gd.rperm, gd.starts[gix]:gd.stops[gix])
return view(gd.rperm, gd.starts[gix]:gd.stops[gix])
end

# Check if there is matching row in gd
Expand Down
4 changes: 2 additions & 2 deletions src/groupeddatatable/grouping.jl
Original file line number Diff line number Diff line change
Expand Up @@ -122,9 +122,9 @@ function groupby{T}(dt::AbstractDataTable, cols::Vector{T}; sort::Bool = false)
dt_groups = group_rows(sdt)
# sort the groups
if sort
group_perm = sortperm(sub(sdt, dt_groups.rperm[dt_groups.starts]))
group_perm = sortperm(view(sdt, dt_groups.rperm[dt_groups.starts]))
permute!(dt_groups.starts, group_perm)
permute!(dt_groups.stops, group_perm)
Base.permute!!(dt_groups.stops, group_perm)
end
GroupedDataTable(dt, cols, dt_groups.rperm,
dt_groups.starts, dt_groups.stops)
Expand Down