Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby() and InexactError (again) #985

Closed
tcovert opened this issue Jun 1, 2016 · 15 comments
Closed

groupby() and InexactError (again) #985

tcovert opened this issue Jun 1, 2016 · 15 comments

Comments

@tcovert
Copy link

tcovert commented Jun 1, 2016

I've found another way to break groupby(): pass it a DataFrame and set of columns for which the cartesian product is greater than 32 bits of addressable space, but for which the actual set of existing groups is smaller. This issue arises in line 106 of grouping.jl:

ngroups = ngroups * (length(dv.pool) + dv_has_nas)

It seems that JMW (or someone else) predicted this would eventually be a problem, as seen in the comment on the following line...

Here is an MWE: https://gist.github.com/tcovert/6df691c5308e1ddd6c5103804cc2bb05

@nalimilan
Copy link
Member

Good catch, though I wonder how often this can happen in practice. Have you observed this with real data?

I guess we could drop unused levels from dv before calling groupsort_indexer. This should probably be done only when ngroups is high, though, since that will require going over the whole vector. Maybe a good rule would be to do it when ngroups > length(dv), since groupsort_indexer needs to allocate and go over a vector with ngroups elements.

I would have sworn DataArrays included a function to drop unused levels and recode the values accordingly, but it doesn't seem to exist. Not very hard to write, though. Would you give it a try?

@tcovert
Copy link
Author

tcovert commented Jun 1, 2016

I only find bugs when they arise in real world data! In my case, it was with a groupby() over 5 columns, one of which had about 1500 unique values, and the other 4 were each less than 100 unique values. in total, there are about half a million unique groups in the data, even though the cartesian product is on the order of 15 billion.

@tcovert
Copy link
Author

tcovert commented Jun 1, 2016

By the way, I don't think dropping unused levels before groupsort_indexer will work. The InexactError() I get happens on line 106, which is strictly before the call to groupsort_indexer.

@merl-dev
Copy link

merl-dev commented Oct 9, 2016

I seem to be running into same issue, is there a suggested (simple) hack other than running this section of our pipeline through R?

@nalimilan
Copy link
Member

I don't think so. Fixing it shouldn't require too much work, but that's not trivial change either.

@tcovert
Copy link
Author

tcovert commented Oct 9, 2016

Isn't this going to be fixed under the CategoricalArrays branch?

@nalimilan
Copy link
Member

Not really, as that's mostly orthogonal (have a look at the current code in master). If you want to help, looking at how Pandas handles this would make it easier for somebody to implement a solution.

@nalimilan
Copy link
Member

Something you can try is replacing UInt32 on line 96 with UInt64, which should hopefully be enough for a reasonable number of combinations.

@merl-dev
Copy link

I tried that before finding this issue, as it's been a long-standing problem with using julia for our data processing. We often (ad tech) have >100K row dataframes with 5 or more variables, one of which usually has about 5K-10K unique values, the others <100. We need to group and summarize the data and regularly run that part of the process through R which does the work in a blink. The julia workaround involves alot of extra code and loops, and adds significant runtime to the process.

Using the example at https://gist.github.com/tcovert/6df691c5308e1ddd6c5103804cc2bb05, with UInt32 we get

gb0 = groupby(df, [:v1, :v2, :v3])
ERROR: OutOfMemoryError()
 in groupsort_indexer(::Array{UInt32,1}, ::Int64, ::Bool) at /home/ubuntu/.julia/v0.5/DataArrays/src/grouping.jl:13
 in groupsort_indexer(::Array{UInt32,1}, ::Int64) at /home/ubuntu/.julia/v0.5/DataArrays/src/grouping.jl:5
 in groupby(::DataFrames.DataFrame, ::Array{Symbol,1}) at /home/ubuntu/.julia/v0.5/DataFrames/src/groupeddataframe/grouping.jl:109

julia> gb1 = groupby(df, [:v1, :v2, :v3, :v4])
ERROR: InexactError()
 in setindex!(::Array{UInt32,1}, ::Int64, ::Int64) at ./array.jl:415
 in groupby(::DataFrames.DataFrame, ::Array{Symbol,1}) at /home/ubuntu/.julia/v0.5/DataFrames/src/groupeddataframe/grouping.jl:104

and with UInt64

julia> gb0 = groupby(df, [:v1, :v2, :v3])
ERROR: OutOfMemoryError()
 in groupsort_indexer(::Array{UInt64,1}, ::Int64, ::Bool) at /home/ubuntu/.julia/v0.5/DataArrays/src/grouping.jl:13
 in groupsort_indexer(::Array{UInt64,1}, ::Int64) at /home/ubuntu/.julia/v0.5/DataArrays/src/grouping.jl:5
 in groupby(::DataFrames.DataFrame, ::Array{Symbol,1}) at /home/ubuntu/.julia/v0.5/DataFrames/src/groupeddataframe/grouping.j                                          l:109

julia> gb1 = groupby(df, [:v1, :v2, :v3, :v4])
ERROR: OutOfMemoryError()
 in groupsort_indexer(::Array{UInt64,1}, ::Int64, ::Bool) at /home/ubuntu/.julia/v0.5/DataArrays/src/grouping.jl:7
 in groupsort_indexer(::Array{UInt64,1}, ::Int64) at /home/ubuntu/.julia/v0.5/DataArrays/src/grouping.jl:5
 in groupby(::DataFrames.DataFrame, ::Array{Symbol,1}) at /home/ubuntu/.julia/v0.5/DataFrames/src/groupeddataframe/grouping.jl:109

@joshbode
Copy link
Contributor

joshbode commented Nov 26, 2016

My simple workaround for the same issue when using the by function was to group each dimension sequentially, e.g.

by_(d::AbstractDataFrame, cols, f::Function) = by(d, shift!(cols), isempty(cols) ? f : (x) -> by_(x, copy(cols), f))
by_(f::Function, d::AbstractDataFrame, cols) = by_(d, cols, f)

Probably not as efficient as it could be but it got the job done. As a bonus, if you have some dimensions that you know can be grouped together without exceeding the limit, you can specify a "group-down" path, i.e.

by_(d, [[:x, :y], :z], f)

and have fewer intermediate calls to by

@cjprybol
Copy link
Contributor

cjprybol commented Mar 6, 2017

Fixed in DataTables JuliaData/DataTables.jl#17

@ararslan
Copy link
Member

ararslan commented Mar 6, 2017

The fix in DataTables should be ported over here to address this issue.

@nalimilan
Copy link
Member

Yeah, but that would take quite some work that I'd rather put into improving the new framework. The priority for DataFrames is to have it work on Julia 0.6...

@ararslan
Copy link
Member

ararslan commented Mar 6, 2017

Right, I just wanted to be sure this issue wasn't closed prematurely.

@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

Closing as fix has been ported over here

@quinnj quinnj closed this as completed Sep 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants