Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide what to do with null values in grouping #1249

Closed
nalimilan opened this issue Oct 10, 2017 · 3 comments
Closed

Decide what to do with null values in grouping #1249

nalimilan opened this issue Oct 10, 2017 · 3 comments

Comments

@nalimilan
Copy link
Member

On current master, and since JuliaData/DataTables.jl#17, we create separate groups for null values:

julia> groupby(df, :x)
DataFrames.GroupedDataFrame  2 groups with keys: Symbol[:x]
First Group:
1×1 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ x    │
├─────┼──────┤
│ 1   │ null │

Last Group:
1×1 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ x │
├─────┼───┤
│ 10

The old code used to discard nulls silently, without an option to retain them (just like Pandas does). Since the behavior changes so radically, we should decide what's the best default before tagging a release. I think keeping nulls is a good default as it's consistent with what we do everywhere. But we should probably provide an argument to skip them so that people can get the old behavior easily. This is not so easy to implement when there are multiple grouping variables, each of which can be null independently from others.

A second issue is that by default we no longer sort groups, which means that null groups are not necessarily at the end. Maybe we should enable sorting by default. IIRC it isn't very costly it is in terms of performance, and I suspect it would be a more useful default.

Note that when using N grouping variables with nulls, there will be (2^N)-1 groups with a null. We could imagine providing an option to collapse all of these into a single "at least one null" group, but probably not by default so that can wait.

@nalimilan nalimilan added this to the 0.11 milestone Oct 10, 2017
@cjprybol
Copy link
Contributor

I'd prefer requiring explicit handling of nulls and treat them as a group just like the other values. The behavior you describe of:

The old code used to discard nulls silently, without an option to retain them

is the kind of thing (my) data analysis nightmares are made of. Groups with missing data could be one of the most informative or important groups in the dataset, I think it makes sense to support them by default as first-class citizens. Removing them by default would be at best an annoyance and at worst a paper retraction waiting to happen.

I think the best argument in favor of handling nulls in groupby is that we still don't have great support for dropping nulls on subsets of the DataFrame as was brought up in JuliaData/DataTables.jl#38, but I think we'd be better served by implementing something like @mkborregaard's proposal of allowing columns to be specified.

e.g. groupby(dropnull(df, :x), :x) would achieve the same purpose but allows more flexibility (because the functions can be used independently) and better re-use of the DataFrames method library than implementing something like groupby(df, :x, dropnull=true) would.

IIRC it isn't very costly it is in terms of performance, and I suspect it would be a more useful default.

I'd actually have a preference of moving the groupby sort functionality into a new method of sort(::GroupedDataFrame) and then requiring users to sort the DataFrame before sending it to groupby or to sort the output using that new function, but that's starting to get a little off-topic... Anyway, benchmarks with the TL;DR of sorting is always expensive :)

julia> using DataFrames, BenchmarkTools

julia> srand(1)
MersenneTwister(UInt32[0x00000001], Base.dSFMT.DSFMT_state(Int32[1749029653, 1072851681, 1610647787, 1072862326, 1841712345, 1073426746, -198061126, 1073322060, -156153802, 1073567984    1977574422, 1073209915, 278919868, 1072835605, 1290372147, 18858467, 1815133874, -1716870370, 382, 0]), [1.02678, 1.46735, 1.12098, 1.81267, 1.39528, 1.62253, 1.39736, 1.92446, 1.02628, 1.35725    1.84562, 1.62919, 1.63941, 1.64108, 1.55719, 1.67337, 1.26605, 1.19832, 1.58489, 1.89534], 382)

julia> a = DataFrame(x = shuffle!(vcat(1:3, null)));

julia> b = DataFrame(x = shuffle!(vcat(1:10^1, null)));

julia> c = DataFrame(x = shuffle!(vcat(1:10^2, null)));

julia> d = DataFrame(x = shuffle!(vcat(1:10^3, null)));

julia> e = DataFrame(x = shuffle!(vcat(1:10^4, null)));

julia> f = DataFrame(x = shuffle!(vcat(1:10^5, null)));

julia> g = DataFrame(x = shuffle!(vcat(1:10^6, null)));

julia> @benchmark groupby(a, :x)
BenchmarkTools.Trial:
  memory estimate:  3.33 KiB
  allocs estimate:  49
  --------------
  minimum time:     4.101 μs (0.00% GC)
  median time:      4.392 μs (0.00% GC)
  mean time:        5.302 μs (10.16% GC)
  maximum time:     787.240 μs (97.34% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark groupby(a, :x, sort=true)
BenchmarkTools.Trial:
  memory estimate:  4.13 KiB
  allocs estimate:  61
  --------------
  minimum time:     5.792 μs (0.00% GC)
  median time:      6.553 μs (0.00% GC)
  mean time:        7.709 μs (8.81% GC)
  maximum time:     858.635 μs (95.88% GC)
  --------------
  samples:          10000
  evals/sample:     6

julia> @benchmark groupby(sort!($(deepcopy(a))), :x)
BenchmarkTools.Trial:
  memory estimate:  4.09 KiB
  allocs estimate:  65
  --------------
  minimum time:     8.070 μs (0.00% GC)
  median time:      9.393 μs (0.00% GC)
  mean time:        10.809 μs (5.82% GC)
  maximum time:     1.531 ms (96.99% GC)
  --------------
  samples:          10000
  evals/sample:     3

julia> @benchmark groupby(sort(a), :x)
BenchmarkTools.Trial:
  memory estimate:  5.11 KiB
  allocs estimate:  78
  --------------
  minimum time:     9.640 μs (0.00% GC)
  median time:      11.639 μs (0.00% GC)
  mean time:        13.049 μs (5.63% GC)
  maximum time:     3.785 ms (97.89% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark groupby(b, :x)
BenchmarkTools.Trial:
  memory estimate:  3.75 KiB
  allocs estimate:  56
  --------------
  minimum time:     4.315 μs (0.00% GC)
  median time:      4.668 μs (0.00% GC)
  mean time:        5.394 μs (9.50% GC)
  maximum time:     587.941 μs (96.97% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark groupby(b, :x, sort=true)
BenchmarkTools.Trial:
  memory estimate:  4.73 KiB
  allocs estimate:  68
  --------------
  minimum time:     9.680 μs (0.00% GC)
  median time:      11.251 μs (0.00% GC)
  mean time:        12.269 μs (6.53% GC)
  maximum time:     4.373 ms (97.46% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark groupby(sort!($(deepcopy(b))), :x)
BenchmarkTools.Trial:
  memory estimate:  4.64 KiB
  allocs estimate:  72
  --------------
  minimum time:     9.463 μs (0.00% GC)
  median time:      11.547 μs (0.00% GC)
  mean time:        13.292 μs (6.06% GC)
  maximum time:     4.426 ms (97.91% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark groupby(sort(b), :x)
BenchmarkTools.Trial:
  memory estimate:  5.66 KiB
  allocs estimate:  85
  --------------
  minimum time:     13.032 μs (0.00% GC)
  median time:      15.735 μs (0.00% GC)
  mean time:        17.852 μs (4.95% GC)
  maximum time:     5.080 ms (97.28% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark groupby(c, :x)
BenchmarkTools.Trial:
  memory estimate:  9.61 KiB
  allocs estimate:  146
  --------------
  minimum time:     6.957 μs (0.00% GC)
  median time:      7.708 μs (0.00% GC)
  mean time:        9.782 μs (16.61% GC)
  maximum time:     1.402 ms (96.74% GC)
  --------------
  samples:          10000
  evals/sample:     4

julia> @benchmark groupby(c, :x, sort=true)
BenchmarkTools.Trial:
  memory estimate:  13.14 KiB
  allocs estimate:  159
  --------------
  minimum time:     75.216 μs (0.00% GC)
  median time:      78.611 μs (0.00% GC)
  mean time:        84.476 μs (2.30% GC)
  maximum time:     5.431 ms (97.12% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark groupby(sort!($(deepcopy(c))), :x)
BenchmarkTools.Trial:
  memory estimate:  12.34 KiB
  allocs estimate:  163
  --------------
  minimum time:     38.252 μs (0.00% GC)
  median time:      40.498 μs (0.00% GC)
  mean time:        46.064 μs (3.96% GC)
  maximum time:     3.864 ms (97.58% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark groupby(sort(c), :x)
BenchmarkTools.Trial:
  memory estimate:  13.36 KiB
  allocs estimate:  176
  --------------
  minimum time:     76.977 μs (0.00% GC)
  median time:      81.806 μs (0.00% GC)
  mean time:        91.327 μs (2.72% GC)
  maximum time:     6.243 ms (96.28% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark groupby(d, :x)
BenchmarkTools.Trial:
  memory estimate:  66.36 KiB
  allocs estimate:  1051
  --------------
  minimum time:     45.072 μs (0.00% GC)
  median time:      53.142 μs (0.00% GC)
  mean time:        73.577 μs (20.75% GC)
  maximum time:     6.136 ms (97.84% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark groupby(d, :x, sort=true)
BenchmarkTools.Trial:
  memory estimate:  255.80 KiB
  allocs estimate:  11366
  --------------
  minimum time:     1.068 ms (0.00% GC)
  median time:      1.122 ms (0.00% GC)
  mean time:        1.262 ms (7.28% GC)
  maximum time:     12.725 ms (83.42% GC)
  --------------
  samples:          3957
  evals/sample:     1

julia> @benchmark groupby(sort!($(deepcopy(d))), :x)
BenchmarkTools.Trial:
  memory estimate:  162.80 KiB
  allocs estimate:  5925
  --------------
  minimum time:     497.095 μs (0.00% GC)
  median time:      519.817 μs (0.00% GC)
  mean time:        601.688 μs (8.28% GC)
  maximum time:     10.612 ms (92.50% GC)
  --------------
  samples:          8288
  evals/sample:     1

julia> @benchmark groupby(sort(d), :x)
BenchmarkTools.Trial:
  memory estimate:  264.25 KiB
  allocs estimate:  12366
  --------------
  minimum time:     1.091 ms (0.00% GC)
  median time:      1.138 ms (0.00% GC)
  mean time:        1.283 ms (7.41% GC)
  maximum time:     12.676 ms (90.59% GC)
  --------------
  samples:          3896
  evals/sample:     1

julia> @benchmark groupby(e, :x)
BenchmarkTools.Trial:
  memory estimate:  678.27 KiB
  allocs estimate:  10057
  --------------
  minimum time:     323.982 μs (0.00% GC)
  median time:      381.920 μs (0.00% GC)
  mean time:        508.820 μs (22.14% GC)
  maximum time:     6.460 ms (91.71% GC)
  --------------
  samples:          9779
  evals/sample:     1

julia> @benchmark groupby(e, :x, sort=true)
BenchmarkTools.Trial:
  memory estimate:  5.54 MiB
  allocs estimate:  288044
  --------------
  minimum time:     17.031 ms (0.00% GC)
  median time:      18.230 ms (0.00% GC)
  mean time:        20.772 ms (10.68% GC)
  maximum time:     32.934 ms (36.42% GC)
  --------------
  samples:          241
  evals/sample:     1

julia> @benchmark groupby(sort!($(deepcopy(e))), :x)
BenchmarkTools.Trial:
  memory estimate:  1.54 MiB
  allocs estimate:  57530
  --------------
  minimum time:     1.808 ms (0.00% GC)
  median time:      1.936 ms (0.00% GC)
  mean time:        2.525 ms (19.89% GC)
  maximum time:     13.133 ms (81.65% GC)
  --------------
  samples:          1977
  evals/sample:     1

julia> @benchmark groupby(sort(e), :x)
BenchmarkTools.Trial:
  memory estimate:  5.75 MiB
  allocs estimate:  307043
  --------------
  minimum time:     16.820 ms (0.00% GC)
  median time:      18.341 ms (0.00% GC)
  mean time:        20.925 ms (11.64% GC)
  maximum time:     36.118 ms (37.38% GC)
  --------------
  samples:          240
  evals/sample:     1

julia> @benchmark groupby(f, :x)
BenchmarkTools.Trial:
  memory estimate:  6.34 MiB
  allocs estimate:  100057
  --------------
  minimum time:     3.761 ms (0.00% GC)
  median time:      4.367 ms (0.00% GC)
  mean time:        5.503 ms (17.80% GC)
  maximum time:     12.438 ms (30.90% GC)
  --------------
  samples:          908
  evals/sample:     1

julia> @benchmark groupby(f, :x, sort=true)
BenchmarkTools.Trial:
  memory estimate:  64.17 MiB
  allocs estimate:  3403352
  --------------
  minimum time:     270.340 ms (9.12% GC)
  median time:      282.425 ms (9.04% GC)
  mean time:        290.332 ms (11.02% GC)
  maximum time:     417.644 ms (37.02% GC)
  --------------
  samples:          18
  evals/sample:     1

julia> @benchmark groupby(sort!($(deepcopy(f))), :x)
BenchmarkTools.Trial:
  memory estimate:  15.46 MiB
  allocs estimate:  597530
  --------------
  minimum time:     18.934 ms (0.00% GC)
  median time:      28.783 ms (32.24% GC)
  mean time:        27.513 ms (25.24% GC)
  maximum time:     37.588 ms (31.23% GC)
  --------------
  samples:          182
  evals/sample:     1

julia> @benchmark groupby(sort(f), :x)
BenchmarkTools.Trial:
  memory estimate:  66.45 MiB
  allocs estimate:  3602351
  --------------
  minimum time:     286.174 ms (13.46% GC)
  median time:      306.774 ms (12.66% GC)
  mean time:        323.814 ms (14.53% GC)
  maximum time:     435.845 ms (36.44% GC)
  --------------
  samples:          16
  evals/sample:     1

julia> @benchmark groupby(g, :x)
BenchmarkTools.Trial:
  memory estimate:  61.41 MiB
  allocs estimate:  1000057
  --------------
  minimum time:     116.534 ms (10.03% GC)
  median time:      139.712 ms (21.51% GC)
  mean time:        170.733 ms (33.41% GC)
  maximum time:     279.206 ms (51.86% GC)
  --------------
  samples:          30
  evals/sample:     1

julia> @benchmark groupby(g, :x, sort=true)
BenchmarkTools.Trial:
  memory estimate:  768.23 MiB
  allocs estimate:  41836262
  --------------
  minimum time:     4.910 s (9.40% GC)
  median time:      4.977 s (9.35% GC)
  mean time:        4.977 s (9.35% GC)
  maximum time:     5.043 s (9.31% GC)
  --------------
  samples:          2
  evals/sample:     1

julia> @benchmark groupby(sort!($(deepcopy(g))), :x)
BenchmarkTools.Trial:
  memory estimate:  152.92 MiB
  allocs estimate:  5997530
  --------------
  minimum time:     463.496 ms (36.73% GC)
  median time:      646.319 ms (52.37% GC)
  mean time:        627.043 ms (50.87% GC)
  maximum time:     668.340 ms (54.92% GC)
  --------------
  samples:          8
  evals/sample:     1

julia> @benchmark groupby(sort(g), :x)
BenchmarkTools.Trial:
  memory estimate:  791.10 MiB
  allocs estimate:  43835261
  --------------
  minimum time:     4.397 s (14.04% GC)
  median time:      4.575 s (16.08% GC)
  mean time:        4.575 s (16.08% GC)
  maximum time:     4.753 s (17.98% GC)
  --------------
  samples:          2
  evals/sample:     1

@nalimilan
Copy link
Member Author

I think the best argument in favor of handling nulls in groupby is that we still don't have great support for dropping nulls on subsets of the DataFrame as was brought up in JuliaData/DataTables.jl#38, but I think we'd be better served by implementing something like @mkborregaard's proposal of allowing columns to be specified.

e.g. groupby(dropnull(df, :x), :x) would achieve the same purpose but allows more flexibility (because the functions can be used independently) and better re-use of the DataFrames method library than implementing something like groupby(df, :x, dropnull=true) would.

The problem is that in most cases people will have to repeat the names of the grouping columns twice (as in your example). So we should provide a more convenient way. I'm not sure we can escape adding a dropnull/skipnull keyword argument, unfortunately. Under the hood it could just call a more general function, but I'm also not sure that checking for nulls first and then passing the result to groupby can be as fast as discarding nulls directly while performing the grouping operation.

Regarding sorting, I like the idea of calling sort on the result explicitly, but OTOH I have the impression that you generally want to have the results sorted. The benchmarks you show are indeed worrying, but that might reflect a problem in our current sorting code: how is it possible that changing the order of 10 groups makes such a big difference? It should be virtually free if done at the right point in the algorithm, since the ordering is arbitrary and choosing an ordering should be equivalent to sorting ten integers. Maybe this should be done directly in row_group_slots or group_rows, i.e. before computing the indices, so that we don't need to reorder them? We probably need to improve these functions anyway to be able to identify and skip nulls.

@nalimilan
Copy link
Member Author

It wasn't too hard to implement, see #1267.

I've actually realized that in 0.10.1 we did not skip NA values already, they were just sorted first and there was absolutely no way to skip them. So even without sort=true by default, the current state of master isn't really worse than the last release. And with skipnull added, it will be strictly better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants