-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decide what to do with null values in grouping #1249
Comments
I'd prefer requiring explicit handling of nulls and treat them as a group just like the other values. The behavior you describe of:
is the kind of thing (my) data analysis nightmares are made of. Groups with missing data could be one of the most informative or important groups in the dataset, I think it makes sense to support them by default as first-class citizens. Removing them by default would be at best an annoyance and at worst a paper retraction waiting to happen. I think the best argument in favor of handling nulls in e.g.
I'd actually have a preference of moving the julia> using DataFrames, BenchmarkTools
julia> srand(1)
MersenneTwister(UInt32[0x00000001], Base.dSFMT.DSFMT_state(Int32[1749029653, 1072851681, 1610647787, 1072862326, 1841712345, 1073426746, -198061126, 1073322060, -156153802, 1073567984 … 1977574422, 1073209915, 278919868, 1072835605, 1290372147, 18858467, 1815133874, -1716870370, 382, 0]), [1.02678, 1.46735, 1.12098, 1.81267, 1.39528, 1.62253, 1.39736, 1.92446, 1.02628, 1.35725 … 1.84562, 1.62919, 1.63941, 1.64108, 1.55719, 1.67337, 1.26605, 1.19832, 1.58489, 1.89534], 382)
julia> a = DataFrame(x = shuffle!(vcat(1:3, null)));
julia> b = DataFrame(x = shuffle!(vcat(1:10^1, null)));
julia> c = DataFrame(x = shuffle!(vcat(1:10^2, null)));
julia> d = DataFrame(x = shuffle!(vcat(1:10^3, null)));
julia> e = DataFrame(x = shuffle!(vcat(1:10^4, null)));
julia> f = DataFrame(x = shuffle!(vcat(1:10^5, null)));
julia> g = DataFrame(x = shuffle!(vcat(1:10^6, null)));
julia> @benchmark groupby(a, :x)
BenchmarkTools.Trial:
memory estimate: 3.33 KiB
allocs estimate: 49
--------------
minimum time: 4.101 μs (0.00% GC)
median time: 4.392 μs (0.00% GC)
mean time: 5.302 μs (10.16% GC)
maximum time: 787.240 μs (97.34% GC)
--------------
samples: 10000
evals/sample: 7
julia> @benchmark groupby(a, :x, sort=true)
BenchmarkTools.Trial:
memory estimate: 4.13 KiB
allocs estimate: 61
--------------
minimum time: 5.792 μs (0.00% GC)
median time: 6.553 μs (0.00% GC)
mean time: 7.709 μs (8.81% GC)
maximum time: 858.635 μs (95.88% GC)
--------------
samples: 10000
evals/sample: 6
julia> @benchmark groupby(sort!($(deepcopy(a))), :x)
BenchmarkTools.Trial:
memory estimate: 4.09 KiB
allocs estimate: 65
--------------
minimum time: 8.070 μs (0.00% GC)
median time: 9.393 μs (0.00% GC)
mean time: 10.809 μs (5.82% GC)
maximum time: 1.531 ms (96.99% GC)
--------------
samples: 10000
evals/sample: 3
julia> @benchmark groupby(sort(a), :x)
BenchmarkTools.Trial:
memory estimate: 5.11 KiB
allocs estimate: 78
--------------
minimum time: 9.640 μs (0.00% GC)
median time: 11.639 μs (0.00% GC)
mean time: 13.049 μs (5.63% GC)
maximum time: 3.785 ms (97.89% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark groupby(b, :x)
BenchmarkTools.Trial:
memory estimate: 3.75 KiB
allocs estimate: 56
--------------
minimum time: 4.315 μs (0.00% GC)
median time: 4.668 μs (0.00% GC)
mean time: 5.394 μs (9.50% GC)
maximum time: 587.941 μs (96.97% GC)
--------------
samples: 10000
evals/sample: 7
julia> @benchmark groupby(b, :x, sort=true)
BenchmarkTools.Trial:
memory estimate: 4.73 KiB
allocs estimate: 68
--------------
minimum time: 9.680 μs (0.00% GC)
median time: 11.251 μs (0.00% GC)
mean time: 12.269 μs (6.53% GC)
maximum time: 4.373 ms (97.46% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark groupby(sort!($(deepcopy(b))), :x)
BenchmarkTools.Trial:
memory estimate: 4.64 KiB
allocs estimate: 72
--------------
minimum time: 9.463 μs (0.00% GC)
median time: 11.547 μs (0.00% GC)
mean time: 13.292 μs (6.06% GC)
maximum time: 4.426 ms (97.91% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark groupby(sort(b), :x)
BenchmarkTools.Trial:
memory estimate: 5.66 KiB
allocs estimate: 85
--------------
minimum time: 13.032 μs (0.00% GC)
median time: 15.735 μs (0.00% GC)
mean time: 17.852 μs (4.95% GC)
maximum time: 5.080 ms (97.28% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark groupby(c, :x)
BenchmarkTools.Trial:
memory estimate: 9.61 KiB
allocs estimate: 146
--------------
minimum time: 6.957 μs (0.00% GC)
median time: 7.708 μs (0.00% GC)
mean time: 9.782 μs (16.61% GC)
maximum time: 1.402 ms (96.74% GC)
--------------
samples: 10000
evals/sample: 4
julia> @benchmark groupby(c, :x, sort=true)
BenchmarkTools.Trial:
memory estimate: 13.14 KiB
allocs estimate: 159
--------------
minimum time: 75.216 μs (0.00% GC)
median time: 78.611 μs (0.00% GC)
mean time: 84.476 μs (2.30% GC)
maximum time: 5.431 ms (97.12% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark groupby(sort!($(deepcopy(c))), :x)
BenchmarkTools.Trial:
memory estimate: 12.34 KiB
allocs estimate: 163
--------------
minimum time: 38.252 μs (0.00% GC)
median time: 40.498 μs (0.00% GC)
mean time: 46.064 μs (3.96% GC)
maximum time: 3.864 ms (97.58% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark groupby(sort(c), :x)
BenchmarkTools.Trial:
memory estimate: 13.36 KiB
allocs estimate: 176
--------------
minimum time: 76.977 μs (0.00% GC)
median time: 81.806 μs (0.00% GC)
mean time: 91.327 μs (2.72% GC)
maximum time: 6.243 ms (96.28% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark groupby(d, :x)
BenchmarkTools.Trial:
memory estimate: 66.36 KiB
allocs estimate: 1051
--------------
minimum time: 45.072 μs (0.00% GC)
median time: 53.142 μs (0.00% GC)
mean time: 73.577 μs (20.75% GC)
maximum time: 6.136 ms (97.84% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark groupby(d, :x, sort=true)
BenchmarkTools.Trial:
memory estimate: 255.80 KiB
allocs estimate: 11366
--------------
minimum time: 1.068 ms (0.00% GC)
median time: 1.122 ms (0.00% GC)
mean time: 1.262 ms (7.28% GC)
maximum time: 12.725 ms (83.42% GC)
--------------
samples: 3957
evals/sample: 1
julia> @benchmark groupby(sort!($(deepcopy(d))), :x)
BenchmarkTools.Trial:
memory estimate: 162.80 KiB
allocs estimate: 5925
--------------
minimum time: 497.095 μs (0.00% GC)
median time: 519.817 μs (0.00% GC)
mean time: 601.688 μs (8.28% GC)
maximum time: 10.612 ms (92.50% GC)
--------------
samples: 8288
evals/sample: 1
julia> @benchmark groupby(sort(d), :x)
BenchmarkTools.Trial:
memory estimate: 264.25 KiB
allocs estimate: 12366
--------------
minimum time: 1.091 ms (0.00% GC)
median time: 1.138 ms (0.00% GC)
mean time: 1.283 ms (7.41% GC)
maximum time: 12.676 ms (90.59% GC)
--------------
samples: 3896
evals/sample: 1
julia> @benchmark groupby(e, :x)
BenchmarkTools.Trial:
memory estimate: 678.27 KiB
allocs estimate: 10057
--------------
minimum time: 323.982 μs (0.00% GC)
median time: 381.920 μs (0.00% GC)
mean time: 508.820 μs (22.14% GC)
maximum time: 6.460 ms (91.71% GC)
--------------
samples: 9779
evals/sample: 1
julia> @benchmark groupby(e, :x, sort=true)
BenchmarkTools.Trial:
memory estimate: 5.54 MiB
allocs estimate: 288044
--------------
minimum time: 17.031 ms (0.00% GC)
median time: 18.230 ms (0.00% GC)
mean time: 20.772 ms (10.68% GC)
maximum time: 32.934 ms (36.42% GC)
--------------
samples: 241
evals/sample: 1
julia> @benchmark groupby(sort!($(deepcopy(e))), :x)
BenchmarkTools.Trial:
memory estimate: 1.54 MiB
allocs estimate: 57530
--------------
minimum time: 1.808 ms (0.00% GC)
median time: 1.936 ms (0.00% GC)
mean time: 2.525 ms (19.89% GC)
maximum time: 13.133 ms (81.65% GC)
--------------
samples: 1977
evals/sample: 1
julia> @benchmark groupby(sort(e), :x)
BenchmarkTools.Trial:
memory estimate: 5.75 MiB
allocs estimate: 307043
--------------
minimum time: 16.820 ms (0.00% GC)
median time: 18.341 ms (0.00% GC)
mean time: 20.925 ms (11.64% GC)
maximum time: 36.118 ms (37.38% GC)
--------------
samples: 240
evals/sample: 1
julia> @benchmark groupby(f, :x)
BenchmarkTools.Trial:
memory estimate: 6.34 MiB
allocs estimate: 100057
--------------
minimum time: 3.761 ms (0.00% GC)
median time: 4.367 ms (0.00% GC)
mean time: 5.503 ms (17.80% GC)
maximum time: 12.438 ms (30.90% GC)
--------------
samples: 908
evals/sample: 1
julia> @benchmark groupby(f, :x, sort=true)
BenchmarkTools.Trial:
memory estimate: 64.17 MiB
allocs estimate: 3403352
--------------
minimum time: 270.340 ms (9.12% GC)
median time: 282.425 ms (9.04% GC)
mean time: 290.332 ms (11.02% GC)
maximum time: 417.644 ms (37.02% GC)
--------------
samples: 18
evals/sample: 1
julia> @benchmark groupby(sort!($(deepcopy(f))), :x)
BenchmarkTools.Trial:
memory estimate: 15.46 MiB
allocs estimate: 597530
--------------
minimum time: 18.934 ms (0.00% GC)
median time: 28.783 ms (32.24% GC)
mean time: 27.513 ms (25.24% GC)
maximum time: 37.588 ms (31.23% GC)
--------------
samples: 182
evals/sample: 1
julia> @benchmark groupby(sort(f), :x)
BenchmarkTools.Trial:
memory estimate: 66.45 MiB
allocs estimate: 3602351
--------------
minimum time: 286.174 ms (13.46% GC)
median time: 306.774 ms (12.66% GC)
mean time: 323.814 ms (14.53% GC)
maximum time: 435.845 ms (36.44% GC)
--------------
samples: 16
evals/sample: 1
julia> @benchmark groupby(g, :x)
BenchmarkTools.Trial:
memory estimate: 61.41 MiB
allocs estimate: 1000057
--------------
minimum time: 116.534 ms (10.03% GC)
median time: 139.712 ms (21.51% GC)
mean time: 170.733 ms (33.41% GC)
maximum time: 279.206 ms (51.86% GC)
--------------
samples: 30
evals/sample: 1
julia> @benchmark groupby(g, :x, sort=true)
BenchmarkTools.Trial:
memory estimate: 768.23 MiB
allocs estimate: 41836262
--------------
minimum time: 4.910 s (9.40% GC)
median time: 4.977 s (9.35% GC)
mean time: 4.977 s (9.35% GC)
maximum time: 5.043 s (9.31% GC)
--------------
samples: 2
evals/sample: 1
julia> @benchmark groupby(sort!($(deepcopy(g))), :x)
BenchmarkTools.Trial:
memory estimate: 152.92 MiB
allocs estimate: 5997530
--------------
minimum time: 463.496 ms (36.73% GC)
median time: 646.319 ms (52.37% GC)
mean time: 627.043 ms (50.87% GC)
maximum time: 668.340 ms (54.92% GC)
--------------
samples: 8
evals/sample: 1
julia> @benchmark groupby(sort(g), :x)
BenchmarkTools.Trial:
memory estimate: 791.10 MiB
allocs estimate: 43835261
--------------
minimum time: 4.397 s (14.04% GC)
median time: 4.575 s (16.08% GC)
mean time: 4.575 s (16.08% GC)
maximum time: 4.753 s (17.98% GC)
--------------
samples: 2
evals/sample: 1
|
The problem is that in most cases people will have to repeat the names of the grouping columns twice (as in your example). So we should provide a more convenient way. I'm not sure we can escape adding a Regarding sorting, I like the idea of calling |
It wasn't too hard to implement, see #1267. I've actually realized that in 0.10.1 we did not skip |
On current master, and since JuliaData/DataTables.jl#17, we create separate groups for
null
values:The old code used to discard nulls silently, without an option to retain them (just like Pandas does). Since the behavior changes so radically, we should decide what's the best default before tagging a release. I think keeping nulls is a good default as it's consistent with what we do everywhere. But we should probably provide an argument to skip them so that people can get the old behavior easily. This is not so easy to implement when there are multiple grouping variables, each of which can be null independently from others.
A second issue is that by default we no longer sort groups, which means that null groups are not necessarily at the end. Maybe we should enable sorting by default. IIRC it isn't very costly it is in terms of performance, and I suspect it would be a more useful default.
Note that when using
N
grouping variables with nulls, there will be(2^N)-1
groups with anull
. We could imagine providing an option to collapse all of these into a single "at least onenull
" group, but probably not by default so that can wait.The text was updated successfully, but these errors were encountered: