Skip to content
This repository has been archived by the owner on May 5, 2019. It is now read-only.

WIP: Restore old grouping algorithm and improve it #76

Closed
wants to merge 2 commits into from

Conversation

nalimilan
Copy link
Member

Follow the strategy used by Pandas. The new implementation is more efficient
since it avoids creating a NullableCategoricalArray: the integer codes are
combined on the fly with those computed from previous columns. Hashing only
happens once by giving arbitrary codes to levels in the first pass; after that,
only integer codes are used.

Move the per-column operations to separate functions which can be specialized
by the compiler for each column type. This also allows using a more efficient
method for CategoricalArray.

Fix ordering of CategoricalArray levels when levels have been reordered,
and sort null values last for consistency with other nullable arrays. Enable
sorting by default since its cost is relatively small compared with the rest.

Avoid some allocations by using in place operations, use Base.unique!().


This is WIP in particular because we need to implement the code to compress the codes when the cartesian product is going to overflow (see this code in Pandas). Also, tests for sorted and unsorted grouping should be adapted.

This branch cuts down the time to run this example from 10.7 seconds (similar to DataFrames) to 5.7 seconds.

using DataTables
N = 100_000_000
A = rand(N);
B = rand(1:100, N);
dt = DataTable(A = A, B = B);
@time by(dt, :B, d -> mean(d[:A]));

Follow the strategy used by Pandas. The new implementation is more efficient
since it avoids creating a NullableCategoricalArray: the integer codes are
combined on the fly with those computed from previous columns. Hashing only
happens once by giving arbitrary codes to levels in the first pass; after that,
only integer codes are used.

Move the per-column operations to separate functions which can be specialized
by the compiler for each column type. This also allows using a more efficient
method for CategoricalArray.

Fix ordering of CategoricalArray levels when levels have been reordered,
and sort null values last for consistency with other nullable arrays. Enable
sorting by default since its cost is relatively small compared with the rest.

Avoid some allocations by using in place operations, use Base.unique!().
@nalimilan
Copy link
Member Author

@alyst I've just added a mechanism to compress group indices when they would overflow. It was actually very simple: we just need to take the computed indices as if they were a grouping column. So every time overflow would happen, we have to pay a cost equivalent to grouping on an integer column.

The code now also always compresses the group indices at the end when there are multiple grouping columns, since we allocate a vector of a length equal to the number of groups. Without this, a memory error can happen in the indexer even if we avoided overflow in the step before. This seems to be what Pandas does (though I'm not completely sure), but in some cases it's clearly not useful, i.e. when there are few empty combinations. So I wonder whether a kind of threshold wouldn't be in order (e.g. compress only if there are more than half as many groups as rows).

I also realized that by moving to a UInt index rather than UInt32, it's quite hard to have enough groups to trigger an overflow. What was the use case which triggered it?

@cjprybol
Copy link
Contributor

Thanks for working on this Milan! I read through the Discourse discussion and the comments on #17 and the reasons for switching back to the old approach make sense to me.

I've started working on a benchmark suite for DataTables again. Given how often people ask about performance, and that we missed the regressions on join in #17 (too excited about the improvements in groupby!), I think the sooner we can get some good, library-wide performance coverage the better. I'll try and get the tests for split-apply-combine related functionality done ASAP so we can compare these changes to the current approach from #17, and also so I can check #65.

@nalimilan
Copy link
Member Author

It would really be great to have a series benchmarks. Without it it's hard to make systematic performance improvements. You can probably steal examples from Pandas and data.tables, they show benchmarks on their websites.

@davidanthoff
Copy link
Contributor

@nalimilan
Copy link
Member Author

I have found an alternative approach at #79 which seems to give similar or even better performance. Unfortunately, the issues with join are quite different so we need to find something else.

@nalimilan
Copy link
Member Author

Closing in favor of #79.

@nalimilan nalimilan closed this Aug 7, 2017
@ararslan ararslan deleted the nl/group branch August 7, 2017 19:43
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants