-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
Description
When deduplicating, I need to get the mapping from the deduplicated records back to the originals.
Unfortunately:
drop_duplicatesandduplicatedprovide no information about this mappingnp.uniquehas areturn_inverse-parameter, but does not work withobjectdata due to python3: regression for unique on dtype=object arrays with varying items types (Trac #2188) numpy/numpy#641
Let's say I have a large DF (df) with several object columns (say, sig) that are the deduplication signature (subset-parameter of duplicated).
Then I have (here only for a subset of 100k records):
sig = ['list', 'of', 'column', 'names']
N = 100000
tic = timeit.default_timer()
### no reverse mapping
df[sig].sample(N).duplicated(keep=False)
toc = timeit.default_timer()
print(f'duplicated: {toc-tic:.2f} sec.')
tic = timeit.default_timer()
### no reverse mapping
pd.unique([x for x in df[sig].sample(N).itertuples(name='', index=False)])
toc = timeit.default_timer()
print(f'pd.unique: {toc-tic:.2f} sec.')
tic = timeit.default_timer()
### this creates the reverse mapping
dups = df.sample(N).groupby(sig).groups.values() # indices per group
redup = pd.concat([pd.Series(x[0], index = x) for x in dups], axis=0) # assume sorted index; otherwise "x.min()"
toc = timeit.default_timer()
print(f'with groupby: {toc-tic:.2f} sec.')
resulting in:
duplicated: 2.45 sec.
pd.unique: 1.64 sec.
with groupby: 16.89 sec.
The performance gets worse with size, so for a few millions, I need to wait ~20min to get the inverse, whereas a pure duplicated call only takes around 10 sec.
This caused me to hunt down the inner workings of duplicated, to see where the inverse gets lost. Turns out until almost the very end of duplicated, the indices are there:
Line 4384 in 3147a86
| ids = get_group_index(labels, shape, sort=False, xnull=False) |
I extracted and applied the guts of duplicated and once having ids for the whole set (~10sec), I have the following:
[calculate ids]
tic = timeit.default_timer()
### no reverse mapping
duplicated_int64(ids, keep=False)
toc = timeit.default_timer()
print(f'pd.duplicated_int64: {toc-tic:.2f} sec.')
tic = timeit.default_timer()
### no reverse mapping
np.unique(ids)
toc = timeit.default_timer()
print(f'np.unique, no inverse: {toc-tic:.2f} sec.')
tic = timeit.default_timer()
### this creates the reverse mapping
np.unique(ids, return_inverse=True)
toc = timeit.default_timer()
print(f'np.unique, with inverse: {toc-tic:.2f} sec.')
yielding
pd.duplicate_int64: 1.60 sec.
np.unique, no inverse: 0.31 sec.
np.unique, with inverse: 0.64 sec.
So, by exposing a return_inverse argument -- either in .duplicated, .drop_duplicates, or in a separate .unique function for DF -- this would accelerate my problem from 20min to around 11sec.
I would imagine the following signature:
df.unique(subset=None, keep={'first'|'last'}, return_inverse=False)
If return_inverse were added to drop_duplicates, it would have to raise an error for the combination keep=False, return_inverse=True, because no inverse can be returned in this case.
I wouldn't know how to adapt the cython code for duplicate_int64, but since it's 5x slower than np.unique anyway, there's no point, IMHO. I would know how to call the numpy function at the appropriate point...
Edit: just saw that np.unique sorts, which makes the 'first'|'last' thing a fair bit harder...