ENH: adding .unique() to DF (or return_inverse for duplicated)

When deduplicating, I need to get the mapping from the deduplicated records back to the originals.

Unfortunately:
* `drop_duplicates` and `duplicated` provide no information about this mapping
* `np.unique` has a `return_inverse`-parameter, but does not work with `object` data due to numpy/numpy#641

Let's say I have a large DF (`df`) with several object columns (say, `sig`) that are the deduplication signature (`subset`-parameter of `duplicated`). 

Then I have (here only for a subset of 100k records):
```
sig = ['list', 'of', 'column', 'names']
N = 100000
tic = timeit.default_timer()
### no reverse mapping
df[sig].sample(N).duplicated(keep=False)
toc = timeit.default_timer()
print(f'duplicated: {toc-tic:.2f} sec.')

tic = timeit.default_timer()
### no reverse mapping
pd.unique([x for x in df[sig].sample(N).itertuples(name='', index=False)])
toc = timeit.default_timer()
print(f'pd.unique: {toc-tic:.2f} sec.')

tic = timeit.default_timer()
### this creates the reverse mapping
dups = df.sample(N).groupby(sig).groups.values() # indices per group
redup = pd.concat([pd.Series(x[0], index = x) for x in dups], axis=0) # assume sorted index; otherwise "x.min()"
toc = timeit.default_timer()
print(f'with groupby: {toc-tic:.2f} sec.')
```
resulting in:
``` 
duplicated: 2.45 sec.
pd.unique: 1.64 sec.
with groupby: 16.89 sec.
```

The performance gets worse with size, so for a few millions, I need to wait ~20min to get the inverse, whereas a pure `duplicated` call only takes around 10 sec.

This caused me to hunt down the inner workings of `duplicated`, to see where the inverse gets lost. Turns out until almost the very end of `duplicated`, the indices are there: https://github.com/pandas-dev/pandas/blob/3147a86e1b20571766b488a8444c74cef29729ad/pandas/core/frame.py#L4384

I extracted and applied the guts of `duplicated` and once having `ids` for the whole set (~10sec), I have the following:
```
[calculate ids]
tic = timeit.default_timer()
### no reverse mapping
duplicated_int64(ids, keep=False)
toc = timeit.default_timer()
print(f'pd.duplicated_int64: {toc-tic:.2f} sec.')

tic = timeit.default_timer()
### no reverse mapping
np.unique(ids)
toc = timeit.default_timer()
print(f'np.unique, no inverse: {toc-tic:.2f} sec.')

tic = timeit.default_timer()
### this creates the reverse mapping
np.unique(ids, return_inverse=True)
toc = timeit.default_timer()
print(f'np.unique, with inverse: {toc-tic:.2f} sec.')
```
yielding
```
pd.duplicate_int64: 1.60 sec.
np.unique, no inverse: 0.31 sec.
np.unique, with inverse: 0.64 sec.
```

So, by exposing a `return_inverse` argument -- either in `.duplicated`, `.drop_duplicates`, or in a separate `.unique` function for DF -- this would accelerate my problem from 20min to around 11sec.

I would imagine the following signature:
```
df.unique(subset=None, keep={'first'|'last'}, return_inverse=False)
```
If  `return_inverse`  were added to `drop_duplicates`, it would have to raise an error for the combination `keep=False, return_inverse=True`, because no inverse can be returned in this case.

I wouldn't know how to adapt the cython code for `duplicate_int64`, but since it's 5x slower than `np.unique` anyway, there's no point, IMHO. I *would* know how to call the numpy function at the appropriate point...

Edit: just saw that `np.unique` sorts, which makes the `'first'|'last'` thing a fair bit harder...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: adding .unique() to DF (or return_inverse for duplicated) #21357

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ENH: adding .unique() to DF (or return_inverse for duplicated) #21357

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions