-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
Description
The state of the various flavours of .unique as of v0.23:
[pd/Series/Index].uniquedoes not havekeep-kwargSeries.uniquereturns array,Series.drop_duplicatesreturnsSeries. Returning a plainnp.ndarrayis quite unusual for aSeriesmethod, and furthermore the differences between these closely-related methods are confusing from a user perspective, IMO- same point for
Index DataFrame.uniquedoes not exist, but is a much more natural candidate (from the behaviour of numpy, resp.Series/Index) than.drop_duplicatespd.uniquechokes on 2-dimensional data- no
return_inverse-kwarg for any of the.uniquevariants; see API: provide a better way of doing np.unique(return_inverses=True) #4087 (milestoned since 0.14), ENH: adding .unique() to DF (or return_inverse for duplicated) #21357
I originally wanted to add df.unique(..., return_inverse=True|False) for #21357, but got directed to add it to duplicated instead. After slow progress over 3 months in #21645 (PR essentially finished since 2), @jorisvandenbossche brought up the - justified (IMO) - feedback that:
I think my main worry is that we are adding a
return_inversekeyword which actually does not return the inverse for that function (it does return the inverse for another function), and that it is in name similar to numpy's keyword, but in usage also different.
and
[...] it might make sense to add this to
pd.unique/Series.uniqueas well? (not necessarily at the same time; or might actually be an easier starter)
This prompted me to have another look at the situation with .unique, and I found the list of the above inconsistencies. To resolve them, I suggest to:
- Change return type for
[Series/Index].uniqueto be same as caller (deprecation cycle by introducingraw=Nonewhich at first defaults to True?) - Add
keep-kwarg to[Series/Index].unique(make.uniquea wrapper around.drop_duplicates?) - Add
df.unique(as thin wrapper around.drop_duplicates?) - Add
keep-kwarg topd.uniqueand dispatch toDataFrame/Series/Indexas necessary - Add
return_inverse-kwarg to all of them (and add to EA interface); under the hood by exposing the same kwarg toduplicatedanddrop_duplicatesas well - (something for later) solve BUG: df.duplicated treats None as np.nan in object columns #21720 (treatment of
np.nan/Noneindf.duplicatedinconsistent vs. Series behaviour)
Each point is essentially self-contained and independent of the others, but of course they make more sense together.