-
-
Couldn't load subscription status.
- Fork 19.2k
Description
The discussion in #6134 has inspired an idea that I'm writing down for
discussion. The idea is pretty obvious so it should've been considered before,
but I still think pandas as it is right now can benefit from it.
My main complaint about pandas when using it in non-interactive way is that
lookups are significantly slower than with ndarray containers. I do realize
that this happens because of many ways the indexing may be done, but at some
point I've really started thinking about ditching pandas in some
performance-critical paths of my project and replacing them with the dreadful
dict/ndarray combo. Not only doing arr = df.values[df.idx.get_loc[key]]
gets old pretty fast but it's also slower when the frame contains different
dtypes and then you need to go deeper to fix that.
Now I thought what if this slowdown can be reduced by creating fastpath
indexers that look like the IndexSlice from #6134 and would convey a
message to pandas indexing facilities, like "trust me, I've done all the
preprocessing, just look it up already". I'm talking about something like that
(the names are arbitrary and chosen for illustrative purposes only):
masked_rows = df.fastloc[pd.bool_slice[bool_array]]
# or
masked_rows = df.fastloc[pd.bool_series_slice[bool_series]]
# or
rows_3_and_10 = df.fastloc[pd.pos_slice[3, 10]]
# or
rows_3_through_10 = df.fastloc[pd.range_slice[3:10]]
# or
rows_for_two_days = df.fastloc[pd.tpos_slice['2014-01-01', '2014-01-08']]Given the actual slice objects will have a common base class, the
implementation could be as easy as:
class FastLocAttribute(object):
def __init__(self, container):
self._container = container
def __getitem__(self, smth):
if not isinstance(smth, FastpathIndexer):
raise TypeError("Indexing object is not a FastpathIndexer")
# open to custom FastpathIndexer implementations
return smth.getitem(self._container)
# or a better encapsulated, but not so open
return self._container._index_method[type(smth)](smth)Cons:
- a change in public API
- one more lookup type
- inconvenient to use interactively
Pros:
- adheres to the Zen of Python (explicit is better than implicit)
- when used in programs, most of the time you know what will the indexing
object look like and how do you want to use its contents (e.g. no guessing if
np.array([0,1,0,1]) is a boolean mask or a series of "takeable" indices) - lengthier than existing lookup schemes but still shorter than jumping through
the hoops ofNDFrameandIndexinternals to avoid premature
pessimization (also, more reliable w.r.t. new releases) - fastpath indexing API could be used in
pandasinternally for the speed (and
clarity, as in "interesting, what does this function pass to df.loc[...],
let's find this out")