-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPR: __finalize__ and _metadata #51280
Comments
EDIT: as clarified below, although |
EDIT: as clarified below, although |
We do use this in the interchange protocol:
|
I think this is mixing different things:
For the |
Thanks @jorisvandenbossche ! A question is have is - is GeoPandas affected by the many bugs in EDIT: I'll take a look at how geopandas uses them, let's not rush anything |
Looks like geopandas uses Also our own I'd be OK with deprecating flags and having allow_duplicate_labels checking move to third-party validation libraries. |
GeoPandas customizes
Occasionally we run into an issue where the metadata is not properly propagated from eg a missing Performance is certainly an important aspect that I don't want to just ignore. But I think we should then do some more analysis of the extent of the problem in actual use cases, to know if this is actually a problem (or how big it is, and to see if there are ways to address it without throwing away the feature in its entirety). |
Another option (i think this was @phofl's idea) would be to keep |
I would propose to let this issue focus on the mechanics of how we propagate some information (using |
Do we have any user feedback that metadata propagation causing issues? Or that it's not worth the slowdown they do incur? It seems like xarray manages to do this successfully. @dcherian have you heard feedback from users about metadata propagation being a bottleneck in xarray? |
I personally have never observed this to be a bottleneck in real-world benchmarks that I encountered myself. We do have a recent report about this at #46505. But I think one could argue that this example is showing this is not necessarily a big problem: the specific example is a good "worst case" example (it is doing a groupby using 500,000 groups each consisting of just 1 row, so creating a lot of tiny DataFrame or Series objects, exaggerating the slow down you will typically see from the constructor overhead), it is also an example of something that is relatively slow anyway (applying a python function and not using one of the built-in cythonized options), and so in this worst case scenario the finalize/metadata propagation is then "only" giving a 30% overhead (10 -> 13s, based on #51109 (comment)). If performance is the main argument that we would want to deprecate the related features to metadata propagation (attrs/flags, #52165 and #52166, where it is currently the only argument), I think we need some more investigation / proof that this is actually a problem. It could also be useful to have some more detailed profiling of what aspect of it takes time / could be optimized (eg if it's mostly the |
It isn't hard to come up with examples where But the issue isn't that its a big overhead it is that it is an everywhere overhead. And its one we incur in order to support a feature that is not-- and has no real prospect of becoming-- fully functional.
Related: #52183 saves a little bit by checking _attrs/_flags instead of attrs/flags. |
Not a performance bottleneck no. The issues are around propagating attributes, and solving attribute conflicts when merging/concatenating. |
To be explicit: 14% of something that takes 25µs, so we are talking about around 4 microseconds. So while being a significant portion of the construction time, it is insignificant overhead in most cases. And to repeat myself from above (#51280 (comment)): there might be ways that we could further reduce this overhead inside
Whatever you think of it, it is a feature that is already useful for many of our users, right now. For subclasses this has already been functional for years, for the public |
Would it be possible (or necessary?) to use |
It's possible, but I'm skeptical that this is the right tool for the job. 5ish years ago I implemented Series/DataFrame subclasses that propagated column-specific metadata and it was a giant hassle. I concluded at the time that doing this at the EA level made more sense |
I tried attr, finalize, _metadata and subclasses yet nothing seems to work. Does anyone have a working solution as of today for attaching an attribute to a series? import pandas as pd
data = {
'name': ['A', 'B', 'C'],
'age': [1, 2, 3],
'country': ['CH', 'DE', 'GB']
}
class SubclassedSeries(pd.Series):
_metadata = ["id"]
@property
def _constructor(self):
return lambda *args, **kwargs: SubclassedSeries(*args, **kwargs).__finalize__(self)
@property
def _constructor_expanddim(self):
return lambda *args, **kwargs: SubclassedDataFrame(*args, **kwargs).__finalize__(self)
def __init__(self, *args, **kwargs):
self.id = kwargs.pop('id', None)
super().__init__(*args, **kwargs)
class SubclassedDataFrame(pd.DataFrame):
_metadata = ["id"]
@property
def _constructor(self):
return lambda *args, **kwargs: SubclassedDataFrame(*args, **kwargs).__finalize__(self)
@property
def _constructor_sliced(self):
return lambda *args, **kwargs: SubclassedSeries(*args, **kwargs).__finalize__(self)
df = SubclassedDataFrame(data)
for idx, col in enumerate(df.columns):
df[col].id = "xxx" + str(idx)
df[col].attrs = {"id": "xxx" + str(idx)}
print("------------- Before Rename -------------")
for col in df.columns:
print(type(df[col]))
print(df[col].attrs)
print(df[col].id)
df.rename(columns={'country': 'countries'}, inplace=True)
print("------------- After Rename in place-------------")
for col in df.columns:
print(type(df[col]))
print(df[col].attrs)
print(df[col].id) this gives:
I am trying to establish some sort of column lineage and would be nice to attach a unique id to a series such that even after a rename I could map the new series back to the original column. |
|
@Flix6x Thanks for the link but there is a lot of code and I don't fully understand which part makes attribute retention work. Care to highlight how it works or give a minimal example? |
We have some support for propagating metadata, but it is incomplete (38 xfails in tests/generic/test_finalize.py) for plain methods and has no real prospect for getting robust support for concat/merge etc.
At the same time, there is a real performance cost to calling
__finalize__
everywhere, in an extreme case like #46505 these calls take up a quarter of the runtime.AFAICT this is useful for a fraction of users and perf-hurting for everyone else. Do we have any idea how common it is to rely on these?
cc @MarcoGorelli
The text was updated successfully, but these errors were encountered: