-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
DEPR: na_sentinel in factorize #47157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @rhshadrach! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2022-06-24 18:33:26 UTC |
The warning handling in algorithms.factorize isn't my favorite, but i'll take your word for it that this is the best viable option. |
pandas/core/common.py
Outdated
use_na_sentinel: bool | lib.NoDefault, | ||
warn: bool = True, | ||
) -> int | None: | ||
"""Determine value of na_sentinel for factorize methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick newline before Determine
needs rebase, one nitpick, otherwise LGTM |
… into depr_na_sentinel � Conflicts: � doc/source/whatsnew/v1.5.0.rst � pandas/core/algorithms.py � pandas/core/arrays/base.py � pandas/core/arrays/sparse/array.py � pandas/core/common.py � pandas/tests/extension/base/methods.py
…_na_sentinel � Conflicts: � doc/source/whatsnew/v1.5.0.rst
The other option is to use catch_warnings when the call to EA.factorize is made from within pd.factorize. I typically try to avoid using it, but perhaps this is a place where it's worth it. It would certainly simplify that logic (which I myself am not a fan of). |
pandas/core/algorithms.py
Outdated
if passed_na_sentinel is lib.no_default: | ||
# User didn't specify na_sentinel; avoid warning. Note EA path always | ||
# uses a na_sentinel value. | ||
codes, uniques = values.factorize(use_na_sentinel=True) | ||
elif passed_na_sentinel is None: | ||
# Emit the appropriate warning message for None | ||
_ = com.resolve_na_sentinel(passed_na_sentinel, use_na_sentinel) | ||
codes, uniques = values.factorize(use_na_sentinel=True) | ||
else: | ||
# EA.factorize will warn | ||
codes, uniques = values.factorize(na_sentinel=na_sentinel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to distinguish deprecating the keyword that the user passes (I think that is handled above?) vs deprecating what the EA can accept.
Currently, if you have an external EA that has a custom factorize(self, na_sentinel=-1)
method, then passing use_na_sentinel
will raise an error. So to provide a smooth upgrade path for EA authors (and not users calling Series.factorize()
), we might need to check here if the new keyword can be passed or not? (and if not, raising a deprecation warning urging the EA author to update their factorize() implementation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess also worth clarifying is whether we expect the user to call EA.factorize directly vs pd.factorize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally we don't "expect" users to directly call EA.factorize, but it is a public method on a public object, so it falls under a deprecation policy (as is being done in this PR already). So I think for EA authors that have a custom EA.factorize, we should recommend doing a similar deprecation in their custom method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jorisvandenbossche. I think anyone using a mypy will be getting errors that their signature does not match the base class. I guess this not sufficient? It does seem noisy to add warnings for users, so I added the warning via __init_subclass__
so that it's emitted only once for each EA subclass. The potential downside is that it's emitted regardless of whether the method is used. Note users can silence warnings that are emitted when using pd.factorize
even before the EA author adopts the new signature.
This also caught a defect in my previous implementation - pd.factorize was passing use_na_sentinel
in some cases, which would cause code to fail if the EA author has not yet made the update. I went the catch_warnings route as I mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't notice you mentioned deprecation warning instead of future warning. That makes sense and will be less noisy.
@jorisvandenbossche - friendly ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For testing, I think it would be good if we have one of our test extension arrays (eg decimal) to still have the old signature, to ensure that we don't start to error for that
The na_sentinel argument is deprecated and | ||
will be removed in a future version of pandas. Specify use_na_sentinel as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The na_sentinel argument is deprecated and | |
will be removed in a future version of pandas. Specify use_na_sentinel as | |
The `na_sentinel` argument is deprecated and | |
will be removed in a future version of pandas. Specify `use_na_sentinel` as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took this to mean use backticks for any argument / code in the warnings. I went and implemented that for all warnings.
pandas/core/algorithms.py
Outdated
with warnings.catch_warnings(): | ||
# We've already warned above | ||
warnings.filterwarnings("ignore", ".*use_na_sentinel.*", FutureWarning) | ||
codes, uniques = values.factorize(na_sentinel=na_sentinel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do a similar "if use_na_sentinel in signature" check (as in init_subclass) and in that case already pass the new keyword to the underlying EA? (in that case we don't have to catch any warning)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear to me as we'd be trading out catch_warnings for inspect and more complex logic (though not by much). I thought inspect would take somewhat longer (on the order of 1ms) but on my machine it's pretty quick:
values = pd.array([1, 2, 3])
print("use_na_sentinel" in inspect.signature(values.factorize).parameters)
%timeit "use_na_sentinel" in inspect.signature(values.factorize).parameters
# True
# 12.8 µs ± 2.97 µs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
I've gone an implemented it; easy to revert if we don't like it.
…_na_sentinel � Conflicts: � doc/source/whatsnew/v1.5.0.rst
Thanks @jorisvandenbossche - changes made and ready for another review.
I don't think there is decimal, so I did this for datetimelike. It has the easiest implementation to change and doesn't generate warnings on its own (it uses super to do that), so seemed like the best option. Plus it also starts with a 'd' :D |
Encountered a minor difficulty with the warnings and leaving TimelikeOps with the old signature. In order to not get the warning on import of pandas, need to special case it in With the current implementation, worst case is some code does not get warned to update the signature for classes named "TimelikeOps", "DatetimeArray", "TimedeltaArray". |
…_na_sentinel � Conflicts: � doc/source/whatsnew/v1.5.0.rst
elif not isinstance(values.dtype, np.dtype): | ||
if ( | ||
na_sentinel == -1 | ||
and "use_na_sentinel" in inspect.signature(values.factorize).parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you inspecting like this? its always passed (with no_default maybe)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're inspecting whether the EA values
can accept the new argument "use_na_sentinel". #47157 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to add a comment to that effect
pandas/core/common.py
Outdated
warnings.warn(msg, FutureWarning, stacklevel=find_stack_level()) | ||
|
||
|
||
def resolve_na_sentinel( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put this somewhere else, closer to where its used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to algorithms.
@jreback @jorisvandenbossche friendly ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO looks pretty good
…_na_sentinel � Conflicts: � pandas/core/arrays/arrow/array.py
@rhshadrach thanks for the updates! Looks good to me from a quick look. I won't be able to further look at it the coming two weeks, so don't wait on me to move forward here |
Thanks @rhshadrach. Follow ups can be made as needed |
Thanks @jorisvandenbossche; if you do end up circling back, let me know if you have any more comments and happy to do a followup. |
* DEPR: na_sentinel in factorize * WIP * DEPR: na_sentinel in factorize * Fixups * Fixups * black * fixup * docs * newline * Warn on class construction, rework pd.factorize warnings * FutureWarning -> DeprecationWarning * Remove old comment * backticks in warnings, revert datetimelike, avoid catch_warnings * fixup for warnings * mypy fixups * Move resolve_na_sentinel * Remove underscores Co-authored-by: Jeff Reback <[email protected]>
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.The NotImplementedError will be removed as part of #46601