-
-
Couldn't load subscription status.
- Fork 19.2k
ENH/PERF: use mask in factorize for nullable dtypes #33064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
246b787
af1cdea
d030ff1
be5a21d
b0a88b9
2e94842
bcde548
c97d357
175a8f6
a6bc6fc
889b1b9
0370045
24b528f
6ed5239
1f7e994
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -27,6 +27,7 @@ | |
| from pandas.core.dtypes.missing import isna | ||
|
|
||
| from pandas.core import nanops, ops | ||
| from pandas.core.algorithms import _factorize_array | ||
| import pandas.core.common as com | ||
| from pandas.core.indexers import check_array_indexer | ||
| from pandas.core.ops import invalid_comparison | ||
|
|
@@ -481,7 +482,16 @@ def astype(self, dtype, copy: bool = True) -> ArrayLike: | |
| def _values_for_factorize(self) -> Tuple[np.ndarray, float]: | ||
| # TODO: https://github.com/pandas-dev/pandas/issues/30037 | ||
| # use masked algorithms, rather than object-dtype / np.nan. | ||
| return self.to_numpy(na_value=np.nan), np.nan | ||
| return self.to_numpy(dtype=float, na_value=np.nan), np.nan | ||
|
|
||
| def factorize2(self, na_sentinel: int = -1) -> Tuple[np.ndarray, "ExtensionArray"]: | ||
|
||
| arr = self._data | ||
| mask = self._mask | ||
|
|
||
| codes, uniques = _factorize_array(arr, na_sentinel=na_sentinel, mask=mask) | ||
|
|
||
| uniques = IntegerArray(uniques, np.zeros(len(uniques), dtype=bool)) | ||
| return codes, uniques | ||
|
|
||
| def _values_for_argsort(self) -> np.ndarray: | ||
| """ | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we're worried about perf for existing cases, could take this check outside of the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to check the mask for each value inside the loop, so not sure what can be moved outside?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to the use_mask check; it would basically become a separate loop or even method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think duplicating the full loop is worth it (the loop itself is 40 lines below here), given the minor performance impact I showed in the timings.