FIX-#7486: Add support for .astype(pandas.CategoricalDtype(…))#7487
Conversation
…pe(…)) Signed-off-by: Cameron Riddell <cameron@dutc.io>
| else: | ||
| # Assume that the dtype is a scalar. | ||
| if not (col_dtypes == self_dtypes).all(): | ||
| if not (self_dtypes == col_dtypes).all(): |
There was a problem hiding this comment.
This points to an underlying issue with how binary_op "eq" is handled for dtypes.
Is it maybe dtypes.py 326?
There was a problem hiding this comment.
Hmm, I can't tell in where in this code-path DtypesDescriptor.equals would be invoked as self.dtypes returns a pandas.Series and col_dtypes is a user-passed object. Definitely open to further discussion as I'm not terribly familiar with the code-base so there likely are mechanics that I am not aware of after a superficial skim.
(for now) I am still of the opinion that this is a order-induced precedence issue, where we want to use the logic defined in the modin object to see if it is equivalent to a user-passed in object.
For the case of col_dtypes == self_dtypes (the current implementation) we always delegate to col_dtypes.__eq__ to check if the two objects are equivalent. Only if this function returns NotImplemented do we invoke the self_dtypes.__eq__ method.
In the common case, col_dtypes is a built-in object (most commonly a string at this point in the code), then we can rely on the fact that strict type checking is done as part of comparison and if the types don't align the a NotImplemented is returned, thus Python then check the __eq__ method of the other object.
In the current case modin.pandas.CategoricalDtype defines an __eq__ that never returns NotImplemented (it returns False in our case) therefore the broadcasted comparison logic is never executed (as that logic exists inside of the modin.Series. In this case pandas.CategoricalDtype defines an __eq__ implementation that returns False (which is logically correct). However since we're comparing against a Series we want a Series[bool] back instead of just a single bool. This is why .all() fails as shown in the traceback.
By swapping this comparison around, we are forcing the Series.__eq__ method to be invoked which returns an object of the correct shape/results.
A brief snippet to describe my observations
import modin.pandas as mpd
class T:
def __eq__(self, other):
return False
s = mpd.Series([1, 2, 3])
print(
f'{s == 1}', # Series → [True, False, False]
f'{(1).__eq__(s) = }', # NotImpelemented
f'{1 == s}', # Series → [True, False, False]
f'{T() == s}', # False (will always be False)
f'{s == T()}', # Series → [False, False, False]
sep='\n{}\n'.format('-'*40)
)There was a problem hiding this comment.
I don't fully understand the root cause here, but this seems like a simple fix.
sfc-gh-jkew
left a comment
There was a problem hiding this comment.
Makes sense to me. @devin-petersohn might want to take a look.
| else: | ||
| # Assume that the dtype is a scalar. | ||
| if not (col_dtypes == self_dtypes).all(): | ||
| if not (self_dtypes == col_dtypes).all(): |
There was a problem hiding this comment.
I don't fully understand the root cause here, but this seems like a simple fix.
Co-authored-by: Mahesh Vashishtha <mahesh.vashishtha@snowflake.com>
This will be fixed in a future modin release modin-project/modin#7487
… to Enum (only in main namespace, not stable.v1) (#2192) * enh nw.Enum to accept categories - add conversion from native to pandas - add conversion from native to Polars * add tests for nw.Enum(categories) * fix enum type checking for Enum dtype * fix enum doctest * positive check for enum instance for pyright * enum use implementation specific CategoricalDtype * enum preserve v1 behavior * preserve v1 polars enum conversion * add Enum support to dask * modin to xfail on Enum dtype This will be fixed in a future modin release modin-project/modin#7487 * Enum support outside of V1 * Fix v1 enum missing argument teset * fix enum error match for py38 * add pragma: no cover to v1.Enum from aligned with DType class * parametrize api versions for dtypes tests * decouple narwhals versioned dtypes * enum types to pass tests * fix: pyright warning on incorrect line for Enum Let this be a warning, check your files after you use a linter. * test(typing): Fix redefinition * test: Add `test_enum_v1_is_enum_unstable` Last statement is failing * fix: Get more of `__eq__` working * fix: Correct subtyping for `__eq__` #2192 (comment) * chore: Coverage https://github.com/narwhals-dev/narwhals/actions/runs/14361432835/job/40263682474?pr=2192 * list -> tuple * use str in annotation but at runtime dont raise for hashable (pandas..) * improve error message * move another test to v1test * update backcompat doc * coverage * test: fix doctest https://github.com/narwhals-dev/narwhals/actions/runs/14427770420/job/40458883250?pr=2192, #2192 (comment) * test: Add `test_enum_from_series` Resolves comment (#2192 (comment)) * enum add non-string and duplicate checking * Update docs/backcompat.md Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com> * fix(typing): Resolve some `Enum.categories` issues * refactor: Simplify `__init__`, raise earlier Also adapted an error message to be closer to https://github.com/pola-rs/polars/blob/319a9a84ab573886b2a13548a8e462fee353acef/py-polars/polars/datatypes/classes.py#L694-L696 * remove enum duplication/null checks --------- Co-authored-by: dangotbanned <125183946+dangotbanned@users.noreply.github.com> Co-authored-by: Marco Gorelli <33491632+MarcoGorelli@users.noreply.github.com>
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.pyblack --check modin/ asv_bench/benchmarks scripts/doc_checker.pygit commit -s.astype(mpd.CategoricalDtype(…))fails to convert via.astype#7486docs/development/architecture.rstis up-to-date