Skip to content

FIX-#7486: Add support for .astype(pandas.CategoricalDtype(…))#7487

Merged
sfc-gh-jkew merged 2 commits intomodin-project:mainfrom
camriddell:fix/support-categoricaldtype
Apr 3, 2025
Merged

FIX-#7486: Add support for .astype(pandas.CategoricalDtype(…))#7487
sfc-gh-jkew merged 2 commits intomodin-project:mainfrom
camriddell:fix/support-categoricaldtype

Conversation

@camriddell
Copy link
Contributor

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves BUG: .astype(mpd.CategoricalDtype(…)) fails to convert via .astype #7486
  • tests added and passing: extended 2 existing tests for converting Categorical/Category data types
  • module layout described at docs/development/architecture.rst is up-to-date

…pe(…))

Signed-off-by: Cameron Riddell <cameron@dutc.io>
else:
# Assume that the dtype is a scalar.
if not (col_dtypes == self_dtypes).all():
if not (self_dtypes == col_dtypes).all():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This points to an underlying issue with how binary_op "eq" is handled for dtypes.

Is it maybe dtypes.py 326?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I can't tell in where in this code-path DtypesDescriptor.equals would be invoked as self.dtypes returns a pandas.Series and col_dtypes is a user-passed object. Definitely open to further discussion as I'm not terribly familiar with the code-base so there likely are mechanics that I am not aware of after a superficial skim.


(for now) I am still of the opinion that this is a order-induced precedence issue, where we want to use the logic defined in the modin object to see if it is equivalent to a user-passed in object.

For the case of col_dtypes == self_dtypes (the current implementation) we always delegate to col_dtypes.__eq__ to check if the two objects are equivalent. Only if this function returns NotImplemented do we invoke the self_dtypes.__eq__ method.

In the common case, col_dtypes is a built-in object (most commonly a string at this point in the code), then we can rely on the fact that strict type checking is done as part of comparison and if the types don't align the a NotImplemented is returned, thus Python then check the __eq__ method of the other object.

In the current case modin.pandas.CategoricalDtype defines an __eq__ that never returns NotImplemented (it returns False in our case) therefore the broadcasted comparison logic is never executed (as that logic exists inside of the modin.Series. In this case pandas.CategoricalDtype defines an __eq__ implementation that returns False (which is logically correct). However since we're comparing against a Series we want a Series[bool] back instead of just a single bool. This is why .all() fails as shown in the traceback.

By swapping this comparison around, we are forcing the Series.__eq__ method to be invoked which returns an object of the correct shape/results.


A brief snippet to describe my observations

import modin.pandas as mpd

class T:
    def __eq__(self, other):
        return False

s = mpd.Series([1, 2, 3])

print(
    f'{s == 1}',           # Series → [True, False, False]
    f'{(1).__eq__(s) = }', # NotImpelemented
    f'{1 == s}',           # Series → [True, False, False]

    f'{T() == s}',         # False (will always be False)
    f'{s == T()}',         # Series → [False, False, False]
    sep='\n{}\n'.format('-'*40)
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the root cause here, but this seems like a simple fix.

Copy link
Contributor

@sfc-gh-jkew sfc-gh-jkew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. @devin-petersohn might want to take a look.

else:
# Assume that the dtype is a scalar.
if not (col_dtypes == self_dtypes).all():
if not (self_dtypes == col_dtypes).all():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the root cause here, but this seems like a simple fix.

Co-authored-by: Mahesh Vashishtha <mahesh.vashishtha@snowflake.com>
@sfc-gh-jkew sfc-gh-jkew merged commit d512c53 into modin-project:main Apr 3, 2025
14 checks passed
camriddell added a commit to camriddell/narwhals that referenced this pull request Apr 4, 2025
This will be fixed in a future modin release
modin-project/modin#7487
dangotbanned added a commit to narwhals-dev/narwhals that referenced this pull request Apr 18, 2025
… to Enum (only in main namespace, not stable.v1) (#2192)

* enh nw.Enum to accept categories

- add conversion from native to pandas
- add conversion from native to Polars

* add tests for nw.Enum(categories)

* fix enum type checking for Enum dtype

* fix enum doctest

* positive check for enum instance for pyright

* enum use implementation specific CategoricalDtype

* enum preserve v1 behavior

* preserve v1 polars enum conversion

* add Enum support to dask

* modin to xfail on Enum dtype

This will be fixed in a future modin release
modin-project/modin#7487

* Enum support outside of V1

* Fix v1 enum missing argument teset

* fix enum error match for py38

* add pragma: no cover to v1.Enum from aligned with DType class

* parametrize api versions for dtypes tests

* decouple narwhals versioned dtypes

* enum types to pass tests

* fix: pyright warning on incorrect line for Enum

Let this be a warning, check your files after you use a linter.

* test(typing): Fix redefinition

* test: Add `test_enum_v1_is_enum_unstable`

Last statement is failing

* fix: Get more of `__eq__` working

* fix: Correct subtyping for `__eq__`

#2192 (comment)

* chore: Coverage

https://github.com/narwhals-dev/narwhals/actions/runs/14361432835/job/40263682474?pr=2192

* list -> tuple

* use str in annotation but at runtime dont raise for hashable  (pandas..)

* improve error message

* move another test to v1test

* update backcompat doc

* coverage

* test: fix doctest

https://github.com/narwhals-dev/narwhals/actions/runs/14427770420/job/40458883250?pr=2192, #2192 (comment)

* test: Add `test_enum_from_series`

Resolves comment (#2192 (comment))

* enum add non-string and duplicate checking

* Update docs/backcompat.md

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

* fix(typing): Resolve some `Enum.categories` issues

* refactor: Simplify `__init__`, raise earlier

Also adapted an error message to be closer to https://github.com/pola-rs/polars/blob/319a9a84ab573886b2a13548a8e462fee353acef/py-polars/polars/datatypes/classes.py#L694-L696

* remove enum duplication/null checks

---------

Co-authored-by: dangotbanned <125183946+dangotbanned@users.noreply.github.com>
Co-authored-by: Marco Gorelli <33491632+MarcoGorelli@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: .astype(mpd.CategoricalDtype(…)) fails to convert via .astype

3 participants