Skip to content

feat: let nw.Enum accept categories, map pandas ordered categorical to Enum (only in main namespace, not stable.v1)#2192

Merged
dangotbanned merged 48 commits intonarwhals-dev:mainfrom
camriddell:enh-enum-creation
Apr 18, 2025
Merged

feat: let nw.Enum accept categories, map pandas ordered categorical to Enum (only in main namespace, not stable.v1)#2192
dangotbanned merged 48 commits intonarwhals-dev:mainfrom
camriddell:enh-enum-creation

Conversation

@camriddell
Copy link
Member

@camriddell camriddell commented Mar 11, 2025

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

Adds support for the nw.Enum datatype for pandas (backed by pandas.CategoricalDtype(…, ordered=True)

The current implementation diverges from pandas/Polars in two broad ways

  1. We do not check for None, NaN, or Null (both pandas and Polars raise when they construct a CategoricalDtype/Enum with these in the categories list.
  2. pandas allows arbitrary (hashable) objects to be stored as the categories, whereas Polars only allows integers. The current implementation is type-hinted to follow suit with pandas, but we do not perform this check instead letting the backend library raise as needed.
>>> import narwhals as nw
>>> import pandas as pd
>>> s = nw.new_series('foo', ['a', 'b', 'c'], dtype=nw.Enum(['a', 'b', 'c', 'd']), native_namespace=pd)
>>> s
┌───────────────────────────────────────────────┐
|                Narwhals Series                |
|-----------------------------------------------|
|0    a                                         |
|1    b                                         |
|2    c                                         |
|Name: foo, dtype: category                     |
|Categories (4, object): ['a' < 'b' < 'c' < 'd']|
└───────────────────────────────────────────────┘

- add conversion from native to pandas
- add conversion from native to Polars
@camriddell camriddell changed the title Feat: Feat: nw.Enum support for pandas Mar 11, 2025
@camriddell camriddell requested a review from MarcoGorelli March 12, 2025 16:03
@camriddell camriddell requested a review from FBruzzesi March 13, 2025 18:16
@MarcoGorelli
Copy link
Member

thanks! it's encouraging that this doesn't break downstream tests

sorry i didn't get round to it for tomorrow's release, will try to get it in for next week's one 👍

except ImportError as exc: # pragma: no cover
msg = f"Unable to convert to {dtype} to to the following exception: {exc.msg}"
raise ImportError(msg) from exc
return pd.CategoricalDtype(categories=dtype.categories, ordered=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if we can do something pandas-specific here, as this is used by cudf and modin too - could we generalise?

Copy link
Member

@dangotbanned dangotbanned Mar 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by this.
pandas is already a module-level import?

import pandas as pd

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dangotbanned you're right- I pulled this code from a pretty old branch I had so that must have just been leftover. I'll delete it.

@MarcoGorelli I'll look into generalizing cudf and modin

Comment on lines 359 to 362
if dtype == "category":
if native_dtype.ordered:
return dtypes.Enum(categories=native_dtype.categories)
return dtypes.Categorical()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be a breaking change, so i'm not totally sure about it - could we preserve the current behaviour in v1 and only make this change in the main namespace? the version variable is available in this function, you can use that

dangotbanned added a commit that referenced this pull request Mar 29, 2025
I noticed a new one in (#2192) and thought I'd get them all in one sweep
MarcoGorelli pushed a commit that referenced this pull request Mar 29, 2025
* chore(typing): Resolve `_polars.utils` dtype ignores

I noticed a new one in (#2192) and thought I'd get them all in one sweep

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* chore: "coverage"

Just replacing the original `getattr`, there was already no coverage for that

https://github.com/narwhals-dev/narwhals/actions/runs/14145863466/job/39633072966?pr=2312

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@MarcoGorelli MarcoGorelli added the enhancement New feature or request label Apr 4, 2025
@MarcoGorelli
Copy link
Member

thanks Cam - looks like there's a xpass

FAILED tests/series_only/cast_test.py::test_cast_to_enum_v1[modin[pyarrow]]

@camriddell camriddell requested a review from dangotbanned April 16, 2025 20:45
Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>
@dangotbanned dangotbanned self-requested a review April 17, 2025 08:07
Copy link
Member

@dangotbanned dangotbanned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @camriddell!

I've left some non-blocking comments/questions.
Looking pretty ready to me 🎉

Comment on lines +214 to +216
def non_object_native_to_narwhals_dtype(native_dtype: Any, version: Version) -> DType:
dtype = str(native_dtype)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems to have been there since the first commit (3581985), but doesn't seem to be documented?

It looks like this part is related

https://github.com/camriddell/narwhals/blob/d2504a40efc606d8e626a5b9049ff8054417d64c/narwhals/_pandas_like/utils.py#L320-L321

Which would mean we do the str(...) call twice now. Just an observation, not sure if there is a cost to that

https://github.com/camriddell/narwhals/blob/d2504a40efc606d8e626a5b9049ff8054417d64c/narwhals/_pandas_like/utils.py#L306-L309

Are all non-object pandas data types guaranteed to be immutable?
I think str was used because it is hashable, so is safe to use in functools.lru_cache

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which would mean we do the str(...) call twice now. Just an observation, not sure if there is a cost to that

I think the cost of a repeated call to str should be fairly negligible, we can always come back later to refactor if a profiler disagrees with this statement and this leads to a larger overhead.

Are all non-object pandas data types guaranteed to be immutable?
I think str was used because it is hashable, so is safe to use in functools.lru_cache

Since the tests pass, I am at least confident that all of the datatypes are hashable, however whether that hash is something meaningful or just the default id(self) / 16 then caching may not be reliable. That said, perhaps we can also leave as is for now, then if we catch wind of a slow down in the future we can revisit it? Trying to avoid the pre-mature optimization scenarios here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@camriddell agreed on the str part.

My concern on the hashability though is related to #2051 (comment)

Right now we won't get a warning like that because we have:

native_dtype: Any

However - good news!
I changed it to this locally:

@functools.lru_cache(maxsize=16)
def non_object_native_to_narwhals_dtype(
    native_dtype: pd.api.extensions.ExtensionDtype, version: Version
) -> DType:

And followed though to the docs to find:

ExtensionDtypes are required to be hashable. The base class provides

Looks like we're all good 🙂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find on that one, thanks so much for diving in there!

Comment on lines +132 to +138
if isinstance(dtype, dtypes.Enum):
import pandas as pd

# NOTE: `pandas-stubs.core.dtypes.dtypes.CategoricalDtype.categories` is too narrow
# Should be one of the `ListLike*` types
# https://github.com/pandas-dev/pandas-stubs/blob/8434bde95460b996323cc8c0fea7b0a8bb00ea26/pandas-stubs/_typing.pyi#L497-L505
return pd.CategoricalDtype(dtype.categories, ordered=True) # pyright: ignore[reportArgumentType]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@camriddell ignore this, I only meant to add as a comment - not the review 🫣

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli gentle nudge on this, in case it was missed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey - yeah, probably, the pandas stubs definitely don't get all the attention they probably deserve

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to cause issues for people even just inspecting the schema of a dataframe:

In [1]: import narwhals as nw

In [2]: import pandas as pd

In [3]: s = pd.Series([1,2,3], dtype=pd.CategoricalDtype(ordered=True))

In [4]: nw.from_native(s, series_only=True)
Out[4]: 
┌──────────────────────────────────┐
|         Narwhals Series          |
|----------------------------------|
|0    1                            |
|1    2                            |
|2    3                            |
|dtype: category                   |
|Categories (3, int64): [1 < 2 < 3]|
└──────────────────────────────────┘

In [5]: nw.from_native(s, series_only=True).dtype
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 nw.from_native(s, series_only=True).dtype

File ~/polars-api-compat-dev/narwhals/series.py:368, in Series.dtype(self)
    353 @property
    354 def dtype(self: Self) -> DType:
    355     """Get the data type of the Series.
    356 
    357     Returns:
   (...)    366         Int64
    367     """
--> 368     return self._compliant_series.dtype

File ~/polars-api-compat-dev/narwhals/_pandas_like/series.py:236, in PandasLikeSeries.dtype(self)
    232 @property
    233 def dtype(self: Self) -> DType:
    234     native_dtype = self.native.dtype
    235     return (
--> 236         native_to_narwhals_dtype(native_dtype, self._version, self._implementation)
    237         if native_dtype != "object"
    238         else object_native_to_narwhals_dtype(
    239             self.native, self._version, self._implementation
    240         )
    241     )

File ~/polars-api-compat-dev/narwhals/_pandas_like/utils.py:321, in native_to_narwhals_dtype(native_dtype, version, implementation)
    319     return arrow_native_to_narwhals_dtype(native_dtype.pyarrow_dtype, version)
    320 if str_dtype != "object":
--> 321     return non_object_native_to_narwhals_dtype(native_dtype, version)
    322 elif implementation is Implementation.DASK:
    323     # Per conversations with their maintainers, they don't support arbitrary
    324     # objects, so we can just return String.
    325     dtypes = import_dtypes_module(version)

File ~/polars-api-compat-dev/narwhals/_pandas_like/utils.py:260, in non_object_native_to_narwhals_dtype(native_dtype, version)
    258         return dtypes.Categorical()
    259     if native_dtype.ordered:
--> 260         return dtypes.Enum(native_dtype.categories)
    261     return dtypes.Categorical()
    262 if (match_ := PATTERN_PD_DATETIME.match(dtype)) or (
    263     match_ := PATTERN_PA_DATETIME.match(dtype)
    264 ):

File ~/polars-api-compat-dev/narwhals/dtypes.py:464, in Enum.__init__(self, categories)
    462     if not isinstance(cat, str):
    463         msg = f"{type(self).__name__} categories must be strings; found data of type {type(cat).__name__!r}"
--> 464         raise TypeError(msg)
    465     seen.add(cat)
    466 self.categories = sequence

TypeError: Enum categories must be strings; found data of type 'int'

In particular, it would be a breaking change for Altair users, who'd no longer be able to plot pandas dataframes where columns are of categorical dtype and have non-string categories. It's probably not showing up at the moment in the downstream tests because we were careful to use narwhals.stable.v1

@dangotbanned
Copy link
Member

dangotbanned commented Apr 17, 2025

#2192 (review)

This is going to cause issues for people even just inspecting the schema of a dataframe

That's a good point @MarcoGorelli

If someone is currently doing that operation, on v1, it would look like this:

import pandas as pd

from narwhals.stable import v1 as nw_v1

s = pd.Series([1, 2, 3], dtype=pd.CategoricalDtype(ordered=True))
>>> nw_v1.from_native(s, series_only=True).dtype
Categorical

So far we've had two options:

  1. Being lax with categories 1, 2
  2. Using the stricter rules from polars 3, 4

I see two other tweaks we could do to option 2 - when we can't meet the constraints of pl.Enum

  • Just continue mapping pd.CategoricalDtype -> nw.Categorical
    • No change in behavior, breaks no-one
  • Use an alternative constructor for pd.CategoricalDtype -> nw.Enum
    • So we'd still reject nw.Enum([1, 2, 3])
    • But we'd allow existing ordered categoricals to be represented by an ordered type

I think either of those would solve the problem, but I think the simplest is to just keep using nw.Categorical

@dangotbanned
Copy link
Member

altair-related

@MarcoGorelli

In particular, it would be a breaking change for Altair users, who'd no longer be able to plot pandas dataframes where columns are of categorical dtype and have non-string categories.
It's probably not showing up at the moment in the downstream tests because we were careful to use narwhals.stable.v1

I could be wrong, but I don't think we have any paths that would hit this - even if we weren't using v1?
AFAICT, pandas is handled natively - since the type conversion logic predates narwhals and (I assume) we didn't wanna make a breaking change.

Impl

Tests

Docs

@MarcoGorelli
Copy link
Member

This is the part that would break in Altair:

https://github.com/vega/altair/blob/f1e0049e6f6669ec46ec462cec81ce62aae8cbf2/altair/utils/core.py#L670-L671

It would also affect Plotly and other libraries

I think it's fine to be laxer here - Polars only allows string column names, but we allow pandas dataframe with non-string column names. Similarly, we can allow pandas dataframes with non-string categories

I think it's legit to do something like

s: nw.Series
categories = list(s.dtype.categories)
categories.append(new_value)
nw.new_series('a', values, dtype=nw.Enum(categories))

where the categories are taken from user inputs. If the user is starting with something which a backend permits, they can continue with that, no issues

@dangotbanned
Copy link
Member

This is the part that would break in Altair:

vega/altair@f1e0049/altair/utils/core.py#L670-L671

Well spotted @MarcoGorelli, I stand corrected 😄

where the categories are taken from user inputs. If the user is starting with something which a backend permits, they can continue with that, no issues

I guess I'm just more in the camp of what @camriddell said in (#2192 (comment))

If we only let backends raise, we will hit an issue where some code only work with specific backends which reduces the purpose of Narwhals.
With the current Enum targeting the pandas_like and Polars backends, I see this primarily happening in the space where writing code with a pandas backend in mind will break if a user passes in a Polars DataFrame because the Enum(…) had non-string categories.

It just seems to me like we're introducing a footgun by deviating from how polars interprets the same situation:

import pandas as pd
import polars as pl

# NOTE: Strings
>>> pl.Series(pd.Series(["1", "2", "3"], dtype=pd.CategoricalDtype(ordered=True))).to_pandas()
0    1
1    2
2    3
Name: , dtype: category
Categories (3, object): ['1', '2', '3']

# NOTE: Not strings
>>> pl.Series(pd.Series([1, 2, 3], dtype=pd.CategoricalDtype(ordered=True))).to_pandas()
0    1
1    2
2    3
Name: , dtype: int64

Important

Happy to follow your lead on this @MarcoGorelli, just wanna make sure I've raised my concerns 🙂

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Apr 17, 2025

sure, thanks for explaining

true, there is a risk that someone writes something which doesn't end up working for polars, but i'd rather accept that risk than disallow people from passing valid pandas dataframes to narwhals

i think one reason for narwhals' relatively rapid growth has been that, relative to similar/competing projects, we've put a lot of emphasis on there not being any cost to existing pandas users

@camriddell
Copy link
Member Author

@MarcoGorelli I believe the current version meets the changes you requested? When you have a chance can you take another look?

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, looks good to me!

@dangotbanned any objections?

@dangotbanned
Copy link
Member

dangotbanned commented Apr 18, 2025

thanks, looks good to me!

@dangotbanned any objections?

@MarcoGorelli just wanna double check this was what you asked for?

remove enum duplication/null checks

I thought in (#2192 (comment)) you just wanted to allow non-strings - not allow duplicates and None.

But no objections from me

@MarcoGorelli
Copy link
Member

thanks!

the backends themselves already disallow duplicates and nulls, so tbh i'm not super-bothered, especially given that people will usually be inspecting schemas of dataframes containing enums rather than making new ones

@dangotbanned dangotbanned changed the title Feat: let nw.Enum accept categories, map pandas ordered categorical to Enum (only in main namespace, not stable.v1) feat: let nw.Enum accept categories, map pandas ordered categorical to Enum (only in main namespace, not stable.v1) Apr 18, 2025
@dangotbanned dangotbanned merged commit 0eff60b into narwhals-dev:main Apr 18, 2025
30 checks passed
@dangotbanned dangotbanned linked an issue Apr 18, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dtypes enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

enh: let Enum take arguments, allow it in construction

3 participants