feat(expr-ir): The big `Selector` overhaul #3233

dangotbanned · 2025-10-23T10:11:39Z

Related issues

Child of feat(expr-ir): Support over(*partition_by) #3224
Part of feat(RFC): A richer Expr IR #2572

Upstream

Show Motivation

Adapting the Selectors implementation to be more like (pola-rs/polars#23351) has been on my mind for a while now:

My most recent woe in (#3224) - when trying to allow ColumnNameOrSelector in more places - are that (here) not all column selection expressions have a selectors equivalent:

col(s) doesn't translate to cs.by_name
- It should be cs.by_name(..., require_all=True)
{nth,index_columns} would have the same issue
- But cs.by_index is missing

That poses a problem for implicitly converting str (ColumnName*) into a Selector (*OrSelector) under-the-hood.
A user can request columns that don't exist and selector-semantics (empty sets are okay in most cases) allow it:

Show tests

narwhals/tests/plan/frame_partition_by_test.py

Lines 120 to 146 in 41d8cc2

    
           # TODO @dangotbanned: Stricter selectors 
        
           @pytest.mark.xfail( 
        
               reason="TODO: Handle missing columns in `strict`/`require_all` selectors." 
        
           ) 
        
           def test_partition_by_missing_names(data: Data) -> None:  # pragma: no cover 
        
               df = dataframe(data) 
        
               with pytest.raises(ColumnNotFoundError, match=r"\"d\""): 
        
                   df.partition_by("d") 
        
               with pytest.raises(ColumnNotFoundError, match=r"\"e\""): 
        
                   df.partition_by("c", "e") 
        
           def test_partition_by_fully_empty_selector(data: Data) -> None: 
        
               df = dataframe(data) 
        
               with pytest.raises( 
        
                   ComputeError, match=r"at least one key is required in a group_by operation" 
        
               ): 
        
                   df.partition_by(ncs.array(ncs.numeric()), ncs.struct(), ncs.duration()) 
        
           # NOTE: Matching polars behavior 
        
           def test_partition_by_partially_missing_selector(data: Data) -> None: 
        
               df = dataframe(data) 
        
               results = df.partition_by(ncs.string() | ncs.list() | ncs.enum()) 
        
               expected = nw.Schema({"a": nw.String(), "b": nw.Int64(), "c": nw.Int64()}) 
        
               for df in results: 
        
                   assert df.schema == expected

[!NOTE]
TL;DR: This PR is about closing these gaps and finally addressing a TODO from 4-5 months ago 😄

Show Tasks

🏆 Highlights

Selectors

Reimplemented Selectors to align with (pola-rs/polars#23351).
TL;DR: we now can use all these guys independent of backend/version:

cs.by_{index,name}(require_all: bool)
cs.{array,list}(inner: Selector | None)
cs.empty
cs.enum
cs.first
cs.float
cs.integer
cs.last
cs.struct
cs.temporal

All Selectors (including nw.nth, nw.all, nw.exclude) and nw.col can now be used in *Frame-level methods, which so far includes:

BaseFrame.sort
BaseFrame.drop
BaseFrame.drop_nulls
DataFrame.partition_by (started in #3224)

Note

All of this is opaque to the compliant-level, which continues to receive already-resolved column names

Well-defined Expansion Rules

I ran into some issues relating to (pola-rs/polars#25022), (pola-rs/polars#21773), (narwhals-dev/narwhals#3029)

The precise details of what expands and how can be found in this fancy new class:

Show Expander

narwhals/narwhals/_plan/_expansion.py

Lines 137 to 352 in 50bcb9c

    
           class Expander: 
        
               __slots__ = ("ignored", "schema") 
        
               schema: FrozenSchema 
        
               ignored: Ignored 
        
               def __init__(self, scope: IntoFrozenSchema, ignored: Ignored = ()) -> None: 
        
                   self.schema = freeze_schema(scope) 
        
                   self.ignored = ignored 
        
               def iter_expand_exprs(self, exprs: Iterable[ExprIR], /) -> Iterator[ExprIR]: 
        
                   # Iteratively expand all of exprs 
        
                   for expr in exprs: 
        
                       yield from self._expand(expr) 
        
               def iter_expand_selector_names( 
        
                   self, selectors: Iterable[SelectorIR], / 
        
               ) -> Iterator[str]: 
        
                   for s in selectors: 
        
                       yield from s.iter_expand_names(self.schema, self.ignored) 
        
               def prepare_projection(self, exprs: Collection[ExprIR], /) -> Seq[NamedIR]: 
        
                   output_names = deque[str]() 
        
                   named_irs = deque[NamedIR]() 
        
                   root_names = set[str]() 
        
                   # NOTE: Collecting here isn't ideal (perf-wise), but the expanded `ExprIR`s 
        
                   # have more useful information to add in an error message 
        
                   # Another option could be keeping things lazy, but repeating the work for the error case? 
        
                   # that way, there isn't a cost paid on the happy path - and it doesn't matter when we're raising 
        
                   # if we take our time displaying the message 
        
                   expanded = tuple(self.iter_expand_exprs(exprs)) 
        
                   for e in expanded: 
        
                       # NOTE: Empty string is allowed as a name, but is falsy 
        
                       if (name := e.meta.output_name(raise_if_undetermined=False)) is not None: 
        
                           target = e 
        
                       elif meta.has_expr_ir(e, KeepName): 
        
                           replaced = replace_keep_name(e) 
        
                           name = replaced.meta.output_name() 
        
                           target = replaced 
        
                       else: 
        
                           msg = f"Unable to determine output name for expression, got: `{e!r}`" 
        
                           raise NotImplementedError(msg) 
        
                       output_names.append(name) 
        
                       named_irs.append(ir.named_ir(name, remove_alias(target))) 
        
                       root_names.update(meta.iter_root_names(e)) 
        
                   if len(output_names) != len(set(output_names)): 
        
                       raise duplicate_error(expanded) 
        
                   if not (set(self.schema).issuperset(root_names)): 
        
                       raise column_not_found_error(root_names, self.schema) 
        
                   return tuple(named_irs) 
        
               def _expand(self, expr: ExprIR, /) -> Iterator[ExprIR]: 
        
                   # For a single expr, fully expand all parts of it 
        
                   if all(not e.needs_expansion() for e in expr.iter_left()): 
        
                       yield expr 
        
                   else: 
        
                       yield from self._expand_recursive(expr) 
        
               def _expand_recursive(self, origin: ExprIR, /) -> Iterator[ExprIR]: 
        
                   # Dispatch the kind of expansion, based on the type of expr 
        
                   # Every other method will call back here 
        
                   # Based on https://github.com/pola-rs/polars/blob/5b90db75911c70010d0c0a6941046e6144af88d4/crates/polars-plan/src/plans/conversion/dsl_to_ir/expr_expansion.rs#L253-L850 
        
                   if isinstance(origin, _EXPAND_NONE): 
        
                       yield origin 
        
                   elif isinstance(origin, ir.SelectorIR): 
        
                       names = origin.iter_expand_names(self.schema, self.ignored) 
        
                       yield from (ir.Column(name=name) for name in names) 
        
                   elif isinstance(origin, _EXPAND_SINGLE): 
        
                       for expr in self._expand_recursive(origin.expr): 
        
                           yield origin.__replace__(expr=expr) 
        
                   elif isinstance(origin, _EXPAND_COMBINATION): 
        
                       yield from self._expand_combination(origin) 
        
                   elif isinstance(origin, ir.FunctionExpr): 
        
                       yield from self._expand_function_expr(origin) 
        
                   else: 
        
                       msg = f"Didn't expect to see {type(origin).__name__}" 
        
                       raise NotImplementedError(msg) 
        
               def _expand_inner(self, children: Seq[ExprIR], /) -> Iterator[ExprIR]: 
        
                   """Use when we want to expand non-root nodes, *without* duplicating the root. 
        
                   If we wrote: 
        
                       col("a").over(col("c", "d", "e")) 
        
                   Then the expanded version should be: 
        
                       col("a").over(col("c"), col("d"), col("e")) 
        
                   An **incorrect** output would cause an error without aliasing: 
        
                       col("a").over(col("c")) 
        
                       col("a").over(col("d")) 
        
                       col("a").over(col("e")) 
        
                   And cause an error if we needed to expand both sides: 
        
                       col("a", "b").over(col("c", "d", "e")) 
        
                   Since that would become: 
        
                       col("a").over(col("c")) 
        
                       col("b").over(col("d")) 
        
                       col(<MISSING>).over(col("e"))  # InvalidOperationError: cannot combine selectors that produce a different number of columns (3 != 2) 
        
                   """ 
        
                   # used by 
        
                   # - `_expand_combination` (tuple fields) 
        
                   # - `_expand_function_expr` (horizontal) 
        
                   for child in children: 
        
                       yield from self._expand_recursive(child) 
        
               def _expand_only(self, child: ExprIR, /) -> ExprIR: 
        
                   # used by 
        
                   # - `_expand_combination` (ExprIR fields) 
        
                   # - `_expand_function_expr` (all others that have len(inputs)>=2, call on non-root) 
        
                   iterable = self._expand_recursive(child) 
        
                   first = next(iterable) 
        
                   if second := next(iterable, None): 
        
                       msg = f"Multi-output expressions are not supported in this context, got: `{second!r}`"  # pragma: no cover 
        
                       raise MultiOutputExpressionError(msg)  # pragma: no cover 
        
                   return first 
        
               # TODO @dangotbanned: It works, but all this class-specific branching belongs in the classes themselves 
        
               def _expand_combination(self, origin: Combination, /) -> Iterator[Combination]: 
        
                   changes: dict[str, Any] = {} 
        
                   if isinstance(origin, (ir.WindowExpr, ir.Filter, ir.SortBy)): 
        
                       if isinstance(origin, ir.WindowExpr): 
        
                           if partition_by := origin.partition_by: 
        
                               changes["partition_by"] = tuple(self._expand_inner(partition_by)) 
        
                           if isinstance(origin, ir.OrderedWindowExpr): 
        
                               changes["order_by"] = tuple(self._expand_inner(origin.order_by)) 
        
                       elif isinstance(origin, ir.SortBy): 
        
                           changes["by"] = tuple(self._expand_inner(origin.by)) 
        
                       else: 
        
                           changes["by"] = self._expand_only(origin.by) 
        
                       replaced = common.replace(origin, **changes) 
        
                       for root in self._expand_recursive(replaced.expr): 
        
                           yield common.replace(replaced, expr=root) 
        
                   elif isinstance(origin, ir.BinaryExpr): 
        
                       yield from self._expand_binary_expr(origin) 
        
                   elif isinstance(origin, ir.TernaryExpr): 
        
                       changes["truthy"] = self._expand_only(origin.truthy) 
        
                       changes["predicate"] = self._expand_only(origin.predicate) 
        
                       changes["falsy"] = self._expand_only(origin.falsy) 
        
                       yield origin.__replace__(**changes) 
        
                   else: 
        
                       assert_never(origin) 
        
               def _expand_binary_expr(self, origin: ir.BinaryExpr, /) -> Iterator[ir.BinaryExpr]: 
        
                   it_lefts = self._expand_recursive(origin.left) 
        
                   it_rights = self._expand_recursive(origin.right) 
        
                   # NOTE: Fast-path that doesn't require collection 
        
                   # - Will miss selectors that expand to 1 column 
        
                   if not origin.meta.has_multiple_outputs(): 
        
                       for left, right in zip_strict(it_lefts, it_rights): 
        
                           yield origin.__replace__(left=left, right=right) 
        
                       return 
        
                   # NOTE: Covers 1:1 (where either is a selector), N:N 
        
                   lefts, rights = tuple(it_lefts), tuple(it_rights) 
        
                   len_left, len_right = len(lefts), len(rights) 
        
                   if len_left == len_right: 
        
                       for left, right in zip_strict(lefts, rights): 
        
                           yield origin.__replace__(left=left, right=right) 
        
                   # NOTE: 1:M 
        
                   elif len_left == 1: 
        
                       binary = origin.__replace__(left=lefts[0]) 
        
                       yield from (binary.__replace__(right=right) for right in rights) 
        
                   # NOTE: M:1 
        
                   elif len_right == 1: 
        
                       binary = origin.__replace__(right=rights[0]) 
        
                       yield from (binary.__replace__(left=left) for left in lefts) 
        
                   else: 
        
                       raise binary_expr_multi_output_error(origin, lefts, rights) 
        
               def _expand_function_expr( 
        
                   self, origin: ir.FunctionExpr, / 
        
               ) -> Iterator[ir.FunctionExpr]: 
        
                   if origin.options.is_input_wildcard_expansion(): 
        
                       reduced = tuple(self._expand_inner(origin.input)) 
        
                       yield origin.__replace__(input=reduced) 
        
                   else: 
        
                       if non_root := origin.input[1:]: 
        
                           children = tuple(self._expand_only(child) for child in non_root) 
        
                       else: 
        
                           children = () 
        
                       for root in self._expand_recursive(origin.input[0]): 
        
                           yield origin.__replace__(input=(root, *children)) 
        
           _EXPAND_NONE = (ir.Column, ir.Literal, ir.Len) 
        
           """we're at the root, nothing left to expand.""" 
        
           _EXPAND_SINGLE = (ir.Alias, ir.Cast, ir.AggExpr, ir.Sort, ir.KeepName, ir.RenameAlias) 
        
           """one (direct) child, always stored in `self.expr`. 
        
           An expansion will always just be cloning *everything but* `self.expr`, 
        
           we only need to be concerned with a **single** attribute. 
        
           Say we had: 
        
               origin = Cast(expr=ByName(names=("one", "two"), require_all=True), dtype=String) 
        
           This would expand to: 
        
               cast_one = Cast(expr=Column(name="one"), dtype=String) 
        
               cast_two = Cast(expr=Column(name="two"), dtype=String) 
        
           """ 
        
           _EXPAND_COMBINATION = ( 
        
               ir.SortBy, 
        
               ir.BinaryExpr, 
        
               ir.TernaryExpr, 
        
               ir.Filter, 
        
               ir.OrderedWindowExpr, 
        
               ir.WindowExpr, 
        
           ) 
        
           """more than one (direct) child and those can be nested."""

There are some more examples in (#3233 (comment)) and (#3029 (comment)).

Also, this test features a particularly-odd blend of the new features 😅

narwhals/tests/plan/over_test.py

Lines 230 to 237 in 50bcb9c

    
           def test_over_order_by_expr(data_alt: Data) -> None: 
        
               df = dataframe(data_alt) 
        
               result = df.select( 
        
                   nwp.all() 
        
                   + nwp.all().last().over(order_by=[nwp.nth(1), ncs.first()], descending=True) 
        
               ) 
        
               expected = {"a": [6, 8, 4, 5, None], "b": [0, 1, 3, 2, 1], "c": [18, 10, 11, 10, 10]} 
        
               assert_equal_data(result, expected)

Testing

Speaking of tests ...

A LOT of the positive diff (+2013/3125) comes from a major effort to improve test coverage.
I focused on selectors and expression-expansion the most, where many of the tests are adapted directly from the polars test suite.

Writing those got repetitive quick, so wrapped some bits up in a new testing util and sprinkled in some examples:

Show Frame

narwhals/tests/plan/utils.py

Lines 41 to 144 in 50bcb9c

    
           class Frame: 
        
               """Schema-only `{Expr,Selector}` projection testing tool. 
        
               Arguments: 
        
                   schema: A Narwhals Schema. 
        
               Examples: 
        
                   >>> import narwhals as nw 
        
                   >>> import narwhals._plan.selectors as ncs 
        
                   >>> df = Frame.from_mapping( 
        
                   ...     { 
        
                   ...         "abc": nw.UInt16(), 
        
                   ...         "bbb": nw.UInt32(), 
        
                   ...         "cde": nw.Float64(), 
        
                   ...         "def": nw.Float32(), 
        
                   ...         "eee": nw.Boolean(), 
        
                   ...     } 
        
                   ... ) 
        
                   Determine the columns names that expression input would select 
        
                   >>> df.project_names(ncs.numeric() - ncs.by_index(1, 2)) 
        
                   ('abc', 'def') 
        
                   Assert an expression selects names in a given order 
        
                   >>> df.assert_selects(ncs.by_name("eee", "abc"), "eee", "abc") 
        
                   Raising a helpful error if something went wrong 
        
                   >>> df.assert_selects(ncs.duration(), "eee", "abc") 
        
                   Traceback (most recent call last): 
        
                   AssertionError: Projected column names do not match expected names: 
        
                   result  : () 
        
                   expected: ('eee', 'abc') 
        
               """ 
        
               def __init__(self, schema: nw.Schema) -> None: 
        
                   self.schema = schema 
        
                   self.columns = tuple(schema.names()) 
        
               @staticmethod 
        
               def from_mapping(mapping: IntoSchema) -> Frame: 
        
                   """Construct from inputs accepted in `nw.Schema`.""" 
        
                   return Frame(nw.Schema(mapping)) 
        
               @staticmethod 
        
               def from_names(*column_names: str) -> Frame: 
        
                   """Construct with all `nw.Int64()`.""" 
        
                   return Frame(nw.Schema((name, nw.Int64()) for name in column_names)) 
        
               @property 
        
               def width(self) -> int: 
        
                   """Get the number of columns in the schema.""" 
        
                   return len(self.columns) 
        
               def project( 
        
                   self, exprs: OneOrIterable[IntoExpr], *more_exprs: IntoExpr 
        
               ) -> Seq[ir.NamedIR]: 
        
                   """Parse and expand expressions into named representations. 
        
                   Arguments: 
        
                       exprs: Column(s) to select. Accepts expression input. Strings are parsed as column names, 
        
                           other non-expression inputs are parsed as literals. 
        
                       *more_exprs: Column(s) to select, specified as positional arguments. 
        
                   Note: 
        
                       `NamedIR` is the form of expression passed to the compliant-level. 
        
                   Examples: 
        
                       >>> import datetime as dt 
        
                       >>> import narwhals._plan.selectors as ncs 
        
                       >>> df = Frame.from_names("a", "b", "c", "d", "idx1", "idx2") 
        
                       >>> expr_1 = ( 
        
                       ...     ncs.by_name("a", "d") 
        
                       ...     .first() 
        
                       ...     .over(ncs.by_index(range(1, 4)), order_by=ncs.matches(r"idx")) 
        
                       ... ) 
        
                       >>> expr_2 = (ncs.by_name("a") | ncs.by_index(2)).abs().name.suffix("_abs") 
        
                       >>> expr_3 = dt.date(2000, 1, 1) 
        
                       >>> df.project(expr_1, expr_2, expr_3)  # doctest: +NORMALIZE_WHITESPACE 
        
                       (a=col('a').first().over(partition_by=[col('b'), col('c'), col('d')], order_by=[col('idx1'), col('idx2')]), 
        
                        d=col('d').first().over(partition_by=[col('b'), col('c'), col('d')], order_by=[col('idx1'), col('idx2')]), 
        
                        a_abs=col('a').abs(), 
        
                        c_abs=col('c').abs(), 
        
                        literal=lit(date: 2000-01-01)) 
        
                   """ 
        
                   expr_irs = _parse.parse_into_seq_of_expr_ir(exprs, *more_exprs) 
        
                   named_irs, _ = _expansion.prepare_projection(expr_irs, schema=self.schema) 
        
                   return named_irs 
        
               def project_names(self, *exprs: IntoExpr) -> Seq[str]: 
        
                   named_irs = self.project(*exprs) 
        
                   return tuple(e.name for e in named_irs) 
        
               def assert_selects(self, selector: Selector | Expr, *column_names: str) -> None: 
        
                   result = self.project_names(selector) 
        
                   expected = column_names 
        
                   assert result == expected, ( 
        
                       f"Projected column names do not match expected names:\n" 
        
                       f"result  : {result!r}\n" 
        
                       f"expected: {expected!r}" 
        
                   )

I'm quite happy with how readable these tests are 🙂

narwhals/tests/plan/selectors_test.py

Lines 639 to 645 in 50bcb9c

    
           def test_selector_array(schema_nested_2: nw.Schema) -> None: 
        
               df = Frame(schema_nested_2) 
        
               df.assert_selects(ncs.array(), "b", "c", "d", "f") 
        
               df.assert_selects(ncs.array(ncs.all()), "b", "c", "d", "f") 
        
               df.assert_selects(ncs.array(size=4), "b", "c", "f") 
        
               df.assert_selects(ncs.array(inner=ncs.integer()), "b", "c", "d") 
        
               df.assert_selects(ncs.array(inner=ncs.string()), "f")

`expressions.selectors` is where most of the changes need to be done

See https://github.com/pola-rs/polars/blob/2b241543851800595efd343be016b65cdbdd3c9f/crates/polars-plan/src/dsl/selector.rs#L110-L172

Still have a lot of surgery left when ripping everything else out

bugs-a-plenty

Have quite a few more to port and almost all follow the same format

This is a subset of the actual test suite, but still a very big subset What's here so far has already caught a few bugs, I'm sure there's more to come (sadly?)

Discovered (#3235) in the process

tests/plan/selectors_test.py

I'm sure I've done this at least once before 😭

Knowing that these return a selector requires knowing the special-casing for `col`, but that can't be known statically without making `Expr` generic (not happening)

There are some things we're missing, but they can be de-sugared

25 tests (potentially) in parallel is better than one big honking boi

#3233 (comment)

…ct-selectors

Resolves #3233 (comment)

Resolves (#3233 (comment)), (#3233 (comment)) The only other `DType` selector I'm still on the fence about is `ncs.nested` The others would be nice to have, but maybe lower utility

An option since `DTypeClass` added

Matches the behavior in pola-rs/polars#21773 (comment)

narwhals/_plan/_parse.py

This also simplifies the compliant-level, since resolving names is part of expansion Heavily based on what `polars` does

Need to rewrite `ArrowExpr.over_ordered` to make use of the simplification Maybe something like loop `order_by` and use `meta.as_selector`?

dangotbanned added 5 commits October 22, 2025 15:43

refactor: Split up selectors

9df31dc

`expressions.selectors` is where most of the changes need to be done

refactor: Separate DTypeSelector

f44e015

See https://github.com/pola-rs/polars/blob/2b241543851800595efd343be016b65cdbdd3c9f/crates/polars-plan/src/dsl/selector.rs#L110-L172

oops missed one!

b57bb10

feat(DRAFT): Re/implement By{Index,Name}, into_columns

e17d18b

Still have a lot of surgery left when ripping everything else out

chore: More planning

11489b3

dangotbanned changed the base branch from main to expr-ir/over-and-over-and-over-again October 23, 2025 10:12

dangotbanned added the internal label Oct 23, 2025

dangotbanned added 4 commits October 23, 2025 14:03

test: Start porting some upstream tests

f9be688

bugs-a-plenty

test: Restructure tests

67d7eb4

Have quite a few more to port and almost all follow the same format

test: Port by_index tests

14b891f

test(DRAFT): Add placeholders for everything that looks interesting

5158d94

This is a subset of the actual test suite, but still a very big subset What's here so far has already caught a few bugs, I'm sure there's more to come (sadly?)

dangotbanned added the tests label Oct 23, 2025

dangotbanned mentioned this pull request Oct 23, 2025

nw.Enum([]) raises when evaluating __{repr,hash}__ #3235

Closed

dangotbanned added 3 commits October 23, 2025 18:08

test: Port port port

18b4a22

Discovered (#3235) in the process

test: 4x more

a032f49

test: Add test_selector_result_order

e93dc46

dangotbanned commented Oct 23, 2025

View reviewed changes

tests/plan/selectors_test.py Outdated Show resolved Hide resolved

dangotbanned added 7 commits October 23, 2025 18:59

cov temp

5d60906

test: oh nice, this already works!

e733dbc

test: More by_name tests

2f66d16

fix: Validate selectors inputs more

bde633a

fix: Allow empty by_dtype

632de2e

fix: Re-align selector binary ops w/ polars

198b296

I'm sure I've done this at least once before 😭

test(typing): Silence unfixable ignore

db4e331

Knowing that these return a selector requires knowing the special-casing for `col`, but that can't be known statically without making `Expr` generic (not happening)

dangotbanned mentioned this pull request Oct 23, 2025

feat(expr-ir): Support over(*partition_by) #3224

Merged

9 tasks

dangotbanned added 5 commits October 24, 2025 11:31

test: All of test_selector_sets works!

f493c72

There are some things we're missing, but they can be de-sugared

test: All of test_selector_datetime works too!

e5dd9e0

test: parametrize big datetime test

9df5700

25 tests (potentially) in parallel is better than one big honking boi

fix: Actually check the inner dtype for cs.{list,array} 🤦‍♂️

e502e73

fix: Ensure DTypeSelector only nests other DTypeSelectors

21ef672

dangotbanned added 3 commits October 30, 2025 23:07

chore: More coverage, simplify

e8eb22e

chore: Fix _expr_output_name coverage

19ca6d3

#3233 (comment)

test: cover error in meta.as_selector

109537c

#3233 (comment)

This was referenced Oct 31, 2025

chore(expr-ir): Improve coverage #3265

Merged

feat(RFC): A richer Expr IR #2572

Draft

dangotbanned added 11 commits November 1, 2025 18:08

Merge branch 'expr-ir/over-and-over-and-over-again' into expr-ir/stri…

2a8212c

…ct-selectors

chore(expr-ir): Improve coverage (#3265)

90104e6

Merge branch 'expr-ir/over-and-over-and-over-again' into expr-ir/stri…

9de3552

…ct-selectors

cov

1c5be8f

feat: Support ncs.matches(re.Pattern[str])

a8f70d7

Resolves #3233 (comment)

feat: Add ncs.{float,integer,temporal}#

67e7918

Resolves (#3233 (comment)), (#3233 (comment)) The only other `DType` selector I'm still on the fence about is `ncs.nested` The others would be nice to have, but maybe lower utility

feat: Add ncs.{first,last}

3557728

chore: Re-order selectors in expressions.expr

d081ef9

chore: Add _plan.selectors.__all__

ebf945a

chore: Simplify, test, stabilize ByDType repr

35047c7

An option since `DTypeClass` added

chore: Remove some bad docs

46bdd51

dangotbanned mentioned this pull request Nov 2, 2025

API: inconsistent column expansion in group_by: pl.col('a', 'b') vs pl.nth(0, 1) vs pl.all() pola-rs/polars#21773

Open

2 tasks

dangotbanned added 2 commits November 2, 2025 17:49

fix: Align group_by agg expansion with polars

a826204

Matches the behavior in pola-rs/polars#21773 (comment)

refactor: Rename, re-doc into_columns -> iter_expand(_names)

2ba8edd

dangotbanned commented Nov 2, 2025

View reviewed changes

narwhals/_plan/_parse.py Outdated Show resolved Hide resolved

dangotbanned added 6 commits November 2, 2025 20:20

feat: Support drop_nulls(OneOrIterable[ColumnNameOrSelector])

616a5d5

feat: Support drop(*columns: OneOrIterable[ColumnNameOrSelector])

9447959

This also simplifies the compliant-level, since resolving names is part of expansion Heavily based on what `polars` does

test: Fully cover selectors in drop

003972b

cov

20ac1d3

refactor: Start transitioning DataFrame.sort to selectors-only

cc780de

Need to rewrite `ArrowExpr.over_ordered` to make use of the simplification Maybe something like loop `order_by` and use `meta.as_selector`?

refactor: Finish sort_by_ir removal

50bcb9c

dangotbanned changed the title ~~refactor(expr-ir): Even concrete-ier Selectors~~ feat(expr-ir): The big Selector overhaul Nov 3, 2025

dangotbanned marked this pull request as ready for review November 3, 2025 19:51

dangotbanned merged commit d9b9b3f into expr-ir/over-and-over-and-over-again Nov 4, 2025
40 checks passed

dangotbanned deleted the expr-ir/strict-selectors branch November 4, 2025 21:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(expr-ir): The big `Selector` overhaul #3233

feat(expr-ir): The big `Selector` overhaul #3233

Uh oh!

dangotbanned commented Oct 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	# TODO @dangotbanned: Stricter selectors
	@pytest.mark.xfail(
	reason="TODO: Handle missing columns in `strict`/`require_all` selectors."
	)
	def test_partition_by_missing_names(data: Data) -> None: # pragma: no cover
	df = dataframe(data)
	with pytest.raises(ColumnNotFoundError, match=r"\"d\""):
	df.partition_by("d")
	with pytest.raises(ColumnNotFoundError, match=r"\"e\""):
	df.partition_by("c", "e")


	def test_partition_by_fully_empty_selector(data: Data) -> None:
	df = dataframe(data)
	with pytest.raises(
	ComputeError, match=r"at least one key is required in a group_by operation"
	):
	df.partition_by(ncs.array(ncs.numeric()), ncs.struct(), ncs.duration())


	# NOTE: Matching polars behavior
	def test_partition_by_partially_missing_selector(data: Data) -> None:
	df = dataframe(data)
	results = df.partition_by(ncs.string() \| ncs.list() \| ncs.enum())
	expected = nw.Schema({"a": nw.String(), "b": nw.Int64(), "c": nw.Int64()})
	for df in results:
	assert df.schema == expected

	class Expander:
	__slots__ = ("ignored", "schema")
	schema: FrozenSchema
	ignored: Ignored

	def __init__(self, scope: IntoFrozenSchema, ignored: Ignored = ()) -> None:
	self.schema = freeze_schema(scope)
	self.ignored = ignored

	def iter_expand_exprs(self, exprs: Iterable[ExprIR], /) -> Iterator[ExprIR]:
	# Iteratively expand all of exprs
	for expr in exprs:
	yield from self._expand(expr)

	def iter_expand_selector_names(
	self, selectors: Iterable[SelectorIR], /
	) -> Iterator[str]:
	for s in selectors:
	yield from s.iter_expand_names(self.schema, self.ignored)

	def prepare_projection(self, exprs: Collection[ExprIR], /) -> Seq[NamedIR]:
	output_names = deque[str]()
	named_irs = deque[NamedIR]()
	root_names = set[str]()

	# NOTE: Collecting here isn't ideal (perf-wise), but the expanded `ExprIR`s
	# have more useful information to add in an error message
	# Another option could be keeping things lazy, but repeating the work for the error case?
	# that way, there isn't a cost paid on the happy path - and it doesn't matter when we're raising
	# if we take our time displaying the message
	expanded = tuple(self.iter_expand_exprs(exprs))
	for e in expanded:
	# NOTE: Empty string is allowed as a name, but is falsy
	if (name := e.meta.output_name(raise_if_undetermined=False)) is not None:
	target = e
	elif meta.has_expr_ir(e, KeepName):
	replaced = replace_keep_name(e)
	name = replaced.meta.output_name()
	target = replaced
	else:
	msg = f"Unable to determine output name for expression, got: `{e!r}`"
	raise NotImplementedError(msg)
	output_names.append(name)
	named_irs.append(ir.named_ir(name, remove_alias(target)))
	root_names.update(meta.iter_root_names(e))
	if len(output_names) != len(set(output_names)):
	raise duplicate_error(expanded)
	if not (set(self.schema).issuperset(root_names)):
	raise column_not_found_error(root_names, self.schema)
	return tuple(named_irs)

	def _expand(self, expr: ExprIR, /) -> Iterator[ExprIR]:
	# For a single expr, fully expand all parts of it
	if all(not e.needs_expansion() for e in expr.iter_left()):
	yield expr
	else:
	yield from self._expand_recursive(expr)

	def _expand_recursive(self, origin: ExprIR, /) -> Iterator[ExprIR]:
	# Dispatch the kind of expansion, based on the type of expr
	# Every other method will call back here
	# Based on https://github.com/pola-rs/polars/blob/5b90db75911c70010d0c0a6941046e6144af88d4/crates/polars-plan/src/plans/conversion/dsl_to_ir/expr_expansion.rs#L253-L850
	if isinstance(origin, _EXPAND_NONE):
	yield origin
	elif isinstance(origin, ir.SelectorIR):
	names = origin.iter_expand_names(self.schema, self.ignored)
	yield from (ir.Column(name=name) for name in names)
	elif isinstance(origin, _EXPAND_SINGLE):
	for expr in self._expand_recursive(origin.expr):
	yield origin.__replace__(expr=expr)
	elif isinstance(origin, _EXPAND_COMBINATION):
	yield from self._expand_combination(origin)
	elif isinstance(origin, ir.FunctionExpr):
	yield from self._expand_function_expr(origin)
	else:
	msg = f"Didn't expect to see {type(origin).__name__}"
	raise NotImplementedError(msg)

	def _expand_inner(self, children: Seq[ExprIR], /) -> Iterator[ExprIR]:
	"""Use when we want to expand non-root nodes, without duplicating the root.

	If we wrote:

	col("a").over(col("c", "d", "e"))

	Then the expanded version should be:

	col("a").over(col("c"), col("d"), col("e"))

	An incorrect output would cause an error without aliasing:

	col("a").over(col("c"))
	col("a").over(col("d"))
	col("a").over(col("e"))

	And cause an error if we needed to expand both sides:

	col("a", "b").over(col("c", "d", "e"))

	Since that would become:

	col("a").over(col("c"))
	col("b").over(col("d"))
	col(<MISSING>).over(col("e")) # InvalidOperationError: cannot combine selectors that produce a different number of columns (3 != 2)
	"""
	# used by
	# - `_expand_combination` (tuple fields)
	# - `_expand_function_expr` (horizontal)
	for child in children:
	yield from self._expand_recursive(child)

	def _expand_only(self, child: ExprIR, /) -> ExprIR:
	# used by
	# - `_expand_combination` (ExprIR fields)
	# - `_expand_function_expr` (all others that have len(inputs)>=2, call on non-root)
	iterable = self._expand_recursive(child)
	first = next(iterable)
	if second := next(iterable, None):
	msg = f"Multi-output expressions are not supported in this context, got: `{second!r}`" # pragma: no cover
	raise MultiOutputExpressionError(msg) # pragma: no cover
	return first

	# TODO @dangotbanned: It works, but all this class-specific branching belongs in the classes themselves
	def _expand_combination(self, origin: Combination, /) -> Iterator[Combination]:
	changes: dict[str, Any] = {}
	if isinstance(origin, (ir.WindowExpr, ir.Filter, ir.SortBy)):
	if isinstance(origin, ir.WindowExpr):
	if partition_by := origin.partition_by:
	changes["partition_by"] = tuple(self._expand_inner(partition_by))
	if isinstance(origin, ir.OrderedWindowExpr):
	changes["order_by"] = tuple(self._expand_inner(origin.order_by))
	elif isinstance(origin, ir.SortBy):
	changes["by"] = tuple(self._expand_inner(origin.by))
	else:
	changes["by"] = self._expand_only(origin.by)
	replaced = common.replace(origin, **changes)
	for root in self._expand_recursive(replaced.expr):
	yield common.replace(replaced, expr=root)
	elif isinstance(origin, ir.BinaryExpr):
	yield from self._expand_binary_expr(origin)
	elif isinstance(origin, ir.TernaryExpr):
	changes["truthy"] = self._expand_only(origin.truthy)
	changes["predicate"] = self._expand_only(origin.predicate)
	changes["falsy"] = self._expand_only(origin.falsy)
	yield origin.__replace__(**changes)
	else:
	assert_never(origin)

	def _expand_binary_expr(self, origin: ir.BinaryExpr, /) -> Iterator[ir.BinaryExpr]:
	it_lefts = self._expand_recursive(origin.left)
	it_rights = self._expand_recursive(origin.right)
	# NOTE: Fast-path that doesn't require collection
	# - Will miss selectors that expand to 1 column
	if not origin.meta.has_multiple_outputs():
	for left, right in zip_strict(it_lefts, it_rights):
	yield origin.__replace__(left=left, right=right)
	return
	# NOTE: Covers 1:1 (where either is a selector), N:N
	lefts, rights = tuple(it_lefts), tuple(it_rights)
	len_left, len_right = len(lefts), len(rights)
	if len_left == len_right:
	for left, right in zip_strict(lefts, rights):
	yield origin.__replace__(left=left, right=right)
	# NOTE: 1:M
	elif len_left == 1:
	binary = origin.__replace__(left=lefts[0])
	yield from (binary.__replace__(right=right) for right in rights)
	# NOTE: M:1
	elif len_right == 1:
	binary = origin.__replace__(right=rights[0])
	yield from (binary.__replace__(left=left) for left in lefts)
	else:
	raise binary_expr_multi_output_error(origin, lefts, rights)

	def _expand_function_expr(
	self, origin: ir.FunctionExpr, /
	) -> Iterator[ir.FunctionExpr]:
	if origin.options.is_input_wildcard_expansion():
	reduced = tuple(self._expand_inner(origin.input))
	yield origin.__replace__(input=reduced)
	else:
	if non_root := origin.input[1:]:
	children = tuple(self._expand_only(child) for child in non_root)
	else:
	children = ()
	for root in self._expand_recursive(origin.input[0]):
	yield origin.__replace__(input=(root, *children))


	_EXPAND_NONE = (ir.Column, ir.Literal, ir.Len)
	"""we're at the root, nothing left to expand."""
	_EXPAND_SINGLE = (ir.Alias, ir.Cast, ir.AggExpr, ir.Sort, ir.KeepName, ir.RenameAlias)
	"""one (direct) child, always stored in `self.expr`.

	An expansion will always just be cloning everything but `self.expr`,
	we only need to be concerned with a single attribute.

	Say we had:

	origin = Cast(expr=ByName(names=("one", "two"), require_all=True), dtype=String)

	This would expand to:

	cast_one = Cast(expr=Column(name="one"), dtype=String)
	cast_two = Cast(expr=Column(name="two"), dtype=String)

	"""
	_EXPAND_COMBINATION = (
	ir.SortBy,
	ir.BinaryExpr,
	ir.TernaryExpr,
	ir.Filter,
	ir.OrderedWindowExpr,
	ir.WindowExpr,
	)
	"""more than one (direct) child and those can be nested."""

	def test_over_order_by_expr(data_alt: Data) -> None:
	df = dataframe(data_alt)
	result = df.select(
	nwp.all()
	+ nwp.all().last().over(order_by=[nwp.nth(1), ncs.first()], descending=True)
	)
	expected = {"a": [6, 8, 4, 5, None], "b": [0, 1, 3, 2, 1], "c": [18, 10, 11, 10, 10]}
	assert_equal_data(result, expected)

	class Frame:
	"""Schema-only `{Expr,Selector}` projection testing tool.

	Arguments:
	schema: A Narwhals Schema.

	Examples:
	>>> import narwhals as nw
	>>> import narwhals._plan.selectors as ncs
	>>> df = Frame.from_mapping(
	... {
	... "abc": nw.UInt16(),
	... "bbb": nw.UInt32(),
	... "cde": nw.Float64(),
	... "def": nw.Float32(),
	... "eee": nw.Boolean(),
	... }
	... )

	Determine the columns names that expression input would select

	>>> df.project_names(ncs.numeric() - ncs.by_index(1, 2))
	('abc', 'def')

	Assert an expression selects names in a given order

	>>> df.assert_selects(ncs.by_name("eee", "abc"), "eee", "abc")

	Raising a helpful error if something went wrong

	>>> df.assert_selects(ncs.duration(), "eee", "abc")
	Traceback (most recent call last):
	AssertionError: Projected column names do not match expected names:
	result : ()
	expected: ('eee', 'abc')
	"""

	def __init__(self, schema: nw.Schema) -> None:
	self.schema = schema
	self.columns = tuple(schema.names())

	@staticmethod
	def from_mapping(mapping: IntoSchema) -> Frame:
	"""Construct from inputs accepted in `nw.Schema`."""
	return Frame(nw.Schema(mapping))

	@staticmethod
	def from_names(*column_names: str) -> Frame:
	"""Construct with all `nw.Int64()`."""
	return Frame(nw.Schema((name, nw.Int64()) for name in column_names))

	@property
	def width(self) -> int:
	"""Get the number of columns in the schema."""
	return len(self.columns)

	def project(
	self, exprs: OneOrIterable[IntoExpr], *more_exprs: IntoExpr
	) -> Seq[ir.NamedIR]:
	"""Parse and expand expressions into named representations.

	Arguments:
	exprs: Column(s) to select. Accepts expression input. Strings are parsed as column names,
	other non-expression inputs are parsed as literals.
	*more_exprs: Column(s) to select, specified as positional arguments.

	Note:
	`NamedIR` is the form of expression passed to the compliant-level.

	Examples:
	>>> import datetime as dt
	>>> import narwhals._plan.selectors as ncs
	>>> df = Frame.from_names("a", "b", "c", "d", "idx1", "idx2")
	>>> expr_1 = (
	... ncs.by_name("a", "d")
	... .first()
	... .over(ncs.by_index(range(1, 4)), order_by=ncs.matches(r"idx"))
	... )
	>>> expr_2 = (ncs.by_name("a") \| ncs.by_index(2)).abs().name.suffix("_abs")
	>>> expr_3 = dt.date(2000, 1, 1)

	>>> df.project(expr_1, expr_2, expr_3) # doctest: +NORMALIZE_WHITESPACE
	(a=col('a').first().over(partition_by=[col('b'), col('c'), col('d')], order_by=[col('idx1'), col('idx2')]),
	d=col('d').first().over(partition_by=[col('b'), col('c'), col('d')], order_by=[col('idx1'), col('idx2')]),
	a_abs=col('a').abs(),
	c_abs=col('c').abs(),
	literal=lit(date: 2000-01-01))
	"""
	expr_irs = _parse.parse_into_seq_of_expr_ir(exprs, *more_exprs)
	named_irs, _ = _expansion.prepare_projection(expr_irs, schema=self.schema)
	return named_irs

	def project_names(self, *exprs: IntoExpr) -> Seq[str]:
	named_irs = self.project(*exprs)
	return tuple(e.name for e in named_irs)

	def assert_selects(self, selector: Selector \| Expr, *column_names: str) -> None:
	result = self.project_names(selector)
	expected = column_names
	assert result == expected, (
	f"Projected column names do not match expected names:\n"
	f"result : {result!r}\n"
	f"expected: {expected!r}"
	)

	def test_selector_array(schema_nested_2: nw.Schema) -> None:
	df = Frame(schema_nested_2)
	df.assert_selects(ncs.array(), "b", "c", "d", "f")
	df.assert_selects(ncs.array(ncs.all()), "b", "c", "d", "f")
	df.assert_selects(ncs.array(size=4), "b", "c", "f")
	df.assert_selects(ncs.array(inner=ncs.integer()), "b", "c", "d")
	df.assert_selects(ncs.array(inner=ncs.string()), "f")

feat(expr-ir): The big Selector overhaul #3233

feat(expr-ir): The big Selector overhaul #3233

Uh oh!

Conversation

dangotbanned commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related issues

Upstream

🏆 Highlights

Selectors

Well-defined Expansion Rules

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(expr-ir): The big `Selector` overhaul #3233

feat(expr-ir): The big `Selector` overhaul #3233

dangotbanned commented Oct 23, 2025 •

edited

Loading