Skip to content

Conversation

@dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Oct 23, 2025

Related issues

Upstream

Show Motivation

Adapting the Selectors implementation to be more like (pola-rs/polars#23351) has been on my mind for a while now:

My most recent woe in (#3224) - when trying to allow ColumnNameOrSelector in more places - are that (here) not all column selection expressions have a selectors equivalent:

  • col(s) doesn't translate to cs.by_name
    • It should be cs.by_name(..., require_all=True)
  • {nth,index_columns} would have the same issue
    • But cs.by_index is missing

That poses a problem for implicitly converting str (ColumnName*) into a Selector (*OrSelector) under-the-hood.
A user can request columns that don't exist and selector-semantics (empty sets are okay in most cases) allow it:

Show tests

# TODO @dangotbanned: Stricter selectors
@pytest.mark.xfail(
reason="TODO: Handle missing columns in `strict`/`require_all` selectors."
)
def test_partition_by_missing_names(data: Data) -> None: # pragma: no cover
df = dataframe(data)
with pytest.raises(ColumnNotFoundError, match=r"\"d\""):
df.partition_by("d")
with pytest.raises(ColumnNotFoundError, match=r"\"e\""):
df.partition_by("c", "e")
def test_partition_by_fully_empty_selector(data: Data) -> None:
df = dataframe(data)
with pytest.raises(
ComputeError, match=r"at least one key is required in a group_by operation"
):
df.partition_by(ncs.array(ncs.numeric()), ncs.struct(), ncs.duration())
# NOTE: Matching polars behavior
def test_partition_by_partially_missing_selector(data: Data) -> None:
df = dataframe(data)
results = df.partition_by(ncs.string() | ncs.list() | ncs.enum())
expected = nw.Schema({"a": nw.String(), "b": nw.Int64(), "c": nw.Int64()})
for df in results:
assert df.schema == expected

[!NOTE]
TL;DR: This PR is about closing these gaps and finally addressing a TODO from 4-5 months ago 😄

Show Tasks

🏆 Highlights

Selectors

Reimplemented Selectors to align with (pola-rs/polars#23351).
TL;DR: we now can use all these guys independent of backend/version:

  • cs.by_{index,name}(require_all: bool)
  • cs.{array,list}(inner: Selector | None)
  • cs.empty
  • cs.enum
  • cs.first
  • cs.float
  • cs.integer
  • cs.last
  • cs.struct
  • cs.temporal

All Selectors (including nw.nth, nw.all, nw.exclude) and nw.col can now be used in *Frame-level methods, which so far includes:

  • BaseFrame.sort
  • BaseFrame.drop
  • BaseFrame.drop_nulls
  • DataFrame.partition_by (started in #3224)

Note

All of this is opaque to the compliant-level, which continues to receive already-resolved column names

Well-defined Expansion Rules

I ran into some issues relating to (pola-rs/polars#25022), (pola-rs/polars#21773), (narwhals-dev/narwhals#3029)

The precise details of what expands and how can be found in this fancy new class:

Show Expander

class Expander:
__slots__ = ("ignored", "schema")
schema: FrozenSchema
ignored: Ignored
def __init__(self, scope: IntoFrozenSchema, ignored: Ignored = ()) -> None:
self.schema = freeze_schema(scope)
self.ignored = ignored
def iter_expand_exprs(self, exprs: Iterable[ExprIR], /) -> Iterator[ExprIR]:
# Iteratively expand all of exprs
for expr in exprs:
yield from self._expand(expr)
def iter_expand_selector_names(
self, selectors: Iterable[SelectorIR], /
) -> Iterator[str]:
for s in selectors:
yield from s.iter_expand_names(self.schema, self.ignored)
def prepare_projection(self, exprs: Collection[ExprIR], /) -> Seq[NamedIR]:
output_names = deque[str]()
named_irs = deque[NamedIR]()
root_names = set[str]()
# NOTE: Collecting here isn't ideal (perf-wise), but the expanded `ExprIR`s
# have more useful information to add in an error message
# Another option could be keeping things lazy, but repeating the work for the error case?
# that way, there isn't a cost paid on the happy path - and it doesn't matter when we're raising
# if we take our time displaying the message
expanded = tuple(self.iter_expand_exprs(exprs))
for e in expanded:
# NOTE: Empty string is allowed as a name, but is falsy
if (name := e.meta.output_name(raise_if_undetermined=False)) is not None:
target = e
elif meta.has_expr_ir(e, KeepName):
replaced = replace_keep_name(e)
name = replaced.meta.output_name()
target = replaced
else:
msg = f"Unable to determine output name for expression, got: `{e!r}`"
raise NotImplementedError(msg)
output_names.append(name)
named_irs.append(ir.named_ir(name, remove_alias(target)))
root_names.update(meta.iter_root_names(e))
if len(output_names) != len(set(output_names)):
raise duplicate_error(expanded)
if not (set(self.schema).issuperset(root_names)):
raise column_not_found_error(root_names, self.schema)
return tuple(named_irs)
def _expand(self, expr: ExprIR, /) -> Iterator[ExprIR]:
# For a single expr, fully expand all parts of it
if all(not e.needs_expansion() for e in expr.iter_left()):
yield expr
else:
yield from self._expand_recursive(expr)
def _expand_recursive(self, origin: ExprIR, /) -> Iterator[ExprIR]:
# Dispatch the kind of expansion, based on the type of expr
# Every other method will call back here
# Based on https://github.com/pola-rs/polars/blob/5b90db75911c70010d0c0a6941046e6144af88d4/crates/polars-plan/src/plans/conversion/dsl_to_ir/expr_expansion.rs#L253-L850
if isinstance(origin, _EXPAND_NONE):
yield origin
elif isinstance(origin, ir.SelectorIR):
names = origin.iter_expand_names(self.schema, self.ignored)
yield from (ir.Column(name=name) for name in names)
elif isinstance(origin, _EXPAND_SINGLE):
for expr in self._expand_recursive(origin.expr):
yield origin.__replace__(expr=expr)
elif isinstance(origin, _EXPAND_COMBINATION):
yield from self._expand_combination(origin)
elif isinstance(origin, ir.FunctionExpr):
yield from self._expand_function_expr(origin)
else:
msg = f"Didn't expect to see {type(origin).__name__}"
raise NotImplementedError(msg)
def _expand_inner(self, children: Seq[ExprIR], /) -> Iterator[ExprIR]:
"""Use when we want to expand non-root nodes, *without* duplicating the root.
If we wrote:
col("a").over(col("c", "d", "e"))
Then the expanded version should be:
col("a").over(col("c"), col("d"), col("e"))
An **incorrect** output would cause an error without aliasing:
col("a").over(col("c"))
col("a").over(col("d"))
col("a").over(col("e"))
And cause an error if we needed to expand both sides:
col("a", "b").over(col("c", "d", "e"))
Since that would become:
col("a").over(col("c"))
col("b").over(col("d"))
col(<MISSING>).over(col("e")) # InvalidOperationError: cannot combine selectors that produce a different number of columns (3 != 2)
"""
# used by
# - `_expand_combination` (tuple fields)
# - `_expand_function_expr` (horizontal)
for child in children:
yield from self._expand_recursive(child)
def _expand_only(self, child: ExprIR, /) -> ExprIR:
# used by
# - `_expand_combination` (ExprIR fields)
# - `_expand_function_expr` (all others that have len(inputs)>=2, call on non-root)
iterable = self._expand_recursive(child)
first = next(iterable)
if second := next(iterable, None):
msg = f"Multi-output expressions are not supported in this context, got: `{second!r}`" # pragma: no cover
raise MultiOutputExpressionError(msg) # pragma: no cover
return first
# TODO @dangotbanned: It works, but all this class-specific branching belongs in the classes themselves
def _expand_combination(self, origin: Combination, /) -> Iterator[Combination]:
changes: dict[str, Any] = {}
if isinstance(origin, (ir.WindowExpr, ir.Filter, ir.SortBy)):
if isinstance(origin, ir.WindowExpr):
if partition_by := origin.partition_by:
changes["partition_by"] = tuple(self._expand_inner(partition_by))
if isinstance(origin, ir.OrderedWindowExpr):
changes["order_by"] = tuple(self._expand_inner(origin.order_by))
elif isinstance(origin, ir.SortBy):
changes["by"] = tuple(self._expand_inner(origin.by))
else:
changes["by"] = self._expand_only(origin.by)
replaced = common.replace(origin, **changes)
for root in self._expand_recursive(replaced.expr):
yield common.replace(replaced, expr=root)
elif isinstance(origin, ir.BinaryExpr):
yield from self._expand_binary_expr(origin)
elif isinstance(origin, ir.TernaryExpr):
changes["truthy"] = self._expand_only(origin.truthy)
changes["predicate"] = self._expand_only(origin.predicate)
changes["falsy"] = self._expand_only(origin.falsy)
yield origin.__replace__(**changes)
else:
assert_never(origin)
def _expand_binary_expr(self, origin: ir.BinaryExpr, /) -> Iterator[ir.BinaryExpr]:
it_lefts = self._expand_recursive(origin.left)
it_rights = self._expand_recursive(origin.right)
# NOTE: Fast-path that doesn't require collection
# - Will miss selectors that expand to 1 column
if not origin.meta.has_multiple_outputs():
for left, right in zip_strict(it_lefts, it_rights):
yield origin.__replace__(left=left, right=right)
return
# NOTE: Covers 1:1 (where either is a selector), N:N
lefts, rights = tuple(it_lefts), tuple(it_rights)
len_left, len_right = len(lefts), len(rights)
if len_left == len_right:
for left, right in zip_strict(lefts, rights):
yield origin.__replace__(left=left, right=right)
# NOTE: 1:M
elif len_left == 1:
binary = origin.__replace__(left=lefts[0])
yield from (binary.__replace__(right=right) for right in rights)
# NOTE: M:1
elif len_right == 1:
binary = origin.__replace__(right=rights[0])
yield from (binary.__replace__(left=left) for left in lefts)
else:
raise binary_expr_multi_output_error(origin, lefts, rights)
def _expand_function_expr(
self, origin: ir.FunctionExpr, /
) -> Iterator[ir.FunctionExpr]:
if origin.options.is_input_wildcard_expansion():
reduced = tuple(self._expand_inner(origin.input))
yield origin.__replace__(input=reduced)
else:
if non_root := origin.input[1:]:
children = tuple(self._expand_only(child) for child in non_root)
else:
children = ()
for root in self._expand_recursive(origin.input[0]):
yield origin.__replace__(input=(root, *children))
_EXPAND_NONE = (ir.Column, ir.Literal, ir.Len)
"""we're at the root, nothing left to expand."""
_EXPAND_SINGLE = (ir.Alias, ir.Cast, ir.AggExpr, ir.Sort, ir.KeepName, ir.RenameAlias)
"""one (direct) child, always stored in `self.expr`.
An expansion will always just be cloning *everything but* `self.expr`,
we only need to be concerned with a **single** attribute.
Say we had:
origin = Cast(expr=ByName(names=("one", "two"), require_all=True), dtype=String)
This would expand to:
cast_one = Cast(expr=Column(name="one"), dtype=String)
cast_two = Cast(expr=Column(name="two"), dtype=String)
"""
_EXPAND_COMBINATION = (
ir.SortBy,
ir.BinaryExpr,
ir.TernaryExpr,
ir.Filter,
ir.OrderedWindowExpr,
ir.WindowExpr,
)
"""more than one (direct) child and those can be nested."""

There are some more examples in (#3233 (comment)) and (#3029 (comment)).

Also, this test features a particularly-odd blend of the new features 😅

def test_over_order_by_expr(data_alt: Data) -> None:
df = dataframe(data_alt)
result = df.select(
nwp.all()
+ nwp.all().last().over(order_by=[nwp.nth(1), ncs.first()], descending=True)
)
expected = {"a": [6, 8, 4, 5, None], "b": [0, 1, 3, 2, 1], "c": [18, 10, 11, 10, 10]}
assert_equal_data(result, expected)

Testing

Speaking of tests ...

A LOT of the positive diff (+2013/3125) comes from a major effort to improve test coverage.
I focused on selectors and expression-expansion the most, where many of the tests are adapted directly from the polars test suite.

Writing those got repetitive quick, so wrapped some bits up in a new testing util and sprinkled in some examples:

Show Frame

class Frame:
"""Schema-only `{Expr,Selector}` projection testing tool.
Arguments:
schema: A Narwhals Schema.
Examples:
>>> import narwhals as nw
>>> import narwhals._plan.selectors as ncs
>>> df = Frame.from_mapping(
... {
... "abc": nw.UInt16(),
... "bbb": nw.UInt32(),
... "cde": nw.Float64(),
... "def": nw.Float32(),
... "eee": nw.Boolean(),
... }
... )
Determine the columns names that expression input would select
>>> df.project_names(ncs.numeric() - ncs.by_index(1, 2))
('abc', 'def')
Assert an expression selects names in a given order
>>> df.assert_selects(ncs.by_name("eee", "abc"), "eee", "abc")
Raising a helpful error if something went wrong
>>> df.assert_selects(ncs.duration(), "eee", "abc")
Traceback (most recent call last):
AssertionError: Projected column names do not match expected names:
result : ()
expected: ('eee', 'abc')
"""
def __init__(self, schema: nw.Schema) -> None:
self.schema = schema
self.columns = tuple(schema.names())
@staticmethod
def from_mapping(mapping: IntoSchema) -> Frame:
"""Construct from inputs accepted in `nw.Schema`."""
return Frame(nw.Schema(mapping))
@staticmethod
def from_names(*column_names: str) -> Frame:
"""Construct with all `nw.Int64()`."""
return Frame(nw.Schema((name, nw.Int64()) for name in column_names))
@property
def width(self) -> int:
"""Get the number of columns in the schema."""
return len(self.columns)
def project(
self, exprs: OneOrIterable[IntoExpr], *more_exprs: IntoExpr
) -> Seq[ir.NamedIR]:
"""Parse and expand expressions into named representations.
Arguments:
exprs: Column(s) to select. Accepts expression input. Strings are parsed as column names,
other non-expression inputs are parsed as literals.
*more_exprs: Column(s) to select, specified as positional arguments.
Note:
`NamedIR` is the form of expression passed to the compliant-level.
Examples:
>>> import datetime as dt
>>> import narwhals._plan.selectors as ncs
>>> df = Frame.from_names("a", "b", "c", "d", "idx1", "idx2")
>>> expr_1 = (
... ncs.by_name("a", "d")
... .first()
... .over(ncs.by_index(range(1, 4)), order_by=ncs.matches(r"idx"))
... )
>>> expr_2 = (ncs.by_name("a") | ncs.by_index(2)).abs().name.suffix("_abs")
>>> expr_3 = dt.date(2000, 1, 1)
>>> df.project(expr_1, expr_2, expr_3) # doctest: +NORMALIZE_WHITESPACE
(a=col('a').first().over(partition_by=[col('b'), col('c'), col('d')], order_by=[col('idx1'), col('idx2')]),
d=col('d').first().over(partition_by=[col('b'), col('c'), col('d')], order_by=[col('idx1'), col('idx2')]),
a_abs=col('a').abs(),
c_abs=col('c').abs(),
literal=lit(date: 2000-01-01))
"""
expr_irs = _parse.parse_into_seq_of_expr_ir(exprs, *more_exprs)
named_irs, _ = _expansion.prepare_projection(expr_irs, schema=self.schema)
return named_irs
def project_names(self, *exprs: IntoExpr) -> Seq[str]:
named_irs = self.project(*exprs)
return tuple(e.name for e in named_irs)
def assert_selects(self, selector: Selector | Expr, *column_names: str) -> None:
result = self.project_names(selector)
expected = column_names
assert result == expected, (
f"Projected column names do not match expected names:\n"
f"result : {result!r}\n"
f"expected: {expected!r}"
)

I'm quite happy with how readable these tests are 🙂

def test_selector_array(schema_nested_2: nw.Schema) -> None:
df = Frame(schema_nested_2)
df.assert_selects(ncs.array(), "b", "c", "d", "f")
df.assert_selects(ncs.array(ncs.all()), "b", "c", "d", "f")
df.assert_selects(ncs.array(size=4), "b", "c", "f")
df.assert_selects(ncs.array(inner=ncs.integer()), "b", "c", "d")
df.assert_selects(ncs.array(inner=ncs.string()), "f")

@dangotbanned dangotbanned changed the base branch from main to expr-ir/over-and-over-and-over-again October 23, 2025 10:12
Have quite a few more to port and almost all follow the same format
This is a subset of the actual test suite, but still a very big subset

What's here so far has already caught a few bugs, I'm sure there's more to come (sadly?)
I'm sure I've done this at least once before 😭
Knowing that these return a selector requires knowing the special-casing for `col`, but that can't be known statically without making `Expr` generic (not happening)
This also simplifies the compliant-level, since resolving names is part of expansion
Heavily based on what `polars` does
Need to rewrite `ArrowExpr.over_ordered` to make use of the simplification
Maybe something like loop `order_by` and use `meta.as_selector`?
@dangotbanned dangotbanned changed the title refactor(expr-ir): Even concrete-ier Selectors feat(expr-ir): The big Selector overhaul Nov 3, 2025
@dangotbanned dangotbanned marked this pull request as ready for review November 3, 2025 19:51
@dangotbanned dangotbanned merged commit d9b9b3f into expr-ir/over-and-over-and-over-again Nov 4, 2025
40 checks passed
@dangotbanned dangotbanned deleted the expr-ir/strict-selectors branch November 4, 2025 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request internal tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants