feat(expr-ir): Acero `order_by`, `hashjoin` , `DataFrame.{filter,join}`, `Expr.is_{first,last}_distinct` #3173

dangotbanned · 2025-10-02T16:08:41Z

Tracking

Bug: DataFrame.filter silently ignores **constraints when using list[bool] #3182

Related issues

Child of feat(RFC): A richer Expr IR #2572
Follow-up to feat(expr-ir): Support group_by, utilize pyarrow.acero #3143

Description

Note

I've used the name sort_by for our wrapped of order_by.
The node corresponds to pa.Table.sort_by, whereas the name order_by doesn't appear anywhere else in pyarrow

Building out more acero parts to be able to support .over(*partition_by)

dangotbanned · 2025-10-02T16:13:06Z

narwhals/_plan/common.py

+# NOTE: See (https://github.com/microsoft/pyright/issues/10673#issuecomment-3033789021)
 # The issue is `T` possibly being `Iterable`
 # Ignoring here still leaks the issue to the caller, where you need to annotate the base case
+@overload
+def flatten_hash_safe(iterable: Iterable[OneOrIterable[str]], /) -> Iterator[str]: ...


It's an improvement over the previous version, but far from ideal.

Still doesn't resolve this case, and I'm not entirely sure why yet

narwhals/narwhals/_plan/compliant/column.py

Lines 49 to 60 in f77bb4c

@classmethod

def align(

cls, *exprs: OneOrIterable[SupportsBroadcast[SeriesT, LengthT]]

) -> Iterator[SeriesT]:

exprs = tuple[SupportsBroadcast[SeriesT, LengthT], ...](flatten_hash_safe(exprs))

length = cls._length_required(exprs)

if length is None:

for e in exprs:

yield e.to_series()

else:

for e in exprs:

yield e.broadcast(length)

dangotbanned · 2025-10-02T16:18:35Z

narwhals/_plan/arrow/acero.py

+def sort_by(
+    by: OneOrIterable[str],
+    *more_by: str,
+    descending: OneOrIterable[bool] = False,
+    nulls_last: bool = False,
+) -> Decl:
+    return SortMultipleOptions.parse(
+        descending=descending, nulls_last=nulls_last
+    ).to_arrow_acero(tuple(flatten_hash_safe((by, more_by))))


As of feat(expr-ir): Impl acero.sort_by, I still need to make use of this in a plan.

A good candidate might be in either/both of

over(order_by=...)

narwhals/narwhals/_plan/arrow/expr.py

Lines 328 to 350 in f77bb4c

def over_ordered(

self, node: ir.OrderedWindowExpr, frame: Frame, name: str

) -> Self | Scalar:

if node.partition_by:

msg = f"Need to implement `group_by`, `join` for:\n{node!r}"

raise NotImplementedError(msg)

# NOTE: Converting `over(order_by=..., options=...)` into the right shape for `DataFrame.sort`

sort_by = tuple(NamedIR.from_ir(e) for e in node.order_by)

options = node.sort_options.to_multiple(len(node.order_by))

idx_name = temp.column_name(frame)

sorted_context = frame.with_row_index(idx_name).sort(sort_by, options)

evaluated = node.expr.dispatch(self, sorted_context.drop([idx_name]), name)

if isinstance(evaluated, ArrowScalar):

# NOTE: We're already sorted, defer broadcasting to the outer context

# Wouldn't be suitable for partitions, but will be fine here

# - https://github.com/narwhals-dev/narwhals/pull/2528/commits/2ae42458cae91f4473e01270919815fcd7cb9667

# - https://github.com/narwhals-dev/narwhals/pull/2528/commits/b8066c4c57d4b0b6c38d58a0f5de05eefc2cae70

return self._with_native(evaluated.native, name)

indices = pc.sort_indices(sorted_context.get_column(idx_name).native)

height = len(sorted_context)

result = evaluated.broadcast(height).native.take(indices)

return self._with_native(result, name)

is_{first,last}_distinct

test: Port over is_first_distinct tests

narwhals/narwhals/_arrow/series.py

Lines 719 to 747 in 715be22

def is_first_distinct(self) -> Self:

import numpy as np # ignore-banned-import

row_number = pa.array(np.arange(len(self)))

col_token = generate_temporary_column_name(n_bytes=8, columns=[self.name])

first_distinct_index = (

pa.Table.from_arrays([self.native], names=[self.name])

.append_column(col_token, row_number)

.group_by(self.name)

.aggregate([(col_token, "min")])

.column(f"{col_token}_min")

)

return self._with_native(pc.is_in(row_number, first_distinct_index))

def is_last_distinct(self) -> Self:

import numpy as np # ignore-banned-import

row_number = pa.array(np.arange(len(self)))

col_token = generate_temporary_column_name(n_bytes=8, columns=[self.name])

last_distinct_index = (

pa.Table.from_arrays([self.native], names=[self.name])

.append_column(col_token, row_number)

.group_by(self.name)

.aggregate([(col_token, "max")])

.column(f"{col_token}_max")

)

return self._with_native(pc.is_in(row_number, last_distinct_index))

Prep for #3173 (comment)

Mostly following what is on `main` (so far)

Both are available at all levels, + `to_series` is implemented in term of `get_columns`

`is_{first,last}_distinct` are one of a few that fit that case

https://results.pre-commit.ci/run/github/760058710/1759512874.Rw7gJ59gRbyoGTojPwa9pw

https://arrow.apache.org/docs/cpp/acero/user_guide.html#hash-join

narwhals/_plan/arrow/acero.py

- Starting to build up the join test suite - At some point, `"cross"` support will be needed

Everything else requires another feature to be implemented: - `DataFrame.filter` for semi, anti - `DataFrame.collect_schema` for suffix - `how="cross"` is just being defered currently (#3173 (comment))

50 lines! Even after all this refactoring 😔

Quite handy that I did `Expr.filter` and `When` first 😄

narwhals/_plan/dataframe.py

the exact text is allowed to change

Some basic cases to consider for #3182 If we decide against supporting them, then all can be converted into a `pytest.raises`

Really don't want this being part of the `ArrowDataFrame` constructor Viewing `join` as an edge case, whereas things like `select`, `with_columns` already handle duplicates during `prepare_projections`

https://github.com/narwhals-dev/narwhals/actions/runs/18312754814/job/52145087735

https://github.com/narwhals-dev/narwhals/actions/runs/18344025005/job/52246113538?pr=3173

Compared to (2ebca30) Now supports the new semantics that will appear in #3183, following #3182

dangotbanned · 2025-10-08T12:45:28Z

narwhals/_plan/arrow/dataframe.py

+    def filter(self, predicate: NamedIR) -> Self:
        mask: pc.Expression | ChunkedArrayAny
-        if not fn.is_series(predicate):
-            resolved = Expr.from_named_ir(predicate, self)
-            if isinstance(resolved, Expr):
-                mask = resolved.broadcast(len(self)).native
-            else:
-                mask = acero.lit(resolved.native)
+        resolved = Expr.from_named_ir(predicate, self)
+        if isinstance(resolved, Expr):
+            mask = resolved.broadcast(len(self)).native
        else:
-            mask = predicate.native
+            mask = acero.lit(resolved.native)
        return self._with_native(self.native.filter(mask))


@FBruzzesi #3183 (comment)

This is very nice now 😄

dangotbanned added 3 commits October 1, 2025 21:11

refactor: Use temp.column_name(s) some more

cb470b4

fix(typing): Resolve some cases for flatten_hash_safe

23e9d43

feat(expr-ir): Impl acero.sort_by

f77bb4c

dangotbanned added the internal label Oct 2, 2025

dangotbanned commented Oct 2, 2025

View reviewed changes

dangotbanned added 14 commits October 2, 2025 17:49

test: Port over is_first_distinct tests

36ddce0

Prep for #3173 (comment)

chore: Add Compliant{Expr,Scalar}.is_{first,last}_distinct

0e49f57

test: Update to cover is_last_distinct as well

a5f192c

feat(DRAFT): Initial is_first_distinct impl

6a1b08a

Mostly following what is on `main` (so far)

test: Port over more cases

1c026bf

refactor: Generalize is_first_distinct impl

e7e8a04

feat: Add is_last_distinct

2d46521

refactor: Make both is_*_distinct methods, aliases

cfb775d

feat: (Properly) add get_column, to_series

9db603b

Both are available at all levels, + `to_series` is implemented in term of `get_columns`

chore: Add pc.is_in wrapper

f8255d3

docs: Add detail to FunctionFlags.LENGTH_PRESERVING

6fe2a0a

`is_{first,last}_distinct` are one of a few that fit that case

test: More test porting

938befb

typo

516f4a6

https://results.pre-commit.ci/run/github/760058710/1759512874.Rw7gJ59gRbyoGTojPwa9pw

feat(DRAFT): Some progress on hashjoin port

ead4e62

https://arrow.apache.org/docs/cpp/acero/user_guide.html#hash-join

dangotbanned changed the title ~~feat(expr-ir): Implement Acero order_by/sort_by pair~~ feat(expr-ir): Implement Acero order_by, hashjoin for over Oct 5, 2025

dangotbanned commented Oct 5, 2025

View reviewed changes

narwhals/_plan/arrow/acero.py Outdated Show resolved Hide resolved

dangotbanned added 8 commits October 5, 2025 14:39

fix: Correctly pass down join keys

273bdcc

- Starting to build up the join test suite - At some point, `"cross"` support will be needed

test: Port over inner, left & clean up

ce37617

Everything else requires another feature to be implemented: - `DataFrame.filter` for semi, anti - `DataFrame.collect_schema` for suffix - `how="cross"` is just being defered currently (#3173 (comment))

test: Add test_suffix

18ef26a

test: Add how="cross" tests

94baf1e

test: Add how={"anti","semi"} tests

733b45a

test: replace "antananarivo"->"a", "bob"->"b"

ce321e0

50 lines! Even after all this refactoring 😔

test: Port the other duplicate test

cc0d379

test: Make all the xfails more visible

dd40e3a

dangotbanned added 4 commits October 6, 2025 17:29

feat: Support single Series as well

d514ad0

test: Use parametrize

d452920

feat: Add predicate expansion

4c7c23d

Quite handy that I did `Expr.filter` and `When` first 😄

feat(expr-ir): Full DataFrame.filter support

2ebca30

dangotbanned changed the title ~~feat(expr-ir): Implement Acero order_by, hashjoin for over~~ feat(expr-ir): Implement Acero order_by, hashjoin for over + DataFrame.filter Oct 6, 2025

test: Merge the anti/semi tests

1b66786

dangotbanned mentioned this pull request Oct 6, 2025

Bug: DataFrame.filter silently ignores **constraints when using list[bool] #3182

Closed

test: parametrize exception messages

fd38911

dangotbanned commented Oct 6, 2025

View reviewed changes

narwhals/_plan/dataframe.py Outdated Show resolved Hide resolved

dangotbanned added 10 commits October 6, 2025 20:48

test: relax more error messages

3537cac

the exact text is allowed to change

typo

b5ef86b

test: Add test_filter_mask_mixed

8433b2d

Some basic cases to consider for #3182 If we decide against supporting them, then all can be converted into a `pytest.raises`

fix: Raise on duplicate column names

7668abb

Really don't want this being part of the `ArrowDataFrame` constructor Viewing `join` as an edge case, whereas things like `select`, `with_columns` already handle duplicates during `prepare_projections`

cov

3ca43d1

https://github.com/narwhals-dev/narwhals/actions/runs/18312754814/job/52145087735

perf: Avoid multiple collections during cross join

0f06479

test: Stop repeating the same data so many times

7e9ee74

test: Add some cases from polars

1523dbb

fix: typing mypy

a479f32

https://github.com/narwhals-dev/narwhals/actions/runs/18344025005/job/52246113538?pr=3173

feat(expr-ir): Full-er DataFrame.filter support

8e840e0

Compared to (2ebca30) Now supports the new semantics that will appear in #3183, following #3182

dangotbanned commented Oct 8, 2025

View reviewed changes

dangotbanned mentioned this pull request Oct 8, 2025

feat(RFC): A richer Expr IR #2572

Draft

74 tasks

refactor: Simplify the NonCrossJoinStrategy split

af26916

dangotbanned mentioned this pull request Oct 8, 2025

fix: BaseFrame.filter with list[bool] in predicates #3183

Merged

10 tasks

test: Convert raising test into a conformance test

6aaf75d

dangotbanned changed the title ~~feat(expr-ir): Implement Acero order_by, hashjoin for over + DataFrame.filter~~ feat(expr-ir): Acero order_by, hashjoin , DataFrame.{filter,join}, Expr.is_{first,last}_distinct Oct 12, 2025

dangotbanned added the enhancement New feature or request label Oct 12, 2025

dangotbanned marked this pull request as ready for review October 12, 2025 16:15

dangotbanned merged commit bde22ac into oh-nodes Oct 12, 2025
31 of 33 checks passed

dangotbanned deleted the expr-ir/acero-order-by branch October 12, 2025 16:15

dangotbanned mentioned this pull request Oct 18, 2025

feat(expr-ir): Support over(*partition_by) #3224

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(expr-ir): Acero `order_by`, `hashjoin` , `DataFrame.{filter,join}`, `Expr.is_{first,last}_distinct` #3173

feat(expr-ir): Acero `order_by`, `hashjoin` , `DataFrame.{filter,join}`, `Expr.is_{first,last}_distinct` #3173

Uh oh!

dangotbanned commented Oct 2, 2025 •

edited

Loading

Uh oh!

dangotbanned Oct 2, 2025

Uh oh!

dangotbanned Oct 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

dangotbanned Oct 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	@classmethod
	def align(
	cls, *exprs: OneOrIterable[SupportsBroadcast[SeriesT, LengthT]]
	) -> Iterator[SeriesT]:
	exprs = tuple[SupportsBroadcast[SeriesT, LengthT], ...](flatten_hash_safe(exprs))
	length = cls._length_required(exprs)
	if length is None:
	for e in exprs:
	yield e.to_series()
	else:
	for e in exprs:
	yield e.broadcast(length)

	def over_ordered(
	self, node: ir.OrderedWindowExpr, frame: Frame, name: str
	) -> Self \| Scalar:
	if node.partition_by:
	msg = f"Need to implement `group_by`, `join` for:\n{node!r}"
	raise NotImplementedError(msg)

	# NOTE: Converting `over(order_by=..., options=...)` into the right shape for `DataFrame.sort`
	sort_by = tuple(NamedIR.from_ir(e) for e in node.order_by)
	options = node.sort_options.to_multiple(len(node.order_by))
	idx_name = temp.column_name(frame)
	sorted_context = frame.with_row_index(idx_name).sort(sort_by, options)
	evaluated = node.expr.dispatch(self, sorted_context.drop([idx_name]), name)
	if isinstance(evaluated, ArrowScalar):
	# NOTE: We're already sorted, defer broadcasting to the outer context
	# Wouldn't be suitable for partitions, but will be fine here
	# - https://github.com/narwhals-dev/narwhals/pull/2528/commits/2ae42458cae91f4473e01270919815fcd7cb9667
	# - https://github.com/narwhals-dev/narwhals/pull/2528/commits/b8066c4c57d4b0b6c38d58a0f5de05eefc2cae70
	return self._with_native(evaluated.native, name)
	indices = pc.sort_indices(sorted_context.get_column(idx_name).native)
	height = len(sorted_context)
	result = evaluated.broadcast(height).native.take(indices)
	return self._with_native(result, name)

	def is_first_distinct(self) -> Self:
	import numpy as np # ignore-banned-import

	row_number = pa.array(np.arange(len(self)))
	col_token = generate_temporary_column_name(n_bytes=8, columns=[self.name])
	first_distinct_index = (
	pa.Table.from_arrays([self.native], names=[self.name])
	.append_column(col_token, row_number)
	.group_by(self.name)
	.aggregate([(col_token, "min")])
	.column(f"{col_token}_min")
	)

	return self._with_native(pc.is_in(row_number, first_distinct_index))

	def is_last_distinct(self) -> Self:
	import numpy as np # ignore-banned-import

	row_number = pa.array(np.arange(len(self)))
	col_token = generate_temporary_column_name(n_bytes=8, columns=[self.name])
	last_distinct_index = (
	pa.Table.from_arrays([self.native], names=[self.name])
	.append_column(col_token, row_number)
	.group_by(self.name)
	.aggregate([(col_token, "max")])
	.column(f"{col_token}_max")
	)

	return self._with_native(pc.is_in(row_number, last_distinct_index))

feat(expr-ir): Acero order_by, hashjoin , DataFrame.{filter,join}, Expr.is_{first,last}_distinct #3173

feat(expr-ir): Acero order_by, hashjoin , DataFrame.{filter,join}, Expr.is_{first,last}_distinct #3173

Uh oh!

Conversation

dangotbanned commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tracking

Related issues

Description

Uh oh!

dangotbanned Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

dangotbanned Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

over(order_by=...)

is_{first,last}_distinct

Uh oh!

Uh oh!

Uh oh!

dangotbanned Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(expr-ir): Acero `order_by`, `hashjoin` , `DataFrame.{filter,join}`, `Expr.is_{first,last}_distinct` #3173

feat(expr-ir): Acero `order_by`, `hashjoin` , `DataFrame.{filter,join}`, `Expr.is_{first,last}_distinct` #3173

dangotbanned commented Oct 2, 2025 •

edited

Loading

dangotbanned Oct 2, 2025 •

edited

Loading

`over(order_by=...)`

`is_{first,last}_distinct`