Skip to content

Conversation

@dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Oct 2, 2025

Tracking

Related issues

Description

Note

I've used the name sort_by for our wrapped of order_by.
The node corresponds to pa.Table.sort_by, whereas the name order_by doesn't appear anywhere else in pyarrow

Building out more acero parts to be able to support .over(*partition_by)

Comment on lines +114 to +118
# NOTE: See (https://github.com/microsoft/pyright/issues/10673#issuecomment-3033789021)
# The issue is `T` possibly being `Iterable`
# Ignoring here still leaks the issue to the caller, where you need to annotate the base case
@overload
def flatten_hash_safe(iterable: Iterable[OneOrIterable[str]], /) -> Iterator[str]: ...
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an improvement over the previous version, but far from ideal.

Still doesn't resolve this case, and I'm not entirely sure why yet

@classmethod
def align(
cls, *exprs: OneOrIterable[SupportsBroadcast[SeriesT, LengthT]]
) -> Iterator[SeriesT]:
exprs = tuple[SupportsBroadcast[SeriesT, LengthT], ...](flatten_hash_safe(exprs))
length = cls._length_required(exprs)
if length is None:
for e in exprs:
yield e.to_series()
else:
for e in exprs:
yield e.broadcast(length)

Comment on lines +194 to +202
def sort_by(
by: OneOrIterable[str],
*more_by: str,
descending: OneOrIterable[bool] = False,
nulls_last: bool = False,
) -> Decl:
return SortMultipleOptions.parse(
descending=descending, nulls_last=nulls_last
).to_arrow_acero(tuple(flatten_hash_safe((by, more_by))))
Copy link
Member Author

@dangotbanned dangotbanned Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of feat(expr-ir): Impl acero.sort_by, I still need to make use of this in a plan.

A good candidate might be in either/both of

over(order_by=...)

def over_ordered(
self, node: ir.OrderedWindowExpr, frame: Frame, name: str
) -> Self | Scalar:
if node.partition_by:
msg = f"Need to implement `group_by`, `join` for:\n{node!r}"
raise NotImplementedError(msg)
# NOTE: Converting `over(order_by=..., options=...)` into the right shape for `DataFrame.sort`
sort_by = tuple(NamedIR.from_ir(e) for e in node.order_by)
options = node.sort_options.to_multiple(len(node.order_by))
idx_name = temp.column_name(frame)
sorted_context = frame.with_row_index(idx_name).sort(sort_by, options)
evaluated = node.expr.dispatch(self, sorted_context.drop([idx_name]), name)
if isinstance(evaluated, ArrowScalar):
# NOTE: We're already sorted, defer broadcasting to the outer context
# Wouldn't be suitable for partitions, but will be fine here
# - https://github.com/narwhals-dev/narwhals/pull/2528/commits/2ae42458cae91f4473e01270919815fcd7cb9667
# - https://github.com/narwhals-dev/narwhals/pull/2528/commits/b8066c4c57d4b0b6c38d58a0f5de05eefc2cae70
return self._with_native(evaluated.native, name)
indices = pc.sort_indices(sorted_context.get_column(idx_name).native)
height = len(sorted_context)
result = evaluated.broadcast(height).native.take(indices)
return self._with_native(result, name)

is_{first,last}_distinct

def is_first_distinct(self) -> Self:
import numpy as np # ignore-banned-import
row_number = pa.array(np.arange(len(self)))
col_token = generate_temporary_column_name(n_bytes=8, columns=[self.name])
first_distinct_index = (
pa.Table.from_arrays([self.native], names=[self.name])
.append_column(col_token, row_number)
.group_by(self.name)
.aggregate([(col_token, "min")])
.column(f"{col_token}_min")
)
return self._with_native(pc.is_in(row_number, first_distinct_index))
def is_last_distinct(self) -> Self:
import numpy as np # ignore-banned-import
row_number = pa.array(np.arange(len(self)))
col_token = generate_temporary_column_name(n_bytes=8, columns=[self.name])
last_distinct_index = (
pa.Table.from_arrays([self.native], names=[self.name])
.append_column(col_token, row_number)
.group_by(self.name)
.aggregate([(col_token, "max")])
.column(f"{col_token}_max")
)
return self._with_native(pc.is_in(row_number, last_distinct_index))

@dangotbanned dangotbanned changed the title feat(expr-ir): Implement Acero order_by/sort_by pair feat(expr-ir): Implement Acero order_by, hashjoin for over Oct 5, 2025
- Starting to build up the join test suite
- At some point, `"cross"` support will be needed
Everything else requires another feature to be implemented:
- `DataFrame.filter` for semi, anti
- `DataFrame.collect_schema` for suffix
- `how="cross"` is just being defered currently (#3173 (comment))
50 lines! Even after all this refactoring 😔
@dangotbanned dangotbanned changed the title feat(expr-ir): Implement Acero order_by, hashjoin for over feat(expr-ir): Implement Acero order_by, hashjoin for over + DataFrame.filter Oct 6, 2025
the exact text is allowed to change
Some basic cases to consider for #3182

If we decide against supporting them, then all can be converted into a `pytest.raises`
Really don't want this being part of the `ArrowDataFrame` constructor
Viewing `join` as an edge case, whereas things like `select`, `with_columns` already handle duplicates during `prepare_projections`
Compared to (2ebca30)

Now supports the new semantics that will appear in #3183, following #3182
Comment on lines 166 to 173
def filter(self, predicate: NamedIR) -> Self:
mask: pc.Expression | ChunkedArrayAny
if not fn.is_series(predicate):
resolved = Expr.from_named_ir(predicate, self)
if isinstance(resolved, Expr):
mask = resolved.broadcast(len(self)).native
else:
mask = acero.lit(resolved.native)
resolved = Expr.from_named_ir(predicate, self)
if isinstance(resolved, Expr):
mask = resolved.broadcast(len(self)).native
else:
mask = predicate.native
mask = acero.lit(resolved.native)
return self._with_native(self.native.filter(mask))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FBruzzesi #3183 (comment)

This is very nice now 😄

@dangotbanned dangotbanned mentioned this pull request Oct 8, 2025
74 tasks
@dangotbanned dangotbanned changed the title feat(expr-ir): Implement Acero order_by, hashjoin for over + DataFrame.filter feat(expr-ir): Acero order_by, hashjoin , DataFrame.{filter,join}, Expr.is_{first,last}_distinct Oct 12, 2025
@dangotbanned dangotbanned added the enhancement New feature or request label Oct 12, 2025
@dangotbanned dangotbanned marked this pull request as ready for review October 12, 2025 16:15
@dangotbanned dangotbanned merged commit bde22ac into oh-nodes Oct 12, 2025
31 of 33 checks passed
@dangotbanned dangotbanned deleted the expr-ir/acero-order-by branch October 12, 2025 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request internal

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants