-
Notifications
You must be signed in to change notification settings - Fork 171
feat(expr-ir): Acero order_by, hashjoin , DataFrame.{filter,join}, Expr.is_{first,last}_distinct
#3173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| # NOTE: See (https://github.com/microsoft/pyright/issues/10673#issuecomment-3033789021) | ||
| # The issue is `T` possibly being `Iterable` | ||
| # Ignoring here still leaks the issue to the caller, where you need to annotate the base case | ||
| @overload | ||
| def flatten_hash_safe(iterable: Iterable[OneOrIterable[str]], /) -> Iterator[str]: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an improvement over the previous version, but far from ideal.
Still doesn't resolve this case, and I'm not entirely sure why yet
narwhals/narwhals/_plan/compliant/column.py
Lines 49 to 60 in f77bb4c
| @classmethod | |
| def align( | |
| cls, *exprs: OneOrIterable[SupportsBroadcast[SeriesT, LengthT]] | |
| ) -> Iterator[SeriesT]: | |
| exprs = tuple[SupportsBroadcast[SeriesT, LengthT], ...](flatten_hash_safe(exprs)) | |
| length = cls._length_required(exprs) | |
| if length is None: | |
| for e in exprs: | |
| yield e.to_series() | |
| else: | |
| for e in exprs: | |
| yield e.broadcast(length) |
| def sort_by( | ||
| by: OneOrIterable[str], | ||
| *more_by: str, | ||
| descending: OneOrIterable[bool] = False, | ||
| nulls_last: bool = False, | ||
| ) -> Decl: | ||
| return SortMultipleOptions.parse( | ||
| descending=descending, nulls_last=nulls_last | ||
| ).to_arrow_acero(tuple(flatten_hash_safe((by, more_by)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of feat(expr-ir): Impl acero.sort_by, I still need to make use of this in a plan.
A good candidate might be in either/both of
over(order_by=...)
narwhals/narwhals/_plan/arrow/expr.py
Lines 328 to 350 in f77bb4c
| def over_ordered( | |
| self, node: ir.OrderedWindowExpr, frame: Frame, name: str | |
| ) -> Self | Scalar: | |
| if node.partition_by: | |
| msg = f"Need to implement `group_by`, `join` for:\n{node!r}" | |
| raise NotImplementedError(msg) | |
| # NOTE: Converting `over(order_by=..., options=...)` into the right shape for `DataFrame.sort` | |
| sort_by = tuple(NamedIR.from_ir(e) for e in node.order_by) | |
| options = node.sort_options.to_multiple(len(node.order_by)) | |
| idx_name = temp.column_name(frame) | |
| sorted_context = frame.with_row_index(idx_name).sort(sort_by, options) | |
| evaluated = node.expr.dispatch(self, sorted_context.drop([idx_name]), name) | |
| if isinstance(evaluated, ArrowScalar): | |
| # NOTE: We're already sorted, defer broadcasting to the outer context | |
| # Wouldn't be suitable for partitions, but will be fine here | |
| # - https://github.com/narwhals-dev/narwhals/pull/2528/commits/2ae42458cae91f4473e01270919815fcd7cb9667 | |
| # - https://github.com/narwhals-dev/narwhals/pull/2528/commits/b8066c4c57d4b0b6c38d58a0f5de05eefc2cae70 | |
| return self._with_native(evaluated.native, name) | |
| indices = pc.sort_indices(sorted_context.get_column(idx_name).native) | |
| height = len(sorted_context) | |
| result = evaluated.broadcast(height).native.take(indices) | |
| return self._with_native(result, name) |
is_{first,last}_distinct
narwhals/narwhals/_arrow/series.py
Lines 719 to 747 in 715be22
| def is_first_distinct(self) -> Self: | |
| import numpy as np # ignore-banned-import | |
| row_number = pa.array(np.arange(len(self))) | |
| col_token = generate_temporary_column_name(n_bytes=8, columns=[self.name]) | |
| first_distinct_index = ( | |
| pa.Table.from_arrays([self.native], names=[self.name]) | |
| .append_column(col_token, row_number) | |
| .group_by(self.name) | |
| .aggregate([(col_token, "min")]) | |
| .column(f"{col_token}_min") | |
| ) | |
| return self._with_native(pc.is_in(row_number, first_distinct_index)) | |
| def is_last_distinct(self) -> Self: | |
| import numpy as np # ignore-banned-import | |
| row_number = pa.array(np.arange(len(self))) | |
| col_token = generate_temporary_column_name(n_bytes=8, columns=[self.name]) | |
| last_distinct_index = ( | |
| pa.Table.from_arrays([self.native], names=[self.name]) | |
| .append_column(col_token, row_number) | |
| .group_by(self.name) | |
| .aggregate([(col_token, "max")]) | |
| .column(f"{col_token}_max") | |
| ) | |
| return self._with_native(pc.is_in(row_number, last_distinct_index)) |
Mostly following what is on `main` (so far)
Both are available at all levels, + `to_series` is implemented in term of `get_columns`
`is_{first,last}_distinct` are one of a few that fit that case
order_by/sort_by pairorder_by, hashjoin for over
- Starting to build up the join test suite - At some point, `"cross"` support will be needed
Everything else requires another feature to be implemented: - `DataFrame.filter` for semi, anti - `DataFrame.collect_schema` for suffix - `how="cross"` is just being defered currently (#3173 (comment))
50 lines! Even after all this refactoring 😔
Quite handy that I did `Expr.filter` and `When` first 😄
order_by, hashjoin for overorder_by, hashjoin for over + DataFrame.filter
the exact text is allowed to change
Some basic cases to consider for #3182 If we decide against supporting them, then all can be converted into a `pytest.raises`
Really don't want this being part of the `ArrowDataFrame` constructor Viewing `join` as an edge case, whereas things like `select`, `with_columns` already handle duplicates during `prepare_projections`
| def filter(self, predicate: NamedIR) -> Self: | ||
| mask: pc.Expression | ChunkedArrayAny | ||
| if not fn.is_series(predicate): | ||
| resolved = Expr.from_named_ir(predicate, self) | ||
| if isinstance(resolved, Expr): | ||
| mask = resolved.broadcast(len(self)).native | ||
| else: | ||
| mask = acero.lit(resolved.native) | ||
| resolved = Expr.from_named_ir(predicate, self) | ||
| if isinstance(resolved, Expr): | ||
| mask = resolved.broadcast(len(self)).native | ||
| else: | ||
| mask = predicate.native | ||
| mask = acero.lit(resolved.native) | ||
| return self._with_native(self.native.filter(mask)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very nice now 😄
order_by, hashjoin for over + DataFrame.filterorder_by, hashjoin , DataFrame.{filter,join}, Expr.is_{first,last}_distinct
Tracking
DataFrame.filtersilently ignores**constraintswhen usinglist[bool]#3182Related issues
ExprIR #2572group_by, utilizepyarrow.acero#3143Description
Note
I've used the name
sort_byfor our wrapped oforder_by.The node corresponds to
pa.Table.sort_by, whereas the nameorder_bydoesn't appear anywhere else inpyarrowBuilding out more
aceroparts to be able to support.over(*partition_by)