-
Notifications
You must be signed in to change notification settings - Fork 186
feat: add list aggregate methods #3332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
fadddc3
040527b
9b45555
7bdd2d2
74934c0
7efe2d2
8e04fc1
7ab1ebf
6b57809
7dfc9fa
1243980
7eca29a
545abf8
74de5c6
65072ff
ca4c794
3251865
eff085c
fdcf3f3
146c458
4d18654
9975334
be000f5
a470f8f
9b53f03
a549576
76c70ff
d716fd1
aa7bad3
54d7041
47f5a4b
9bcdebe
3ab7639
2c913d7
687c4ae
c851f10
310daa6
eca1c02
398c350
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -7,6 +7,11 @@ | |
| - contains | ||
| - get | ||
| - len | ||
| - max | ||
| - mean | ||
| - median | ||
| - min | ||
| - sum | ||
| - unique | ||
| show_source: false | ||
| show_bases: false | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -7,6 +7,11 @@ | |
| - contains | ||
| - get | ||
| - len | ||
| - max | ||
| - mean | ||
| - median | ||
| - min | ||
| - sum | ||
| - unique | ||
| show_source: false | ||
| show_bases: false | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -11,6 +11,7 @@ | |
|
|
||
| if TYPE_CHECKING: | ||
| from collections.abc import Iterable, Iterator, Mapping | ||
| from typing import Literal | ||
|
|
||
| from typing_extensions import TypeAlias, TypeIs | ||
|
|
||
|
|
@@ -494,3 +495,40 @@ def arange(start: int, end: int, step: int) -> ArrayAny: | |
| return pa.array(np.arange(start, end, step)) | ||
| # NOTE: Added in https://github.com/apache/arrow/pull/46778 | ||
| return pa.arange(start, end, step) # type: ignore[attr-defined] | ||
|
|
||
|
|
||
| def list_agg( | ||
| array: ChunkedArrayAny, | ||
| func: Literal["min", "max", "mean", "approximate_median", "sum"], | ||
| ) -> ChunkedArrayAny: | ||
| lit_: Incomplete = lit | ||
| aggregation = ( | ||
| ("values", func, pc.ScalarAggregateOptions(min_count=0)) | ||
| if func == "sum" | ||
| else ("values", func) | ||
| ) | ||
| agg = pa.array( | ||
| pa.Table.from_arrays( | ||
| [pc.list_flatten(array), pc.list_parent_indices(array)], | ||
| names=["values", "offsets"], | ||
| ) | ||
| .group_by("offsets") | ||
| .aggregate([aggregation]) | ||
| .sort_by("offsets") | ||
| .column(f"values_{func}") | ||
| ) | ||
|
Comment on lines
+510
to
519
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a comment rather than anything else, but I want mention it - maybe someone has ideas. In the case of multiple expressions on the same list column, we would end up running the same (expensive?) operations (flatten + group by + sort). I wonder if for the case of expressions there is a way to execute them together. In code, I would the following: frame.select(
a_sum = nw.col("a").list.sum(),
a_mean = nw.col("a").list.mean()
)to execute as: agg = pa.array(
pa.Table.from_arrays(
[pc.list_flatten(array), pc.list_parent_indices(array)],
names=["values", "offsets"],
)
.group_by("offsets")
.aggregate([("values", "sum"), ("values", "mean")])
.sort_by("offsets")
)
... # Here get the 2 columns
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
100% agree on the idea, but probably too complex in current This issue isn't unique to ExampleEfficient: frame.select(
nw.col("a", "b").sum().over("c"),
)Same intention, but worse performance for frame.select(
nw.col("a").sum().over("c"),
nw.col("b").sum().over("c")
)Why are these too hard?I think what you're describing corresponds to common subplan/expression elimination (see The main challenges are:
@FBruzzesi if this sounds similar to what I mentioned RE (#2572) and a Basically, the problem would be much easier to generalize if we had: frame.select(
a_sum = nw.col("a").list.sum(),
a_mean = nw.col("a").list.mean()
)Outputs: nw_plan = Select(
inputs=[
Alias(name="a_sum", expr=ListSum(Column(name="a"))),
Alias(name="a_mean", expr=ListMean(Column(name="a"))),
]
)But then we transform the Warning - verbosity overload and invented data model
pyarrow_list_sum = Sort(
inputs=[
GroupBy(
inputs=[
HConcat(
inputs=[ListExplode(Column(name="a")), ListOffsets(Column(name="a"))],
names=["values", "offsets"],
)
],
keys=["offsets"],
aggs=[Alias(name="a_sum", expr=Sum(Column(name="value")))],
)
],
by=["offsets"],
)
pyarrow_list_mean = Sort(
inputs=[
GroupBy(
inputs=[
HConcat(
inputs=[ListExplode(Column(name="a")), ListOffsets(Column(name="a"))],
names=["values", "offsets"],
)
],
keys=["offsets"],
aggs=[Alias(name="a_mean", expr=Mean(Column(name="value")))],
)
],
by=["offsets"],
)Then detecting what is shared between |
||
| non_empty_mask = pa.array(pc.not_equal(pc.list_value_length(array), lit(0))) | ||
| if func == "sum": | ||
| # Make sure sum of empty list is 0. | ||
| base_array = pc.if_else(non_empty_mask.is_null(), None, 0) | ||
| else: | ||
| base_array = pa.repeat(lit_(None, type=agg.type), len(array)) | ||
| return pa.chunked_array( | ||
| [ | ||
| pc.replace_with_mask( | ||
| base_array, | ||
| non_empty_mask.fill_null(False), # type: ignore[arg-type] | ||
| agg, | ||
| ) | ||
| ] | ||
| ) | ||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raisadz I'm pretty excited by this! 😄
+1 from me on (#3332 (review))
I've just tried this out with the test case for
list.unique:narwhals/tests/expr_and_series/list/unique_test.py
Line 13 in 2c6c3c3
The result for that should be:
But using
list_aggseems to have dropped 2/4 lists and all nulls 🤔I managed to get slightly closer to what we want, by passing in
optionsfor thegroup_by:Show
list_agg_optsThese are the correct results for 2/4 of the lists 🎉
But where did the other 2 go? 😳
Edit: I missed it myself lol, fixed in (d8363e1)