Skip to content

feat: add list aggregate methods#3332

Merged
MarcoGorelli merged 39 commits intonarwhals-dev:mainfrom
raisadz:feat/list-agg
Dec 14, 2025
Merged

feat: add list aggregate methods#3332
MarcoGorelli merged 39 commits intonarwhals-dev:mainfrom
raisadz:feat/list-agg

Conversation

@raisadz
Copy link
Contributor

@raisadz raisadz commented Nov 28, 2025

Description

The following list methods are implemented:

    - list.max
    - list.mean
    - list.median
    - list.min
    - list.sum

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

  • Related issue #<issue number>
  • Closes #<issue number>

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

@raisadz raisadz marked this pull request as ready for review November 28, 2025 12:29
Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @raisadz - this looks amazing! Just left a comment to possibly support more in the spark-like case 😇

@FBruzzesi FBruzzesi added enhancement New feature or request nested data `list`, `struct`, etc labels Nov 28, 2025
Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @raisadz, I have a couple more comments, apologies for the fragmented review 🙈

  1. Could you add a couple of test cases:
    a. All nulls in list
    b. Empty list
    c. polars.Expr.list.sum says: If there are no non-null elements in a row, the output is 0. For the other aggregations it's unclear what the output should be, and I wonder how consistent it is across all different backends.
  2. Could you mix the docstring examples a bit other than polars?

Comment on lines 500 to 512
def list_agg(
array: ChunkedArrayAny,
func: Literal["min", "max", "mean", "approximate_median", "sum"],
) -> ChunkedArrayAny:
return (
pa.Table.from_arrays(
[pc.list_flatten(array), pc.list_parent_indices(array)],
names=["values", "offsets"],
)
.group_by("offsets")
.aggregate([("values", func)])
.column(f"values_{func}")
)
Copy link
Member

@dangotbanned dangotbanned Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raisadz I'm pretty excited by this! 😄

+1 from me on (#3332 (review))


I've just tried this out with the test case for list.unique:

data = {"a": [[2, 2, 3, None, None], None, [], [None]]}

The result for that should be:

[[None, 2, 3], None, [], [None]]

But using list_agg seems to have dropped 2/4 lists and all nulls 🤔

import pyarrow as pa

data = {"a": [[2, 2, 3, None, None], None, [], [None]]}
ca = pa.chunked_array([pa.array(data["a"])])
result = list_agg(ca, "distinct").to_pylist()
print(result)
[[2, 3], []]

I managed to get slightly closer to what we want, by passing in options for the group_by:

Show list_agg_opts

from typing import Any

import pyarrow as pa
import pyarrow.compute as pc

def list_agg_opts(
    array: pa.ChunkedArray[Any], func: Any, options: Any = None
) -> pa.ChunkedArray[Any]:
    return (
        pa.Table.from_arrays(
            [pc.list_flatten(array), pc.list_parent_indices(array)],
            names=["values", "offsets"],
        )
        .group_by("offsets")
        .aggregate([("values", func, options)])  # <-------
        .column(f"values_{func}")
    )

These are the correct results for 2/4 of the lists 🎉

But where did the other 2 go? 😳

result = list_agg_opts(ca, "distinct", pc.CountOptions("all")).to_pylist()
print(result)
[[2, 3, None], [None]]

Edit: I missed it myself lol, fixed in (d8363e1)

@raisadz
Copy link
Contributor Author

raisadz commented Nov 29, 2025

Thank you both! Yes in fact those tests with empty lists or all Nones lists fail for multiple backends. I pushed some changes but this is wip and I will continue to work on that

Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @raisadz - I left a couple of comments more.
After those are solved I am happy to merge 🙏🏼

raisadz and others added 3 commits December 9, 2025 11:13
Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com>
@FBruzzesi
Copy link
Member

@raisadz the action failing for python 3.9 seems to be on windows only (I cannot replicate that on mac nor ubuntu, but I don't have a windows machine to try it). I would say let's xfail such case and consider:

  • reporting it upstream
  • issue a warning from our side? I am 50-50 split on this

if TYPE_CHECKING:
from tests.utils import Constructor, ConstructorEager

data = {"a": [[3, None, 2, 2, 4, None], [-1], None, [None, None, None], []]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been a bit afraid to bring this up, but AFAICT all of the list.* methods have only been tested against 1 level of nesting.

I think 2 (or more) levels might be a problem for pyarrow and pandas, because lists aren't hashable yet.

I'd be really happy to wrong on this though 🙏

If it is an issue, I don't think it needs to be a blocker - just something to keep in mind 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

polars seems to properly support these ops only for List(<numeric_type>) (and maybe temporal types?) and either fail or return nulls otherwise:

import polars as pl

data = [
    [[1], [2,3]],
    [[4,5,6], [7,8]]
]

series = pl.Series(data)
print(series.dtype)
print()
for op in ("min", "mean", "max", "median", "sum"):
    print(f"Executing {op}")
    try:
        print("result", getattr(series.list, op)())
    except Exception as exc:
        print("error", exc)
    print()
List(List(Int64))

Executing min
error `min` operation not supported for dtype `list[i64]`

Executing mean
result shape: (2,)
Series: '' [f64]
[
        null
        null
]

Executing max
error `max` operation not supported for dtype `list[i64]`

Executing median
result shape: (2,)
Series: '' [f64]
[
        null
        null
]

Executing sum
error `sum` operation not supported for dtype `list[i64]`

Hint: you may mean to call `concat_list`

There are a few other cases for which we check the dtype before performing an operation. We might do the same here

Copy link
Member

@FBruzzesi FBruzzesi Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raisadz I think this is the last missing bit before merging. If we can standardize the error message for all backends it would be amazing. I would be in favor in raising for list.{mean,median} as well, and ask in polars if the current output is expected.

Update1: Asked in their discord channel

Update2: reply:

it's an older leftover I think
unsupported aggregates used to return None and we've been slowly transitioning over time to nice errors

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 2 (or more) levels might be a problem for pyarrow and pandas, because lists aren't hashable yet.

tbh i wouldn't worry about it, just leave it to each backend

@raisadz
Copy link
Contributor Author

raisadz commented Dec 12, 2025

@raisadz the action failing for python 3.9 seems to be on windows only (I cannot replicate that on mac nor ubuntu, but I don't have a windows machine to try it). I would say let's xfail such case and consider:

  • reporting it upstream
  • issue a warning from our side? I am 50-50 split on this

Thanks @FBruzzesi ! I missed this failure and I am skipping it now. I don't think we should worry about 3.9 version on windows as it is almost obsolete at this point

@dangotbanned
Copy link
Member

I don't think we should worry about 3.9 version on windows as it is almost obsolete at this point

This might be good to mention in #3204?

@raisadz raisadz mentioned this pull request Dec 12, 2025
25 tasks
Comment on lines 519 to 521
base_array = pc.if_else(
non_empty_mask, 0, None
) # zero is just a placeholder which is replaced below
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it work to just do

        base_array = pa.repeat(lit(None, type=agg.type), len(array))

?

the 0 placeholder feels a bit magical

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's something i worked through in the live stream, could it work?

    lit_: Incomplete = lit
    aggregation = ('values', 'sum', pc.ScalarAggregateOptions(min_count=0)) if func == 'sum' else ('values', func)
    agg = pa.array(
        pa.Table.from_arrays(
            [pc.list_flatten(array), pc.list_parent_indices(array)],
            names=["values", "offsets"],
        )
        .group_by("offsets")
        .aggregate([aggregation])
        .sort_by("offsets")
        .column(f"values_{func}")
    )
    non_empty_mask = pa.array(pc.not_equal(pc.list_value_length(array), lit(0)))
    if func == "sum":
        # Make sure sum of empty list is 0.
        base_array = pc.if_else(non_empty_mask.is_null(), None, 0)
    else:
        base_array = pa.repeat(lit_(None, type=agg.type), len(array))
    return pa.chunked_array(
        [
            pc.replace_with_mask(
                base_array,
                non_empty_mask.fill_null(False),  # type: ignore[arg-type]
                agg,
            )
        ]
    )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got like 97 variations of this now 😂

Pretty much everything that involves list.* needs to do this dance around [] and None

Show rabbit hole

class ExplodeBuilder:
"""Tools for exploding lists.
The complexity of these operations increases with:
- Needing to preserve null/empty elements
- All variants are cheaper if this can be skipped
- Exploding in the context of a table
- Where a single column is much simpler than multiple
"""
options: ExplodeOptions
def __init__(self, *, empty_as_null: bool = True, keep_nulls: bool = True) -> None:
self.options = ExplodeOptions(empty_as_null=empty_as_null, keep_nulls=keep_nulls)
@classmethod
def from_options(cls, options: ExplodeOptions, /) -> Self:
obj = cls.__new__(cls)
obj.options = options
return obj
@t.overload
def explode(
self, native: ChunkedList[DataTypeT] | ListScalar[DataTypeT]
) -> ChunkedArray[Scalar[DataTypeT]]: ...
@t.overload
def explode(self, native: ListArray[DataTypeT]) -> Array[Scalar[DataTypeT]]: ...
@t.overload
def explode(
self, native: Arrow[ListScalar[DataTypeT]]
) -> ChunkedOrArray[Scalar[DataTypeT]]: ...
def explode(
self, native: Arrow[ListScalar[DataTypeT]]
) -> ChunkedOrArray[Scalar[DataTypeT]]:
"""Explode list elements, expanding one-level into a new array.
Equivalent to `polars.{Expr,Series}.explode`.
"""
safe = self._fill_with_null(native) if self.options.any() else native
if not isinstance(safe, pa.Scalar):
return _list_explode(safe)
return chunked_array(_list_explode(safe))
def explode_with_indices(self, native: ChunkedList | ListArray) -> pa.Table:
safe = self._fill_with_null(native) if self.options.any() else native
arrays = [_list_parent_indices(safe), _list_explode(safe)]
return concat_horizontal(arrays, ["idx", "values"])
def explode_column(self, native: pa.Table, column_name: str, /) -> pa.Table:
"""Explode a list-typed column in the context of `native`."""
ca = native.column(column_name)
if native.num_columns == 1:
return native.from_arrays([self.explode(ca)], [column_name])
safe = self._fill_with_null(ca) if self.options.any() else ca
exploded = _list_explode(safe)
col_idx = native.schema.get_field_index(column_name)
if len(exploded) == len(native):
return native.set_column(col_idx, column_name, exploded)
return (
native.remove_column(col_idx)
.take(_list_parent_indices(safe))
.add_column(col_idx, column_name, exploded)
)
def explode_columns(self, native: pa.Table, subset: Collection[str], /) -> pa.Table:
"""Explode multiple list-typed columns in the context of `native`."""
subset = list(subset)
arrays = native.select(subset).columns
first = arrays[0]
first_len = list_len(first)
if self.options.any():
mask = self._predicate(first_len)
first_safe = self._fill_with_null(first, mask)
it = (
_list_explode(self._fill_with_null(arr, mask))
for arr in self._iter_ensure_shape(first_len, arrays[1:])
)
else:
first_safe = first
it = (
_list_explode(arr)
for arr in self._iter_ensure_shape(first_len, arrays[1:])
)
first_result = _list_explode(first_safe)
if len(first_result) != len(native):
gathered = native.drop_columns(subset).take(_list_parent_indices(first_safe))
for name, arr in zip(subset, chain([first_result], it)):
gathered = gathered.append_column(name, arr)
return gathered.select(native.column_names)
# NOTE: Not too happy about this import
from narwhals._plan.arrow.dataframe import with_arrays
return with_arrays(native, zip(subset, chain([first_result], it)))
@classmethod
def explode_column_fast(cls, native: pa.Table, column_name: str, /) -> pa.Table:
"""Explode a list-typed column in the context of `native`, ignoring empty and nulls."""
return cls(empty_as_null=False, keep_nulls=False).explode_column(
native, column_name
)
def _iter_ensure_shape(
self,
first_len: ChunkedArray[pa.UInt32Scalar],
arrays: Iterable[ChunkedArrayAny],
/,
) -> Iterator[ChunkedArrayAny]:
for arr in arrays:
if not first_len.equals(list_len(arr)):
msg = "exploded columns must have matching element counts"
raise ShapeError(msg)
yield arr
def _predicate(self, lengths: ArrowAny, /) -> Arrow[pa.BooleanScalar]:
"""Return True for each sublist length that indicates the original sublist should be replaced with `[None]`."""
empty_as_null, keep_nulls = self.options.empty_as_null, self.options.keep_nulls
if empty_as_null and keep_nulls:
return or_(is_null(lengths), eq(lengths, lit(0)))
if empty_as_null:
return eq(lengths, lit(0))
return is_null(lengths)
def _fill_with_null(
self, native: ArrowListT, mask: Arrow[BooleanScalar] | NoDefault = no_default
) -> ArrowListT:
"""Replace each sublist in `native` with `[None]`, according to `self.options`.
Arguments:
native: List-typed arrow data.
mask: An optional, pre-computed replacement mask. By default, this is generated from `native`.
"""
predicate = self._predicate(list_len(native)) if mask is no_default else mask
result: ArrowListT = when_then(predicate, lit([None], native.type), native)
return result
@t.overload
def _list_explode(native: ChunkedList[DataTypeT]) -> ChunkedArray[Scalar[DataTypeT]]: ...
@t.overload
def _list_explode(
native: ListArray[NonListTypeT] | ListScalar[NonListTypeT],
) -> Array[Scalar[NonListTypeT]]: ...
@t.overload
def _list_explode(native: ListArray[DataTypeT]) -> Array[Scalar[DataTypeT]]: ...
@t.overload
def _list_explode(native: ListScalar[ListTypeT]) -> ListArray[ListTypeT]: ...
def _list_explode(native: Arrow[ListScalar]) -> ChunkedOrArrayAny:
result: ChunkedOrArrayAny = pc.call_function("list_flatten", [native])
return result
@t.overload
def _list_parent_indices(native: ChunkedList) -> ChunkedArray[pa.Int64Scalar]: ...
@t.overload
def _list_parent_indices(native: ListArray) -> pa.Int64Array: ...
def _list_parent_indices(
native: ChunkedOrArray[ListScalar],
) -> ChunkedOrArray[pa.Int64Scalar]:
"""Don't use this withut handling nulls!"""
result: ChunkedOrArray[pa.Int64Scalar] = pc.call_function(
"list_parent_indices", [native]
)
return result
@t.overload
def list_len(native: ChunkedList) -> ChunkedArray[pa.UInt32Scalar]: ...
@t.overload
def list_len(native: ListArray) -> pa.UInt32Array: ...
@t.overload
def list_len(native: ListScalar) -> pa.UInt32Scalar: ...
@t.overload
def list_len(native: ChunkedOrScalar[ListScalar]) -> ChunkedOrScalar[pa.UInt32Scalar]: ...
@t.overload
def list_len(native: Arrow[ListScalar[Any]]) -> Arrow[pa.UInt32Scalar]: ...
def list_len(native: ArrowAny) -> ArrowAny:
length: Incomplete = pc.list_value_length
result: ArrowAny = length(native).cast(pa.uint32())
return result
@t.overload
def list_get(
native: ChunkedList[DataTypeT], index: int
) -> ChunkedArray[Scalar[DataTypeT]]: ...
@t.overload
def list_get(native: ListArray[DataTypeT], index: int) -> Array[Scalar[DataTypeT]]: ...
@t.overload
def list_get(native: ListScalar[DataTypeT], index: int) -> Scalar[DataTypeT]: ...
@t.overload
def list_get(native: SameArrowT, index: int) -> SameArrowT: ...
@t.overload
def list_get(native: ChunkedOrScalarAny, index: int) -> ChunkedOrScalarAny: ...
def list_get(native: ArrowAny, index: int) -> ArrowAny:
list_get_: Incomplete = pc.list_element
result: ArrowAny = list_get_(native, index)
return result
@t.overload
def list_join(
native: ChunkedList[StringType],
separator: Arrow[StringScalar] | str,
*,
ignore_nulls: bool = ...,
) -> ChunkedArray[StringScalar]: ...
@t.overload
def list_join(
native: ListArray[StringType],
separator: Arrow[StringScalar] | str,
*,
ignore_nulls: bool = ...,
) -> pa.StringArray: ...
@t.overload
def list_join(
native: ListScalar[StringType],
separator: Arrow[StringScalar] | str,
*,
ignore_nulls: bool = ...,
) -> pa.StringScalar: ...
def list_join(
native: ArrowAny, separator: Arrow[StringScalar] | str, *, ignore_nulls: bool = False
) -> ArrowAny:
"""Join all string items in a sublist and place a separator between them.
Each list of values in the first input is joined using each second input as separator.
If any input list is null or contains a null, the corresponding output will be null.
Edge cases:
>>> import polars as pl
>>> data = {
... "s": [
... ["a", "b", "c"],
... ["x", "y"],
... ["1", None, "3"],
... [None],
... None,
... [],
... [None, None], # <-- everything works except this, for now
... ]
... }
>>> s = pl.col("s")
>>> result = pl.DataFrame(data).select(
... s,
... ignore_nulls=s.list.join("-", ignore_nulls=True),
... propagate_nulls=s.list.join("-", ignore_nulls=False),
... )
>>> result
shape: (7, 3)
┌──────────────────┬──────────────┬─────────────────┐
│ s ┆ ignore_nulls ┆ propagate_nulls │
│ --- ┆ --- ┆ --- │
│ list[str] ┆ str ┆ str │
╞══════════════════╪══════════════╪═════════════════╡
│ ["a", "b", "c"] ┆ a-b-c ┆ a-b-c │
│ ["x", "y"] ┆ x-y ┆ x-y │
│ ["1", null, "3"] ┆ 1-3 ┆ null │
│ [null] ┆ ┆ null │
│ null ┆ null ┆ null │
│ [] ┆ ┆ │
│ [null, null] ┆ ┆ null │
└──────────────────┴──────────────┴─────────────────┘
"""
join = t.cast(
"Callable[[Any, Any], ChunkedArray[StringScalar] | pa.StringArray]",
pc.binary_join,
)
if not ignore_nulls:
return pc.binary_join(native, separator)
# NOTE: `polars` default is `True`
if isinstance(native, pa.Scalar):
to_join = (
implode(_list_explode(native).drop_null()) if native.is_valid else native
)
return pc.binary_join(to_join, separator)
result = join(native, separator)
if not result.null_count:
# if we got here and there were no nulls, then we're done
return result
todo_mask = pc.and_not(result.is_null(), native.is_null())
todo_lists = native.filter(todo_mask)
list_len_1: ChunkedOrArrayAny = eq(list_len(todo_lists), lit(1)) # pyright: ignore[reportAssignmentType]
only_single_null = any_(list_len_1).as_py()
if only_single_null:
todo_lists = when_then(list_len_1, lit([""], todo_lists.type), todo_lists)
builder = ExplodeBuilder(empty_as_null=False, keep_nulls=False)
replacements = join(
builder.explode_with_indices(todo_lists)
.drop_null()
.group_by("idx")
.aggregate([("values", "hash_list")])
.column(1),
separator,
)
if len(replacements) != len(list_len_1):
# probably do-able, but the edge cases here are getting hairy
msg = f"TODO: `ArrowExpr.list.join` w/ `[None, None , ...]` element\n{native!r}"
raise NotImplementedError(msg)
return replace_with_mask(result, todo_mask, replacements)
@overload
def list_unique(native: ChunkedList) -> ChunkedList: ...
@overload
def list_unique(native: ListScalar) -> ListScalar: ...
@overload
def list_unique(native: ChunkedOrScalar[ListScalar]) -> ChunkedOrScalar[ListScalar]: ...
def list_unique(native: ChunkedOrScalar[ListScalar]) -> ChunkedOrScalar[ListScalar]:
"""Get the unique/distinct values in the list.
There's lots of tricky stuff going on in here, but for good reasons!
Whenever possible, we want to avoid having to deal with these pesky guys:
[["okay", None, "still fine"], None, []]
# ^^^^ ^^
- Those kinds of list elements are ignored natively
- `unique` is length-changing operation
- We can't use [`pc.replace_with_mask`] on a list
- We can't join when a table contains list columns [apache/arrow#43716]
**But** - if we're lucky, and we got a non-awful list (or only one element) - then
most issues vanish.
[`pc.replace_with_mask`]: https://arrow.apache.org/docs/python/generated/pyarrow.compute.replace_with_mask.html
[apache/arrow#43716]: https://github.com/apache/arrow/issues/43716
"""
from narwhals._plan.arrow.group_by import AggSpec
if isinstance(native, pa.Scalar):
scalar = t.cast("pa.ListScalar[Any]", native)
if scalar.is_valid and (len(scalar) > 1):
return implode(_list_explode(native).unique())
return scalar
idx, v = "index", "values"
names = idx, v
len_not_eq_0 = not_eq(list_len(native), lit(0))
can_fastpath = all_(len_not_eq_0, ignore_nulls=False).as_py()
if can_fastpath:
arrays = [_list_parent_indices(native), _list_explode(native)]
return AggSpec.unique(v).over_index(concat_horizontal(arrays, names), idx)
# Oh no - we caught a bad one!
# We need to split things into good/bad - and only work on the good stuff.
# `int_range` is acting like `parent_indices`, but doesn't give up when it see's `None` or `[]`
indexed = concat_horizontal([int_range(len(native)), native], names)
valid = indexed.filter(len_not_eq_0)
invalid = indexed.filter(or_(native.is_null(), not_(len_not_eq_0)))
# To keep track of where we started, our index needs to be exploded with the list elements
explode_with_index = ExplodeBuilder.explode_column_fast(valid, v)
valid_unique = AggSpec.unique(v).over(explode_with_index, [idx])
# And now, because we can't join - we do a poor man's version of one 😉
return concat_tables([valid_unique, invalid]).sort_by(idx).column(v)
def list_contains(
native: ChunkedOrScalar[ListScalar], item: NonNestedLiteral | ScalarAny
) -> ChunkedOrScalar[pa.BooleanScalar]:
from narwhals._plan.arrow.group_by import AggSpec
if isinstance(native, pa.Scalar):
scalar = t.cast("pa.ListScalar[Any]", native)
if scalar.is_valid:
if len(scalar):
value_type = scalar.type.value_type
return any_(eq_missing(_list_explode(scalar), lit(item).cast(value_type)))
return lit(False, BOOL)
return lit(None, BOOL)
builder = ExplodeBuilder(empty_as_null=False, keep_nulls=False)
tbl = builder.explode_with_indices(native)
idx, name = tbl.column_names
contains = eq_missing(tbl.column(name), item)
l_contains = AggSpec.any(name).over_index(tbl.set_column(1, name, contains), idx)
# Here's the really key part: this mask has the same result we want to return
# So by filling the `True`, we can flip those to `False` if needed
# But if we were already `None` or `False` - then that's sticky
propagate_invalid: ChunkedArray[pa.BooleanScalar] = not_eq(list_len(native), lit(0))
return replace_with_mask(propagate_invalid, propagate_invalid, l_contains)
def implode(native: Arrow[Scalar[DataTypeT]]) -> ListScalar[DataTypeT]:
"""Aggregate values into a list.
The returned list itself is a scalar value of `list` dtype.
"""
arr = array(native)
return pa.ListArray.from_arrays([0, len(arr)], arr)[0]
def str_join(
native: Arrow[StringScalar], separator: str, *, ignore_nulls: bool = True
) -> StringScalar:
"""Vertically concatenate the string values in the column to a single string value."""
if isinstance(native, pa.Scalar):
# already joined
return native
if ignore_nulls and native.null_count:
native = native.drop_null()
return list_join(implode(native), separator)

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks all! looks like this is very much on the right track

@raisadz
Copy link
Contributor Author

raisadz commented Dec 13, 2025

Thanks for the suggestion @MarcoGorelli ! I applied it now to the list_agg function

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @raisadz , and @FBruzzesi + @dangotbanned for reviews

@MarcoGorelli MarcoGorelli merged commit 4e6f646 into narwhals-dev:main Dec 14, 2025
35 of 36 checks passed
dangotbanned added a commit that referenced this pull request Dec 14, 2025
Comment on lines +33 to +39
if (
any(backend in str(constructor) for backend in ("pandas", "pyarrow"))
and sys.version_info < (3, 10)
and is_windows
): # pragma: no cover
reason = "The issue only affects old Python versions on Windows."
pytest.skip(reason=reason)
Copy link
Member

@dangotbanned dangotbanned Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to have caught this so late, I just noticed in (oh-nodes...expr-ir/list-agg)

Since is_windows is a function ...

Show is_windows

image

... this condition is always True:

from tests.utils import is_windows

bool(is_windows)
True

Meaning that all platforms skip on sys.version_info < (3, 10) 😱

I usually prefer xfail to skip since the former is the only one that'll tell you when something's amiss 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dangotbanned ! I added a follow-up PR to fix this #3354

dangotbanned added a commit that referenced this pull request Dec 14, 2025
Tried to keep everything as close to original as possible
Next step is simplifying everything and fixing `list.sum`
Comment on lines +15 to +17
data = {"a": [[3, None, 2, 2, 4, None], [-1], None, [None, None, None], [], [3, 4, None]]}
expected = [2.5, -1, None, None, None, 3.5]
expected_pyarrow = [2.5, -1, None, None, None, 3]
Copy link
Member

@dangotbanned dangotbanned Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raisadz @FBruzzesi (re #3332 (comment), #3332 (review))

I've done some experimenting and think I've found what the pyarrow issue is:

import pyarrow as pa
import pyarrow.compute as pc


def median(*values: float | None) -> pa.DoubleScalar:
    return pc.approximate_median(pa.array(values, pa.float64()))


def median_pretty(*values: float | None) -> None:
    print(f"median({list(values)!a:21}) = {median(*values)}")

I wonder if you can spot it too 😄:

median_pretty()
median_pretty(3)
median_pretty(3, 4)
median_pretty(3, 4, None)
median_pretty(3, 4, None, None)
median_pretty(3, 4, None, None, 5)
median_pretty(3, 4, 5)
median_pretty(5, 3, 4)
median_pretty(5, 3)
median_pretty(5, 2)
median_pretty(5, 2, 50)
median_pretty(None, 2, 50)
median_pretty(None, 2, 50, 2)
median_pretty(50, 2, 50, 2)
median([]                   ) = None
median([3]                  ) = 3.0
median([3, 4]               ) = 3.0
median([3, 4, None]         ) = 3.0
median([3, 4, None, None]   ) = 3.0
median([3, 4, None, None, 5]) = 4.0
median([3, 4, 5]            ) = 4.0
median([5, 3, 4]            ) = 4.0
median([5, 3]               ) = 3.0
median([5, 2]               ) = 2.0
median([5, 2, 50]           ) = 5.0
median([None, 2, 50]        ) = 2.0
median([None, 2, 50, 2]     ) = 2.0
median([50, 2, 50, 2]       ) = 26.0

dangotbanned added a commit that referenced this pull request Dec 14, 2025
Demonstrated in (#3332 (comment))
The issue is unrelated to group_by and lists
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request nested data `list`, `struct`, etc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants