feat: add DataFrame and LazyFrame explode method#1542
Conversation
narwhals/_arrow/dataframe.py
Outdated
| # TODO(Unassigned): Even with promote_options="permissive", pyarrow does not | ||
| # upcast numeric to non-numeric (e.g. string) datatypes | ||
|
|
||
| def explode(self: Self, columns: str | Sequence[str], *more_columns: str) -> Self: |
There was a problem hiding this comment.
pyarrow has two paths:
- if nulls or empty lists are not present, then it is enough to:
- make sure element counts are same
- explode each array individually
- if nulls or empty lists are present, then these are ignore by
pc.list_parent_indicesandpc.list_flatten, which is a problem. This implementation falls back to a python list both to flatten the array(s) and to create the corresponding indices .
After flattening, a new table is created by take-ing the indices of the non-flattened arrays and the flattened arrays.
| ) | ||
| ) | ||
|
|
||
| def explode(self: Self, columns: str | Sequence[str], *more_columns: str) -> Self: |
There was a problem hiding this comment.
If a single column is to be exploded, then we use the pandas native method. If multiple columns, the strategy is to explode the one column with the rest of the dataframe, and the other series individually and finally concatenating them back, plus sorting by original column names order
narwhals/_arrow/dataframe.py
Outdated
|
|
||
| def explode_null_array(array: pa.ChunkedArray) -> pa.ChunkedArray: | ||
| exploded_values = [] # type: ignore[var-annotated] | ||
| for lst_element in array.to_pylist(): |
There was a problem hiding this comment.
i might be missing something but this looks potentially very expensive
There was a problem hiding this comment.
It definitely is 😐 happy to raise for nullable cases for now and try to figure out a native alternative
There was a problem hiding this comment.
I did some improvements 🙌🏼 now there is only one .to_pylist call to create the indices and no loop in the case of nulls or empty lists
|
thanks - tbh i'm still not too keen on having to_pylist, should be just not support pyarrow and raise this as a feature request with them? |
Let me take a detour in the rabbit hole before a final decision 😇 |
|
@MarcoGorelli thoughts on the following? The idea being, if consecutive parent indices are the same, their diff will be zero, therefore the counter of such index should increase, otherwise it should not. By how much you ask? by the cumulative count of the matching parent indices, starting from 0 (that's why of the I will definitly add a comment if we decide to move forward with this. This is the else block: import numpy as np
parent_indices = pc.list_parent_indices(native_frame[to_explode[0]])
diff = pc.pairwise_diff(parent_indices.combine_chunks())
is_change_point = pc.or_kleene(pc.equal(diff, 0), pc.is_null(diff))
indices = pc.subtract(
pc.add(
parent_indices,
pc.cumulative_sum(is_change_point.cast(pa.int32()))
), 1
)
exploded_size = pc.sum(pc.max_element_wise(counts, 1, skip_nulls=True)).as_py()
valid_mask = pc.is_in(pa.array(np.arange(exploded_size), type=indices.type), indices)
def flatten_func(array: pa.ChunkedArray) -> pa.ChunkedArray:
dtype = array.type.value_type
return pc.replace_with_mask(
pa.array([None] * exploded_size, type=dtype),
valid_mask,
pc.list_flatten(array).combine_chunks(),
)Edit: As the initial explanation above was a flow of consciousness, I will try to rephrase it in a proper way, which could actually end up documenting the "algorithm". The issueBoth Polars explode will result in The solutionSo far in the code the native solution is not present, yet the proposal is the codeblock above in this comment. The idea goes as follows:
Edit pt.2It turns out that the solution is still missing how to compute indices for the indices = pa.array(
[
i
for i, count in enumerate(filled_counts.to_pylist())
for _ in range(count)
]
) |
|
nice - imma have to spend some time understanding this 😄 😳 |
I deduce that my explanation was garbage. Let me explode the previous comment |
|
Not at all, I just hadn't spent enough time on it , will take a look and make sense of it! |
MarcoGorelli
left a comment
There was a problem hiding this comment.
OK thanks for explaining - sure, happy to try this out, maybe I could write a hypothesis test to stress test it
would also be open to merging for now without pyarrow (as the rest is quite uncontroversial) and then adding pyarrow later?
narwhals/_arrow/dataframe.py
Outdated
| if fast_path: | ||
| indices = pc.list_parent_indices(native_frame[to_explode[0]]) | ||
| flatten_func = pc.list_flatten | ||
|
|
||
| else: | ||
| msg = ( | ||
| "`DataFrame.explode` is not supported for pyarrow backend and column" | ||
| "containing null's or empty list elements" | ||
| ) | ||
| raise NotImplementedError(msg) |
There was a problem hiding this comment.
do we want a value-dependent determining whether an error is raised? would you be opposed to raising completely for pyarrow and raising a feature request on their part?
There was a problem hiding this comment.
Oh, I see! I am happy to keep pyarrow on a dedicated PR. I will adjust here
Edit: feat/pyarrow-explode branch
There was a problem hiding this comment.
Maybe I should also check how pandas does it with pyarrow backend?
MarcoGorelli
left a comment
There was a problem hiding this comment.
thanks @FBruzzesi ! feel free to merge when ready

What type of PR is this? (check all applicable)
Checklist