feat: add `DataFrame` and `LazyFrame` `explode` method by FBruzzesi · Pull Request #1542 · narwhals-dev/narwhals

FBruzzesi · 2024-12-09T09:39:28Z

What type of PR is this? (check all applicable)

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

FBruzzesi · 2024-12-09T09:42:59Z

narwhals/_arrow/dataframe.py

        # TODO(Unassigned): Even with promote_options="permissive", pyarrow does not
        # upcast numeric to non-numeric (e.g. string) datatypes
+
+    def explode(self: Self, columns: str | Sequence[str], *more_columns: str) -> Self:


pyarrow has two paths:

if nulls or empty lists are not present, then it is enough to:

make sure element counts are same

explode each array individually

if nulls or empty lists are present, then these are ignore by pc.list_parent_indices and pc.list_flatten, which is a problem. This implementation falls back to a python list both to flatten the array(s) and to create the corresponding indices .

After flattening, a new table is created by take-ing the indices of the non-flattened arrays and the flattened arrays.

FBruzzesi · 2024-12-09T09:44:56Z

narwhals/_pandas_like/dataframe.py

            )
        )
+
+    def explode(self: Self, columns: str | Sequence[str], *more_columns: str) -> Self:


If a single column is to be exploded, then we use the pandas native method. If multiple columns, the strategy is to explode the one column with the rest of the dataframe, and the other series individually and finally concatenating them back, plus sorting by original column names order

narwhals/dataframe.py

MarcoGorelli · 2024-12-15T10:13:52Z

narwhals/_arrow/dataframe.py

+
+            def explode_null_array(array: pa.ChunkedArray) -> pa.ChunkedArray:
+                exploded_values = []  # type: ignore[var-annotated]
+                for lst_element in array.to_pylist():


i might be missing something but this looks potentially very expensive

It definitely is 😐 happy to raise for nullable cases for now and try to figure out a native alternative

I did some improvements 🙌🏼 now there is only one .to_pylist call to create the indices and no loop in the case of nulls or empty lists

MarcoGorelli · 2024-12-17T20:28:02Z

thanks - tbh i'm still not too keen on having to_pylist, should be just not support pyarrow and raise this as a feature request with them?

FBruzzesi · 2024-12-17T20:41:29Z

thanks - tbh i'm still not too keen on having to_pylist, should be just not support pyarrow and raise this as a feature request with them?

Let me take a detour in the rabbit hole before a final decision 😇

FBruzzesi · 2024-12-17T21:18:59Z

@MarcoGorelli thoughts on the following?

The idea being, if consecutive parent indices are the same, their diff will be zero, therefore the counter of such index should increase, otherwise it should not. By how much you ask? by the cumulative count of the matching parent indices, starting from 0 (that's why of the -1 after the cumulative sum).

I will definitly add a comment if we decide to move forward with this.

This is the else block:

import numpy as np
parent_indices = pc.list_parent_indices(native_frame[to_explode[0]])
diff = pc.pairwise_diff(parent_indices.combine_chunks())
is_change_point = pc.or_kleene(pc.equal(diff, 0), pc.is_null(diff))
indices = pc.subtract(
    pc.add(
        parent_indices,
        pc.cumulative_sum(is_change_point.cast(pa.int32()))
    ), 1
)

exploded_size = pc.sum(pc.max_element_wise(counts, 1, skip_nulls=True)).as_py()
valid_mask = pc.is_in(pa.array(np.arange(exploded_size), type=indices.type), indices)

def flatten_func(array: pa.ChunkedArray) -> pa.ChunkedArray:
    dtype = array.type.value_type
    return pc.replace_with_mask(
        pa.array([None] * exploded_size, type=dtype),
        valid_mask,
        pc.list_flatten(array).combine_chunks(),
    )

Edit: As the initial explanation above was a flow of consciousness, I will try to rephrase it in a proper way, which could actually end up documenting the "algorithm".

The issue

Both pc.list_parent_indices and pc.list_flatten ignore the list elements if these are null or [] (empty list).
Therefore the previous block (if fast_path-case) cannot be used, since it would not match what polars does.

Polars explode will result in nulls for both original null's and empty lists.

The solution

So far in the code the native solution is not present, yet the proposal is the codeblock above in this comment.

The idea goes as follows:

Create the indices corresponding to the elements resulted from pc.list_flatten

Consider the example we have in the test:

 data = [[3, None], None, [42], []]
 array = pa.chunked_array([data])
 pc.list_flatten(array)  # [3, null, 42]
 pc.list_parent_indices(array)  # [0, 0, 2]

Now as stated, nulls and empty lists are ignored, therefore instead of a length 5 output, we have a length 3
These 3 elements should end up in indices [0, 1, 3] with indices 2 and 4 being filled with nulls
To obtain these such indices ([0, 1, 3]), the idea is to take the pc.list_parent_indices output and increase by 1 each time that consecutive parent indices do not change (meaning that they originally belong to the same list element and therefore will be sequential in the result) - that's obtained with diff, is_change_point and addition of parent indices with cumulative sum
To construct the final output:
- the final size is the sum of elements, counting nulls and empty lists (exploded_size)
- the mask in which the pc.list_flatten elements should go is where the index is in indices constructed
- flutten_fnc stars by creating an array of nulls, and then replace the values at the right index

Edit pt.2

It turns out that the solution is still missing how to compute indices for the .take method on the rest of the dataframe. And as of now I don't have a better option than:

indices = pa.array(
    [
        i
        for i, count in enumerate(filled_counts.to_pylist())
        for _ in range(count)
    ]
)

MarcoGorelli · 2024-12-18T15:13:07Z

nice - imma have to spend some time understanding this 😄 😳

FBruzzesi · 2024-12-18T21:12:17Z

nice - imma have to spend some time understanding this 😄 😳

I deduce that my explanation was garbage. Let me explode the previous comment

MarcoGorelli · 2024-12-18T23:30:48Z

Not at all, I just hadn't spent enough time on it , will take a look and make sense of it!

MarcoGorelli

OK thanks for explaining - sure, happy to try this out, maybe I could write a hypothesis test to stress test it

would also be open to merging for now without pyarrow (as the rest is quite uncontroversial) and then adding pyarrow later?

narwhals/_pandas_like/dataframe.py

…/narwhals into feat/explode-method

MarcoGorelli · 2024-12-21T12:11:14Z

narwhals/_arrow/dataframe.py

+        if fast_path:
+            indices = pc.list_parent_indices(native_frame[to_explode[0]])
+            flatten_func = pc.list_flatten
+
+        else:
+            msg = (
+                "`DataFrame.explode` is not supported for pyarrow backend and column"
+                "containing null's or empty list elements"
+            )
+            raise NotImplementedError(msg)


do we want a value-dependent determining whether an error is raised? would you be opposed to raising completely for pyarrow and raising a feature request on their part?

Oh, I see! I am happy to keep pyarrow on a dedicated PR. I will adjust here

Edit: feat/pyarrow-explode branch

Maybe I should also check how pandas does it with pyarrow backend?

MarcoGorelli

thanks @FBruzzesi ! feel free to merge when ready

FBruzzesi added 3 commits December 8, 2024 22:46

feat: DataFrame and LazyFrame explode

3061fe9

arrow refactor

2326b08

raise for invalid type and docstrings

32af22e

github-actions bot added the enhancement New feature or request label Dec 9, 2024

FBruzzesi commented Dec 9, 2024

View reviewed changes

narwhals/dataframe.py Outdated Show resolved Hide resolved

FBruzzesi and others added 3 commits December 9, 2024 10:45

Update narwhals/dataframe.py

3b52ab5

old versions

c3bf009

merge main

b427e79

MarcoGorelli reviewed Dec 15, 2024

View reviewed changes

FBruzzesi added 4 commits December 17, 2024 13:29

Merge branch 'main' into feat/explode-method

c77dc62

almost all native

72314a2

doctest

7f04579

Merge branch 'main' into feat/explode-method

7be326e

Merge branch 'main' into feat/explode-method

5da1ad6

Merge branch 'main' into feat/explode-method

4a098b8

MarcoGorelli reviewed Dec 21, 2024

View reviewed changes

narwhals/_pandas_like/dataframe.py Show resolved Hide resolved

FBruzzesi added 3 commits December 21, 2024 12:46

Merge branch 'feat/explode-method' of https://github.com/narwhals-dev…

380a6cb

…/narwhals into feat/explode-method

Merge branch 'main' into feat/explode-method

c7a47c9

better error message, fail for arrow with nulls

864e932

MarcoGorelli reviewed Dec 21, 2024

View reviewed changes

FBruzzesi added 2 commits December 21, 2024 13:13

doctest-modules

cc72f6b

completely remove pyarrow implementation

1156beb

MarcoGorelli approved these changes Dec 21, 2024

View reviewed changes

FBruzzesi merged commit e112a99 into main Dec 22, 2024

FBruzzesi deleted the feat/explode-method branch December 22, 2024 11:31

FBruzzesi mentioned this pull request Dec 22, 2024

feat: ArrowDataFrame.explode #1644

Draft

10 tasks

Conversation

FBruzzesi commented Dec 9, 2024

What type of PR is this? (check all applicable)

Checklist

Uh oh!

FBruzzesi Dec 9, 2024

Choose a reason for hiding this comment

Uh oh!

FBruzzesi Dec 9, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MarcoGorelli Dec 15, 2024

Choose a reason for hiding this comment

Uh oh!

FBruzzesi Dec 15, 2024

Choose a reason for hiding this comment

Uh oh!

FBruzzesi Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli commented Dec 17, 2024

Uh oh!

FBruzzesi commented Dec 17, 2024

Uh oh!

FBruzzesi commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The issue

The solution

Edit pt.2

Uh oh!

MarcoGorelli commented Dec 18, 2024

Uh oh!

FBruzzesi commented Dec 18, 2024

Uh oh!

MarcoGorelli commented Dec 18, 2024

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MarcoGorelli Dec 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FBruzzesi Dec 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FBruzzesi Dec 21, 2024

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FBruzzesi Dec 17, 2024 •

edited

Loading

FBruzzesi commented Dec 17, 2024 •

edited

Loading

MarcoGorelli Dec 21, 2024 •

edited

Loading

FBruzzesi Dec 21, 2024 •

edited

Loading