Skip to content

feat: ArrowDataFrame.explode#1644

Draft
FBruzzesi wants to merge 27 commits intomainfrom
feat/pyarrow-explode
Draft

feat: ArrowDataFrame.explode#1644
FBruzzesi wants to merge 27 commits intomainfrom
feat/pyarrow-explode

Conversation

@FBruzzesi
Copy link
Member

@FBruzzesi FBruzzesi commented Dec 22, 2024

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

I will leave this as draft until we decide how to move forward.

To summarize the discussion(s) in #1542 :

  • pyarrow native methods ignore nulls and empty list in explode
  • the workaround here is to have a "fast_path" for when nulls or empty lists are not present, and a dedicated path for when they are
  • the issue is that we enter python world to create the index via one .to_pylist() call
  • pandas seems to enter cython anyway to explode a list

@dangotbanned dangotbanned mentioned this pull request Mar 17, 2025
12 tasks
@dangotbanned dangotbanned added enhancement New feature or request pyarrow Issue is related to pyarrow backend labels Mar 25, 2025
@dangotbanned
Copy link
Member

dangotbanned commented Mar 25, 2025

@FBruzzesi I feel like this shouldn't have got lost!

ArrowDataFrame.explode is 1 of 3 remaining implementations we need

explode = not_implemented()

join_asof = not_implemented()

I might add a PR for ArrowDataFrame.clone - since it can just utilize arrow data being immutable

clone = not_implemented()

@FBruzzesi
Copy link
Member Author

FBruzzesi commented Mar 25, 2025

I feel like this shouldn't have got lost!

Thanks @dangotbanned ♥️ The main concern was a conversion to python object: filled_counts.to_pylist() in:

    if fast_path:
        indices = pc.list_parent_indices(native_frame[to_explode[0]])
        flatten_func = pc.list_flatten

    else:
        filled_counts = pc.max_element_wise(counts, 1, skip_nulls=True)
        indices = pa.array(
            [
                i
                for i, count in enumerate(filled_counts.to_pylist())
                for _ in range(count)
            ]
        )

@dangotbanned
Copy link
Member

#1644 (comment)

Maybe we can figure out another path hidden somewhere in the stubs? 🤔

dangotbanned added a commit that referenced this pull request Mar 25, 2025
`.to_pylist` being called on a scalar is all that is left
@dangotbanned
Copy link
Member

dangotbanned commented Mar 25, 2025

@FBruzzesi @MarcoGorelli

It seems like polars wants to make a breaking change in the next major version - resulting in the same behavior as pyarrow.

If we had that behavior as the goal - I think pc.list_flatten(..., recursive=True) would get us most of the way there.
Just something to keep in mind for the future 🙂

dangotbanned and others added 3 commits March 25, 2025 13:41
Just leaving as-is, since this'll probably change in the future
#1644 (comment)
> error: Incompatible redefinition (redefinition with type "Callable[[ChunkedArray[ListScalar[Any]]], ChunkedArray[Any]]", original type overloaded function)  [misc]

https://github.com/narwhals-dev/narwhals/actions/runs/14060304329/job/39369169923?pr=1644
@dangotbanned
Copy link
Member

dangotbanned commented Dec 7, 2025

@FBruzzesi I'm coming back to this after my recent fiddling with pyarrow 😄
There may still be a way to match polars - while staying in pyarrow-land - but I'm only just starting to look now 🤞

Updated: Yep, it works! 🥳 (#3347)


Regarding: (#1644 (comment))

3 weeks ago on that issue (pola-rs/polars#17664 (comment))

Note that the option got added to resolve this in #25289.

So now polars.DataFrame.explode has this signature:

    def explode(
        self,
        columns: ColumnNameOrSelector | Iterable[ColumnNameOrSelector],
        *more_columns: ColumnNameOrSelector,
        empty_as_null: bool = True,
        keep_nulls: bool = True,
    ) -> DataFrame:

dangotbanned added a commit that referenced this pull request Dec 7, 2025
Mentioned in (#1644 (comment))
Multi-column coming up next 😄
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request pyarrow Issue is related to pyarrow backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants