Add PyCapsule support for Arrow import and export #825

timsaucer · 2024-08-20T00:05:36Z

Which issue does this PR close?

Closes #752

Rationale for this change

User requested.

What changes are included in this PR?

With this change you can import any arrow table that implements the PyCapsule Interface using the SessionContext.from_arrow_table function. Additionally, PyCapsule export of DataFrame is added. Now any python based project that uses python arrow with the pycapsule interface can directly consume a datafusion dataframe.

The DataFrame will be executed at the point of export.

You can see a minimal example in the issue ticket.

Are there any user-facing changes?

This PR adds SessionContext.from_arrow which served the same purpose as from_arrow_table except that it now takes any object that implements the required PyCapsule functions. from_arrow_table is now an alias to from_arrow.

Still to do:

Add support for the requested schema
Add user examples
Add unit tests

python/datafusion/dataframe.py

src/context.rs

src/dataframe.rs

kylebarron · 2024-08-20T00:24:15Z

Add support for the requested schema

You can see my casting implementation here. But note that's slightly different as I'm using an ArrayReader not a RecordBatchReader.

timsaucer · 2024-08-24T14:48:21Z

Ok, added the requested schema and unit tests for export. All that I think is left is unit tests on the import. I'm not sure if I should keep the nanoarrow in the unit test. If so I need to add it to the requirements files.

src/context.rs

src/dataframe.rs

kylebarron · 2024-08-26T14:35:03Z

docs/source/user-guide/io/arrow.rst

+important to note that this will cause the DataFrame execution to happen, which may be
+a time consuming task. That is, you will cause a :py:func:`datafusion.dataframe.DataFrame.collect`
+operation call to occur.


I'd suggest putting this into an "admonition" box with a warning color to make this clearer. I'm not sure how to do that in sphinx, but this is what I'm referring to in mkdocs-material: https://squidfunk.github.io/mkdocs-material/reference/admonitions/#supported-types

kylebarron · 2024-08-26T14:37:14Z

python/datafusion/context.py

+        ``__arrow_c_stream__`` or ``__arrow_c_array__``. For the latter, it must return
+        a struct array. Common examples of sources from pyarrow include


For both they must emit a struct array. Any Arrow array can be passed through an __arrow_c_stream__. Canonically, to transfer a DataFrame you have a stream of struct arrays where each one is unpacked to be the columns of a RecordBatch. But it doesn't have to a struct array: you can also transfer a Series through an __arrow_c_stream__, where each batch in the stream iterator is just a primitive array.

python/datafusion/context.py

src/dataframe.rs

Michael-J-Ward · 2024-08-26T21:12:44Z

src/dataframe.rs

+    merged_schema.project(&project_indices)
+}
+
+fn record_batch_into_schema(


I am surprised that arrow-rs nor datafusion have such a utility for converting a record-batch, but I did take a quick look around and didn't find anything.

Well there is cast. Cast works on struct arrays, so you could make a simple wrapper around cast to work on RecordBatch by creating a struct array from the record batch. This is what I do in pyo3-arrow.

The main difference is that cast doesn't also project. It's not clear to me whether the PyCapsule Interface intends to support projection or not. I don't think anyone has asked.

Since the user isn't calling the pycapsule interface directly, it's also not clear how the user API would look to ask for a projection via pycapsules.

timsaucer · 2024-08-27T10:31:09Z

I applied your suggestions. Do either of you want a re-review before we ask to merge?

Michael-J-Ward

Excellent.

python/datafusion/tests/test_context.py

…taframes

…g much over pyarrow

…ow objects

…ypes of input

Co-authored-by: Michael J Ward <[email protected]>

kylebarron reviewed Aug 20, 2024

View reviewed changes

python/datafusion/dataframe.py Show resolved Hide resolved

src/context.rs Outdated Show resolved Hide resolved

src/context.rs Outdated Show resolved Hide resolved

src/context.rs Outdated Show resolved Hide resolved

src/dataframe.rs Outdated Show resolved Hide resolved

kylebarron reviewed Aug 24, 2024

View reviewed changes

src/context.rs Show resolved Hide resolved

src/context.rs Outdated Show resolved Hide resolved

src/context.rs Show resolved Hide resolved

src/dataframe.rs Show resolved Hide resolved

src/dataframe.rs Outdated Show resolved Hide resolved

timsaucer marked this pull request as ready for review August 25, 2024 12:26

kylebarron reviewed Aug 26, 2024

View reviewed changes

kylebarron approved these changes Aug 26, 2024

View reviewed changes

Michael-J-Ward approved these changes Aug 26, 2024

View reviewed changes

kylebarron approved these changes Aug 27, 2024

View reviewed changes

Michael-J-Ward approved these changes Aug 27, 2024

View reviewed changes

python/datafusion/tests/test_context.py Outdated Show resolved Hide resolved

timsaucer and others added 19 commits August 28, 2024 04:39

Support Arrow PyCapsule for reading arrow tables and for exporting da…

4a93356

…taframes

Improve flow control on receiving record batches from arrow pycapsule

d61621c

Update code comment

a813dc9

Add function to project record batch into desired schema

533c915

Validate the pycapsule and get the pointer when successful

d94885c

Add unit test for export via pycapsule

932c8c2

Readability

a404a68

Remove nanoarrow so we don't add another dependency that isn't gainin…

1ed4632

…g much over pyarrow

Remove unused import

d1fe4df

Add unit test to check for reading from record batch streams

61473df

Update unit test to try a variety of arrow data sources on import

0bbb293

Simplify dataframe creation in unit test

2586b3f

Update user documenation to include usage of import and exporting arr…

43a869b

…ow objects

Rename from_arrow_table to from_arrow since we now support multiple t…

06e16fb

…ypes of input

Add admonition box to warn user

afe6420

Mark from_arrow_table as deprecated

a5b4e39

Clean up error handling

7cc807d

Marking was not required

babeb77

Co-authored-by: Michael J Ward <[email protected]>

Apply linting after rebase

955d7d5

timsaucer force-pushed the feature/arrow-pycapsule branch from f1ef382 to 955d7d5 Compare August 28, 2024 08:41

Michael-J-Ward approved these changes Aug 28, 2024

View reviewed changes

kylebarron mentioned this pull request Aug 30, 2024

[Python] Promote usage of the Arrow PyCapsule Protocol (for the C Data Inteface) apache/arrow#39195

Open

8 tasks

andygrove merged commit 69ed7fe into apache:main Aug 30, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PyCapsule support for Arrow import and export #825

Add PyCapsule support for Arrow import and export #825

timsaucer commented Aug 20, 2024 •

edited

Loading

kylebarron commented Aug 20, 2024

timsaucer commented Aug 24, 2024

kylebarron Aug 26, 2024

kylebarron Aug 26, 2024

Michael-J-Ward Aug 26, 2024

kylebarron Aug 26, 2024

kylebarron Aug 26, 2024

timsaucer commented Aug 27, 2024

Michael-J-Ward left a comment

		``__arrow_c_stream__`` or ``__arrow_c_array__``. For the latter, it must return
		a struct array. Common examples of sources from pyarrow include

Add PyCapsule support for Arrow import and export #825

Add PyCapsule support for Arrow import and export #825

Conversation

timsaucer commented Aug 20, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Still to do:

kylebarron commented Aug 20, 2024

timsaucer commented Aug 24, 2024

kylebarron Aug 26, 2024

Choose a reason for hiding this comment

kylebarron Aug 26, 2024

Choose a reason for hiding this comment

Michael-J-Ward Aug 26, 2024

Choose a reason for hiding this comment

kylebarron Aug 26, 2024

Choose a reason for hiding this comment

kylebarron Aug 26, 2024

Choose a reason for hiding this comment

timsaucer commented Aug 27, 2024

Michael-J-Ward left a comment

Choose a reason for hiding this comment

timsaucer commented Aug 20, 2024 •

edited

Loading