Skip to content

Add to_arrow method to pylibcudf core types#19787

Merged
rapids-bot[bot] merged 14 commits intorapidsai:branch-25.10from
Matt711:fea/plc/to-arrow
Sep 18, 2025
Merged

Add to_arrow method to pylibcudf core types#19787
rapids-bot[bot] merged 14 commits intorapidsai:branch-25.10from
Matt711:fea/plc/to-arrow

Conversation

@Matt711
Copy link
Contributor

@Matt711 Matt711 commented Aug 25, 2025

Description

Addsto_arrow methods to pylibcudf core types. Not a priority but a quality of life improvement for pylibcudf developers. Eg. Call .to_arrow instead of plc.interop.to_arrow

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@Matt711 Matt711 requested a review from a team as a code owner August 25, 2025 16:44
@Matt711 Matt711 added feature request New feature or request non-breaking Non-breaking change labels Aug 25, 2025
@Matt711 Matt711 marked this pull request as draft August 25, 2025 16:45
@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 25, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Copy link
Contributor

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, is there an immediate use case for this API?

@Matt711
Copy link
Contributor Author

Matt711 commented Aug 25, 2025

Curious, is there an immediate use case for this API?

This is really just a quality of life improvement for pylibcudf devs. Easier to type to_arrow vs plc.interop.to_arrow

@Matt711 Matt711 moved this to In Progress in cuDF Python Aug 26, 2025
@Matt711 Matt711 marked this pull request as ready for review August 26, 2025 13:36
@mroeschke
Copy link
Contributor

OK I would be open to having to_arrow methods on our objects (to mirror to_py for example) if we eventually also depreciate and remove plc.interop.to_arrow as well. Could use a second opinion cc @vyasr

@vyasr
Copy link
Contributor

vyasr commented Sep 3, 2025

I would prefer not to have two ways to do the arrow conversion. The argument for keeping the interop module approach is that I would also like to isolate our pyarrow dependency as much as possible so that we could completely avoid the pyarrow dependency except when a user strictly requests it. I'm not very convinced of the quality-of-life improvement of this PR TBH. Are you finding the current approach cumbersome? My hope is that we move away from converting to pyarrow representations as much as possible actually. Outside of our test suite, my view is that we view the arrow data spec as the mode of interop so that any library that directly supports it will "just work", but we shouldn't be converting device->host outside of debugging too often. Would it help to lazily alias to_arrow into the top-level pylibcudf namespace (e.g. with a module-level __getattr__ implementation)? I say lazily so that we still don't import pyarrow if we don't have to.

@Matt711
Copy link
Contributor Author

Matt711 commented Sep 4, 2025

I would prefer not to have two ways to do the arrow conversion.

Sure, but we have two ways of ingesting arrow objects Table.from_arrow and plc.interop.from_arrow. If we prefer to have one way, maybe it should hold for both directions?

I'm not very convinced of the quality-of-life improvement of this PR TBH. Are you finding the current approach cumbersome?

Heh, maybe I'm in the minority here 😅, but I do find the current approach cumbersome.

Outside of our test suite, my view is that we view the arrow data spec as the mode of interop so that any library that directly supports it will "just work", but we shouldn't be converting device->host outside of debugging too often.

OK yeah that makes sense.

Would it help to lazily alias to_arrow into the top-level pylibcudf namespace (e.g. with a module-level __getattr__ implementation)? I say lazily so that we still don't import pyarrow if we don't have to.

I don't think so. I like this style

(
    plc.Column.from_arrow(...)
    .do_stuff()
    .to_arrow()
)

@vyasr
Copy link
Contributor

vyasr commented Sep 4, 2025

Matt and I discussed earlier today and we decided to move forward with this PR and make the class methods the source of truth and eventually deprecate the interop functions instead. We can still minimize and isolate our pyarrow interactions, it'll just have to be spread over a few more files to do it.

@Matt711 Matt711 added the 3 - Ready for Review Ready for review by team label Sep 15, 2025
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me aside from a couple of tiny notes. Could you open a follow-up issue to migrate all calls in cudf-polars and cudf classic to use the new APIs, and then to deprecate the old interop module's functions?

Comment on lines 83 to 85
# TODO: Once the arrow C device interface registers more
# types that it supports, we can call pa.array(self) if
# no metadata is passed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this comment referring to? I don't understand I'm afraid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo that should be pa.table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean is that once arrow can consume more kinds of device arrays, we can call pa.table(self) directly. The following example fails today

In [1]: import pylibcudf as plc

In [2]: import pyarrow as pa

In [3]: pa.table(plc.Table([plc.Column.from_iterable_of_py([1,2])]))
---------------------------------------------------------------------------
ArrowKeyError                             Traceback (most recent call last)
Cell In[3], line 1
----> 1 pa.table(plc.Table([plc.Column.from_iterable_of_py([1,2])]))

File ~/.conda/envs/rapids/lib/python3.13/site-packages/pyarrow/table.pxi:6235, in pyarrow.lib.table()

File ~/.conda/envs/rapids/lib/python3.13/site-packages/pyarrow/table.pxi:6054, in pyarrow.lib.record_batch()

File ~/.conda/envs/rapids/lib/python3.13/site-packages/pyarrow/table.pxi:4044, in pyarrow.lib.RecordBatch._import_from_c_device_capsule()

File ~/.conda/envs/rapids/lib/python3.13/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File ~/.conda/envs/rapids/lib/python3.13/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowKeyError: Device type 2is not registered

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. Yeah... I'm not sure that'll ever be supported, but if it is then great. The best bet might be backing pyarrow with pylibcudf for GPU bits!

@github-actions github-actions bot added the cudf-polars Issues specific to cudf-polars label Sep 17, 2025
@Matt711
Copy link
Contributor Author

Matt711 commented Sep 18, 2025

/merge

@rapids-bot rapids-bot bot merged commit 54c9ce0 into rapidsai:branch-25.10 Sep 18, 2025
130 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF Python Sep 18, 2025
@Matt711 Matt711 deleted the fea/plc/to-arrow branch September 18, 2025 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team cudf-polars Issues specific to cudf-polars feature request New feature or request non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants

Comments