Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support sliced list arrays in cast #2461

Merged
merged 2 commits into from
Jun 8, 2021

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Jun 8, 2021

There is this issue in pyarrow:

import pyarrow as pa

arr = pa.array([[i * 10] for i in range(4)])
arr.cast(pa.list_(pa.int32()))  # works

arr = arr.slice(1)
arr.cast(pa.list_(pa.int32()))  # fails
# ArrowNotImplementedError("Casting sliced lists (non-zero offset) not yet implemented")

However in Dataset.cast we slice tables to cast their types (it's memory intensive), so we have the same issue.
Because of this it is currently not possible to cast a Dataset with a Sequence feature type (unless the table is small enough to not be sliced).

In this PR I fixed this by resetting the offset of pyarrow.ListArray arrays to zero in the table before casting.
I used pyarrow.compute.subtract function to update the offsets of the ListArray.

cc @abhi1thakur @SBrandeis

@lhoestq lhoestq merged commit a7fd3e5 into master Jun 8, 2021
@lhoestq lhoestq deleted the support-sliced-list-arrays-in-cast branch June 8, 2021 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant