-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dont use take with arrow #2336
base: master
Are you sure you want to change the base?
dont use take with arrow #2336
Conversation
Yeah, when using sliced arrays, that seems to be the case. It will try to concatenate them first, which will explode the memory use! This is quite a bad arrow situation... :/ digesting this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this Ben, please take a look at my comment.
packages/vaex-core/vaex/column.py
Outdated
# Don't use .take in arrow anymore | ||
# https://issues.apache.org/jira/browse/ARROW-9773 | ||
# slice is zero-copy | ||
if isinstance(ar_unfiltered, pa.Array): | ||
ar = pa.concat_arrays( | ||
[ar_unfiltered.slice(i, 1) for i in take_indices] | ||
) | ||
elif isinstance(ar_unfiltered, pa.ChunkedArray): | ||
ar = pa.concat_arrays( | ||
[ar_unfiltered.slice(i, 1).combine_chunks() for i in take_indices] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Don't use .take in arrow anymore | |
# https://issues.apache.org/jira/browse/ARROW-9773 | |
# slice is zero-copy | |
if isinstance(ar_unfiltered, pa.Array): | |
ar = pa.concat_arrays( | |
[ar_unfiltered.slice(i, 1) for i in take_indices] | |
) | |
elif isinstance(ar_unfiltered, pa.ChunkedArray): | |
ar = pa.concat_arrays( | |
[ar_unfiltered.slice(i, 1).combine_chunks() for i in take_indices] | |
) | |
if isinstance(ar_unfiltered, pa.Array): | |
ar = ar_unfiltered.take(vaex.array_types.to_arrow(take_indices)) | |
elif isinstance(ar_unfiltered, pa.ChunkedArray): | |
# Don't use .take in arrow for chunked arrays | |
# https://issues.apache.org/jira/browse/ARROW-9773 | |
# slice is zero-copy | |
ar = pa.concat_arrays( | |
[ar_unfiltered.slice(i, 1).combine_chunks() for i in take_indices] | |
) |
it is only an issue for chunked arrays right?
Maybe the best place to put this is vaex.array_types.take ?
There is already a take function there, and if all places in vaex uses that function, we have a nice central place where we can solve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maartenbreddels i do not think this is only an issue for chunked_arrays, I think it's an issue with take
broadly. I definitely like the idea of moving it down to a centralized place, but I think no matter the array type, you need to replace take
with slice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://issues.apache.org/jira/browse/ARROW-9773 only talks about chunked arrays, and I also think it's only a problem for that (since it's more complex).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay i'll take your word on this one, you certainly know more here. I'll move this into the vaex array_types.take
@maartenbreddels i updated the function to use the array_types I tried to dig into the dataframe |
I'll scan over it tomorrow! |
This directly addresses #2335
And is directly the fix for https://issues.apache.org/jira/browse/ARROW-9773 (which is now apache/arrow#33049)
I believe fixing all of the
.take
s to.slice
would also fix #2334 because.take
uses memory, but.slice
is zero-copy. The memory is exploding going to hdf5 because we keep.take
ingYou can see huggingface datasets does the same thing: https://github.com/huggingface/datasets/pull/645/files
That being said, there are a number of other places that vaex uses
.take
which should be fixed. But because of the lack of typing in the vaex repo, it's hard for me to know which ones are pyarrow arrays, which are pyarrow tables, and which are numpy arrays. I'm happy to help move the rest over, but I would need some guidance.Here are all of the places
.take
is used