-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Python] Large strings cause ArrowInvalid: offset overflow while concatenating arrays #33049
Comments
David Li / @lidavidm: |
Ben Epstein:
Is there a way for me to check this before casting? Ie instead of trying, catching the failure, casting, and trying again, is there a (fast) computation I can run to see if i'm over the size limit? |
David Li / @lidavidm:
|
Ben Epstein:
I'm using vaex for all my arrow work, so using your suggestion here's how i'm handling it (in cast others find themselves here) import pyarrow as pa
import vaex
import numpy as np
from vaex.dataframe import DataFrame
n = 50_000
x = str(np.random.randint(low=0,high=1000, size=(30_000,)).tolist())
# Create a df with a string too large
df = vaex.from_arrays(
id=list(range(n)),
y=np.random.randint(low=0,high=1000,size=n)
)
df["text"] = vaex.vconstant(x, len(df))
# byte limit for arrow strings
# because 1 character = 1 byte, the total number of characters in the
# column in question must be less than the size_limit
size_limit = 2*1e9
def validate_str_cols(df: DataFrame) -> DataFrame:
for col, dtype in zip(df.get_column_names(), df.dtypes):
if dtype == str and df[col].str.len().sum() >= size_limit:
df[col] = df[col].to_arrow().cast(pa.large_string())
return df
# text is type string
print(df.dtypes)
df = validate_str_cols(df)
# test is type large_string
print(df.dtypes)
# works!
y = df.text.values.combine_chunks() |
David Li / @lidavidm: In general the large_X vs X type handling could use some consideration, I think |
Hi @asfimport, I would like to work on this issue. Could you please assign it to me? |
Could you add a comment that contains only "take" like #33849 (comment) to here? |
Current work around I've developed for vaex in general with this pyarrow related error on dataframes for which the technique mentioned above does not work (for materialisation of pandas array from a joined multi file data frame where I was unable to set the arrow data type on the column):
|
take |
@leprechaunt33 I was working on a PR in vaex to remove the use of See the related issue referenced in the PR for more details |
@Ben-Epstein thanks for the heads up. I'm still digesting the code. In my use case this workaround had been successful but only fully after ensuring the iterator was always chunked. |
@leprechaunt33 thanks for checking back in. Since this same issue happened in huggingface datasets, it would make me believe that it really is the .take explosion. That's certainly an interesting workaround, but not one I can use in production :) |
@Ben-Epstein yup its really only a workaround that's applicable to cases where you need the data in memory after filtering, and since vaex is meant to be lazy execution memory map as required a lot of folks will be encountering the problem in other contexts. Its certainly possible its a take explosion; the context in this case is that I'm working with filterset resulting from a df.take on an already potentially filtered set of indices of rows identified as having matching report ids due to the lack of indexing and a many to one record relationship between two tables that prevents a join from being used. Column failing are free text reports relating to leiomyosarcoma cases, so there's less than 100 scattered throughout this table of >2M reports that get filtered via a regex query on a MedDRA term. Its possible the take is being multiplied across different tables/arrays from the different hdf5 files in the dataset multiplied by the separate chunks of those files and just creating a polynomial complexity, but I'm not familiar enough yet with the vaex internals to confirm that. As you figured out there the Dataframe take vs arrow take and the code complexity makes it a little challenging to debug. I'll be able to look more under the hood of whats going on in a couple of days. |
I managed to get to this a bit earlier and it definitely looks like its a take issue. I printed out the indices and sizes of data structures for calls to pyarrow take. I see this for the column in question (details are sys.getsizeof data, len(indices) and the indices themselves. The first of the 3 above is the column without using an iterator, the second and third is using an iterator with chunk size (number of rows) 10. If I'm reading things right here, it is the fact that the data structure being taken from has hit the limit, not the size of the data being collated. To be clear its not the size of the filtered dataset but the size of the unfiltered data set or chunk thereof that is being taken from to obtain the filtered rows. I also ran a memory_profile on the same code and determined both the iterator and the to_pandas_df both cause allocation of almost identical amounts of memory (around 165MB, hardly a memory issue). @Ben-Epstein just did a double take on those numbers, that looks like a little more than garbage collection above the 2GB limit so it could be some kind of segmentation issue with indexing the string chunks that close to the 2GB limit. Not sure whether it would be necessary or not to limit chunk size allocation in vaex to 2GB along with the slice to fully fix the problem... something to consult the code about I guess :) Will watch with interest. |
I'll ping @maartenbreddels to get deeper opinion |
This at least hopefully gives a minimal reproducible example for testing that is database independent, and I assume this would happen with any hdf5 with chunks that big. Create an hdf5 with a single text field of arbitrary lengths that at least totals over 2GB and then try to combine chunks from a limited number rows from the start to the end of that text field spaced evenly. Should be able to trigger the exception that way |
I think there are 2 issues being reported here right?
I believe number 2 is already reported before in a different issue. |
@maartenbreddels it seems from my testing that this may be a third issue also related to take, which only occurs when vaex is forced to do a df.take on rows which contain a string column whose unfiltered in memory representation is larger than 2GB. For example, I have been able to consistently reproduce the bug with the leiiomyosarcoma data set (16 indices above) but not for juvenile polymyositis, which generates indices at the start of the data set (indices 0, 44726, 225143 and subsequently the full 2GB of data is not memory mapped). Wherever I've been able to consistently reproduce the problem, vaex has been attempting a take (df.take in this case which turns into a column take) on high row indices which has caused vaex to read in more than 2GB of data and the exception occurs precisely after the lazy execution attempts to take from that large memory mapped column. I have not noted a speed issue or memory explosion in these cases (testing on the 16 index case above only allocated 165MB), but the ArrowInvalid is consistently yielded. It should be fairly simple to test this by generating a hdf5 data set with arbitrary length string columns of at least 2GB+ in total combined length and doing a df take on high indices or a small number of indices which span the column. The error is then yielded when attempting to access the data in that column (for example for conversion to pandas) Edit again: to be clear, the size of the data being combined does not seem to matter in these instances but rather the index within the column. It could be tangentially related to the original issue, but attempting the explicit column type conversion does not resolve the problem, only using an iterator to ensure the filtered chunks are less than 2GB resolves it. Edit again: On further reflection the root cause is similar as its failing in the _compute/take function and the cause is vaex being convinced to generate a chunked array of that size in order to take from it. But there's no large data set involved, the chunked array is generated indirectly and is combining small (typically 0-5000 character) strings. |
It looks like the simplest way looking at the api of not triggering it to read in such large chunks in the meantime is going to be to nbytes() the unfiltered data frame and split the filtered one up into suitable chunk sizes (shl 31 plus one) to operate on, assuming sufficient rows. shl 30 plus one would probably give the most optimal. Or some other more clever use of the into parameter where the take indices and unfiltered length are known. |
That sounds like #25822 |
@wjones127 yup that looks a lot more like the bug in question! perhaps my confusion here has come from the fact that the same exception is being thrown for multiple different reasons (all of them likely connected to the 32 bit limitations). Is this just a take kernel thing as the title suggests or would it also impact slicing (which is mentioned in the PRL mentioned above as a zero copy alternative)? It looked to me very much like a 32 bit pointer limitation issue. |
Yes, it essentially is. We use 32-bit offsets for the default string and binary types. This means you always have to be careful when generating the arrays to chunk them if they get too large. Or otherwise use the Basically the bug with take and concatenation is that we aren't automatically handling the chunking for you. But you are right that essentially the same issue shows up in multiple places. |
…h large string ararays apache/arrow#33049
…he resulting arrow array is way too big. Instead: Pull data into R as a data.table to do the concatenation and then push it back into Arrow. 🤷 apache/arrow#33049
When working with medium-sized datasets that have very long strings, arrow fails when trying to operate on the strings. The root is the
combine_chunks
function.Here is a minimally reproducible example
I get the following error
With smaller strings or smaller arrays this works fine.
The first example that fails takes a few minutes to run. If you'd like a faster example for experimentation, you can use
vaex
to generate the chunked array much faster. This will throw the identical error and will run about 1 second.Reporter: Ben Epstein
Related issues:
Note: This issue was originally created as ARROW-17828. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: