You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for reaching out and helping us improve Vaex!
Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.
Description
Please provide a clear and concise description of the problem. This should contain all the steps needed to reproduce the problem. A minimal code example that exposes the problem is very appreciated.
Software information
Vaex version (import vaex; vaex.__version__): 4.16.1
Vaex was installed via: pip / conda-forge / from source pip
We create a dataframe where the text column is larger than the max size for a string. We write it to an arrow file, and that's fine. Note here that if we wrote this to an hdf5 file without converting to a large string, we'd have an error... see 2334
importvaeximportnumpyasnp# from https://issues.apache.org/jira/browse/ARROW-17828# arrow string maxsize is 2GB. Then you need large stringmaxsize_string=2e9df=vaex.example()
df["text"] =vaex.vconstant("OHYEA"*2000, len(df))
df["id"] =np.array(list(range(len(df))))
assertdf.text.str.len().sum() >maxsize_stringdf[["x", "y", "z", "text", "id"]].export("part1.arrow")
Part 2
If we join this dataframe with another, we see that there are no issues
# Works finenewdf=vaex.example()[["id"]]
newdf["col1"] =vaex.vconstant("foo", len(newdf))
join_df=newdf.join(df, on="id")
join_df
Part 3 (issue)
BUT, if we instead read that file from disk where it's memory mapped, joining to another dataframe causes an issue
# Reading from disk will break the joinnewdf2=vaex.example()[["id"]]
newdf3=vaex.open("part1.arrow")
newdf2["col1"] =vaex.vconstant("foo", len(newdf))
join_df2=newdf2.join(newdf3, on="id")
join_df2
This throws the error
return call_function('take', [data, indices], options,
memory_pool)
File "pyarrow/_compute.pyx", line 560, in
pyarrow._compute.call_function
File "pyarrow/_compute.pyx", line 355, in
pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 144, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
And the reason I think that is because of this very related issue from the huggingface datasets repo huggingface/datasets#615 and the fix that they have simply removes the use of takehuggingface/datasets#645
The text was updated successfully, but these errors were encountered:
Thank you for reaching out and helping us improve Vaex!
Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.
Description
Please provide a clear and concise description of the problem. This should contain all the steps needed to reproduce the problem. A minimal code example that exposes the problem is very appreciated.
Software information
import vaex; vaex.__version__)
: 4.16.1Additional information
This likely has to do with both https://issues.apache.org/jira/browse/ARROW-17828 and is probably related to #2334
Part 1
We create a dataframe where the text column is larger than the max size for a string. We write it to an arrow file, and that's fine. Note here that if we wrote this to an hdf5 file without converting to a large string, we'd have an error... see 2334
Part 2
If we join this dataframe with another, we see that there are no issues
Part 3 (issue)
BUT, if we instead read that file from disk where it's memory mapped, joining to another dataframe causes an issue
This throws the error
Related issues
This is, I think, the root cause: https://issues.apache.org/jira/browse/ARROW-9773
This is certainly related https://issues.apache.org/jira/browse/ARROW-17828
But more useful, I think the core of the issue is actually from the
.take
used hereAnd the reason I think that is because of this very related issue from the huggingface datasets repo huggingface/datasets#615 and the fix that they have simply removes the use of
take
huggingface/datasets#645The text was updated successfully, but these errors were encountered: