[BUG-REPORT] Offset Overflow when joining dataframes with large strings #2335

Ben-Epstein · 2023-02-09T16:20:44Z

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

Description
Please provide a clear and concise description of the problem. This should contain all the steps needed to reproduce the problem. A minimal code example that exposes the problem is very appreciated.

Software information

Vaex version (import vaex; vaex.__version__): 4.16.1
Vaex was installed via: pip / conda-forge / from source pip
OS: mac/linux

Additional information
This likely has to do with both https://issues.apache.org/jira/browse/ARROW-17828 and is probably related to #2334

Part 1

We create a dataframe where the text column is larger than the max size for a string. We write it to an arrow file, and that's fine. Note here that if we wrote this to an hdf5 file without converting to a large string, we'd have an error... see 2334

import vaex
import numpy as np

# from https://issues.apache.org/jira/browse/ARROW-17828
# arrow string maxsize is 2GB. Then you need large string
maxsize_string = 2e9

df = vaex.example()
df["text"] = vaex.vconstant("OHYEA"*2000, len(df))
df["id"] = np.array(list(range(len(df))))

assert df.text.str.len().sum() > maxsize_string

df[["x", "y", "z", "text", "id"]].export("part1.arrow")

Part 2

If we join this dataframe with another, we see that there are no issues

# Works fine
newdf = vaex.example()[["id"]]
newdf["col1"] = vaex.vconstant("foo", len(newdf))

join_df = newdf.join(df, on="id")
join_df

Part 3 (issue)

BUT, if we instead read that file from disk where it's memory mapped, joining to another dataframe causes an issue

# Reading from disk will break the join
newdf2 = vaex.example()[["id"]]
newdf3 = vaex.open("part1.arrow")

newdf2["col1"] = vaex.vconstant("foo", len(newdf))

join_df2 = newdf2.join(newdf3, on="id")
join_df2

This throws the error

    return call_function('take', [data, indices], options,                            
                             memory_pool)                                                                          
                               File "pyarrow/_compute.pyx", line 560, in                                           
                             pyarrow._compute.call_function                                                        
                               File "pyarrow/_compute.pyx", line 355, in                                           
                             pyarrow._compute.Function.call                                                        
                               File "pyarrow/error.pxi", line 144, in                                              
                             pyarrow.lib.pyarrow_internal_check_status                                             
                               File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status                     
                             pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

Related issues

This is, I think, the root cause: https://issues.apache.org/jira/browse/ARROW-9773
This is certainly related https://issues.apache.org/jira/browse/ARROW-17828

But more useful, I think the core of the issue is actually from the .take used here

.venv/lib/python3.9/                  
                             site-packages/pyarrow/compute.py", line 470, in take                                  
                                 return call_function('take', [data, indices], options,                            
                             memory_pool)

And the reason I think that is because of this very related issue from the huggingface datasets repo huggingface/datasets#615 and the fix that they have simply removes the use of take huggingface/datasets#645

The text was updated successfully, but these errors were encountered:

Ben-Epstein mentioned this issue Feb 9, 2023

dont use take with arrow #2336

Open

castedice mentioned this issue Feb 1, 2024

Cannot load a binary column of many rows via the to_arrow method. apache/iceberg-python#344

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG-REPORT] Offset Overflow when joining dataframes with large strings #2335

[BUG-REPORT] Offset Overflow when joining dataframes with large strings #2335

Ben-Epstein commented Feb 9, 2023 •

edited

Loading

[BUG-REPORT] Offset Overflow when joining dataframes with large strings #2335

[BUG-REPORT] Offset Overflow when joining dataframes with large strings #2335

Comments

Ben-Epstein commented Feb 9, 2023 • edited Loading

Part 1

Part 2

Part 3 (issue)

Related issues

Ben-Epstein commented Feb 9, 2023 •

edited

Loading