Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG-REPORT] Offset Overflow when joining dataframes with large strings #2335

Open
Ben-Epstein opened this issue Feb 9, 2023 · 0 comments
Open

Comments

@Ben-Epstein
Copy link
Contributor

Ben-Epstein commented Feb 9, 2023

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

Description
Please provide a clear and concise description of the problem. This should contain all the steps needed to reproduce the problem. A minimal code example that exposes the problem is very appreciated.

Software information

  • Vaex version (import vaex; vaex.__version__): 4.16.1
  • Vaex was installed via: pip / conda-forge / from source pip
  • OS: mac/linux

Additional information
This likely has to do with both https://issues.apache.org/jira/browse/ARROW-17828 and is probably related to #2334

Part 1

We create a dataframe where the text column is larger than the max size for a string. We write it to an arrow file, and that's fine. Note here that if we wrote this to an hdf5 file without converting to a large string, we'd have an error... see 2334

import vaex
import numpy as np

# from https://issues.apache.org/jira/browse/ARROW-17828
# arrow string maxsize is 2GB. Then you need large string
maxsize_string = 2e9

df = vaex.example()
df["text"] = vaex.vconstant("OHYEA"*2000, len(df))
df["id"] = np.array(list(range(len(df))))

assert df.text.str.len().sum() > maxsize_string

df[["x", "y", "z", "text", "id"]].export("part1.arrow")

Part 2

If we join this dataframe with another, we see that there are no issues

# Works fine
newdf = vaex.example()[["id"]]
newdf["col1"] = vaex.vconstant("foo", len(newdf))

join_df = newdf.join(df, on="id")
join_df

image

Part 3 (issue)

BUT, if we instead read that file from disk where it's memory mapped, joining to another dataframe causes an issue

# Reading from disk will break the join
newdf2 = vaex.example()[["id"]]
newdf3 = vaex.open("part1.arrow")

newdf2["col1"] = vaex.vconstant("foo", len(newdf))

join_df2 = newdf2.join(newdf3, on="id")
join_df2

This throws the error

    return call_function('take', [data, indices], options,                            
                             memory_pool)                                                                          
                               File "pyarrow/_compute.pyx", line 560, in                                           
                             pyarrow._compute.call_function                                                        
                               File "pyarrow/_compute.pyx", line 355, in                                           
                             pyarrow._compute.Function.call                                                        
                               File "pyarrow/error.pxi", line 144, in                                              
                             pyarrow.lib.pyarrow_internal_check_status                                             
                               File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status                     
                             pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays     

Related issues

This is, I think, the root cause: https://issues.apache.org/jira/browse/ARROW-9773
This is certainly related https://issues.apache.org/jira/browse/ARROW-17828

But more useful, I think the core of the issue is actually from the .take used here

.venv/lib/python3.9/                  
                             site-packages/pyarrow/compute.py", line 470, in take                                  
                                 return call_function('take', [data, indices], options,                            
                             memory_pool)                              

And the reason I think that is because of this very related issue from the huggingface datasets repo huggingface/datasets#615 and the fix that they have simply removes the use of take huggingface/datasets#645

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant