-
Notifications
You must be signed in to change notification settings - Fork 3k
Fix nested data conversions error in parquet loading (fixes #7793) #7794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix nested data conversions error in parquet loading (fixes #7793) #7794
Conversation
- Add fallback mechanism for PyArrow's 'Nested data conversions not implemented for chunked array outputs' error - Implement selective chunk combining for nested data types (lists, structs, maps, unions) - Maintain backward compatibility with non-nested datasets - Fixes issue huggingface#7793 The error occurs when parquet files contain deeply nested data structures that exceed PyArrow's 16MB chunk limit. The fix detects this specific error and falls back to reading the full table, then applying selective chunk combining only for problematic nested columns before manually splitting into batches.
Unfortunately, I'm running into this error:
|
Also the gated dataset has automatic approval so you should feel free to sign in and test if you'd like! |
hi @neevparikh I've updated the fix based on your feedback. The new approach uses row group reading as a fallback when both to_batches() and to_table() fail. I've successfully tested it with an actual file from your dataset and it loads correctly. Could you test the updated version? |
Now we're failing with this error:
|
it seems to me that we dropped the ones we couldn't read? |
@Aishwarya0811 let me know if there's helpful things here I can do? |
Fixes #7793
Problem
Loading datasets with deeply nested structures (like
metr-evals/malt-public
) fails with:ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs
This occurs when parquet files contain nested data (lists, structs, maps) that exceed PyArrow's 16MB chunk limit.
Root Cause
PyArrow's C++ implementation explicitly rejects nested data conversions when data is split across multiple chunks. The limitation exists in the
WrapIntoListArray
function where repetition levels cannot be reconstructed across chunk boundaries.Solution
Implementation Details
_is_nested_type()
helper to detect nested PyArrow types_handle_nested_chunked_conversion()
for selective chunk combining_generate_tables()
to catch and handle the specific errorTesting
Note: This fix is based on thorough research of PyArrow limitations and similar issues in the ecosystem. While we cannot test with the original dataset due to access restrictions, the implementation follows established patterns for handling this PyArrow limitation.
Request for Testing
@neevparikh Could you please test this fix with your original failing dataset? The implementation should resolve the nested data conversion error you encountered.