-
Notifications
You must be signed in to change notification settings - Fork 7k
[Data] Support serializing zero-length numpy arrays #57858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Support serializing zero-length numpy arrays #57858
Conversation
Signed-off-by: Chris O'Hara <[email protected]>
Signed-off-by: Chris O'Hara <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly addresses the serialization issue with zero-length numpy arrays by adding a special case for when num_items_per_element is zero. The fix is clean and directly solves the ValueError from np.arange. The addition of a new test case ensures this scenario is covered going forward. I've suggested a small improvement to the new test to make it more robust by parameterizing it with various zero-length array shapes.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Chris O'Hara <[email protected]>
Signed-off-by: Chris O'Hara <[email protected]>
## Description
Ray data can't serialize zero (byte) length numpy arrays:
```python3
import numpy as np
import ray.data
array = np.empty((2, 0), dtype=np.int8)
ds = ray.data.from_items([{"array": array}])
for batch in ds.iter_batches(batch_size=1):
print(batch)
```
What I expect to see:
```
{'array': array([], shape=(1, 2, 0), dtype=int8)}
```
What I see:
```
/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py:736: RuntimeWarning: invalid value encountered in scalar divide
offsets = np.arange(
2025-10-17 17:18:09,499 WARNING arrow.py:189 -- Failed to convert column 'array' into pyarrow array due to: Error converting data to Arrow: column: 'array', shape: (1, 2, 0), dtype: int8, data: []; falling back to serialize as pickled python objects
Traceback (most recent call last):
File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 672, in from_numpy
return cls._from_numpy(arr)
^^^^^^^^^^^^^^^^^^^^
File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 736, in _from_numpy
offsets = np.arange(
^^^^^^^^^^
ValueError: arange: cannot compute length
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 141, in convert_to_pyarrow_array
return ArrowTensorArray.from_numpy(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 678, in from_numpy
raise ArrowConversionError(data_str) from e
ray.air.util.tensor_extensions.arrow.ArrowConversionError: Error converting data to Arrow: column: 'array', shape: (1, 2, 0), dtype: int8, data: []
2025-10-17 17:18:09,789 INFO logging.py:293 -- Registered dataset logger for dataset dataset_0_0
2025-10-17 17:18:09,815 WARNING resource_manager.py:134 -- ⚠️ Ray's object store is configured to use only 33.5% of available memory (2.0GiB out of 6.0GiB total). For optimal Ray Data performance, we recommend setting the object store to at least 50% of available memory. You can do this by setting the 'object_store_memory' parameter when calling ray.init() or by setting the RAY_DEFAULT_OBJECT_STORE_MEMORY_PROPORTION environment variable.
{'array': array([array([], shape=(2, 0), dtype=int8)], dtype=object)}
```
This PR fixes the issue so that zero-length arrays are serialized
correctly, and the shape and dtype is preserved.
## Additional information
This is `ray==2.50.0`.
---------
Signed-off-by: Chris O'Hara <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Kamil Kaczmarek <[email protected]>
## Description
Ray data can't serialize zero (byte) length numpy arrays:
```python3
import numpy as np
import ray.data
array = np.empty((2, 0), dtype=np.int8)
ds = ray.data.from_items([{"array": array}])
for batch in ds.iter_batches(batch_size=1):
print(batch)
```
What I expect to see:
```
{'array': array([], shape=(1, 2, 0), dtype=int8)}
```
What I see:
```
/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py:736: RuntimeWarning: invalid value encountered in scalar divide
offsets = np.arange(
2025-10-17 17:18:09,499 WARNING arrow.py:189 -- Failed to convert column 'array' into pyarrow array due to: Error converting data to Arrow: column: 'array', shape: (1, 2, 0), dtype: int8, data: []; falling back to serialize as pickled python objects
Traceback (most recent call last):
File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 672, in from_numpy
return cls._from_numpy(arr)
^^^^^^^^^^^^^^^^^^^^
File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 736, in _from_numpy
offsets = np.arange(
^^^^^^^^^^
ValueError: arange: cannot compute length
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 141, in convert_to_pyarrow_array
return ArrowTensorArray.from_numpy(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 678, in from_numpy
raise ArrowConversionError(data_str) from e
ray.air.util.tensor_extensions.arrow.ArrowConversionError: Error converting data to Arrow: column: 'array', shape: (1, 2, 0), dtype: int8, data: []
2025-10-17 17:18:09,789 INFO logging.py:293 -- Registered dataset logger for dataset dataset_0_0
2025-10-17 17:18:09,815 WARNING resource_manager.py:134 -- ⚠️ Ray's object store is configured to use only 33.5% of available memory (2.0GiB out of 6.0GiB total). For optimal Ray Data performance, we recommend setting the object store to at least 50% of available memory. You can do this by setting the 'object_store_memory' parameter when calling ray.init() or by setting the RAY_DEFAULT_OBJECT_STORE_MEMORY_PROPORTION environment variable.
{'array': array([array([], shape=(2, 0), dtype=int8)], dtype=object)}
```
This PR fixes the issue so that zero-length arrays are serialized
correctly, and the shape and dtype is preserved.
## Additional information
This is `ray==2.50.0`.
---------
Signed-off-by: Chris O'Hara <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: xgui <[email protected]>
## Description
Ray data can't serialize zero (byte) length numpy arrays:
```python3
import numpy as np
import ray.data
array = np.empty((2, 0), dtype=np.int8)
ds = ray.data.from_items([{"array": array}])
for batch in ds.iter_batches(batch_size=1):
print(batch)
```
What I expect to see:
```
{'array': array([], shape=(1, 2, 0), dtype=int8)}
```
What I see:
```
/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py:736: RuntimeWarning: invalid value encountered in scalar divide
offsets = np.arange(
2025-10-17 17:18:09,499 WARNING arrow.py:189 -- Failed to convert column 'array' into pyarrow array due to: Error converting data to Arrow: column: 'array', shape: (1, 2, 0), dtype: int8, data: []; falling back to serialize as pickled python objects
Traceback (most recent call last):
File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 672, in from_numpy
return cls._from_numpy(arr)
^^^^^^^^^^^^^^^^^^^^
File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 736, in _from_numpy
offsets = np.arange(
^^^^^^^^^^
ValueError: arange: cannot compute length
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 141, in convert_to_pyarrow_array
return ArrowTensorArray.from_numpy(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 678, in from_numpy
raise ArrowConversionError(data_str) from e
ray.air.util.tensor_extensions.arrow.ArrowConversionError: Error converting data to Arrow: column: 'array', shape: (1, 2, 0), dtype: int8, data: []
2025-10-17 17:18:09,789 INFO logging.py:293 -- Registered dataset logger for dataset dataset_0_0
2025-10-17 17:18:09,815 WARNING resource_manager.py:134 -- ⚠️ Ray's object store is configured to use only 33.5% of available memory (2.0GiB out of 6.0GiB total). For optimal Ray Data performance, we recommend setting the object store to at least 50% of available memory. You can do this by setting the 'object_store_memory' parameter when calling ray.init() or by setting the RAY_DEFAULT_OBJECT_STORE_MEMORY_PROPORTION environment variable.
{'array': array([array([], shape=(2, 0), dtype=int8)], dtype=object)}
```
This PR fixes the issue so that zero-length arrays are serialized
correctly, and the shape and dtype is preserved.
## Additional information
This is `ray==2.50.0`.
---------
Signed-off-by: Chris O'Hara <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: elliot-barn <[email protected]>
## Description
Ray data can't serialize zero (byte) length numpy arrays:
```python3
import numpy as np
import ray.data
array = np.empty((2, 0), dtype=np.int8)
ds = ray.data.from_items([{"array": array}])
for batch in ds.iter_batches(batch_size=1):
print(batch)
```
What I expect to see:
```
{'array': array([], shape=(1, 2, 0), dtype=int8)}
```
What I see:
```
/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py:736: RuntimeWarning: invalid value encountered in scalar divide
offsets = np.arange(
2025-10-17 17:18:09,499 WARNING arrow.py:189 -- Failed to convert column 'array' into pyarrow array due to: Error converting data to Arrow: column: 'array', shape: (1, 2, 0), dtype: int8, data: []; falling back to serialize as pickled python objects
Traceback (most recent call last):
File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 672, in from_numpy
return cls._from_numpy(arr)
^^^^^^^^^^^^^^^^^^^^
File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 736, in _from_numpy
offsets = np.arange(
^^^^^^^^^^
ValueError: arange: cannot compute length
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 141, in convert_to_pyarrow_array
return ArrowTensorArray.from_numpy(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 678, in from_numpy
raise ArrowConversionError(data_str) from e
ray.air.util.tensor_extensions.arrow.ArrowConversionError: Error converting data to Arrow: column: 'array', shape: (1, 2, 0), dtype: int8, data: []
2025-10-17 17:18:09,789 INFO logging.py:293 -- Registered dataset logger for dataset dataset_0_0
2025-10-17 17:18:09,815 WARNING resource_manager.py:134 -- ⚠️ Ray's object store is configured to use only 33.5% of available memory (2.0GiB out of 6.0GiB total). For optimal Ray Data performance, we recommend setting the object store to at least 50% of available memory. You can do this by setting the 'object_store_memory' parameter when calling ray.init() or by setting the RAY_DEFAULT_OBJECT_STORE_MEMORY_PROPORTION environment variable.
{'array': array([array([], shape=(2, 0), dtype=int8)], dtype=object)}
```
This PR fixes the issue so that zero-length arrays are serialized
correctly, and the shape and dtype is preserved.
## Additional information
This is `ray==2.50.0`.
---------
Signed-off-by: Chris O'Hara <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
## Description
Ray data can't serialize zero (byte) length numpy arrays:
```python3
import numpy as np
import ray.data
array = np.empty((2, 0), dtype=np.int8)
ds = ray.data.from_items([{"array": array}])
for batch in ds.iter_batches(batch_size=1):
print(batch)
```
What I expect to see:
```
{'array': array([], shape=(1, 2, 0), dtype=int8)}
```
What I see:
```
/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py:736: RuntimeWarning: invalid value encountered in scalar divide
offsets = np.arange(
2025-10-17 17:18:09,499 WARNING arrow.py:189 -- Failed to convert column 'array' into pyarrow array due to: Error converting data to Arrow: column: 'array', shape: (1, 2, 0), dtype: int8, data: []; falling back to serialize as pickled python objects
Traceback (most recent call last):
File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 672, in from_numpy
return cls._from_numpy(arr)
^^^^^^^^^^^^^^^^^^^^
File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 736, in _from_numpy
offsets = np.arange(
^^^^^^^^^^
ValueError: arange: cannot compute length
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 141, in convert_to_pyarrow_array
return ArrowTensorArray.from_numpy(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 678, in from_numpy
raise ArrowConversionError(data_str) from e
ray.air.util.tensor_extensions.arrow.ArrowConversionError: Error converting data to Arrow: column: 'array', shape: (1, 2, 0), dtype: int8, data: []
2025-10-17 17:18:09,789 INFO logging.py:293 -- Registered dataset logger for dataset dataset_0_0
2025-10-17 17:18:09,815 WARNING resource_manager.py:134 -- ⚠️ Ray's object store is configured to use only 33.5% of available memory (2.0GiB out of 6.0GiB total). For optimal Ray Data performance, we recommend setting the object store to at least 50% of available memory. You can do this by setting the 'object_store_memory' parameter when calling ray.init() or by setting the RAY_DEFAULT_OBJECT_STORE_MEMORY_PROPORTION environment variable.
{'array': array([array([], shape=(2, 0), dtype=int8)], dtype=object)}
```
This PR fixes the issue so that zero-length arrays are serialized
correctly, and the shape and dtype is preserved.
## Additional information
This is `ray==2.50.0`.
---------
Signed-off-by: Chris O'Hara <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Aydin Abiar <[email protected]>
Description
Ray data can't serialize zero (byte) length numpy arrays:
What I expect to see:
What I see:
This PR fixes the issue so that zero-length arrays are serialized correctly, and the shape and dtype is preserved.
Additional information
This is
ray==2.50.0.