Skip to content

[Java] Handle offset field from ArrowArray when BufferImportTypeVisitor imports offset buffer #74

@viirya

Description

@viirya

Describe the bug, including details regarding any error messages, version, and platform.

This bug is found during debugging the issue apache/datafusion-comet#540.

We found that some string arrays' offsets are out of the range of their value buffer. I.e., a string array's value buffer is only 147456 bytes, but the offsets of the last string is (294894, 294912).

The string is output from DataFusion aggregation operator AggregateExec. When producing the output batch, the operator will possibly slice the output batch if it is larger than a configured size. The slice of a string array in arrow-rs, keeps original value buffer and moves the pointer of offset buffer so it is a zero-copy slice.

During importOffsets call in BufferImportTypeVisitor, the slice of offset can be imported correctly as it uses the moved pointer and calculates the offset buffer correctly.

But when it goes to import value buffer, it calculates the capacity of it by using the difference between imported last offset and first offset. Because the imported offsets are from the slice, the calculated capacity is only for the certain slice of the value buffer.

For example, the original string array's value buffer is 346536 bytes, last offset is 346536. We take a slice of 8192 strings from it. The slice array's last offset is 294912 but the value buffer is the same (346536 bytes).

When BufferImportTypeVisitor imports the slice, the imported offsets are [147456, ..., 294912]. It calculates the length of value buffer is 294912 - 147456 = 147456. But actually the length of value buffer is 346536.

Obviously the offsets are now out of range of the incorrect value buffer size 147456.

To be clear, the source of the issue comes from apache/arrow-rs#5896 where it exports moved pointer of offer buffer of a slice of string array. Instead, we should use offset field in ArrowArray for it. We are going to fix it in arrow-rs for that.

But actually seems either ArrayImporter or BufferImportTypeVisitor also doesn't consider offset in ArrowArray.

Component(s)

Java

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions