Use Typed Buffers in Arrays (#1811) (#1176) #3743

tustvold · 2023-02-21T12:50:47Z

Which issue does this PR close?

Part of #1176
Closes #1811

Rationale for this change

The first part of moving towards a fully typed ArrayData as part of #1176

As an added bonus the change to precompute the offset for ByteArray improves the performance of the string comparison kernels by up to 25%. The arithmetic kernels show a very slight performance regression of ~1-2%, I'm inclined to think this does not matter, we are talking microseconds here.

What changes are included in this PR?

Are there any user-facing changes?

tustvold · 2023-02-21T12:51:37Z

arrow-buffer/src/buffer/immutable.rs

@@ -162,12 +183,15 @@ impl Buffer {
    /// Panics iff `(offset + length)` is larger than the existing length.
    pub fn slice_with_length(&self, offset: usize, length: usize) -> Self {
        assert!(
-            offset + length <= self.len(),
+            offset.saturating_add(length) <= self.length,


This was actually a bug, now covered by https://github.com/apache/arrow-rs/pull/3743/files#diff-24dc7184f64fe7a484db137ef61b2ec31090bb959fc9013beefea7865cdfea59R628

tustvold · 2023-02-21T12:52:53Z

arrow-array/src/record_batch.rs

@@ -603,7 +603,7 @@ mod tests {
        let record_batch =
            RecordBatch::try_new(Arc::new(schema), vec![Arc::new(a), Arc::new(b)])
                .unwrap();
-        assert_eq!(record_batch.get_array_memory_size(), 592);
+        assert_eq!(record_batch.get_array_memory_size(), 640);


There is a slight regression in the size of the arrays themselves, typically they are passed around as references so I am inclined to think this doesn't really matter, it is also likely temporary

tustvold · 2023-02-21T12:55:03Z

arrow-buffer/src/buffer/scalar.rs

+        unsafe {
+            std::slice::from_raw_parts(
+                self.buffer.as_ptr() as *const T,
+                self.buffer.len() / std::mem::size_of::<T>(),


I found caching this value doesn't impact performance, this is likely because the "hot" codepaths are either iterating the slice amortising this cost over multiple values, or using unchecked addressing where this length isn't ever consulted

tustvold · 2023-02-21T13:00:53Z

arrow-array/src/array/mod.rs

@@ -637,7 +637,7 @@ pub fn new_null_array(data_type: &DataType, length: usize) -> ArrayRef {
 }

 // Helper function for printing potentially long arrays.
-pub(crate) fn print_long_array<A, F>(
+fn print_long_array<A, F>(


This is just a drive-by cleanup

tustvold · 2023-02-21T13:02:51Z

arrow-buffer/src/buffer/immutable.rs

+    ///
+    /// We store a pointer instead of an offset to avoid pointer arithmetic
+    /// which causes LLVM to fail to vectorise code correctly
+    ptr: *const u8,


This seemingly inconsequential change makes a substantial difference to the benchmarks, LLVM seems to fail to realise that the offset doesn't change in the body of the loop and so fails to lift it out, which in turn causes the loop itself to fail to vectorise correctly

viirya

Thank you @tustvold for moving this forward! I plan to review this in next few days.

viirya · 2023-02-22T20:09:01Z

arrow-array/src/array/list_array.rs

-        // Soundness
-        //     pointer alignment & location is ensured by RawPtrBox
-        //     buffer bounds/offset is ensured by the ArrayData instance.
-        unsafe {
-            std::slice::from_raw_parts(
-                self.value_offsets.as_ptr().add(self.data.offset()),
-                self.len() + 1,
-            )
-        }


Oh that's good. 👍

viirya · 2023-02-22T20:09:52Z

arrow-array/src/array/list_array.rs

+        let value_offsets = match data.is_empty() && data.buffers()[0].is_empty() {
+            true => OffsetBuffer::new_empty(),
+            false => {
+                let buffer = ScalarBuffer::new(
+                    data.buffers()[0].clone(),
+                    data.offset(),
+                    data.len() + 1,
+                );
+                // Safety:
+                // ArrayData is valid
+                unsafe { OffsetBuffer::new_unchecked(buffer) }
+            }


This seems occurring more than once. Maybe we can have a function for it?

viirya · 2023-02-22T20:30:36Z

arrow-buffer/src/buffer/offset.rs

+
+/// A non-empty buffer of monotonically increasing, positive integers
+#[derive(Debug, Clone)]
+pub struct OffsetBuffer<O: ArrowNativeType>(ScalarBuffer<O>);


Suggested change

pub struct OffsetBuffer<O: ArrowNativeType>(ScalarBuffer<O>);

pub struct OffsetBuffer<O: OffsetSizeTrait>(ScalarBuffer<O>);

OffsetSizeTrait is defined in arrow-array and so can't be used here

viirya · 2023-02-22T20:39:00Z

arrow-buffer/src/buffer/scalar.rs

-        Self { buffer, ptr, len }
+        let byte_offset = offset.checked_mul(size).expect("offset overflow");
+        let byte_len = len.checked_mul(size).expect("length overflow");
+        buffer.slice_with_length(byte_offset, byte_len).into()


No need to check alignment anymore?

The alignment gets checked in the From conversion

tustvold · 2023-02-23T09:40:25Z

arrow-array/src/array/mod.rs

+/// # Safety
+///
+/// - ArrayData must contain a valid [`OffsetBuffer`] as its first buffer
+unsafe fn get_offsets<O: ArrowNativeType>(data: &ArrayData) -> OffsetBuffer<O> {


This will eventually get removed as we make ArrayData an enumeration of typed buffers

ursabot · 2023-02-23T09:56:18Z

Benchmark runs are scheduled for baseline = ebe6f53 and contender = 47e4b61. 47e4b61 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Remove RawPtrBox (apache#1811) (apache#1176)

74d15fa

github-actions bot added the arrow Changes to the arrow crate label Feb 21, 2023

tustvold commented Feb 21, 2023

View reviewed changes

Clippy

849dba6

tustvold commented Feb 21, 2023

View reviewed changes

tustvold requested a review from viirya February 21, 2023 16:53

viirya reviewed Feb 21, 2023

View reviewed changes

tustvold mentioned this pull request Feb 22, 2023

ArrayData Enumeration for Primitive, Binary and UTF8 #3749

Merged

viirya approved these changes Feb 22, 2023

View reviewed changes

Extract get_offsets function

7cd9dca

tustvold commented Feb 23, 2023

View reviewed changes

tustvold merged commit 47e4b61 into apache:master Feb 23, 2023

tustvold mentioned this pull request Feb 23, 2023

Discussion: relationship / unification of arrow-rs and arrow2 going forward #1176

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Typed Buffers in Arrays (#1811) (#1176) #3743

Use Typed Buffers in Arrays (#1811) (#1176) #3743

tustvold commented Feb 21, 2023

tustvold Feb 21, 2023

tustvold Feb 21, 2023

tustvold Feb 21, 2023

tustvold Feb 21, 2023

tustvold Feb 21, 2023

viirya left a comment

viirya Feb 22, 2023

viirya Feb 22, 2023

viirya Feb 22, 2023

tustvold Feb 23, 2023

viirya Feb 22, 2023

tustvold Feb 23, 2023

tustvold Feb 23, 2023

ursabot commented Feb 23, 2023

	pub struct OffsetBuffer<O: ArrowNativeType>(ScalarBuffer<O>);
	pub struct OffsetBuffer<O: OffsetSizeTrait>(ScalarBuffer<O>);

Use Typed Buffers in Arrays (#1811) (#1176) #3743

Use Typed Buffers in Arrays (#1811) (#1176) #3743

Conversation

tustvold commented Feb 21, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ursabot commented Feb 23, 2023