ARROW-11627: [Rust] Make allocator be a generic over type T #9495

jorgecarleitao · 2021-02-14T17:06:20Z

The background and rational for this is described here; the idea is that this is groundwork to make our buffers typed, so that we can start introducing strong typing in the crate.

This change is backward incompatible:

Our allocator is now a generic over type T: NativeType, which implies that we can now allocate certain types.
The allocator moved from memory to a new module alloc (inspired after std::alloc).

Necessary steps to migrate existing code:

use arrow::memory -> use arrow::alloc
memory::allocate_aligned(...) -> alloc::allocate_aligned::<u8>(...)

Note how NativeType contains to_le_bytes; we will use this method for IPC, where we need to serialize buffers with a specific endianess. This is ground work to enable multiple endianesses support

github-actions · 2021-02-14T17:07:34Z

https://issues.apache.org/jira/browse/ARROW-11627

jorgecarleitao · 2021-02-14T17:41:23Z

cc: @sunchao @nevi-me @pitrou @andygrove @alamb, this is a core change to the arrow crate, as it will enable us to allocate memory regions without losing the type information about what they contain. I would appreciate to know if someone sees an issue here.

The general direction is stop transmuting byte buffers, which have caused us too many bugs and is unsafe, and instead use Rust's type system on them.

AFAIK, this is will also enable us to support multiple endianess, but we are some PRs away from it. For now, we just stick to little endianess (AFAIK we do not support big atm), where at least we make it explicit.

cc @maxburke , since you depend on the memory module of this crate.

codecov-io · 2021-02-14T17:55:30Z

Codecov Report

Merging #9495 (a15c5b1) into master (8547c61) will decrease coverage by 0.06%.
The diff coverage is 51.45%.

@@            Coverage Diff             @@
##           master    #9495      +/-   ##
==========================================
- Coverage   82.12%   82.05%   -0.07%     
==========================================
  Files         235      236       +1     
  Lines       54729    54774      +45     
==========================================
  Hits        44944    44944              
- Misses       9785     9830      +45

Impacted Files	Coverage Δ
rust/arrow/src/alloc/types.rs	`0.00% <0.00%> (ø)`
rust/arrow/src/alloc/mod.rs	`92.68% <92.68%> (ø)`
rust/arrow/src/array/array_list.rs	`93.40% <100.00%> (ø)`
rust/arrow/src/array/raw_pointer.rs	`100.00% <100.00%> (ø)`
rust/arrow/src/buffer.rs	`95.62% <100.00%> (+0.35%)`	⬆️
rust/arrow/src/bytes.rs	`53.12% <100.00%> (ø)`
rust/parquet/src/encodings/encoding.rs	`94.86% <0.00%> (-0.20%)`	⬇️
rust/arrow/src/array/transform/fixed_binary.rs	`84.21% <0.00%> (+5.26%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8547c61...8ac48f4. Read the comment docs.

alamb

I went through this carefully and it looks like a significant improvement to me. I would not call myself an expert in this area (low level Rust allocators) but the changes make sense to me.

I vote @nevi-me or @vertexclique do you have any thoughts about this PR?

alamb · 2021-02-15T19:54:15Z

rust/arrow/src/alloc/alignment.rs

FWIW I know you didn't introduce this comment in this PR (it just moved) but I find this comment more confusing than enlightening.

I've written this before, which part is confusing? Let's rephrase.

I think I am overwhelmed by the detail before I get the context of what the information is for.

Maybe if the comment started with something like

// Pick the best memory alignment based on the target architecture. // // The rationale for doing so is ....

alamb · 2021-02-15T19:57:39Z

rust/arrow/src/alloc/mod.rs

Suggested change

/// * ptr must denote a block of memory currently allocated via this allocator,

/// * ptr must denote a block of memory previously returned from `allocate_aligned` or

/// `allocate_aligned_zeroed`

alamb · 2021-02-15T19:58:10Z

rust/arrow/src/alloc/mod.rs

Suggested change

/// * ptr must be currently allocated via this allocator,

/// ptr must denote a block of memory previously returned from `allocate_aligned` or

/// `allocate_aligned_zeroed`

alamb · 2021-02-15T19:59:12Z

rust/arrow/src/alloc/mod.rs

If your allocation size is even anywhere near overflowing a 64-bit usize you are likely going to have problems 😆

I agree. I do think that this comment applies mostly to 32 bit systems where usize = u32.

alamb · 2021-02-15T19:59:55Z

rust/arrow/src/alloc/types.rs

alamb · 2021-02-15T20:04:28Z

rust/arrow/src/alloc/types.rs

Suggested change

/// Whether a DataType is a valid type for this physical representation.

/// Whether an array element of `data_type` can be stored using this `NativeType` as

/// its physical representation. For example, `i32` stores `DataType::Int32` as well as

/// `DataType::Date32`

alamb · 2021-02-16T19:46:52Z

Any thoughts @nevi-me or @maxburke ?

alamb · 2021-02-18T11:54:03Z

I plan to merge this on the weekend unless I hear any comments otherwise

alamb · 2021-02-20T14:02:05Z

@jorgecarleitao this needs a rebase :(

But then I think we'll (finally) get it in. Woohoo!

vertexclique

Atomics needs some improvement and there is no need for SeqCst check since it doesn't matter while test threads are still ongoing. What is needed here is atomicity.

In addition to all these there is a zz_memory_check.rs existing. That code will fail because of ordering is not related to thread access. So I have refactored that for you:

#[cfg(feature = "memory-check")]
mod tests {
    use crate::alloc::ALLOCATIONS;
    use std::sync::atomic::*;

    // verify that there is no data un-allocated
    #[test]
    fn test_memory_check() {
        let mut i: usize = 0;
        unsafe {
            while ALLOCATIONS.load(Ordering::Relaxed) != 0 {
                i+=1;
                spin_loop_hint();
                
                // Fair __mm_pause can be 10m.
                assert!(i < 10_000_000);
            }
        }
    }
}

vertexclique · 2021-02-21T14:30:01Z

rust/arrow/src/alloc/alignment.rs

I've written this before, which part is confusing? Let's rephrase.

vertexclique · 2021-02-21T14:39:37Z

rust/arrow/src/alloc/mod.rs

Suggested change

// If this number is not zero after all objects have been `drop`, there is a memory leak

pub static mut ALLOCATIONS: AtomicIsize = AtomicIsize::new(0);

// If this number is not zero after all objects have been `drop`, there is a memory leak

#[cfg(feature = "memory-check")]

pub static mut ALLOCATIONS: AtomicIsize = AtomicIsize::new(0);

vertexclique · 2021-02-21T14:39:53Z

rust/arrow/src/alloc/mod.rs

Suggested change

ALLOCATIONS.fetch_add(size as isize, std::sync::atomic::Ordering::SeqCst);

#[cfg(feature = "memory-check")]

ALLOCATIONS.fetch_add(size as isize, Ordering::Relaxed);

vertexclique · 2021-02-21T14:40:11Z

rust/arrow/src/alloc/mod.rs

Suggested change

ALLOCATIONS.fetch_add(size as isize, std::sync::atomic::Ordering::SeqCst);

#[cfg(feature = "memory-check")]

ALLOCATIONS.fetch_add(size as isize, Ordering::Relaxed);

vertexclique · 2021-02-21T14:40:43Z

rust/arrow/src/alloc/mod.rs

Suggested change

ALLOCATIONS.fetch_sub(size as isize, std::sync::atomic::Ordering::SeqCst);

#[cfg(feature = "memory-check")]

ALLOCATIONS.fetch_sub(size as isize, Ordering::Relaxed);

vertexclique · 2021-02-21T14:41:28Z

rust/arrow/src/alloc/mod.rs

Suggested change

ALLOCATIONS.fetch_add(

new_size as isize - old_size as isize,

std::sync::atomic::Ordering::SeqCst,

);

#[cfg(feature = "memory-check")]

ALLOCATIONS.fetch_add(

new_size.checked_sub(old_size).unwrap() as isize,

Ordering::Relaxed,

);

vertexclique · 2021-02-21T15:54:57Z

@alamb Please don't get it in, it will slow down allocations with MOESI communication between cores.

vertexclique · 2021-02-21T15:58:54Z

rust/arrow/src/alloc/mod.rs

Suggested change

use std::{

alloc::{handle_alloc_error, Layout},

sync::atomic::AtomicIsize,

};

use std::alloc::{handle_alloc_error, Layout};

use std::sync::atomic::*;

vertexclique · 2021-02-21T16:08:23Z

rust/arrow/src/alloc/mod.rs

It would be better to add the checked_sub check written below for memory-check; also, go here.

vertexclique · 2021-02-21T16:09:13Z

rust/arrow/src/alloc/types.rs

tioli; Adding to_ne_bytes, to_be_bytes would be cool. Why not :)

jorgecarleitao · 2021-02-21T16:19:37Z

@vertexclique , thanks a lot for going through this and your comments.

For my own understanding, could you give some insight into the ordering comments and the spin_loop_hint?

Note that this PR does no modify the ALLOCATIONS bit. That is the same as it was before. That was added in 2.0 and in the context of FFI; I measured no performance difference (maybe 2-3%).

The change set of this PR is essentially bf1a7eb

vertexclique · 2021-02-21T16:37:18Z

For my own understanding, could you give some insight into the ordering comments and the spin_loop_hint?

loop hint is for waiting on the cpu without os intervention to wait until to the point of expected value for this test.

Note that this PR does no modify the ALLOCATIONS bit. That is the same as it was before. That was added in 2.0 and in the context of FFI; I measured no performance difference (maybe 2-3%).

The change set of this PR is essentially bf1a7eb

I've just seen this, probably I wasn't the reviewer or simply missed the review cycle.
I don't think that it can be seen in test execution. But it can be seen in the normal execution of the code.
Moreover, in the actual release code, it is not needed and can be filtered out.

alamb

For what it is worth, here is a pointer to the ALLOCATIONS code that was moved (aka confirming that the behavior has not changed):

https://github.com/apache/arrow/pull/9495/files#diff-d2c73184ac4d2feb55c895ed48bc8af4433d309276de1fd9bb0aa21fd9588f14L227-L230

@vertexclique I suggest we get this PR in and then update/optimize the use of ALLOCATIONS as a follow on PR. I think @jorgecarleitao has many other PRs / plans waiting on this PR so we should not delay it unecessairly

alamb · 2021-02-23T22:45:35Z

rust/arrow/src/alloc/alignment.rs

I think I am overwhelmed by the detail before I get the context of what the information is for.

Maybe if the comment started with something like

// Pick the best memory alignment based on the target architecture. // // The rationale for doing so is ....

vertexclique · 2021-02-24T13:54:25Z

@alamb @jorgecarleitao @nevi-me Yeah, I am totally ok merging it like this. We can keep track of the suggestions as Jira ticket. :)

vertexclique · 2021-02-24T13:59:50Z

FYI: I've assigned these improvements to myself. Truly let's not block each other: https://issues.apache.org/jira/browse/ARROW-11760 @alamb @jorgecarleitao

nevi-me · 2021-02-24T14:20:00Z

i've been away for over a week due to personal issues.
When Jorge started work on this, we had discussed the need for Bytes or Buffer to know its byte width, so that we could correct our zero-copy slicing/offsets. I had taken a different approach to this PR (which was to directly track the byte_size in the Buffer struct), but because I made a lot of changes to the code base, it was difficult for me to reconcile changes.

This PR seems to deal more with the allocator, so not yet necessarily with Buffer. I'll try to see what impact it has on my (draft) work, when it's merged. [EDIT: this is done in https://github.com//pull/9496, which I'm now looking at]

The change set of this PR is essentially bf1a7eb

I'm fine with getting this PR merged, I'm a bit overwhelmed by the detail, but looking at bf1a7eb has certainly helped me with reviewing the change.

nevi-me · 2021-02-24T14:35:32Z

I've rebased, and I'll merge this when CI is green

alamb · 2021-02-24T15:58:24Z

Thanks @nevi-me

alamb · 2021-02-24T19:35:48Z

FYI: I've assigned these improvements to myself. Truly let's not block each other:

Thank you @vertexclique !

Polars uses the `arrow::memory` module. With the backwards incompatible change of #9495, the API is refactored to `arrow::alloc`. By making `alloc` public, users can shift to the new changes. Closes #9572 from ritchie46/make_alloc_public Authored-by: Ritchie Vink <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>

github-actions bot added the Component: Rust label Feb 14, 2021

jorgecarleitao changed the title ~~[Rust] Refactored allocator to be a generic over type T~~ ARROW-11627: [Rust] Make allocator be a generic over type T Feb 14, 2021

apache deleted a comment from github-actions bot Feb 14, 2021

jorgecarleitao mentioned this pull request Feb 14, 2021

ARROW-11628: [Rust] Made Bytes typed #9496

Closed

alamb approved these changes Feb 15, 2021

View reviewed changes

vertexclique reviewed Feb 21, 2021

View reviewed changes

alamb approved these changes Feb 23, 2021

View reviewed changes

nevi-me added the needs-rebase A PR that needs to be rebased by the author label Feb 24, 2021

jorgecarleitao added 2 commits February 24, 2021 16:23

Moved file.

e305565

Moved alignment to own module.

c211bb1

nevi-me removed the needs-rebase A PR that needs to be rebased by the author label Feb 24, 2021

Made allocator typed.

647e4ea

nevi-me closed this in b5ac048 Feb 25, 2021

ritchie46 mentioned this pull request Feb 25, 2021

ARROW-11779: [Rust] make alloc module public #9572

Closed

asfimport mentioned this pull request Feb 25, 2021

[Rust] Typed allocator #27491

Closed

alamb mentioned this pull request May 7, 2024

Fix Rustdocs (amd64, nightly)" CI check apache/arrow-rs#5727

Merged

	/// * ptr must denote a block of memory currently allocated via this allocator,
	/// * ptr must denote a block of memory previously returned from `allocate_aligned` or
	/// `allocate_aligned_zeroed`

	/// * ptr must be currently allocated via this allocator,
	/// ptr must denote a block of memory previously returned from `allocate_aligned` or
	/// `allocate_aligned_zeroed`

-    /// Whether a DataType is a valid type for this physical representation.
+    /// Whether an array element of `data_type` can be stored using this `NativeType` as
+    /// its physical representation. For example, `i32` stores `DataType::Int32` as well as
+    /// `DataType::Date32`

-// If this number is not zero after all objects have been `drop`, there is a memory leak
-pub static mut ALLOCATIONS: AtomicIsize = AtomicIsize::new(0);
+// If this number is not zero after all objects have been `drop`, there is a memory leak
+#[cfg(feature = "memory-check")]
+pub static mut ALLOCATIONS: AtomicIsize = AtomicIsize::new(0);

	ALLOCATIONS.fetch_add(size as isize, std::sync::atomic::Ordering::SeqCst);
	#[cfg(feature = "memory-check")]
	ALLOCATIONS.fetch_add(size as isize, Ordering::Relaxed);

	ALLOCATIONS.fetch_sub(size as isize, std::sync::atomic::Ordering::SeqCst);
	#[cfg(feature = "memory-check")]
	ALLOCATIONS.fetch_sub(size as isize, Ordering::Relaxed);

-    ALLOCATIONS.fetch_add(
-        new_size as isize - old_size as isize,
-        std::sync::atomic::Ordering::SeqCst,
-    );
+    #[cfg(feature = "memory-check")]
+    ALLOCATIONS.fetch_add(
+        new_size.checked_sub(old_size).unwrap() as isize,
+        Ordering::Relaxed,
+    );

-use std::{
-    alloc::{handle_alloc_error, Layout},
-    sync::atomic::AtomicIsize,
-};
+use std::alloc::{handle_alloc_error, Layout};
+use std::sync::atomic::*;

ARROW-11627: [Rust] Make allocator be a generic over type T #9495

ARROW-11627: [Rust] Make allocator be a generic over type T #9495

Uh oh!

Conversation

jorgecarleitao commented Feb 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 14, 2021

Uh oh!

jorgecarleitao commented Feb 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Feb 14, 2021

Codecov Report

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 16, 2021

Uh oh!

alamb commented Feb 18, 2021

Uh oh!

alamb commented Feb 20, 2021

Uh oh!

vertexclique left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vertexclique commented Feb 21, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao commented Feb 21, 2021

Uh oh!

vertexclique commented Feb 21, 2021

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vertexclique commented Feb 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vertexclique commented Feb 24, 2021

Uh oh!

nevi-me commented Feb 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

jorgecarleitao commented Feb 14, 2021 •

edited

Loading

jorgecarleitao commented Feb 14, 2021 •

edited

Loading

vertexclique commented Feb 24, 2021 •

edited

Loading

nevi-me commented Feb 24, 2021 •

edited

Loading