Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace mmap with file io in merkle tree hash calculation #3547

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

HaoranYi
Copy link

@HaoranYi HaoranYi commented Nov 8, 2024

Problem

We have noticed that during hash calculation, the performance of block producing
process degrades. Part of it is due to the stress that was put on memory and
disk/io from the accounts hash threads.

When we computes merkle tree hash, we use mmap to store the extracted accounts'
hashes. Mmap is heavy on resource usages, such as memory and disk/io, and put a
stress on the whole system.

In this PR, we propose to switch to use file io, less resource stressful, for
merkle tree hash computation.

Studies on mainnet with this PR shows that file-io use less memory and put less
stress on disk io. The "pure" hash computation time with file io is a little
longer than with mmap. But, we also save the mmap drop time with file io. And
the saving from mmap drop time is more to offset the extra time spent on hash
calculation. Thus, it makes the overall time for computing the hash smaller.

Note that there is an upcoming lattice hash feature, which will be the ultimate
solution to hash, i.e. it remove all the merkle tree hash calculation. However,
before that feature is activated, we could still use this PR as an interim
enhancement for merkle tree hash computation.

Summary of Changes

  • replace mmap with file io for merkle tree hash calcuation

Fixes #

@HaoranYi HaoranYi marked this pull request as draft November 8, 2024 17:22
@HaoranYi HaoranYi changed the title cli hash bins Replace mmap with file io in merkle tree hash calculation Nov 8, 2024
@HaoranYi HaoranYi force-pushed the cli_hash_bins branch 2 times, most recently from f8d3ca6 to d9b0c8a Compare November 9, 2024 02:50
@HaoranYi
Copy link
Author

HaoranYi commented Nov 11, 2024

Performance comparison

mmap-64Kbins (pink)
mmap-4kbins (orange)
file/io - 64kbins (blue)

image

@HaoranYi
Copy link
Author

HaoranYi commented Nov 11, 2024

Summary

  • smaller bins use more memory and a bit slower in hashtime, but is faster on drop time.
  • file-io use less memory and but requires a bit longer time to compute hash but the save over drop time is larger than mmap, which makes it overall faster than mmap with the same number of bins.
  • file-io use less disk/io.

@HaoranYi
Copy link
Author

rebase to pick up #3589

@HaoranYi HaoranYi force-pushed the cli_hash_bins branch 2 times, most recently from beb8a1f to f6c5c61 Compare November 15, 2024 15:41
@@ -1160,16 +1255,15 @@ impl AccountsHasher<'_> {
let binner = PubkeyBinCalculator24::new(bins);

// working_set hold the lowest items for each slot_group sorted by pubkey descending (min_key is the last)
let (mut working_set, max_inclusive_num_pubkeys) = Self::initialize_dedup_working_set(
let (mut working_set, _max_inclusive_num_pubkeys) = Self::initialize_dedup_working_set(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_inclusive_num_pubkeys is an estimate of the upper bound for the hash file size. It is only required when we use mmap - creating an mmap requires to specify the initial size, which could be over allocated too. After switching to file writer, we don't need this any more.

@HaoranYi HaoranYi marked this pull request as ready for review January 15, 2025 16:11
@@ -619,6 +540,180 @@ impl AccountsHasher<'_> {
(num_hashes_per_chunk, levels_hashed, three_level)
}

// This function is called at the top lover level to compute the merkle. It

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fn is copied from fn compute_merkle_root_from_slices<'b, F, T>(. Do we need the other fn still?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do.
compute_merkle_root_from_slices is still used when we compute merkle tree at 2 level and above, where we have all the data in memory already.

The comments here may be helpful to understand.

// This function is called at the top level to compute the merkle hash. It
// takes a closure that returns an owned vec of hash data at the leaf level
// of the merkle tree. The input data for this bottom level are read from a
// file. For non-leaves nodes, where the input data is already in memory, we
// will use `compute_merkle_root_from_slices`, which is a version that takes
// a borrowed slice of hash data instead.

fn get_slice(&self, start: usize) -> &[Hash] {
// return the biggest hash data possible that starts at the overall index 'start'
// start is the index of hashes
fn get_data(&self, start: usize) -> Arc<Vec<u8>> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you help me understand the reason for returning Vec<u8> here and not Vec<Hash>. I know it will require some casting somewhere.

Copy link
Author

@HaoranYi HaoranYi Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cast is happening inside the caller - fn compute_merkle_root_from_start<F, T>, where T is the type Hash.

My thinking is that function is a low level function to just return the raw bytes (similar to file io read). And let the caller to cast it to the type its want. In a hypothetical case, the caller may want to use a different hash that may have more or less bytes, depending on the type T passed in.

Since fn compute_merkle_root_from_start<F, T> is already designed to be generic on T, if we limit ourselves to Hash here, we are limiting compute_merkle_root_from_start.

Copy link
Author

@HaoranYi HaoranYi Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made an experiment to change the return value to Vec<Hash>.

This is what I came up with. It will require to introduce unsafe code, which is awkward. And we may also have the issue with capacity not dividable by 32 (size of hash).

Therefore, I prefer to keep it as Vec<u8> and let the caller cast it. wdyt?

@@ -337,7 +338,7 @@ impl CumulativeHashesFromFiles {

     // return the biggest hash data possible that starts at the overall index 'start'
     // start is the index of hashes
-    fn get_data(&self, start: usize) -> Arc<Vec<u8>> {
+    fn get_data(&self, start: usize) -> Arc<Vec<Hash>> {
         let (start, offset) = self.cumulative.find(start);
         let data_source_index = offset.index[0];
         let mut data = self.readers[data_source_index].lock().unwrap();
@@ -349,7 +350,16 @@ impl CumulativeHashesFromFiles {

         let mut result_bytes: Vec<u8> = vec![];
         data.read_to_end(&mut result_bytes).unwrap();
-        Arc::new(result_bytes)
+
+        let hashes = unsafe {
+            let len = result_bytes.len() / std::mem::size_of::<Hash>();
+            let capacity = result_bytes.capacity() / std::mem::size_of::<Hash>();
+            let ptr = result_bytes.as_mut_ptr() as *mut Hash;
+            std::mem::forget(result_bytes);
+            Vec::from_raw_parts(ptr, len, capacity)
+        };
+
+        Arc::new(hashes)
     }
 }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think we should return Hashs instead of bytes here. And a Box<[Hash]> should make life easier as well. And no Arc in the return type.

let mut hashes = vec![Hash::default(); num_hashes].into_boxed_slice();
// todo: ensure there are `num_hashes` left to read in `data`. This ensures `.unwrap()` below is safe.
data.read_exact(bytemuck::must_cast_slice_mut(hashes.as_mut()).unwrap();

hashes

I prefer this method because it ensures we always have the correct alignment/size for each element. (I know in this case it is Hash, which has an alignment of 1. By using Hash and casting to bytes, the reader doesn't need to go and check every caller/use if we cast the opposite direction.)

And:

  • Box vs Vec: we don't want to allow growing the data by the caller after this function returns.
  • Arc vs no-Arc: this is something the caller can decide, but nothing requires this function to return an Arc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed a commit to share the hash calcuation from slice and from owned vec.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. done - change get_data() to return Box<[Hash]>. 9cc3366

.unwrap();

let mut result_bytes: Vec<u8> = vec![];
data.read_to_end(&mut result_bytes).unwrap();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that this allocates what could be a large Vec. Depends on hashing tuning parameters (num bins, etc.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have found in pop testing that large mem allocations can cause us to oom.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

We could add a cap on how many bytes we load here. The downside is that we may need to call this function multiple times.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commit a change to cap the file read buffer size to 64M.

@jeffwashington
Copy link

I'm certainly happy and supportive with the idea of the change here. What is the estimate of the lattice hash creation? Even then, we may need to use this code to create a full lattice hash from scratch initially or for full accountsdb verification.


// initial fetch - could return entire slice
let data_bytes = get_hash_slice_starting_at_index(0);
let data: &[T] = bytemuck::cast_slice(&data_bytes);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffwashington The caller cast the data to the generic type T here.

for _k in 0..end {
if data_index >= data_len {
// we exhausted our data, fetch next slice starting at i
data_bytes = get_hash_slice_starting_at_index(i);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffwashington With a cap only how many bytes we load each time, we may find that we will need to load the time more times.

Copy link

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to go through the merkle tree code.

Heh, trying to fit the file io impl into the existing mmap api is really clunky... That's probably the right choice though, as this code will go away after the accounts lt hash is activated.

fn get_slice(&self, start: usize) -> &[Hash] {
// return the biggest hash data possible that starts at the overall index 'start'
// start is the index of hashes
fn get_data(&self, start: usize) -> Arc<Vec<u8>> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think we should return Hashs instead of bytes here. And a Box<[Hash]> should make life easier as well. And no Arc in the return type.

let mut hashes = vec![Hash::default(); num_hashes].into_boxed_slice();
// todo: ensure there are `num_hashes` left to read in `data`. This ensures `.unwrap()` below is safe.
data.read_exact(bytemuck::must_cast_slice_mut(hashes.as_mut()).unwrap();

hashes

I prefer this method because it ensures we always have the correct alignment/size for each element. (I know in this case it is Hash, which has an alignment of 1. By using Hash and casting to bytes, the reader doesn't need to go and check every caller/use if we cast the opposite direction.)

And:

  • Box vs Vec: we don't want to allow growing the data by the caller after this function returns.
  • Arc vs no-Arc: this is something the caller can decide, but nothing requires this function to return an Arc.

Comment on lines 351 to 354
#[cfg(test)]
const MAX_BUFFER_SIZE: usize = 128; // for testing
#[cfg(not(test))]
const MAX_BUFFER_SIZE: usize = 64 * 1024 * 1024; // 64MB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if the function took a start/offset and number of max number of hashes to fetch. This way we avoid magic constants within the implementation here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. done.

@brooksprumo brooksprumo self-requested a review January 17, 2025 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants