Skip to content

[TieredStorage] Enable file hash in TieredStorage#170

Closed
yhchiang-sol wants to merge 1 commit intoanza-xyz:masterfrom
yhchiang-sol:ts-file-hash
Closed

[TieredStorage] Enable file hash in TieredStorage#170
yhchiang-sol wants to merge 1 commit intoanza-xyz:masterfrom
yhchiang-sol:ts-file-hash

Conversation

@yhchiang-sol
Copy link
Copy Markdown

@yhchiang-sol yhchiang-sol commented Mar 11, 2024

Problem

File hash feature isn't implemented in tiered-storage.

Summary of Changes

This PR enables file hash in TieredStorage with the following changes.

Test Plan

Added new unit-tests for file hash match and mismatch cases.
Updated existing tests to also cover the file hash verification.

@yhchiang-sol yhchiang-sol force-pushed the ts-file-hash branch 2 times, most recently from e6300f7 to ea04c44 Compare March 11, 2024 06:50
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 11, 2024

Codecov Report

❌ Patch coverage is 94.40559% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.9%. Comparing base (30eecd6) to head (caa8af6).
⚠️ Report is 7050 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master     #170     +/-   ##
=========================================
- Coverage    81.9%    81.9%   -0.1%     
=========================================
  Files         840      841      +1     
  Lines      228107   228357    +250     
=========================================
+ Hits       186851   187047    +196     
- Misses      41256    41310     +54     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split up this PR please, it appears to contain multiple orthogonal changes.

In particular these:

  • Deprecate tiered_storage/writer.rs as we currently don't have plans
    to develop writers other than HotStorageWriter for TieredStorage.

Should be its own PR. Not fully on board with this change though.

  • Bump the footer format version as the older file will have hash mismatch.

Should be it's own PR

The others appear related.

@yhchiang-sol
Copy link
Copy Markdown
Author

Can you split up this PR please, it appears to contain multiple orthogonal changes.

Sure. I feel the same way as well. Will split the PR!

@yhchiang-sol yhchiang-sol marked this pull request as draft March 11, 2024 23:15
@yhchiang-sol yhchiang-sol force-pushed the ts-file-hash branch 5 times, most recently from 002b063 to a73c0ea Compare March 18, 2024 21:35
@yhchiang-sol yhchiang-sol marked this pull request as ready for review March 19, 2024 02:10
@yhchiang-sol yhchiang-sol force-pushed the ts-file-hash branch 2 times, most recently from a584276 to bc11e3a Compare March 20, 2024 02:18
@yhchiang-sol yhchiang-sol marked this pull request as draft March 20, 2024 04:34
@yhchiang-sol yhchiang-sol force-pushed the ts-file-hash branch 2 times, most recently from 3275073 to c2ab194 Compare March 20, 2024 05:39
@yhchiang-sol
Copy link
Copy Markdown
Author

Rebased on top of #195 so that file-hash check are also done inside the constructor of TieredReadableFile.

@yhchiang-sol yhchiang-sol marked this pull request as ready for review March 20, 2024 17:47
Comment on lines +199 to +203
self.hash = file.hash();
file.write_pod(&self.hash)?;

file.write_pod(&self.format_version)?;
file.write_pod(&self.footer_size)?;
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brooksprumo: I think we can consider reordering the items in the footer tail so that the format version and footer-size are also hashed.

And, together with your previous comment #195 (comment). If we also exclude hash or even footer tail from the footer, then we can directly do file.write_type(footer) here.

What do you think?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me. So like a two-part footer. The top is hashed, and the bottom is not hashed. The bottom is where the hash itself lives, along with the magic number/etc. Is that right?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right!

@yhchiang-sol
Copy link
Copy Markdown
Author

Rebased to include the refactoring open-path PR.

Comment on lines +186 to +188
pub fn hash(&self) -> Hash {
Hash::new_from_array(self.hasher.finalize().into())
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this function is called multiple times? Iow, is it safe to call finalize() or into() repeatedly? And if it's safe, is it expensive or free?

pub fn write_bytes(&mut self, bytes: &[u8]) -> IoResult<usize> {
self.0.write_all(bytes)?;
self.file.write_all(bytes)?;
self.hasher.update(bytes);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'll be good to benchmark how much time this adds. A normal mnb slot probably has ~1000 accounts written, and each account is about 200 bytes of account data. I think a benchmark comparing with and without the hash would be good. If this hash is relatively expensive, we may need to consider an alternative impl.

Comment on lines +38 to +39
#[error("FileHashMismatch: {0} {1}")]
FileHashMismatch(Hash, Hash),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this error message be updated to indicate which one was read from the file and which one was computed?

self.seek(0)?;

let len = self.0.metadata()?.len() as usize;
let hashed_len = len - std::mem::size_of::<Hash>() - FOOTER_TAIL_SIZE;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to handle wrapping/overflow. Please use safe math here.

Comment on lines +70 to +72
(hash == hash_from_file)
.then(|| Ok(()))
.unwrap_or_else(|| Err(TieredStorageError::FileHashMismatch(hash, hash_from_file)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably simpler to do:

Suggested change
(hash == hash_from_file)
.then(|| Ok(()))
.unwrap_or_else(|| Err(TieredStorageError::FileHashMismatch(hash, hash_from_file)))
if hash == hash_from_file {
Ok(())
} else {
Err(TieredStorageError::FileHashMismatch(hash, hash_from_file))
}

hash: Hash::new_unique(),
min_account_address: Pubkey::default(),
max_account_address: Pubkey::default(),
hash: Hash::default(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up on line 111, where the struct is defined, I think we should use our own newtype for the hash. i.e.:

struct TieredStorageFooterHash(Hash)

(or some other name)

And likely derive all the same things that TieredStorageMagicNumber does (plus the static asserts).

Comment on lines +199 to +203
self.hash = file.hash();
file.write_pod(&self.hash)?;

file.write_pod(&self.format_version)?;
file.write_pod(&self.footer_size)?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me. So like a two-part footer. The top is hashed, and the bottom is not hashed. The bottom is where the hash itself lives, along with the magic number/etc. Is that right?

@yhchiang-sol
Copy link
Copy Markdown
Author

Rebase to address merge conflict.

@yhchiang-sol yhchiang-sol marked this pull request as draft March 22, 2024 09:06
@yhchiang-sol
Copy link
Copy Markdown
Author

Converting this one to draft. Addressing comments and collecting perf numbers.

@yhchiang-sol
Copy link
Copy Markdown
Author

Rebased on top of recent master to include bench result.

From my local bench runs, the cost seems high. Will try several ways and see which one leads to a better result.

@yhchiang-sol
Copy link
Copy Markdown
Author

Here're the results from bigger samples. Looks like the hashing overhead is around +140% with the current implementation.

Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high severe
write_accounts_file/hot_storage/1000
                        time:   [473.22 µs 474.85 µs 477.72 µs]
                        thrpt:  [2.0933 Melem/s 2.1059 Melem/s 2.1132 Melem/s]
                 change:
                        time:   [+136.29% +138.10% +140.00%] (p = 0.00 < 0.05)
                        thrpt:  [-58.334% -58.001% -57.679%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe
write_accounts_file/hot_storage/10000
                        time:   [4.7196 ms 4.7296 ms 4.7406 ms]
                        thrpt:  [2.1094 Melem/s 2.1143 Melem/s 2.1188 Melem/s]
                 change:
                        time:   [+139.84% +141.41% +142.82%] (p = 0.00 < 0.05)
                        thrpt:  [-58.817% -58.577% -58.305%]
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)

@steviez
Copy link
Copy Markdown

steviez commented Apr 9, 2025

Closing this PR since it is more than 1 year old

@steviez steviez closed this Apr 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants