[TieredStorage] Enable file hash in TieredStorage#170
[TieredStorage] Enable file hash in TieredStorage#170yhchiang-sol wants to merge 1 commit intoanza-xyz:masterfrom
Conversation
e6300f7 to
ea04c44
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #170 +/- ##
=========================================
- Coverage 81.9% 81.9% -0.1%
=========================================
Files 840 841 +1
Lines 228107 228357 +250
=========================================
+ Hits 186851 187047 +196
- Misses 41256 41310 +54 🚀 New features to boost your workflow:
|
brooksprumo
left a comment
There was a problem hiding this comment.
Can you split up this PR please, it appears to contain multiple orthogonal changes.
In particular these:
- Deprecate tiered_storage/writer.rs as we currently don't have plans
to develop writers other than HotStorageWriter for TieredStorage.
Should be its own PR. Not fully on board with this change though.
- Bump the footer format version as the older file will have hash mismatch.
Should be it's own PR
The others appear related.
Sure. I feel the same way as well. Will split the PR! |
002b063 to
a73c0ea
Compare
a584276 to
bc11e3a
Compare
3275073 to
c2ab194
Compare
|
Rebased on top of #195 so that file-hash check are also done inside the constructor of TieredReadableFile. |
c2ab194 to
f295c19
Compare
| self.hash = file.hash(); | ||
| file.write_pod(&self.hash)?; | ||
|
|
||
| file.write_pod(&self.format_version)?; | ||
| file.write_pod(&self.footer_size)?; |
There was a problem hiding this comment.
@brooksprumo: I think we can consider reordering the items in the footer tail so that the format version and footer-size are also hashed.
And, together with your previous comment #195 (comment). If we also exclude hash or even footer tail from the footer, then we can directly do file.write_type(footer) here.
What do you think?
There was a problem hiding this comment.
Seems reasonable to me. So like a two-part footer. The top is hashed, and the bottom is not hashed. The bottom is where the hash itself lives, along with the magic number/etc. Is that right?
|
Rebased to include the refactoring open-path PR. |
| pub fn hash(&self) -> Hash { | ||
| Hash::new_from_array(self.hasher.finalize().into()) | ||
| } |
There was a problem hiding this comment.
What happens if this function is called multiple times? Iow, is it safe to call finalize() or into() repeatedly? And if it's safe, is it expensive or free?
| pub fn write_bytes(&mut self, bytes: &[u8]) -> IoResult<usize> { | ||
| self.0.write_all(bytes)?; | ||
| self.file.write_all(bytes)?; | ||
| self.hasher.update(bytes); |
There was a problem hiding this comment.
I think it'll be good to benchmark how much time this adds. A normal mnb slot probably has ~1000 accounts written, and each account is about 200 bytes of account data. I think a benchmark comparing with and without the hash would be good. If this hash is relatively expensive, we may need to consider an alternative impl.
| #[error("FileHashMismatch: {0} {1}")] | ||
| FileHashMismatch(Hash, Hash), |
There was a problem hiding this comment.
Can this error message be updated to indicate which one was read from the file and which one was computed?
| self.seek(0)?; | ||
|
|
||
| let len = self.0.metadata()?.len() as usize; | ||
| let hashed_len = len - std::mem::size_of::<Hash>() - FOOTER_TAIL_SIZE; |
There was a problem hiding this comment.
This needs to handle wrapping/overflow. Please use safe math here.
| (hash == hash_from_file) | ||
| .then(|| Ok(())) | ||
| .unwrap_or_else(|| Err(TieredStorageError::FileHashMismatch(hash, hash_from_file))) |
There was a problem hiding this comment.
Probably simpler to do:
| (hash == hash_from_file) | |
| .then(|| Ok(())) | |
| .unwrap_or_else(|| Err(TieredStorageError::FileHashMismatch(hash, hash_from_file))) | |
| if hash == hash_from_file { | |
| Ok(()) | |
| } else { | |
| Err(TieredStorageError::FileHashMismatch(hash, hash_from_file)) | |
| } |
| hash: Hash::new_unique(), | ||
| min_account_address: Pubkey::default(), | ||
| max_account_address: Pubkey::default(), | ||
| hash: Hash::default(), |
There was a problem hiding this comment.
Up on line 111, where the struct is defined, I think we should use our own newtype for the hash. i.e.:
struct TieredStorageFooterHash(Hash)(or some other name)
And likely derive all the same things that TieredStorageMagicNumber does (plus the static asserts).
| self.hash = file.hash(); | ||
| file.write_pod(&self.hash)?; | ||
|
|
||
| file.write_pod(&self.format_version)?; | ||
| file.write_pod(&self.footer_size)?; |
There was a problem hiding this comment.
Seems reasonable to me. So like a two-part footer. The top is hashed, and the bottom is not hashed. The bottom is where the hash itself lives, along with the magic number/etc. Is that right?
|
Rebase to address merge conflict. |
|
Converting this one to draft. Addressing comments and collecting perf numbers. |
f295c19 to
caa8af6
Compare
|
Rebased on top of recent master to include bench result. From my local bench runs, the cost seems high. Will try several ways and see which one leads to a better result. |
|
Here're the results from bigger samples. Looks like the hashing overhead is around +140% with the current implementation. |
|
Closing this PR since it is more than 1 year old |
Problem
File hash feature isn't implemented in tiered-storage.
Summary of Changes
This PR enables file hash in TieredStorage with the following changes.
Test Plan
Added new unit-tests for file hash match and mismatch cases.
Updated existing tests to also cover the file hash verification.