Remove needless last_root for better reclaims#8148
Remove needless last_root for better reclaims#8148ryoqun merged 3 commits intosolana-labs:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8148 +/- ##
========================================
- Coverage 82% 82% -0.1%
========================================
Files 249 249
Lines 53530 53503 -27
========================================
- Hits 43913 43887 -26
+ Misses 9617 9616 -1 |
|
Seems right to me. |
|
Sad status report: It seems the description is wrong. I can't create a valid test demonstrating the described error condition... And I cannot find another alternative bad scenario at the moment.... This bug is hard. ;) |
4132eb6 to
5cdec64
Compare
5cdec64 to
3120504
Compare
|
|
||
| fn cleanup_dead_slots(&self, dead_slots: &mut HashSet<Slot>, last_root: u64) { | ||
| // a slot is not totally dead until it is older than the root | ||
| dead_slots.retain(|slot| *slot < last_root); |
There was a problem hiding this comment.
I just realized all inputs of this function is ensured to be older or equal to last_root.
Notice the or equal to part, this assertion must be *slot <= last_root for purge_zero_lamport_accounts. But, we don't need this to begin with, as said above.
There was a problem hiding this comment.
sorry, why are they guaranteed to be older than last_root?
There was a problem hiding this comment.
I can explain this by reading the code from here upwards the call graph.
Firstly, this cleanup_dead_slots(dead_slots, last_root) is only called from handle_reclaims(reclaims).
dead_slots are derived from reclaims by calling remove_dead_accounts() inside handle_reclaims(), which does just simple 1-to-1 storage -> slot mapping. Thus, we just need to explain in terms of reclaims further on.
handle_reclaims() are called only from purge_zero_lamport_accounts()(1) and store_with_hashes() (2). These are the only 2 code paths reaching to this function.
(1) For the purge_zero_lamport_accounts() cod epath, reasoning is simple:
The reclaims is a collection of results of accounts_index.purge(&pubkey), which in turn the sum of get_rooted_entries() for all such zero lamport pubkeys. So, reclaims (thus, dead_slots) are always rooted slots only. This is true for normal operation because the validator calls add_root() as it roots slots.
Also this is also true for snapshot restoration because generate_index() is called before purge_zero_lamport_accounts() and generate_index() does accounts_index.roots.insert(*slot_id).
(2) For the store_with_hashes(), reasoning is a bit longer:
The reclaims is ultimately only generated at accounts_index.update(). There is intermediately fns like accounts_db.update_index() and accounts_index.insert(), but the destination of this call graph is same.
Next, accounts_index.update(slot, ...) creates reclaims by concatenating two sets:
A: can_purge-able (= slot < max_root) entries
B: Old entries in slot by replacing with updated one
A is obviously fine because max_root is always member of accounts_index.roots. B is ensured not to be dead at remove_dead_accounts() because the case B is always inserting an updated alive entry into the slot.
So, as a normal operation, cleanup_dead_slots() never removes anything such that slot > last_root.
And leaving this line's retain causes halfway-deallocated state because cleanup_dead_slots() is only called once ever for each given slot. That's because later invocations of remove_dead_accounts() would remove any slots whose storage are already removed.
Thanks for reading a rather long comment. :)
So I think this retain does add little value and complicates code reading with seemingly overlapping and spread responsibility for ensuring purge-able slots at caller-site or at callee-site.
As a second thought, if you're a bit concerned for future changes that might break these assumptions, this PR can be changed to assert! that.
There was a problem hiding this comment.
Ok. Thanks for the description. I think it can be removed.
|
|
||
| pub fn add_root(&mut self, slot: Slot) { | ||
| assert!( | ||
| (self.last_root == 0 && slot == 0) || (slot >= self.last_root), |
There was a problem hiding this comment.
I think removing this assertion is ok to remove last_root, first of all.
|
Removing the |
|
@sakridge Thanks for checking this! I've just merged! |
|
This got needed in the v0.23 branch for #8436. |
* Restore last_root to fix unintended storage delete * Remove last_root thing altogether * Remove unneeded test... (cherry picked from commit 5872746)
Problem
last_root wasn't correctly restored from snapshot and its related range check is bad for
purge_zero_lamport_account, which ultimately causes dangling storage entries some times when restoring from snapshot.(a lot of original investigation report for #8130 is moved to new PR: #8176)
Summary of changes
After careful code scrutinization, I'm pretty sure the
last_rootthing can go away.Just remove
last_rootBackground
I found this as part of investigation for #8130.
Part of #7167.