blockstore: account for blockstore cleanup during shred insertion by AshwinSekar · Pull Request #1259 · anza-xyz/agave

AshwinSekar · 2024-05-09T04:29:24Z

v1.18 introduces many new expect and unwraps on blockstore invariants.

An audit shows that some of these invariants might not hold under extreme situations. Specifically any check relying on receiving a shred from get_shred_from_just_inserted_or_db may fail if the slot is cleaned up during shred insertion:

Shred for slot S is received and compared successfully against blockstore.max_root() in shred fetch stage
Shred enters blockstore shred insertion, insertion lock is grabbed
Replay freezes and votes on a block, updating blockstore.max_root() past S
Blockstore cleanup purges all slot columns up to this new blockstore.max_root(), including the columns for the shred currently being inserted (insertion lock is not checked here)
Shred insertion panics as the shred expected from get_shred_from_just_inserted_or_db is no longer present.

(4) is unlikely to happen as @steviez pointed out here #1151 (comment), but for safety this should still be accounted for.

mergify · 2024-05-09T04:45:11Z

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

codecov-commenter · 2024-05-09T06:26:41Z

Codecov Report

Attention: Patch coverage is 25.00000% with 15 lines in your changes are missing coverage. Please review.

Project coverage is 82.1%. Comparing base (fd8b075) to head (b5e8182).
Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##           master    #1259     +/-   ##
=========================================
- Coverage    82.1%    82.1%   -0.1%     
=========================================
  Files         893      893             
  Lines      236670   236679      +9     
=========================================
- Hits       194460   194437     -23     
- Misses      42210    42242     +32

AshwinSekar · 2024-05-09T18:28:02Z

Going to explore a different approach here by having blockstore cleanup wait for the shred insertion lock to expire. This would avoid having to check shred output every time in shred insertion.

carllin · 2024-05-09T18:45:57Z

Something like this should work:

updates max_root/max_cleanup_slot
waits for lock to expire
proceeds with cleanup up to max_root

Any new insertions after 2 should see the update from 1

steviez · 2024-05-09T19:10:36Z

blockstore cleanup wait for the shred insertion lock to expire.

When you say wait for "shred insertion lock to expire", you're saying that the blockstore cleanup service would acquire the shred insertion lock right ? I was initially hesitant about this, but given that cleanup runs fairly infrequently (once every ~3.5 min) and that we should be able to minimize work in cleanup service with that lock (write batch commit + updating cleanup slots), this approach doesn't seem so bad

behzadnouri · 2024-05-09T20:58:23Z

Going to explore a different approach here by having blockstore cleanup wait for the shred insertion lock to expire.

I think any new logic would be risky in the short term.
For v1.18 we just need to relax expects to error logs and we can revise later if needed.

AshwinSekar · 2024-05-09T21:41:09Z

When you say wait for "shred insertion lock to expire", you're saying that the blockstore cleanup service would acquire the shred insertion lock right ?

doesn't need to acquire the shred insertion lock, can wait until notified by a cond var. essentially as carl said we just need to be sure that the shred insertion thread has seen the new value of blockstore.max_root() before we proceed with the blockstore cleanup to avoid cleaning up any in flight shreds that will not be rejected due to the outdated root check.

I think any new logic would be risky in the short term.
For v1.18 we just need to relax expects to error logs and we can revise later if needed.

I'm fine with this as well. @carllin @steviez if you're okay with that i'll merge this right now and we can revisit in a future pr.

steviez · 2024-05-09T23:17:40Z

we just need to be sure that the shred insertion thread has seen the new value of blockstore.max_root() ...

Thanks for the additional context, Carl's comment is more clear / less ambiguous to me with that. This is potentially overkill / redundant, but here is a clear statement of the problem with relevant points all in one place:

Blockstore cleanup thread reads out a max root value at some point in time, call it MR
Blockstore cleanup is guaranteed to NOT clean any slot > MR
Blockstore shred insertion read max root value and ONLY accepts shreds with parent_slot >= max_root
Blockstore shred insertion does the parent_slot >= max_root check fairly early in Blockstore::do_insert_shreds()
The value of Blockstore::max_root() can change while a Blockstore::do_insert_shreds() is in-flight
Thus, if max root advanced while an insert was in-flight, blockstore cleanup could hypothetically delete data that shred insertion has already checked and is now assuming to be present
- Technically, some of this data could still be alive in the _working_set HashMap's that are created at the top of shred insertion, but stuff like data shreds that had previously been inserted would disappear
Making blockstore cleanup wait until an-flight shred insertion has finished prevents the issue mentioned in previous bullet + guarantees that the next insertion will use a max root that is greater or equal to the one blockstore cleanup is using to clean

With that all stated, yes, the approach you and Carl mentioned seems like it'll do the trick / is better than having cleanup thread grab the shred insertion lock too

steviez

I think any new logic would be risky in the short term.
For v1.18 we just need to relax expects to error logs and we can revise later if needed.

I'm fine with this as well. @carllin @steviez if you're okay with that i'll merge this right now and we can revisit in a future pr.

This is a good point and for the sake of trying to keep commits 1-to-1 between master and v1.18, I'm onboard with it. These are also all demotions of panic to error log, so I think it is fine if we try to sneak the v1.18 BP into the release tomorrow instead of waiting for a week of runtime on the tip of master canaries

) (cherry picked from commit b5c5bd3)

…ion (backport of #1259) (#1279) blockstore: account for blockstore cleanup during shred insertion (#1259) (cherry picked from commit b5c5bd3) Co-authored-by: Ashwin Sekar <ashwin@anza.xyz>

…ion (backport of anza-xyz#1259) (anza-xyz#1279) blockstore: account for blockstore cleanup during shred insertion (anza-xyz#1259) (cherry picked from commit b5c5bd3) Co-authored-by: Ashwin Sekar <ashwin@anza.xyz>

AshwinSekar mentioned this pull request May 9, 2024

blockstore: use erasure meta index field to find conflicting shreds #1151

Merged

AshwinSekar requested review from behzadnouri and steviez May 9, 2024 04:39

blockstore: account for blockstore cleanup during shred insertion

b5e8182

AshwinSekar force-pushed the demote-expect branch from 633bdd8 to b5e8182 Compare May 9, 2024 04:42

AshwinSekar added the v1.18 label May 9, 2024

AshwinSekar marked this pull request as draft May 9, 2024 18:26

behzadnouri approved these changes May 9, 2024

View reviewed changes

AshwinSekar marked this pull request as ready for review May 9, 2024 21:41

steviez approved these changes May 9, 2024

View reviewed changes

AshwinSekar merged commit b5c5bd3 into anza-xyz:master May 10, 2024

mergify Bot pushed a commit that referenced this pull request May 10, 2024

blockstore: account for blockstore cleanup during shred insertion (#1259

e810e27

) (cherry picked from commit b5c5bd3)

mergify Bot mentioned this pull request May 10, 2024

v1.18: blockstore: account for blockstore cleanup during shred insertion (backport of #1259) #1279

Merged

AshwinSekar deleted the demote-expect branch May 10, 2024 01:14

AshwinSekar mentioned this pull request May 10, 2024

patches Blockstore::is_shred_duplicate for resigned chained Merkle shreds #1271

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blockstore: account for blockstore cleanup during shred insertion#1259

blockstore: account for blockstore cleanup during shred insertion#1259
AshwinSekar merged 1 commit intoanza-xyz:masterfrom
AshwinSekar:demote-expect

AshwinSekar commented May 9, 2024

Uh oh!

mergify Bot commented May 9, 2024

Uh oh!

codecov-commenter commented May 9, 2024

Uh oh!

AshwinSekar commented May 9, 2024

Uh oh!

carllin commented May 9, 2024

Uh oh!

steviez commented May 9, 2024

Uh oh!

behzadnouri commented May 9, 2024

Uh oh!

AshwinSekar commented May 9, 2024

Uh oh!

steviez commented May 9, 2024

Uh oh!

steviez left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

AshwinSekar commented May 9, 2024

Uh oh!

mergify Bot commented May 9, 2024

Uh oh!

codecov-commenter commented May 9, 2024

Codecov Report

Uh oh!

AshwinSekar commented May 9, 2024

Uh oh!

carllin commented May 9, 2024

Uh oh!

steviez commented May 9, 2024

Uh oh!

behzadnouri commented May 9, 2024

Uh oh!

AshwinSekar commented May 9, 2024

Uh oh!

steviez commented May 9, 2024

Uh oh!

steviez left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants