blockstore: account for blockstore cleanup during shred insertion#1259
blockstore: account for blockstore cleanup during shred insertion#1259AshwinSekar merged 1 commit intoanza-xyz:masterfrom
Conversation
|
Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1259 +/- ##
=========================================
- Coverage 82.1% 82.1% -0.1%
=========================================
Files 893 893
Lines 236670 236679 +9
=========================================
- Hits 194460 194437 -23
- Misses 42210 42242 +32 |
|
Going to explore a different approach here by having blockstore cleanup wait for the shred insertion lock to expire. This would avoid having to check shred output every time in shred insertion. |
|
Something like this should work:
Any new insertions after 2 should see the update from 1 |
When you say wait for "shred insertion lock to expire", you're saying that the blockstore cleanup service would acquire the shred insertion lock right ? I was initially hesitant about this, but given that cleanup runs fairly infrequently (once every ~3.5 min) and that we should be able to minimize work in cleanup service with that lock (write batch commit + updating cleanup slots), this approach doesn't seem so bad |
I think any new logic would be risky in the short term. |
doesn't need to acquire the shred insertion lock, can wait until notified by a cond var. essentially as carl said we just need to be sure that the shred insertion thread has seen the new value of
I'm fine with this as well. @carllin @steviez if you're okay with that i'll merge this right now and we can revisit in a future pr. |
Thanks for the additional context, Carl's comment is more clear / less ambiguous to me with that. This is potentially overkill / redundant, but here is a clear statement of the problem with relevant points all in one place:
With that all stated, yes, the approach you and Carl mentioned seems like it'll do the trick / is better than having cleanup thread grab the shred insertion lock too |
steviez
left a comment
There was a problem hiding this comment.
I think any new logic would be risky in the short term.
For v1.18 we just need to relax expects to error logs and we can revise later if needed.I'm fine with this as well. @carllin @steviez if you're okay with that i'll merge this right now and we can revisit in a future pr.
This is a good point and for the sake of trying to keep commits 1-to-1 between master and v1.18, I'm onboard with it. These are also all demotions of panic to error log, so I think it is fine if we try to sneak the v1.18 BP into the release tomorrow instead of waiting for a week of runtime on the tip of master canaries
) (cherry picked from commit b5c5bd3)
…ion (backport of anza-xyz#1259) (anza-xyz#1279) blockstore: account for blockstore cleanup during shred insertion (anza-xyz#1259) (cherry picked from commit b5c5bd3) Co-authored-by: Ashwin Sekar <ashwin@anza.xyz>
v1.18 introduces many new
expectandunwraps on blockstore invariants.An audit shows that some of these invariants might not hold under extreme situations. Specifically any check relying on receiving a shred from
get_shred_from_just_inserted_or_dbmay fail if the slot is cleaned up during shred insertion:Sis received and compared successfully againstblockstore.max_root()in shred fetch stageblockstore.max_root()pastSblockstore.max_root(), including the columns for the shred currently being inserted (insertion lock is not checked here)get_shred_from_just_inserted_or_dbis no longer present.(4) is unlikely to happen as @steviez pointed out here #1151 (comment), but for safety this should still be accounted for.