Finally introduce sane unified scheduler shutdown#5866
Finally introduce sane unified scheduler shutdown#5866apfitzge merged 6 commits intoanza-xyz:masterfrom
Conversation
|
|
||
| impl Drop for BankForks { | ||
| fn drop(&mut self) { | ||
| error!("BankForks::drop(): successfully dropped"); |
There was a problem hiding this comment.
note to self: now that, ProgramCache's BankForks circular-referenc is gone. maybe i can wire prepare_to_drop from here to exercise the shutdown code-path (esp in tests!)
apfitzge
left a comment
There was a problem hiding this comment.
I'm a bit biased toward my solution in #5826 because I think have the strong circular references adds quite a bit of room for us to introduce a similar bug in the future whereas weak references make that harder to do.
However I think most of these changes make sense in addition to my PR, and give us stronger confidence in shutdown logic. AFAICT these two PRs are not conflicting, please let me know if you see a reason why this would not be the case.
| drop(self.cluster_info); | ||
|
|
||
| self.poh_service.join().expect("poh_service"); | ||
| macro_rules! join_then_log { |
There was a problem hiding this comment.
This seems like a separate change to assist in debugging thread-joining issues. Probably a good change to make, but not related to fixing this issue afiact?
There was a problem hiding this comment.
yeah. another piggy-bucked salvation from the ancient pr #593 by procrastinating me.
| bank | ||
| } | ||
|
|
||
| pub fn prepare_to_drop(&mut self) { |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5866 +/- ##
========================================
Coverage 83.2% 83.2%
========================================
Files 853 853
Lines 375116 375214 +98
========================================
+ Hits 312225 312330 +105
+ Misses 62891 62884 -7 🚀 New features to boost your workflow:
|
First of all, thanks for working on unified scheduler code. To be clear, i have a strong preference regarding strong circular references and relaxing it by weak ones. but I'm not planning to push my idea to the team so hard. So, if i cannot change your mind with this comment, I'm fine to merge #5826. it's written after good understanding of unified scheduler and it's indeed a good solution. I'm quite happy to see these kinds of prs from someone other than me.. :) And it reduces my work load for now. however, I might revisit it in the future, as i said at bottom of this wall of text (due to real perf reasons)... xD Maybe, the difference of approaches in the two prs boils down to different coding stances. I agree that a strong circular references should be avoided if possible and that it's error-prone. However, at the same time, i think this downside can be under control with strict coding of full of Fyi, this bug is known (at least by me!) since start, and i still don't think local-cluster test (#5435) and production On the other hand, I agree weak circular references does indeed fix the pressing problem (if any!). However, i think it clouds system responsibilities. Namely, guarantee of Lastly, there's a rationale for strong circular references other than introduction of strict runtime invariant: performance. For
Yeah, we can mix the two technically. however, i think merging either one should be enough. I think we should stick to one philosophy and be consistent with it. |
|
@ryoqun is this ready to review? |
f6f9acb to
903fd09
Compare
903fd09 to
0842d6a
Compare
0842d6a to
c943b74
Compare
apfitzge
left a comment
There was a problem hiding this comment.
minor comments on the unreachable.
would like to remove the additional contention on poh_recorder before merging
| impl BankingStageMonitor for DecisionMaker { | ||
| fn status(&mut self) -> BankingStageStatus { | ||
| if matches!( | ||
| if self.poh_recorder.read().unwrap().is_exited.load(Relaxed) { |
There was a problem hiding this comment.
Would be good to just clone is_exited off of poh_recorder. It's already wrapped in an Arc, so we can avoid locking poh_recorder each time we check status.
a78c13b to
5509cde
Compare
|
😱 New commits were pushed while the automerge label was present. |
bf31400 to
3e35d9f
Compare
|
😱 New commits were pushed while the automerge label was present. |
23ac0b4 to
2211a02
Compare
|
fyi, i added 63dc14c after noticing ci delays across board... |
2211a02 to
63dc14c
Compare
|
😱 New commits were pushed while the automerge label was present. |
* Finally introduce sane unified scheduler shutdown * Avoid lock contentions on poh_recorder * Remove confusing redundant comment * Provide explicit msg to unreachable!()s * Minor edits * Improve ci stability with faster joining
Problem
I was lazy. There was wip shutdown impl buried in an ancient pr..
And that wip work is now well integrated with bp scheduler code-path.
Summary of Changes
An alternative approach to #5826
fyi: salvaged from #593 by porting it to #3946 and force-pushed to this pr by extracting from there.