-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LeaderBankNotifier #30395
LeaderBankNotifier #30395
Conversation
@ryoqun requesting an early review on this. You had a similar mechanism in your branch, which I had adapated for my own uses - using condvar to wake threads up when they have a valid bank to commit to. LMK what you want/need for this so we can start using a common struct for it. |
Thinking that an issue with the current impl is that when we are no longer the leader, we'd end up waiting forever on txs waiting for the bank to commit into. Probably need to allow some short timeout for the |
hehe, yeah, i had similar struct called That said, the impl is itself looks good at quick glance. As for name, I'd prefer something which alludes blocking nature here. (so, i don't like |
this is just fyi, but, as for |
I've tried signaling for retry, and what I found was that the scheduler scheduled a bunch of work, most of it was not processed and sent back, then it needed to be re-scheduled. That seemed inefficient and I ended up trying a few different versions of waiting for the next bank, though condvars seem most reliable. Not convinced it's the 100% best way, but right now it seems to be better than other options I've tested and I'd like to move forward - "don't let perfect be the enemy of good" and all. |
Could you elaborate what you mean here? AFAICT Wasn't entirely clear what you meant in terms of changes with that crossbeam PR - is your desired synchronization in one of your branches? |
43ccdd2
to
81a60e4
Compare
Codecov Report
@@ Coverage Diff @@
## master #30395 +/- ##
========================================
Coverage 81.5% 81.5%
========================================
Files 726 727 +1
Lines 204809 204971 +162
========================================
+ Hits 167027 167197 +170
+ Misses 37782 37774 -8 |
here, it's creating solana/poh/src/poh_recorder.rs Lines 212 to 224 in 65cd552
|
nope, it's only in mind mind and @behzadnouri 's (he was faster than me to report to crossbeam: crossbeam-rs/crossbeam#861 for this optimization opportunity) ... lol anyway, it's straightforward: we can remove so, we can elide 2 syscalls per batch at banking thread side and we can elide 1 syscall per batch at poh thread side. worse, those syscalls must be serialized.... 🤮 I already confirmed this is significant bottleneck according to my off cpu profiling: https://github.com/solana-labs/solana/wiki/General-Debugging#perf-based-profiling |
btw, diff coverage is quite good. 💯 |
I missed testing the timeout case for both of the functions. Added tests which should bump this to 100% - and actually found a bug where even in timeout |
27616ae
to
42d778b
Compare
@ryoqun, I had to rebase due to some dependency audit failures unrelated to the change, which screwed up some of the commit links I replied with... In addition to addressing your comments I added an additional commit 42d778b to add appropriate assertions in |
poh/src/poh_recorder.rs
Outdated
|
||
#[derive(Clone)] | ||
pub struct TransactionRecorder { | ||
// shared by all users of PohRecorder | ||
pub record_sender: Sender<Record>, | ||
pub is_exited: Arc<AtomicBool>, | ||
} | ||
|
||
impl Clone for TransactionRecorder { | ||
fn clone(&self) -> Self { | ||
TransactionRecorder::new(self.record_sender.clone(), self.is_exited.clone()) | ||
} | ||
} | ||
|
||
impl TransactionRecorder { | ||
pub fn new(record_sender: Sender<Record>, is_exited: Arc<AtomicBool>) -> Self { | ||
Self { | ||
// shared | ||
record_sender, | ||
// shared | ||
is_exited, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, seems rebase miss... manual Clone
is revived unexpectedly...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 nice catch. I manually "reverted" my original derive by re-adding these manually.
That's what I get for not using git properly!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a single commit to make your re-review simpler, which should do nothing once we squash everything on merge: b679dee
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm with nits.
super hard work here. really appreciate addressing all of my comments seriously.. :) Through that journey, I think the impl pivoting made sense.
At this point, I've thoroughly analyzed the code path and i think all of newly added panic sources are safe.
There's a few of nits still. but i think it's ok to merge as-is, unless you're so obsessed with cleanest code possible like me... xD
@ryoqun I pushed those 2 nits...
I think you may have me pegged 😆 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-lgtm. thanks for the extra chores.
Problem
Currently there's not a great way to wait for a new leader slot to start or end, without busy waiting. Often a leader has multiple (4) back-to-back slots, and will end up with txs hitting
PohRecorderError:MaxHeightReached
on transactions during the transition between bank N and bank N+1. Instead of just running a fast loop for the next slot to begin, we can use aCondvar
and notify other threads when the slot begins/ends.Summary of Changes
LeaderBankNotifier
InProgress
orStandBy
InProgress
state when we set the working bank in PohRecorder.StandBy
state when the working bank is cleared in PohRecorder.This also enables us to have less busy threads in BankingStage - probably not going to rework existing BankingStage, but plan to use the functionality in scheduler "worker" threads.
An example of a slot ending notification being useful is in #30396
Fixes #