Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LeaderBankNotifier #30395

Merged
merged 36 commits into from
Mar 27, 2023
Merged

Conversation

apfitzge
Copy link
Contributor

@apfitzge apfitzge commented Feb 17, 2023

Problem

Currently there's not a great way to wait for a new leader slot to start or end, without busy waiting. Often a leader has multiple (4) back-to-back slots, and will end up with txs hitting PohRecorderError:MaxHeightReached on transactions during the transition between bank N and bank N+1. Instead of just running a fast loop for the next slot to begin, we can use a Condvar and notify other threads when the slot begins/ends.

Summary of Changes

  • New struct LeaderBankNotifier
    • Locked state-machine for slot transitions and notifies on change to InProgress or StandBy
    • Also stores weak reference to the most recent leader bank
  • Set InProgress state when we set the working bank in PohRecorder.
  • Set StandBy state when the working bank is cleared in PohRecorder.

This also enables us to have less busy threads in BankingStage - probably not going to rework existing BankingStage, but plan to use the functionality in scheduler "worker" threads.

An example of a slot ending notification being useful is in #30396

Fixes #

@apfitzge
Copy link
Contributor Author

@ryoqun requesting an early review on this. You had a similar mechanism in your branch, which I had adapated for my own uses - using condvar to wake threads up when they have a valid bank to commit to.

LMK what you want/need for this so we can start using a common struct for it.

@apfitzge
Copy link
Contributor Author

Thinking that an issue with the current impl is that when we are no longer the leader, we'd end up waiting forever on txs waiting for the bank to commit into.

Probably need to allow some short timeout for the wait_for_in_progress as well - maybe wait up to 50ms before we just stop processing txs in the execution thread(s).

@ryoqun
Copy link
Member

ryoqun commented Mar 1, 2023

@ryoqun requesting an early review on this. You had a similar mechanism in your branch, which I had adapated for my own uses - using condvar to wake threads up when they have a valid bank to commit to.

LMK what you want/need for this so we can start using a common struct for it.

hehe, yeah, i had similar struct called CommitStatus in my branch. Actually, I'm still not decided that parking worker threads is the best idea during the gap of bank freezing... alternatively, signal retry to scheduler thread, and let the scheduler thread process incoming txes in search of more higher prio tx while waiting for new tpu bank.

That said, the impl is itself looks good at quick glance. As for name, I'd prefer something which alludes blocking nature here. (so, i don't like CommitStatus too...)...

@ryoqun
Copy link
Member

ryoqun commented Mar 1, 2023

this is just fyi, but, as for TransactionRecorder, it's synchronization arrangement is too inefficient. namely, one shot channle isn't needed. I'm planning to fix it using this upstream change: crossbeam-rs/crossbeam#959

@apfitzge
Copy link
Contributor Author

apfitzge commented Mar 6, 2023

alternatively, signal retry to scheduler thread, and let the scheduler thread process incoming txes in search of more higher prio tx while waiting for new tpu bank.

I've tried signaling for retry, and what I found was that the scheduler scheduled a bunch of work, most of it was not processed and sent back, then it needed to be re-scheduled. That seemed inefficient and I ended up trying a few different versions of waiting for the next bank, though condvars seem most reliable.

Not convinced it's the 100% best way, but right now it seems to be better than other options I've tested and I'd like to move forward - "don't let perfect be the enemy of good" and all.
In terms of searching for a higher prio tx while we wait, and option there is we could potentially use skiplists instead of channels - that way we're essentially building dynamically updating queues for the threads. I have not tested this yet however.

@apfitzge
Copy link
Contributor Author

apfitzge commented Mar 7, 2023

this is just fyi, but, as for TransactionRecorder, it's synchronization arrangement is too inefficient. namely, one shot channle isn't needed. I'm planning to fix it using this upstream change: crossbeam-rs/crossbeam#959

Could you elaborate what you mean here? AFAICT TransactionRecorder doesn't use a oneshot right now as its' got a loop to receive potentially multiple messages. 100% agree its' inefficient though.

Wasn't entirely clear what you meant in terms of changes with that crossbeam PR - is your desired synchronization in one of your branches?

@codecov
Copy link

codecov bot commented Mar 13, 2023

Codecov Report

Merging #30395 (974cf7e) into master (5a05e9b) will increase coverage by 0.0%.
The diff coverage is 98.1%.

@@           Coverage Diff            @@
##           master   #30395    +/-   ##
========================================
  Coverage    81.5%    81.5%            
========================================
  Files         726      727     +1     
  Lines      204809   204971   +162     
========================================
+ Hits       167027   167197   +170     
+ Misses      37782    37774     -8     

@ryoqun
Copy link
Member

ryoqun commented Mar 15, 2023

this is just fyi, but, as for TransactionRecorder, it's synchronization arrangement is too inefficient. namely, one shot channle isn't needed. I'm planning to fix it using this upstream change: crossbeam-rs/crossbeam#959

Could you elaborate what you mean here? AFAICT TransactionRecorder doesn't use a oneshot right now as its' got a loop to receive potentially multiple messages. 100% agree its' inefficient though.

here, it's creating unbounded() per a batch. (i.e. per transaction for my unified scheduler.)

let (result_sender, result_receiver) = unbounded();
let res =
self.record_sender
.send(Record::new(mixin, transactions, bank_slot, result_sender));
if res.is_err() {
// If the channel is dropped, then the validator is shutting down so return that we are hitting
// the max tick height to stop transaction processing and flush any transactions in the pipeline.
return Err(PohRecorderError::MaxHeightReached);
}
// Besides validator exit, this timeout should primarily be seen to affect test execution environments where the various pieces can be shutdown abruptly
let mut is_exited = false;
loop {
let res = result_receiver.recv_timeout(Duration::from_millis(1000));

@ryoqun
Copy link
Member

ryoqun commented Mar 15, 2023

Wasn't entirely clear what you meant in terms of changes with that crossbeam PR - is your desired synchronization in one of your branches?

nope, it's only in mind mind and @behzadnouri 's (he was faster than me to report to crossbeam: crossbeam-rs/crossbeam#861 for this optimization opportunity) ... lol

anyway, it's straightforward: we can remove .recv_timeout entirely and FUTEX_WAKE in .send in the quoted code in my previous comment at worst case (it's likely to happen, considering poh idling nature; so there is no amortization fluff applied here).

so, we can elide 2 syscalls per batch at banking thread side and we can elide 1 syscall per batch at poh thread side. worse, those syscalls must be serialized.... 🤮 I already confirmed this is significant bottleneck according to my off cpu profiling: https://github.com/solana-labs/solana/wiki/General-Debugging#perf-based-profiling

@ryoqun
Copy link
Member

ryoqun commented Mar 15, 2023

The diff coverage is 98.8%.

btw, diff coverage is quite good. 💯

@apfitzge
Copy link
Contributor Author

apfitzge commented Mar 15, 2023

The diff coverage is 98.8%.

btw, diff coverage is quite good. 💯

I missed testing the timeout case for both of the functions. Added tests which should bump this to 100% - and actually found a bug where even in timeout wait_for_in_progress still returned a bank!

@apfitzge apfitzge marked this pull request as ready for review March 15, 2023 23:19
@ryoqun
Copy link
Member

ryoqun commented Mar 20, 2023

The diff coverage is 98.8%.

btw, diff coverage is quite good. 100

I missed testing the timeout case for both of the functions. Added tests which should bump this to 100% - and actually found a bug where even in timeout wait_for_in_progress still returned a bank!

see? codecov has some value. :)

poh/src/poh_recorder.rs Outdated Show resolved Hide resolved
poh/src/poh_recorder.rs Outdated Show resolved Hide resolved
poh/src/poh_recorder.rs Outdated Show resolved Hide resolved
poh/src/poh_recorder.rs Outdated Show resolved Hide resolved
core/src/replay_stage.rs Outdated Show resolved Hide resolved
poh/src/poh_recorder.rs Outdated Show resolved Hide resolved
poh/src/poh_recorder.rs Outdated Show resolved Hide resolved
@apfitzge
Copy link
Contributor Author

@ryoqun, I had to rebase due to some dependency audit failures unrelated to the change, which screwed up some of the commit links I replied with...

In addition to addressing your comments I added an additional commit 42d778b to add appropriate assertions in set_completed since it is now only called by the recorder, similar to our previous assertions with set_in_progress

Comment on lines 139 to 159

#[derive(Clone)]
pub struct TransactionRecorder {
// shared by all users of PohRecorder
pub record_sender: Sender<Record>,
pub is_exited: Arc<AtomicBool>,
}

impl Clone for TransactionRecorder {
fn clone(&self) -> Self {
TransactionRecorder::new(self.record_sender.clone(), self.is_exited.clone())
}
}

impl TransactionRecorder {
pub fn new(record_sender: Sender<Record>, is_exited: Arc<AtomicBool>) -> Self {
Self {
// shared
record_sender,
// shared
is_exited,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, seems rebase miss... manual Clone is revived unexpectedly...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 nice catch. I manually "reverted" my original derive by re-adding these manually.
That's what I get for not using git properly!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a single commit to make your re-review simpler, which should do nothing once we squash everything on merge: b679dee

ryoqun
ryoqun previously approved these changes Mar 25, 2023
Copy link
Member

@ryoqun ryoqun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with nits.

super hard work here. really appreciate addressing all of my comments seriously.. :) Through that journey, I think the impl pivoting made sense.

At this point, I've thoroughly analyzed the code path and i think all of newly added panic sources are safe.

There's a few of nits still. but i think it's ok to merge as-is, unless you're so obsessed with cleanest code possible like me... xD

@apfitzge apfitzge requested a review from ryoqun March 25, 2023 23:47
@apfitzge
Copy link
Contributor Author

@ryoqun I pushed those 2 nits...

unless you're so obsessed with cleanest code possible like me... xD

I think you may have me pegged 😆

Copy link
Member

@ryoqun ryoqun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-lgtm. thanks for the extra chores.

@apfitzge apfitzge merged commit a575ea2 into solana-labs:master Mar 27, 2023
@apfitzge apfitzge deleted the feature/leader_bank_status branch March 27, 2023 15:17
@ryoqun ryoqun mentioned this pull request May 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

2 participants