Stabilize some banking stage tests#6251
Conversation
| .map(|(_bank, (entry, _tick_height))| entry) | ||
| .collect(); | ||
|
|
||
| assert!(entries.verify(&blockhash)); |
There was a problem hiding this comment.
this assertion should be moved inside the newly-added empty guard if clause.
| vote_receiver, | ||
| ); | ||
| trace!("sending bank"); | ||
| sleep(Duration::from_millis(600)); |
There was a problem hiding this comment.
Because poh_config's life time is controlled by the target_tick_count, this is no longer needed.
| bank.process_transactions(&entry.transactions) | ||
| .iter() | ||
| .for_each(|x| assert_eq!(*x, Ok(()))); | ||
| if !entries.is_empty() { |
There was a problem hiding this comment.
Under high load, I observed empty entries are returned.
| if poh_config.target_tick_count.is_none() { | ||
| Self::sleepy_tick_producer(poh_recorder, &poh_config, &poh_exit_); | ||
| } else { | ||
| Self::short_lived_tick_producer(poh_recorder, &poh_config, &poh_exit_); |
There was a problem hiding this comment.
naming: or short_lived_sleepy_tick_producer.. (too long?)
| ) { | ||
| let exit = Arc::new(AtomicBool::new(false)); | ||
| let poh_config = Arc::new(PohConfig::default()); | ||
| let poh_config = if let Some(poh_config) = poh_config { |
There was a problem hiding this comment.
Hmm, I'm sure there is better way to write in more idiomatic Rust way.
There was a problem hiding this comment.
Maybe let poh_config = Arc::new(poh_config.unwrap_or(PohConfig::default())) 😄
|
@ryoqun, thanks for taking this on! In the |
|
@carllin Thanks for checking this out! Sure thing, I'll elaborate it! I'll work on to finish this up tomorrow's work-time in JST. :) |
|
@carllin Sorry for being late... I added the details of race conditions. Hope that clarifies what I wanted to fix... Also updated this PR with your comments! |
| Blocktree::open(&ledger_path).expect("Expected to be able to open database ledger"), | ||
| ); | ||
| let mut poh_config = PohConfig::default(); | ||
| poh_config.target_tick_count = Some(6); |
There was a problem hiding this comment.
The purpose of this test is to check that the number of ticks output is genesis_block.ticks_per_slot, no matter how many extra ticks are sent by the PohService. For clarity, instead of the magic number 6, can we write this as genesis_block.ticks_per_slot + num_extra_ticks for some number num_extra_ticks?
There was a problem hiding this comment.
You were right 😄 for the same reasons as you found in the other cases listed below, this should also use bank.max_tick_height() instead of genesis_block.ticks_per_slot for clarity and consistency
| ); | ||
| let mut poh_config = PohConfig::default(); | ||
| // limit the tick to 1 to prevent clearing working_bank at PohRecord then PohRecorderError(MaxHeightReached) at BankingStage | ||
| poh_config.target_tick_count = Some(1); |
There was a problem hiding this comment.
Good catch, quite a subtle bug, especially for your first time through the code base :)
Instead of the magic number 1, can we use something like genesis_block.ticks_per_slot - 1, to make it clear that there's nothing special about this number, it just needs to be < genesis_block.ticks_per_slot
There was a problem hiding this comment.
Well, I couldn't directly use genesis_block.ticks_per_slot because subtracting -1 from it still causes the race condition, so I chose different one: b07bf58
There was a problem hiding this comment.
@ryoqun, ah yes, it's because of a special case where slot 0 generates one less tick than all the other slots. For instance, if genesis_block.ticks_per_slot is equal to 10, then the first slot will have bank.max_tick_height of 9, so PoH will, starting at 0, generate a tick for tick_heights 0->1, 1->2, 2->3...8->9, which is only 9 ticks. The next slot, slot 1, will then generate 10 ticks because it's max tick height is 19, and it will start generating a tick for tick heights 9->10, 10->11, ... 18 -> 19. This is all because in the past we wanted to keep the first tick in a slot as 0-indexed.
This open PR here actually will address this issue: #6263, if you are curious and want to take a look :D
There was a problem hiding this comment.
Thanks for very detailed explanation! I'll look at it. :D
| ); | ||
| let mut poh_config = PohConfig::default(); | ||
| // limit the tick to 1 to prevent clearing working_bank at PohRecord then PohRecorderError(MaxHeightReached) at BankingStage | ||
| poh_config.target_tick_count = Some(1); |
There was a problem hiding this comment.
Instead of the magic number 1, can we use something like genesis_block.ticks_per_slot - 1, to make it clear that there's nothing special about this number, it just needs to be < genesis_block.ticks_per_slot
There was a problem hiding this comment.
Well, I couldn't directly use genesis_block.ticks_per_slot because subtracting -1 from it still causes the race condition, so I chose different one: b07bf58
|
@ryoqun, looking good! Just a few nits, and a little more plumbing to make CI pass :) |
| } | ||
|
|
||
| #[test] | ||
| #[ignore] |
Codecov Report
@@ Coverage Diff @@
## master #6251 +/- ##
=========================================
- Coverage 77.9% 64.4% -13.6%
=========================================
Files 216 216
Lines 40601 49205 +8604
=========================================
+ Hits 31654 31717 +63
- Misses 8947 17488 +8541 |
|
@carllin - does this look good to land? It's green |
Problem
Various banking stage tests are fragile.
Summary of Changes
Fixed several race conditions in tests.
This fixes 3 unstable unit tests. For all of them, the root cause of unstability is a race condition between solana-poh-service-tick_producer thread and solana-banking-stage-tx thread. So, to precisely control the timings of solana-poh-service-tick_producer thread, I introduced new PohConfig field just for testing purpose.
The requested explanations for each tests follow (in simpler-hardest order):
test_banking_stage_ticktest_banking_stage_entryfication(originally mentioned by the issue)test_banking_stage_entries_onlytest_banking_stage_entryfication. There are additional issues. First, assertion may be run before banking stage finishes. Second, the wall time resulted fromfor _ in 0..10+sleep(Duration::from_millis(200));may not be enough.banking_stage.join().unwrap();up in the test code and replace the 10-time repetition with unbounded looping also with guard for empty entries fromentry_receiver(First starting from #5660, I ended up with spotting other similar race conditions in neighbor tests, so fixed them as well.)
Fixes #5660