Banking_stage drops TooOld vote transactions#19152
Banking_stage drops TooOld vote transactions#19152tao-stones wants to merge 3 commits intosolana-labs:masterfrom
Conversation
Yeah, iirc on our last call the actions we thought to take were:
|
Yea, I am exploring if there are light-weight options to detect "old" votes without having to invoke validator's VoteState entirely. Maybe banking_stage keeps a HashMap<validator_key, its_latest_voted_slot> , any votes from a validator has last vote earlier than |
|
I ran today PR build in @carllin @sakridge What do you feel about this approach and result? |
@taozhu-chicago, what is the metric the first graph is measuring? Do we have a metric for measuring |
| vote_redundancy_checker | ||
| .write() | ||
| .unwrap() | ||
| .check(&tx, bank) | ||
| .ok()?; |
There was a problem hiding this comment.
Hmm I think there's a race here. Even if the vote_redundancy_checker see slots 1->2->3->4 in order, the execution could still be out of order because ultimately what matters is the order in which these banking threads hit the process_and_record_transactions_locked() and grab the locks for the accounts
This implies this check should probably happen after we've acquired the account locks.
There was a problem hiding this comment.
Super! Thanks for the important point - vote check should be done after bank is locked, and redundant votes are not to be submitted to bank.
There was a problem hiding this comment.
on second thought, if banking_stage tracks last voted slot, and assume all transaction it let through will eventually landed in blockstore, it should be able to drop votes that voted for older slots (as implemented as solution 2 at https://github.com/solana-labs/solana/pull/19152/files#diff-bb8464c2f02c8863fb60e6e29f023f3e6da93a93fbb9199a70a44c9c1b41c7f3R1103-R1118 ). Do you think so?
There was a problem hiding this comment.
Hmm, this still seems to run into the same issue. I can have two threads, one gets a vote for slot 1, one for slot 2 from the same validator. They insert the slots in order into latest_voted_slots, but the thread with slot 2 processes it's vote transaction first. Then the thread adding slot 1 will incur the TooOld error when it tries to be process the transaction.
There was a problem hiding this comment.
right! the "solution 1", which banking_stage checks vote without involving bank, inherently has chances to pass old votes to bank for processing, though the possibilities are largely reduced (by filtering out out-of-order votes received by banking_stage). The benefit of this approach, as mentioned in Summary of Change above, is it doesn't require bank, which alights with bank-less leader goal.
The other solution implemented in this PR (https://github.com/solana-labs/solana/pull/19152/files#diff-ed47b4a0198313377e091bb3957bbbc63d937805426d1b2b6de39d0a50d32a0cR3461) checks vote in bank without race condition, but my concern is if this approach will be invalidated when implementing bank-less leader (I haven't thought it through, just a general concern)
At the end, we only need one of these two implementations in this PR. The debate is "which one" :)
There was a problem hiding this comment.
A few things:
-
Doesn't the bankless leader design also require access to a bank in order to figure out whether an account has sufficient funds to pay the transaction fee?
-
It seems like the stateless
VoteRedundancyCheckercould theoretically also be used in solution 2 after we grab the locks? -
If we go with the
VoteRedundancyChecker, I also think we may need a separateVoteRedundancyCheckerper bank because of forking. For example if you had
1
/ \
2 3
A vote for slot 1 may have landed in the bank for slot 2, but not for the bank in slot 3.
If you were to check the account state for slot 2 and 3 though this is not a problem because account fetch takes into account the fork on which the bank resides.
There was a problem hiding this comment.
- Yeah, will need bank to check account balance, would be great is just simply in-n-out for that purpose only. But I am not complete sure, not opposing to do check after account lock at all.
- Yep, the Bank solution currently does the same thing
VoteRedundantCheckerdoes. - This is another point for using bank-based solution, so current implement (https://github.com/solana-labs/solana/pull/19152/files#diff-ed47b4a0198313377e091bb3957bbbc63d937805426d1b2b6de39d0a50d32a0cR3117) covers the forking scenario, i think
Answering though these questions, I started to see the benefit of bank-based solution, I'll remove the banking_stage based code, to make the PR easier for review.
Sorry, me sloppy here. The first metric is the count of verified transactions (https://github.com/solana-labs/solana/pull/19152/files#diff-bb8464c2f02c8863fb60e6e29f023f3e6da93a93fbb9199a70a44c9c1b41c7f3L1069-L1072), if redundant votes are dropped, this counter should drop too. The second n=metric is the time takes to convert packets (https://github.com/solana-labs/solana/pull/19152/files#diff-bb8464c2f02c8863fb60e6e29f023f3e6da93a93fbb9199a70a44c9c1b41c7f3L1163), with less transactions to convert, this metric should go down too. |
5357d5c to
f1574f8
Compare
c952156 to
a951c88
Compare
|
settled on checking vote using current bank's VoteState, after accounts were locked, and right before TXs were loaded and executed. Still working on manipulating test bank instances to test redundancy check function. But the implementation is ready for review. ran testnet cluster, the metrics shows almost half of votes were identified. as redundant and dropped (Orange are vote TXs count, and Blue are dropped vote TXs count): |
e664c9b to
8b7a37a
Compare
|
rebased master |
… vote_state; 2. banking_stage calls above function after accounts were locked, and before calling bank.load_and_execute_transactions, redundant vote transactions are marked as TransactionError::ProcessedAlready in result;
8b7a37a to
ba832a9
Compare
|
The PR is consistently failing on CI local-cluster test I traced the test by log, notice there are mainly two type vote tx are being dropped:
Are any of these case could possibly impact consensus negatively? @carllin |
Hmmm dup vote shouldn't affect consensus, since it only needs one copy to be ingested
I would be worried if the older, dropped vote had slots that were not present in the processed vote, but the old vote here is a strict subset of the newer vote, so all the slot information should have landed. It seems like certain validators' towers aren't making progress with this change, it would be helpful to log the vote state that has landed in a bank on each bank in the validator: Line 223 in 8ad52fa
Goal here is to see if roots are making progress. Then compare to the local tower of each validator with that Line 369 in 8ad52fa
|
|
Thanks @carllin ! It makes good sense and super helpful. I'll do your suggestions -- helps me to understand consensus much more too. |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
|
This stale pull request has been automatically closed. Thank you for your contributions. |



Problem
A lot of vote transactions are transmitted and replayed only to fail at instruction_process stage. Leader could not include TooOld Vote transactions into block to improve efficiency.
Summary of Changes
check_redundant_votesfunction to checkvoteby currentbank's vote_state.banking_stagecalls above function after accounts were locked, and before callingbank.load_and_execute_transactions, redundant vote transactions are marked asTransactionError::ProcessedAlreadyin result.bank.rsis unchanged, no impact toreplay_stageFixes #17877