-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Several changes to improve Consensus stability: #4505
Conversation
1e9cd9c
to
bb74967
Compare
@HowardHinnant @ChronusZ I just pushed a change and rebased it -- I hadn't seen that you guys were reviewiing it yet before pushing. If you haven't started looking at the PR, nevermind. If you have started looking at it, reload the latest. I won't make any more changes except as part of the review. What I added was something I though I had added before, but didn't, and I caught it in testing today. It doesn't change the overall scope of the project. |
haveCloseTimeConsensus_ = true; | ||
// Make sure that the winning close time is the one | ||
// that propagates to the rest of the function. | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this is needed? Can two times have more than threshConsensus votes? Why not have a fixed rule in that case that the large votes wins or the large time if vots are equal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left an explanation in the code about the change, from lines 1740-8.
std::size_t minCONSENSUS_PCT = 80; | ||
float minCONSENSUS_FACTOR = static_cast<float>(minCONSENSUS_PCT / 100.0f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not good i think. Floating points are bad. You should use the std::ratio<80, 100>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic has always been a floating point. The change I made calculates it into a variable so that it can be re-used in the code. Anyway, I've never heard of std::ratio until right now and prefer not to re-implement perfectly functioning consensus code as my first use of it.
public: | ||
std::atomic<bool> validating_{false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I return validating_
back to private, this commit still compiles for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments that I think need to be addressed before approval. But I didn't see any showstoppers. Looks good!
I'm getting a repeatable unit test error:
I only get the error when I run all of the unit tests:
I don't get it when I run the test by itself:
There's probably a global setting that a previous test is setting and not resetting back. |
I've confirmed that the assert is introduced with bb74967. It doesn't occur prior to this commit and does afterwards. I didn't detect it previously because I wasn't running with assert enabled. |
I confirm this. Funny, the stack trace is in Consensus_test.cpp, which comes right after the "conditions" test. It makes sense for this assertion to come as a result of Consensus. I'm not sure why the crash happened during conditions, but anyway, I'll figure it out. Thanks for finding it. Going forward, I'll do debug builds locally to find this in the future. Recently, I had been doing release builds locally once we moved to Conan, and it didn't occur to me to make the build debug. |
I just recently learned to add this: |
Fixed, and all unittests pass with no assertions. |
@ChronusZ will you be able to review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you remove the #include
s that I flagged, it resolves any spurious circular dependencies that were introduced, but you'll still need to run ./Builds/levelization/levelization.sh
and commit the changes to ordering.txt
, since new dependencies were introduced to make the levelization check job happy.
* Verify accepted ledger becomes validated, and retry with a new consensus transaction set if not. * Always store proposals. * Track proposals by ledger sequence. This helps slow peers catch up with the rest of the network. * Acquire transaction sets for proposals with future ledger sequences. This also helps slow peers catch up. * Optimize timer delay for establish phase to wait based on how long validators have been sending proposals. This also helps slow peers to catch up. * Fix impasse achieving close time consensus. * Don't wait between open and establish phases.
* Verify accepted ledger becomes validated, and retry with a new consensus transaction set if not. * Always store proposals. * Track proposals by ledger sequence. This helps slow peers catch up with the rest of the network. * Acquire transaction sets for proposals with future ledger sequences. This also helps slow peers catch up. * Optimize timer delay for establish phase to wait based on how long validators have been sending proposals. This also helps slow peers to catch up. * Fix impasse achieving close time consensus. * Don't wait between open and establish phases.
* Verify accepted ledger becomes validated, and retry with a new consensus transaction set if not. * Always store proposals. * Track proposals by ledger sequence. This helps slow peers catch up with the rest of the network. * Acquire transaction sets for proposals with future ledger sequences. This also helps slow peers catch up. * Optimize timer delay for establish phase to wait based on how long validators have been sending proposals. This also helps slow peers to catch up. * Fix impasse achieving close time consensus. * Don't wait between open and establish phases.
This reverts commit e8a7b2a.
Since this was reverted in 1.12.0-rc3, this PR is being reopened for additional review, testing, and potential improvements, with the expectation that it will be merged again soon (e.g. for the 1.13 or 2.0 release). |
in progress: considering a txThreshold so new behavior is used under high volume. change would likely be pushed to this |
* Verify accepted ledger becomes validated, and retry with a new consensus transaction set if not. * Always store proposals. * Track proposals by ledger sequence. This helps slow peers catch up with the rest of the network. * Acquire transaction sets for proposals with future ledger sequences. This also helps slow peers catch up. * Optimize timer delay for establish phase to wait based on how long validators have been sending proposals. This also helps slow peers to catch up. * Fix impasse achieving close time consensus. * Don't wait between open and establish phases.
This reverts commit e8a7b2a.
* Verify accepted ledger becomes validated, and retry with a new consensus transaction set if not. * Always store proposals. * Track proposals by ledger sequence. This helps slow peers catch up with the rest of the network. * Acquire transaction sets for proposals with future ledger sequences. This also helps slow peers catch up. * Optimize timer delay for establish phase to wait based on how long validators have been sending proposals. This also helps slow peers to catch up. * Fix impasse achieving close time consensus. * Don't wait between open and establish phases.
* Verify accepted ledger becomes validated, and retry with a new consensus transaction set if not. * Always store proposals. * Track proposals by ledger sequence. This helps slow peers catch up with the rest of the network. * Acquire transaction sets for proposals with future ledger sequences. This also helps slow peers catch up. * Optimize timer delay for establish phase to wait based on how long validators have been sending proposals. This also helps slow peers to catch up. * Fix impasse achieving close time consensus. * Don't wait between open and establish phases.
This reverts commit e8a7b2a.
* Verify accepted ledger becomes validated, and retry with a new consensus transaction set if not. * Always store proposals. * Track proposals by ledger sequence. This helps slow peers catch up with the rest of the network. * Acquire transaction sets for proposals with future ledger sequences. This also helps slow peers catch up. * Optimize timer delay for establish phase to wait based on how long validators have been sending proposals. This also helps slow peers to catch up. * Fix impasse achieving close time consensus. * Don't wait between open and establish phases.
This reverts commit f259cc1.
* Revert "Optimize calculation of close time to avoid impasse and minimize gratuitous proposal changes (#4760)" This reverts commit 8ce85a9. * Revert "Several changes to improve Consensus stability: (#4505)" This reverts commit f259cc1. * Add missing include --------- Co-authored-by: seelabs <[email protected]>
This reverts commit e8a7b2a.
* Verify accepted ledger becomes validated, and retry with a new consensus transaction set if not. * Always store proposals. * Track proposals by ledger sequence. This helps slow peers catch up with the rest of the network. * Acquire transaction sets for proposals with future ledger sequences. This also helps slow peers catch up. * Optimize timer delay for establish phase to wait based on how long validators have been sending proposals. This also helps slow peers to catch up. * Fix impasse achieving close time consensus. * Don't wait between open and establish phases.
* Revert "Optimize calculation of close time to avoid impasse and minimize gratuitous proposal changes (XRPLF#4760)" This reverts commit 8ce85a9. * Revert "Several changes to improve Consensus stability: (XRPLF#4505)" This reverts commit f259cc1. * Add missing include --------- Co-authored-by: seelabs <[email protected]>
This PR is related to 2 others that, when combined, increase throughput significantly:
#4503
#4504
High Level Overview of Change
There are several changes to improve consensus stability--namely, helping nodes to stay in sync as transaction volume increases. Analysis under high load shows that rippled often falls out of sync and is unable to get back into sync with the rest of the network. Namely, once rippled falls out of sync, it then attempts to acquire SHAMaps for the subsequent ledgers. As volume increases, this becomes more difficult. Nodes can fall out of sync for a variety of reasons--often having to do with intermittent slowdowns that occur as volume increases, such as garbage collection, network processing delays, and various others. Instead of fixing those issues (which should be addressed separately), this patch focuses on making rippled more robust in the event of intermittent delays. This patch has several fixes that keep a node in sync even if it falls slightly behind the rest of the network.
From the commit message:
Don't wait between open and establish phases.
Context of Change
Analysis under high volume shows that individual peers sometimes come to consensus on a position that the rest of the network discards in favor of another. This causes rippled to build a ledger that never becomes validated and falls out of sync. This could be caused by several peers changing their position but which is not received by another peer until after it comes to consensus on the wrong position. The fix for this is to essentially check if the newly built ledger has become the validated ledger. If not, look for a newer position to possibly adopt, and then retry. This step often introduces a delay in the accept phase. To counteract this, the patch measures the time delayed this way and then factors it into the open phase interval timing.
Essentially, this is implemented by splitting the the current doAccept() function into 2 functions:
buildAndValidate() and prepareOpenLedger(). Once buildAndValidate() produces a confirmed validated ledger, prepareOpenLedger() finishes up the accept phase and then launches a new consensus round. This relies upon calling updateOurPositions() in case we've built something that needs to be discarded. Care is taken to only send a single validation, and to send no new proposals after the initial decision to build and validate a ledger. This is because sending conflicting messages like this to the rest of the network could cause problems with consensus as a whole. However, it's OK to abandon a position in favor of something that actually achieves network consensus.
To illustrate, take a network with 5 validators: A,B,C,D,E. A sees proposal z from B,C,D that it agrees with and then builds a ledger. However, B,C,D changed their minds to agree with E on proposal y. With the fix, A sees that z never becomes validated and discovers that y instead is more popular. It retries with y and stays in sync with the network.
This effort has been made complex by the abstraction in the code distinguishing Consensus in an abstract sense from the XRP Ledger implementation (called RCLConsensus). This abstraction is intended allow other products to be built using the same consensus mechanism but with such things as different transaction formats. There only implementation like this is a test implementation within the rippled codebase. Note that this is distinct from other unit tests that test consensus. There is no real world use of this abstraction. This adds complexity because these 2 classes that have to interact very closely with each other. The main design implication of this has to do with the current doAccept() existing as part of RCLConsensus. However, the new implementation requires buildAndValidate()'s parameters to possibly change based on the result of updateOurPositions(). This essentially means that buildAndValidate() is now in Consensus while the remainder (prepareOpenLedger()) stays in RCLConsensus. However, all existing tests and functionality still work, including the abstraction.
Always store proposals.
For example, rippled currently doesn't consider proposals received during the accept phase. But this is necessary for changing the ledger that we need to work on. On a similar note, transaction sets are also discarded during the accept phase.
Fix impasse achieving close time consensus.
A corner case was discovered in the existing code where close time consensus never is reached. The fix is described in updateOurPositions() in Consensus.h.
The remaining fixes are related to the same thing. Namely, rippled currently doesn't track peer proposals robustly enough to stay in sync after a slight delay:
Track proposals by ledger sequence. This helps slow peers catch up with the rest of the network.
Currently, proposals are stored in a map by peer nodeid -> deque. We know the ledger upon which the proposal is based by its hash, but not necessarily by ledger sequence. The fix implements an optional new field in proposal messages to assign the ledger sequence. This allows us to know exactly which proposals correspond to ledger sequence.
Acquire transaction sets for proposals with future ledger sequences. This also helps slow peers catch up.
Related to this is that we now acquire transaction sets for each proposal.
After an intermittent slowdown, we will already have what are now current proposals and transaction sets.
Optimize timer delay for establish phase to wait based on how long validators have been sending proposals. This also helps slow peers to catch up.
The establish phase takes a minimum of 1950ms. This is to allow the whole network to spend plenty of time reconciling transaction sets.. However, if peers have already been sending proposals for at least 1950ms, then it means that we are probably lagging behind. Therefore, we should try to find consensus right now. This is implemented by placing a timestamp on proposals as received. Then, a heuristic in phaseEstablish() determines the start of the countdown to 1950ms based on proposals received vs when we first adopted a position. There is protection against byzantine validators. The implementation comments provide precise explanation.
Don't wait between open and establish phases.
There's a 1s pause for the heartbeat timer once the ledger is closed. This is no good for a peer that is falling behind.
Type of Change