Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several changes to improve Consensus stability: #4505

Merged
merged 2 commits into from
Sep 11, 2023

Conversation

mtrippled
Copy link
Collaborator

@mtrippled mtrippled commented Apr 21, 2023

  • Verify accepted ledger becomes validated, and retry with a new consensus transaction set if not.
  • Always store proposals.
  • Track proposals by ledger sequence. This helps slow peers catch up with the rest of the network.
  • Acquire transaction sets for proposals with future ledger sequences. This also helps slow peers catch up.
  • Optimize timer delay for establish phase to wait based on how long validators have been sending proposals. This also helps slow peers to catch up.
  • Fix impasse achieving close time consensus.
  • Don't wait between open and establish phases.

This PR is related to 2 others that, when combined, increase throughput significantly:
#4503
#4504

High Level Overview of Change

There are several changes to improve consensus stability--namely, helping nodes to stay in sync as transaction volume increases. Analysis under high load shows that rippled often falls out of sync and is unable to get back into sync with the rest of the network. Namely, once rippled falls out of sync, it then attempts to acquire SHAMaps for the subsequent ledgers. As volume increases, this becomes more difficult. Nodes can fall out of sync for a variety of reasons--often having to do with intermittent slowdowns that occur as volume increases, such as garbage collection, network processing delays, and various others. Instead of fixing those issues (which should be addressed separately), this patch focuses on making rippled more robust in the event of intermittent delays. This patch has several fixes that keep a node in sync even if it falls slightly behind the rest of the network.

From the commit message:

  • Verify accepted ledger becomes validated, and retry with a new consensus transaction set if not.
  • Always store proposals.
  • Track proposals by ledger sequence. This helps slow peers catch up with the rest of the network.
  • Acquire transaction sets for proposals with future ledger sequences. This also helps slow peers catch up.
  • Optimize timer delay for establish phase to wait based on how long validators have been sending proposals. This also helps slow peers to catch up.
  • Fix impasse achieving close time consensus.
    Don't wait between open and establish phases.

Context of Change

  • Verify accepted ledger becomes validated, and retry with a new consensus transaction set if not:

Analysis under high volume shows that individual peers sometimes come to consensus on a position that the rest of the network discards in favor of another. This causes rippled to build a ledger that never becomes validated and falls out of sync. This could be caused by several peers changing their position but which is not received by another peer until after it comes to consensus on the wrong position. The fix for this is to essentially check if the newly built ledger has become the validated ledger. If not, look for a newer position to possibly adopt, and then retry. This step often introduces a delay in the accept phase. To counteract this, the patch measures the time delayed this way and then factors it into the open phase interval timing.

Essentially, this is implemented by splitting the the current doAccept() function into 2 functions:
buildAndValidate() and prepareOpenLedger(). Once buildAndValidate() produces a confirmed validated ledger, prepareOpenLedger() finishes up the accept phase and then launches a new consensus round. This relies upon calling updateOurPositions() in case we've built something that needs to be discarded. Care is taken to only send a single validation, and to send no new proposals after the initial decision to build and validate a ledger. This is because sending conflicting messages like this to the rest of the network could cause problems with consensus as a whole. However, it's OK to abandon a position in favor of something that actually achieves network consensus.

To illustrate, take a network with 5 validators: A,B,C,D,E. A sees proposal z from B,C,D that it agrees with and then builds a ledger. However, B,C,D changed their minds to agree with E on proposal y. With the fix, A sees that z never becomes validated and discovers that y instead is more popular. It retries with y and stays in sync with the network.

This effort has been made complex by the abstraction in the code distinguishing Consensus in an abstract sense from the XRP Ledger implementation (called RCLConsensus). This abstraction is intended allow other products to be built using the same consensus mechanism but with such things as different transaction formats. There only implementation like this is a test implementation within the rippled codebase. Note that this is distinct from other unit tests that test consensus. There is no real world use of this abstraction. This adds complexity because these 2 classes that have to interact very closely with each other. The main design implication of this has to do with the current doAccept() existing as part of RCLConsensus. However, the new implementation requires buildAndValidate()'s parameters to possibly change based on the result of updateOurPositions(). This essentially means that buildAndValidate() is now in Consensus while the remainder (prepareOpenLedger()) stays in RCLConsensus. However, all existing tests and functionality still work, including the abstraction.

  • Always store proposals.
    For example, rippled currently doesn't consider proposals received during the accept phase. But this is necessary for changing the ledger that we need to work on. On a similar note, transaction sets are also discarded during the accept phase.

  • Fix impasse achieving close time consensus.
    A corner case was discovered in the existing code where close time consensus never is reached. The fix is described in updateOurPositions() in Consensus.h.

The remaining fixes are related to the same thing. Namely, rippled currently doesn't track peer proposals robustly enough to stay in sync after a slight delay:

  • Track proposals by ledger sequence. This helps slow peers catch up with the rest of the network.
    Currently, proposals are stored in a map by peer nodeid -> deque. We know the ledger upon which the proposal is based by its hash, but not necessarily by ledger sequence. The fix implements an optional new field in proposal messages to assign the ledger sequence. This allows us to know exactly which proposals correspond to ledger sequence.

  • Acquire transaction sets for proposals with future ledger sequences. This also helps slow peers catch up.
    Related to this is that we now acquire transaction sets for each proposal.

After an intermittent slowdown, we will already have what are now current proposals and transaction sets.

  • Optimize timer delay for establish phase to wait based on how long validators have been sending proposals. This also helps slow peers to catch up.
    The establish phase takes a minimum of 1950ms. This is to allow the whole network to spend plenty of time reconciling transaction sets.. However, if peers have already been sending proposals for at least 1950ms, then it means that we are probably lagging behind. Therefore, we should try to find consensus right now. This is implemented by placing a timestamp on proposals as received. Then, a heuristic in phaseEstablish() determines the start of the countdown to 1950ms based on proposals received vs when we first adopted a position. There is protection against byzantine validators. The implementation comments provide precise explanation.

  • Don't wait between open and establish phases.
    There's a 1s pause for the heartbeat timer once the ledger is closed. This is no good for a peer that is falling behind.

Type of Change

  • New feature (non-breaking change which adds functionality)

@mtrippled
Copy link
Collaborator Author

@HowardHinnant @ChronusZ I just pushed a change and rebased it -- I hadn't seen that you guys were reviewiing it yet before pushing. If you haven't started looking at the PR, nevermind. If you have started looking at it, reload the latest. I won't make any more changes except as part of the review. What I added was something I though I had added before, but didn't, and I caught it in testing today. It doesn't change the overall scope of the project.

src/ripple/consensus/Consensus.h Outdated Show resolved Hide resolved
src/ripple/app/consensus/RCLConsensus.h Outdated Show resolved Hide resolved
src/ripple/app/consensus/RCLConsensus.h Show resolved Hide resolved
src/ripple/consensus/Consensus.h Show resolved Hide resolved
src/ripple/consensus/Consensus.h Outdated Show resolved Hide resolved
src/ripple/consensus/Consensus.h Outdated Show resolved Hide resolved
src/ripple/consensus/Consensus.h Show resolved Hide resolved
haveCloseTimeConsensus_ = true;
// Make sure that the winning close time is the one
// that propagates to the rest of the function.
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is needed? Can two times have more than threshConsensus votes? Why not have a fixed rule in that case that the large votes wins or the large time if vots are equal?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left an explanation in the code about the change, from lines 1740-8.

std::size_t minCONSENSUS_PCT = 80;
float minCONSENSUS_FACTOR = static_cast<float>(minCONSENSUS_PCT / 100.0f);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not good i think. Floating points are bad. You should use the std::ratio<80, 100>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic has always been a floating point. The change I made calculates it into a variable so that it can be re-used in the code. Anyway, I've never heard of std::ratio until right now and prefer not to re-implement perfectly functioning consensus code as my first use of it.

public:
std::atomic<bool> validating_{false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I return validating_ back to private, this commit still compiles for me.

Copy link
Contributor

@HowardHinnant HowardHinnant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments that I think need to be addressed before approval. But I didn't see any showstoppers. Looks good!

@HowardHinnant
Copy link
Contributor

I'm getting a repeatable unit test error:

ripple.conditions.PreimageSha256 Known Vectors
Assertion failed: (id != result_->position.position()), function gotTxSet, file Consensus.h, line 992.
Abort trap: 6

I only get the error when I run all of the unit tests:

rippled --unittest

I don't get it when I run the test by itself:

rippled --unittest=conditions

There's probably a global setting that a previous test is setting and not resetting back.

@HowardHinnant
Copy link
Contributor

I've confirmed that the assert is introduced with bb74967. It doesn't occur prior to this commit and does afterwards. I didn't detect it previously because I wasn't running with assert enabled.

@mtrippled
Copy link
Collaborator Author

I've confirmed that the assert is introduced with bb74967. It doesn't occur prior to this commit and does afterwards. I didn't detect it previously because I wasn't running with assert enabled.

I confirm this. Funny, the stack trace is in Consensus_test.cpp, which comes right after the "conditions" test. It makes sense for this assertion to come as a result of Consensus. I'm not sure why the crash happened during conditions, but anyway, I'll figure it out. Thanks for finding it. Going forward, I'll do debug builds locally to find this in the future. Recently, I had been doing release builds locally once we moved to Conan, and it didn't occur to me to make the build debug.

@HowardHinnant
Copy link
Contributor

I just recently learned to add this: -Dassert=ON after my -DCMAKE_BUILD_TYPE=Release, so I get release speed but with assert checking.

@mtrippled
Copy link
Collaborator Author

I'm getting a repeatable unit test error:

ripple.conditions.PreimageSha256 Known Vectors
Assertion failed: (id != result_->position.position()), function gotTxSet, file Consensus.h, line 992.
Abort trap: 6

I only get the error when I run all of the unit tests:

rippled --unittest

I don't get it when I run the test by itself:

rippled --unittest=conditions

There's probably a global setting that a previous test is setting and not resetting back.

Fixed, and all unittests pass with no assertions.

@intelliot
Copy link
Collaborator

@ChronusZ will you be able to review this?

Copy link
Collaborator

@ximinez ximinez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you remove the #includes that I flagged, it resolves any spurious circular dependencies that were introduced, but you'll still need to run ./Builds/levelization/levelization.sh and commit the changes to ordering.txt, since new dependencies were introduced to make the levelization check job happy.

src/ripple/proto/ripple.proto Outdated Show resolved Hide resolved
src/ripple/consensus/Consensus.h Show resolved Hide resolved
src/ripple/consensus/Consensus.h Show resolved Hide resolved
src/ripple/app/consensus/RCLConsensus.cpp Outdated Show resolved Hide resolved
src/ripple/overlay/impl/PeerImp.cpp Outdated Show resolved Hide resolved
src/ripple/overlay/impl/PeerImp.cpp Show resolved Hide resolved
src/ripple/overlay/impl/PeerImp.cpp Outdated Show resolved Hide resolved
src/test/csf/Peer.h Outdated Show resolved Hide resolved
src/ripple/consensus/Consensus.h Outdated Show resolved Hide resolved
ximinez pushed a commit to manojsdoshi/rippled that referenced this pull request Aug 19, 2023
 * Verify accepted ledger becomes validated, and retry
   with a new consensus transaction set if not.
 * Always store proposals.
 * Track proposals by ledger sequence. This helps slow peers catch
   up with the rest of the network.
 * Acquire transaction sets for proposals with future ledger sequences.
   This also helps slow peers catch up.
 * Optimize timer delay for establish phase to wait based on how
   long validators have been sending proposals. This also helps slow
   peers to catch up.
 * Fix impasse achieving close time consensus.
 * Don't wait between open and establish phases.
ximinez pushed a commit to manojsdoshi/rippled that referenced this pull request Aug 19, 2023
 * Verify accepted ledger becomes validated, and retry
   with a new consensus transaction set if not.
 * Always store proposals.
 * Track proposals by ledger sequence. This helps slow peers catch
   up with the rest of the network.
 * Acquire transaction sets for proposals with future ledger sequences.
   This also helps slow peers catch up.
 * Optimize timer delay for establish phase to wait based on how
   long validators have been sending proposals. This also helps slow
   peers to catch up.
 * Fix impasse achieving close time consensus.
 * Don't wait between open and establish phases.
manojsdoshi pushed a commit to manojsdoshi/rippled that referenced this pull request Aug 21, 2023
 * Verify accepted ledger becomes validated, and retry
   with a new consensus transaction set if not.
 * Always store proposals.
 * Track proposals by ledger sequence. This helps slow peers catch
   up with the rest of the network.
 * Acquire transaction sets for proposals with future ledger sequences.
   This also helps slow peers catch up.
 * Optimize timer delay for establish phase to wait based on how
   long validators have been sending proposals. This also helps slow
   peers to catch up.
 * Fix impasse achieving close time consensus.
 * Don't wait between open and establish phases.
intelliot added a commit that referenced this pull request Aug 29, 2023
@intelliot
Copy link
Collaborator

Since this was reverted in 1.12.0-rc3, this PR is being reopened for additional review, testing, and potential improvements, with the expectation that it will be merged again soon (e.g. for the 1.13 or 2.0 release).

@intelliot intelliot reopened this Sep 1, 2023
@intelliot
Copy link
Collaborator

in progress: considering a txThreshold so new behavior is used under high volume. change would likely be pushed to this consensus branch/PR.

@intelliot intelliot merged commit f259cc1 into XRPLF:develop Sep 11, 2023
ckeshava pushed a commit to ckeshava/rippled that referenced this pull request Sep 22, 2023
 * Verify accepted ledger becomes validated, and retry
   with a new consensus transaction set if not.
 * Always store proposals.
 * Track proposals by ledger sequence. This helps slow peers catch
   up with the rest of the network.
 * Acquire transaction sets for proposals with future ledger sequences.
   This also helps slow peers catch up.
 * Optimize timer delay for establish phase to wait based on how
   long validators have been sending proposals. This also helps slow
   peers to catch up.
 * Fix impasse achieving close time consensus.
 * Don't wait between open and establish phases.
ckeshava pushed a commit to ckeshava/rippled that referenced this pull request Sep 22, 2023
ckeshava pushed a commit to ckeshava/rippled that referenced this pull request Sep 22, 2023
* Verify accepted ledger becomes validated, and retry
   with a new consensus transaction set if not.
 * Always store proposals.
 * Track proposals by ledger sequence. This helps slow peers catch
   up with the rest of the network.
 * Acquire transaction sets for proposals with future ledger sequences.
   This also helps slow peers catch up.
 * Optimize timer delay for establish phase to wait based on how
   long validators have been sending proposals. This also helps slow
   peers to catch up.
 * Fix impasse achieving close time consensus.
 * Don't wait between open and establish phases.
ckeshava pushed a commit to ckeshava/rippled that referenced this pull request Sep 25, 2023
 * Verify accepted ledger becomes validated, and retry
   with a new consensus transaction set if not.
 * Always store proposals.
 * Track proposals by ledger sequence. This helps slow peers catch
   up with the rest of the network.
 * Acquire transaction sets for proposals with future ledger sequences.
   This also helps slow peers catch up.
 * Optimize timer delay for establish phase to wait based on how
   long validators have been sending proposals. This also helps slow
   peers to catch up.
 * Fix impasse achieving close time consensus.
 * Don't wait between open and establish phases.
ckeshava pushed a commit to ckeshava/rippled that referenced this pull request Sep 25, 2023
ckeshava pushed a commit to ckeshava/rippled that referenced this pull request Sep 25, 2023
* Verify accepted ledger becomes validated, and retry
   with a new consensus transaction set if not.
 * Always store proposals.
 * Track proposals by ledger sequence. This helps slow peers catch
   up with the rest of the network.
 * Acquire transaction sets for proposals with future ledger sequences.
   This also helps slow peers catch up.
 * Optimize timer delay for establish phase to wait based on how
   long validators have been sending proposals. This also helps slow
   peers to catch up.
 * Fix impasse achieving close time consensus.
 * Don't wait between open and establish phases.
sophiax851 added a commit to sophiax851/rippled that referenced this pull request Nov 30, 2023
intelliot pushed a commit that referenced this pull request Dec 1, 2023
* Revert "Optimize calculation of close time to avoid impasse and minimize gratuitous proposal changes (#4760)"

This reverts commit 8ce85a9.

* Revert "Several changes to improve Consensus stability: (#4505)"

This reverts commit f259cc1.

* Add missing include

---------

Co-authored-by: seelabs <[email protected]>
@intelliot intelliot mentioned this pull request Dec 20, 2023
2 tasks
@intelliot intelliot added Reverted Changes which should still be considered for re-merging. See "Closed" PRs with this label Perf Attn Needed Attention needed from RippleX Performance Team and removed Performance/Resource Improvement labels Jan 10, 2024
@intelliot intelliot modified the milestones: TPS 2023-09, TPS Jan 10, 2024
This was referenced Jan 24, 2024
sophiax851 pushed a commit to sophiax851/rippled that referenced this pull request Jun 12, 2024
sophiax851 pushed a commit to sophiax851/rippled that referenced this pull request Jun 12, 2024
* Verify accepted ledger becomes validated, and retry
   with a new consensus transaction set if not.
 * Always store proposals.
 * Track proposals by ledger sequence. This helps slow peers catch
   up with the rest of the network.
 * Acquire transaction sets for proposals with future ledger sequences.
   This also helps slow peers catch up.
 * Optimize timer delay for establish phase to wait based on how
   long validators have been sending proposals. This also helps slow
   peers to catch up.
 * Fix impasse achieving close time consensus.
 * Don't wait between open and establish phases.
sophiax851 added a commit to sophiax851/rippled that referenced this pull request Jun 12, 2024
* Revert "Optimize calculation of close time to avoid impasse and minimize gratuitous proposal changes (XRPLF#4760)"

This reverts commit 8ce85a9.

* Revert "Several changes to improve Consensus stability: (XRPLF#4505)"

This reverts commit f259cc1.

* Add missing include

---------

Co-authored-by: seelabs <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Perf Attn Needed Attention needed from RippleX Performance Team Ready to merge (Author) Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. Reverted Changes which should still be considered for re-merging. See "Closed" PRs with this label
Projects
Status: 📋 Backlog
Development

Successfully merging this pull request may close these issues.

6 participants