Optimize calculation of close time to avoid impasse and minimize gratuitous proposal changes #4760

mtrippled · 2023-10-12T17:12:57Z

High Level Overview of Change

This fixes a bug introduced in #4505 that causes gratuitous position changes, resulting in validator desyncs.

Context of Change

This change refactors the way that proposal close times are determined for purposes of a node adopting a new close time. The change in PR4505 which introduced the bug also fixed a bug that resulted in impasses in testing. However, it also increased the likelihood of a peer changing its close time, and, therefore, its position. This fix avoids the impasse but minimizes changing positions. Namely, it will only avoid an impasse if consensus taking a long time already.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Refactor (non-breaking change that only restructures code)
Tests (You added tests for code that already exists, or your new feature included in this PR)
Documentation Updates
Release

Run throughput test, observe network stability after transaction processing stops.

impasse and minimize gratuitous proposal changes.

intelliot · 2023-10-17T05:29:30Z

I applied the clang-format.patch and merged 'develop'. @mtrippled , if that inconveniences you in any way, please feel free to force push over my commits.

thejohnfreeman · 2023-10-30T18:46:15Z

src/ripple/consensus/Consensus.h

+        // Avoiding the impasse means that validators should not necessarily
+        // count their own close time towards the total until they know
+        // the most popular among their peers, and then change their vote
+        // to that of the most popular. In this case, the validators in set
+        // t2 from the example would have switched to close time t1 and ended
+        // the network impasse.


If the validators in t1 do not count their own close times, then they perceive two equal sets of 2 validators each for different close times. Are we sure they won't switch, even if t2 is greater? Perhaps the code handles this correctly (I have not yet analyzed it), but it seems ambiguous in at least this comment.

I'm a little worried that we're adding more and more special cases to the code instead of carefully designing and implementing a general algorithm. I'm disturbed by how long this function is even before this change. Is it possible to state the current consensus algorithm in simple terms? How far has it strayed from the description in the whitepapers? I don't think I'm concerned enough right now to try to get in the way of this change, but I'd like to hear opinions from others. How much confidence do we have in the convergence of consensus at this point?

I observed the impasse behavior in testing, and commented on how specifically it occurred:
// Here's how the impasse could occur:
// Imagine 5 validators. 3 have close time t1, and 2 have t2.
// As consensus time increases, the threshVote threshold also increases.
// Once threshVote exceeds 60%, no members of either set of validators
// will change their close times.

This is a point fix to a problem observed in testing that also potentially affects the live network. It is not a holistic refactor of consensus.

As duration increases, threshVote gets higher. threshVote is the criteria for whether or not to change one's close time. So if it's >60% and only 60% of peers have the other close time, then none will ever change their close time. That's the impasse which was observed in testing. The original fix led to gratuitous position changes, which were again observed in testing. This fixes both issues, and no problems are observed in testing.

// Here's how the impasse could occur: // Imagine 5 validators. 3 have close time t1, and 2 have t2. // As consensus time increases, the threshVote threshold also increases. // Once threshVote exceeds 60%, no members of either set of validators // will change their close times.

If t2 > t1, then:

the validators in t1 will see two times tied with two votes each, and switch to the greater of the two, which is t2

the validators in t2 will see t1 with 3 votes and t2 with 1 vote, and switch to the one with more votes, which is t1

After, t2 will have 3 votes and t1 will have 2 votes, and there will be one more flip as everyone converges on the time that is both greater and has more votes, which is t2. Correct? Can that be done in one step instead of two? Are there chances for more discord with 35 validators and more than two close times? Does this feel hacky to anyone else?

What is the logic behind the threshVote restriction anyway? Why is switching halted after all times fall below threshVote? Why not just remove that restriction? Could the fix be as simple as that?

@mtrippled How is the impasse for which #4505 addresses possible for any transaction volume? The protocol has been working well with the 1950ms being the min_duration for the establish phase and how the countdown starts, again, if we want to change how the close_time should be established, we need to test it under a more realistic env. and test case. I agree totally with the proactive testing, but the testing has to be valid in terms of how it can appear in production.

The original impasse description is commented: https://github.com/XRPLF/rippled/pull/4760/files#diff-742221137528fb505ea44914e395d7576a77eb3d97e32ced8565b7a6f1c9ff7bR1822

And simply making a test environment more like production doesn't necessarily cause the bug to occur, at least immediately. As in the description, the impasse potential is latent at any volume, having to do with decisions about when to change a close time.

The code is fine, but this comment is wrong. Clearing the impasse takes one more round.

// Avoiding the impasse means that validators should not necessarily // count their own close time towards the total until they know // the most popular among their peers, and then change their vote // to that of the most popular. In this case, the validators in set // t2 from the example would have switched to close time t1 and ended // the network impasse.

Assume all these validators are "stuck", i.e. they all perceive a long consensus time. Our starting condition is this:

validator voting for

A t1

B t1

C t1

D t2

E t2

Every validator looks at the count, but excludes their own vote. Here is how they see the votes:

validator votes for t1 votes for t2

A 2 2

B 2 2

C 2 2

D 3 1

E 3 1

The t1 validators A, B, C decide to switch to t2 because they perceive a tie between t1 and t2 and t2 is the later value. The t2 validators D, E decide to switch because they perceive a sole winning time of t1. All validators flip. Here is the next state:

validator voting for

A t2

B t2

C t2

D t1

E t1

The impasse remains. Now they count votes again, excluding their own vote. Here is what they see:

validator votes for t1 votes for t2

A 2 2

B 2 2

C 2 2

D 1 3

E 1 3

The validators A, B, C, which are now voting for t2, still perceive a tie between t1 and t2, but they do not switch their vote, because t2 is the later value. The validators D, E, which are now voting for t1, do switch their vote because they perceive a sole winning time of t2. Now the impasse is cleared. Here is the next state:

validator voting for

A t2

B t2

C t2

D t2

E t2

I modified the comment to reflect the current behavior.

Regarding the table you made--the new behavior is different than described based on the problem you pointed out. No longer do validators revoke their vote in the first step after being "stuck". Instead, they all vote as they always have and tally for each. Now, however, they determine which is the best close time based on the criteria: most votes, then if tie, latest close time. In this case, t1 will have 3 and t2 start with 2 votes. Only the t2 set will change their position and vote to t1. The t1 set know they are the best, and will not change their votes. Impasse resolved next iteration.

In the case of the problem with t1 having 2 votes and t2 having 2 votes--that was possible with before my previous commit. What happens now is that each validator first counts their own close time towards the tallies of each possible close time. They they see that there is a tie between t1 and t2. However, t2 is greater. Therefore, the set in t1 will switch their close time to t2, and t2 will keep the same. Resolved in the next iteration.

HowardHinnant

Left a couple of optional changes.

src/ripple/consensus/Consensus.h

HowardHinnant · 2023-11-01T01:02:49Z

LGTM.

thejohnfreeman

I'll leave it up to Mark to decide how to fix the portion of the comment that talks through the example.

src/ripple/consensus/Consensus.h

scottschurr

This all looks good as far as I can see. I left a few comments about things that might be a little more efficient. But nothing that I think should stop this pull request. It's also possible I misunderstand something, so be cautious. Don't let my "efficiencies" introduce bugs.

I do have one concern, however. There are a few places that simply assume that STL containers have at least one entry. I'm guessing that's a legit assumption. If it's not, then this code could lead to occasional crashes when some of these containers are empty.

Personally, I wish the code were handling the empty container cases a bit more defensively. But it may not be necessary.

src/ripple/consensus/Consensus.h

mtrippled · 2023-11-17T03:07:59Z

This all looks good as far as I can see. I left a few comments about things that might be a little more efficient. But nothing that I think should stop this pull request. It's also possible I misunderstand something, so be cautious. Don't let my "efficiencies" introduce bugs.

I do have one concern, however. There are a few places that simply assume that STL containers have at least one entry. I'm guessing that's a legit assumption. If it's not, then this code could lead to occasional crashes when some of these containers are empty.

Personally, I wish the code were handling the empty container cases a bit more defensively. But it may not be necessary.

I commented above those optionals, and also put an assert() there. It's safe

* Support for the mold linker (#4807) * Promote API version 2 to supported (#4803) * Promote API version 2 to be supported * Switch the command line to API version 1 * Fix LedgerRequestRPC test * Remove obsolete tx_account method This method is not implemented, the only parts which are removed are related to command-line parsing * Fix RPCCall test * Reduce diff size, small test improvements * Minor fixes * Support for the mold linker * Fix TransactionEntry_test * Fix AccountTx_test --------- Co-authored-by: seelabs <[email protected]> * Update Linux smoketest distros (#4813) * Fix 2.0 regression in tx method with binary output (#4812) * Fix binary output from tx method * Formatting fix * Minor test improvement * Minor test improvements * Optimize calculation of close time to avoid impasse and minimize gratuitous proposal changes (#4760) * Optimize the calculation of close time to avoid impasse and minimize gratuitous proposal changes. * git apply clang-format.patch * Scott S review fixes. Also clang-format. * Set version to 2.0.0-rc2 --------- Co-authored-by: manoj <[email protected]> Co-authored-by: Scott Determan <[email protected]> Co-authored-by: Bronek Kozicki <[email protected]> Co-authored-by: Michael Legleux <[email protected]> Co-authored-by: Mark Travis <[email protected]>

…uitous proposal changes (XRPLF#4760) * Optimize the calculation of close time to avoid impasse and minimize gratuitous proposal changes. * git apply clang-format.patch * Review (Howard) fixes. * Review fix for impasse discovered by John. * Review fixes (comments) from John. * Scott S review fixes. Also clang-format.

…ize gratuitous proposal changes (XRPLF#4760)" This reverts commit 8ce85a9.

* Revert "Optimize calculation of close time to avoid impasse and minimize gratuitous proposal changes (#4760)" This reverts commit 8ce85a9. * Revert "Several changes to improve Consensus stability: (#4505)" This reverts commit f259cc1. * Add missing include --------- Co-authored-by: seelabs <[email protected]>

…uitous proposal changes (XRPLF#4760) * Optimize the calculation of close time to avoid impasse and minimize gratuitous proposal changes. * git apply clang-format.patch * Review (Howard) fixes. * Review fix for impasse discovered by John. * Review fixes (comments) from John. * Scott S review fixes. Also clang-format.

* Revert "Optimize calculation of close time to avoid impasse and minimize gratuitous proposal changes (XRPLF#4760)" This reverts commit 8ce85a9. * Revert "Several changes to improve Consensus stability: (XRPLF#4505)" This reverts commit f259cc1. * Add missing include --------- Co-authored-by: seelabs <[email protected]>

Optimize calculation of close time to avoid

ed8da55

impasse and minimize gratuitous proposal changes.

mtrippled force-pushed the close-time branch from 605610c to ed8da55 Compare October 12, 2023 17:15

intelliot changed the title ~~Optimize calculation of close time to avoid~~ Optimize calculation of close time to avoid impasse and minimize gratuitous proposal changes Oct 12, 2023

intelliot requested review from HowardHinnant and ximinez October 12, 2023 20:39

intelliot added the Performance/Resource Improvement label Oct 12, 2023

intelliot added this to the TPS milestone Oct 12, 2023

intelliot requested review from thejohnfreeman and removed request for ximinez October 17, 2023 02:47

intelliot added 2 commits October 16, 2023 22:28

git apply clang-format.patch

6eeb8d3

Merge branch 'develop' into close-time

6659d94

thejohnfreeman reviewed Oct 30, 2023

View reviewed changes

intelliot requested a review from scottschurr October 31, 2023 17:28

HowardHinnant approved these changes Oct 31, 2023

View reviewed changes

src/ripple/consensus/Consensus.h Outdated Show resolved Hide resolved

src/ripple/consensus/Consensus.h Outdated Show resolved Hide resolved

Review (Howard) fixes.

51c6c91

Review fix for impasse discovered by John.

abb8cab

HowardHinnant self-requested a review November 16, 2023 02:28

Merge branch 'develop' into close-time

7e0ec0e

thejohnfreeman requested changes Nov 16, 2023

View reviewed changes

src/ripple/consensus/Consensus.h Outdated Show resolved Hide resolved

src/ripple/consensus/Consensus.h Show resolved Hide resolved

Review fixes (comments) from John.

4ad9cbf

thejohnfreeman approved these changes Nov 16, 2023

View reviewed changes

scottschurr approved these changes Nov 17, 2023

View reviewed changes

src/ripple/consensus/Consensus.h Show resolved Hide resolved

src/ripple/consensus/Consensus.h Outdated Show resolved Hide resolved

src/ripple/consensus/Consensus.h Show resolved Hide resolved

src/ripple/consensus/Consensus.h Outdated Show resolved Hide resolved

Scott S review fixes. Also clang-format.

f4f0391

mtrippled force-pushed the close-time branch from 4c54ff3 to f4f0391 Compare November 17, 2023 03:06

mtrippled added the Ready to merge (Author) Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. label Nov 17, 2023

manojsdoshi merged commit 8ce85a9 into XRPLF:develop Nov 17, 2023
16 checks passed

sophiax851 added a commit to sophiax851/rippled that referenced this pull request Nov 30, 2023

Revert "Optimize calculation of close time to avoid impasse and minim…

f9ea98d

…ize gratuitous proposal changes (XRPLF#4760)" This reverts commit 8ce85a9.

intelliot added Reverted Changes which should still be considered for re-merging. See "Closed" PRs with this label Perf Attn Needed Attention needed from RippleX Performance Team and removed Performance/Resource Improvement labels Jan 10, 2024

intelliot mentioned this pull request Jan 24, 2024

Revert 4505 4760 #4842

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize calculation of close time to avoid impasse and minimize gratuitous proposal changes #4760

Optimize calculation of close time to avoid impasse and minimize gratuitous proposal changes #4760

mtrippled commented Oct 12, 2023 •

edited by intelliot

Loading

intelliot commented Oct 17, 2023

thejohnfreeman Oct 30, 2023

mtrippled Oct 30, 2023

mtrippled Oct 30, 2023

thejohnfreeman Oct 30, 2023

thejohnfreeman Oct 30, 2023

sophiax851 Nov 16, 2023

mtrippled Nov 16, 2023

mtrippled Nov 16, 2023

thejohnfreeman Nov 16, 2023

mtrippled Nov 16, 2023

HowardHinnant left a comment

HowardHinnant commented Nov 1, 2023

thejohnfreeman left a comment

scottschurr left a comment

mtrippled commented Nov 17, 2023

Optimize calculation of close time to avoid impasse and minimize gratuitous proposal changes #4760

Optimize calculation of close time to avoid impasse and minimize gratuitous proposal changes #4760

Conversation

mtrippled commented Oct 12, 2023 • edited by intelliot Loading

High Level Overview of Change

Context of Change

Type of Change

intelliot commented Oct 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HowardHinnant left a comment

Choose a reason for hiding this comment

HowardHinnant commented Nov 1, 2023

thejohnfreeman left a comment

Choose a reason for hiding this comment

scottschurr left a comment

Choose a reason for hiding this comment

mtrippled commented Nov 17, 2023

mtrippled commented Oct 12, 2023 •

edited by intelliot

Loading