Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize calculation of close time to avoid impasse and minimize gratuitous proposal changes #4760
Optimize calculation of close time to avoid impasse and minimize gratuitous proposal changes #4760
Changes from 3 commits
ed8da55
6eeb8d3
6659d94
51c6c91
abb8cab
7e0ec0e
4ad9cbf
f4f0391
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the validators in t1 do not count their own close times, then they perceive two equal sets of 2 validators each for different close times. Are we sure they won't switch, even if t2 is greater? Perhaps the code handles this correctly (I have not yet analyzed it), but it seems ambiguous in at least this comment.
I'm a little worried that we're adding more and more special cases to the code instead of carefully designing and implementing a general algorithm. I'm disturbed by how long this function is even before this change. Is it possible to state the current consensus algorithm in simple terms? How far has it strayed from the description in the whitepapers? I don't think I'm concerned enough right now to try to get in the way of this change, but I'd like to hear opinions from others. How much confidence do we have in the convergence of consensus at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I observed the impasse behavior in testing, and commented on how specifically it occurred:
// Here's how the impasse could occur:
// Imagine 5 validators. 3 have close time t1, and 2 have t2.
// As consensus time increases, the threshVote threshold also increases.
// Once threshVote exceeds 60%, no members of either set of validators
// will change their close times.
This is a point fix to a problem observed in testing that also potentially affects the live network. It is not a holistic refactor of consensus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As duration increases, threshVote gets higher. threshVote is the criteria for whether or not to change one's close time. So if it's >60% and only 60% of peers have the other close time, then none will ever change their close time. That's the impasse which was observed in testing. The original fix led to gratuitous position changes, which were again observed in testing. This fixes both issues, and no problems are observed in testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If t2 > t1, then:
After, t2 will have 3 votes and t1 will have 2 votes, and there will be one more flip as everyone converges on the time that is both greater and has more votes, which is t2. Correct? Can that be done in one step instead of two? Are there chances for more discord with 35 validators and more than two close times? Does this feel hacky to anyone else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the logic behind the
threshVote
restriction anyway? Why is switching halted after all times fall belowthreshVote
? Why not just remove that restriction? Could the fix be as simple as that?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mtrippled How is the impasse for which #4505 addresses possible for any transaction volume? The protocol has been working well with the 1950ms being the min_duration for the establish phase and how the countdown starts, again, if we want to change how the close_time should be established, we need to test it under a more realistic env. and test case. I agree totally with the proactive testing, but the testing has to be valid in terms of how it can appear in production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original impasse description is commented: https://github.com/XRPLF/rippled/pull/4760/files#diff-742221137528fb505ea44914e395d7576a77eb3d97e32ced8565b7a6f1c9ff7bR1822
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And simply making a test environment more like production doesn't necessarily cause the bug to occur, at least immediately. As in the description, the impasse potential is latent at any volume, having to do with decisions about when to change a close time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is fine, but this comment is wrong. Clearing the impasse takes one more round.
Assume all these validators are "stuck", i.e. they all perceive a long consensus time. Our starting condition is this:
Every validator looks at the count, but excludes their own vote. Here is how they see the votes:
The t1 validators A, B, C decide to switch to t2 because they perceive a tie between t1 and t2 and t2 is the later value. The t2 validators D, E decide to switch because they perceive a sole winning time of t1. All validators flip. Here is the next state:
The impasse remains. Now they count votes again, excluding their own vote. Here is what they see:
The validators A, B, C, which are now voting for t2, still perceive a tie between t1 and t2, but they do not switch their vote, because t2 is the later value. The validators D, E, which are now voting for t1, do switch their vote because they perceive a sole winning time of t2. Now the impasse is cleared. Here is the next state:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified the comment to reflect the current behavior.
Regarding the table you made--the new behavior is different than described based on the problem you pointed out. No longer do validators revoke their vote in the first step after being "stuck". Instead, they all vote as they always have and tally for each. Now, however, they determine which is the best close time based on the criteria: most votes, then if tie, latest close time. In this case, t1 will have 3 and t2 start with 2 votes. Only the t2 set will change their position and vote to t1. The t1 set know they are the best, and will not change their votes. Impasse resolved next iteration.
In the case of the problem with t1 having 2 votes and t2 having 2 votes--that was possible with before my previous commit. What happens now is that each validator first counts their own close time towards the tallies of each possible close time. They they see that there is a tie between t1 and t2. However, t2 is greater. Therefore, the set in t1 will switch their close time to t2, and t2 will keep the same. Resolved in the next iteration.