-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chain stalls while scaling out from 1 to 4 nodes, changing quorum size fixes things #796
Conversation
As additional background, besides ceil(2N/3), we have also evaluated three other formulae: ceil((N+f+1)/2), floor(2N/3)+1 and N-f. It turns out ceil(2N/3) and ceil((N+f+1)/2) are equivalent, while floor(2N/3)+1 and N-f are equivalent:
We felt ceil(2N/3) already guarantees super majority in all cases, whereas floor(2N/3)+1 or N-f might be overly strict in some cases (N=3, 6, 9, etc.) In our testing we have run the new formula ceil(2N/3) alongside the original in our test pipelines (where we stand up and tear down Quorum networks constantly throughout the day). While the original formula produced a few stalled chains during the add-one-node-at-a-time ramp up, all attributed to the roundchange loop, the new formula has not failed at all. We couldn't find existing literature that explains if ceil(2N/3) and floor(2N/3)+1 would cause IBFT to behave differently. But empirical data definitely points to a positive enhancement over the existing formula. |
A question we should consider is whether there's value in making the formula switchable. We could adopt a default, between ceil(2N/3) and floor(2N/3)+1. But have the other non-default switched on via an optional command line switch. This may allow the technical community to experiment with both and collect empirical data. If that's considered useful, @panghalamit I do want to make a suggestion, to remove the formula parameter from the function |
@panghalamit @jimthematrix thank you for your contribution and for your feedback on the formula that is being used in round change. We have been actively testing both formulas for all phases of IBFT, not just the round change and more specifically in non 3f+1 networks with dynamic validators. On smaller non 3f+1 validator networks, n-f provides the best consistency in an network that should tolerate f failures but both formulas converge on a similar quorum for larger networks. To calculate n-f, we use floor(2n/3) + 1 so that we don't have to calculate f separately. Our intention is to make the update on all phases of IBFT, not just the round change. It would be helpful if you can provide your feedback on the below quorum branches based on the networks that you have been testing against. https://github.com/jbhurat/quorum/tree/update_consensus_floor (floor(2N/3)+1 ) https://github.com/jbhurat/quorum/tree/update_istanbul_consensus ceil(2N/3) We won't merge this PR with only the round change but we'd like to see it updated to include the other phases of IBFT to use a consistent formula. No need to make it a flag as most users won't control the entire network and that might lead to unintentional breaks. Users similar to you should be able to update the code to experiment with other formulas. |
@jpmsam thanks for the review and input Sam, we'll use our test pipeline to exercise the two formulae and report our findings. My prediction is that neither will turn up any failures given the conditions we run them in. Would love to see more details on On smaller non 3f+1 validator networks, n-f provides the best consistency to help us better understand the difference. |
… to floor(2*N/3)+1
In a 3 node validator network, f should be 0. In using ceil(2N/3), you are tolerating f=1. n-f guarantees that f is 0 by requiring a quorum of 3. Similar logic applies to 6,9,12... validator networks. |
Updated code to add the formula everywhere it is used. The formula used is (floor(2N/3) +1). |
Hi @panghalamit @jimthematrix. I'm sorry about the back and forth. @jpmsam and I actually discussed this today and we think that ceil(2N/3) is the most appropriate formula, precisely for the reasons you identified. It matches the lower bound for safety, while not being as strict as n-f, thus ensuring the best performance. Thank you for this contribution! (We are happy to make the updates too.) |
@hmoniz thanks for your input, no trouble at all, we totally understand the complexity of this and are happy to help in different ways to get it right. The latest updates from Amit have been running in our health check part of the pipeline since two weeks ago and haven’t caused any failures (again the deliberate hostility of the environment has reliably produced failures in the past with the old formula). Please let us know if any further changes are needed. |
…talls node scale out
… to floor(2*N/3)+1
`op.isMutating()` was added for checking mutating VM operation, but `operation.writes` should be used now as it registers all mutating ops and is consistent with upstream. Additionally, there is a new mutating opcode `CREATE2`, but `CREATE2` has not been added to the `op.isMutating()` set. The VM is in read-only mode when a private contract calls a public contract.
Tessera 0.10.0 Release updates
…ceil(2n/3), tests updated based on suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I just have a few more comments. Thank you for incorporating all the suggestions!
@@ -289,8 +289,8 @@ func (sb *backend) verifyCommittedSeals(chain consensus.ChainReader, header *typ | |||
} | |||
} | |||
|
|||
// The length of validSeal should be larger than number of faulty node + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the master the condition is
if validSeal < 2*snap.ValSet.F()
shouldn't it be this instead?
if validSeal <= snap.ValSet.F()
…to ibft-quorum-formula
…onfirmations required to move between states to Ceil(2n/3)
Hi @jimthematrix and @panghalamit, to make a controlled transition to the new formula for existing chains we have made an enhancement where the change in formula only happen after |
@jbhurat sorry for the delayed response, the additional changes in the just to confirm my understanding, we expect the procedure to migrate an existing branch to be:
|
Hi @jimthematrix, |
I saw in |
In The reason we went with the approach of applying the flag, for |
thanks for explaining that. The mechanism looks good to us 👍 |
…rn nil pointer dereference when istanbul config section is missing from genesis
Hi @jimthematrix and @panghalamit, do you guys want to merge those changes in this PR and we will go ahead and merge this PR |
@jbhurat just finished merging from your branch to this PR, please double check the result. thanks! fyi @panghalamit |
Thanks @jimthematrix and @panghalamit for your contribution and for incorporating all the feedback. |
Scenario:
While scaling out nodes by voting them one by one in successive blocks, if a roundchange timer is triggered when validator set size <=3, the nodes keep on timing out on roundchange, without exchanging any new block msgs. The chain stalls and nodes keep on timing out forever.
On careful inspection of logs from multiple runs and the code, issue is in the roundchange msg handling logic in roundchange.go
https://github.com/jpmorganchase/quorum/blob/4c74f47525a8db6ffaf7a539c875a8c71236f15c/consensus/istanbul/core/roundchange.go#L96-L109
the if (line 96) and else if (line 101) condition are same for N<=3, with f = 0. The code in line 103 is never executed, which has logic for current proposer to either send locked proposal, or pendingRequest. Hence, nodes keep on sending roundchange messages and timing out and chain doesn't make any progress.
Changing condition in line 101, from num == 2f+1 to num == ceil(2N/3) or num == N-f. fixes this failure scenario.
Please find attached logs for 4 quorum nodes in failure scenario.
27728.zip
The change includes adding QuorumSize() function to validator set interface, which returns quorum size specific to formulae used. This change only updates the number of confirmations (quorum size) in roundchange.go but it could be added at other places where 2f+1 is used.