-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stream Raft Messages and Fix Check Quorum #3138
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 4 files at r1, 1 of 1 files at r2.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @manishrjain)
conn/node.go, line 379 at r2 (raw file):
/ Exit after a thousand tries
"Exit after at least a thousand tries or ten seconds" to match what the code is actually doing.
conn/node.go, line 386 at r2 (raw file):
So that we print error only a few times
nit: "Update lastLog so that we print the error ... "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 5 files reviewed, 2 unresolved discussions (waiting on @martinmr)
conn/node.go, line 379 at r2 (raw file):
Previously, martinmr (Martin Martinez Rivera) wrote…
/ Exit after a thousand tries
"Exit after at least a thousand tries or ten seconds" to match what the code is actually doing.
Done.
conn/node.go, line 386 at r2 (raw file):
Previously, martinmr (Martin Martinez Rivera) wrote…
So that we print error only a few times
nit: "Update lastLog so that we print the error ... "
Done.
Instead of sending the Raft messages, one message per gRPC call, this PR creates a one-way stream between the sender and the receiver. Each messages gets pushed to a channel. We use smart batching to pick up as many messages as we can and send them over the stream in order. If we see connection issues etc., there are mechanisms in place to recreate the stream. Another issue I saw was related to Zero being unable to maintain quorum. It was because of an unbuffered channel in checkQuorum asking for read index, which didn't allow multiple requests to be pushed into one batch causing check quorum to fail even with one second timeout. After allocating a buffered channel, all the check quorum requests finish within a millisecond, rarely going above 7ms in my tests. Changes: * Stream raft messages instead of sending them one by one. * Set duration to 10s * Zero checkQuorum works well now * Martin's comments * Adjust timeouts in contexts, so the deeper one has shorter timeout and the outer one has longer. * Batch up multiple Raft messages from channel and send them in one request.
Instead of sending the Raft messages, one message per gRPC call, this PR creates a one-way stream between the sender and the receiver. Each messages gets pushed to a channel. We use smart batching to pick up as many messages as we can and send them over the stream in order. If we see connection issues etc., there are mechanisms in place to recreate the stream. Another issue I saw was related to Zero being unable to maintain quorum. It was because of an unbuffered channel in checkQuorum asking for read index, which didn't allow multiple requests to be pushed into one batch causing check quorum to fail even with one second timeout. After allocating a buffered channel, all the check quorum requests finish within a millisecond, rarely going above 7ms in my tests. Changes: * Stream raft messages instead of sending them one by one. * Set duration to 10s * Zero checkQuorum works well now * Martin's comments * Adjust timeouts in contexts, so the deeper one has shorter timeout and the outer one has longer. * Batch up multiple Raft messages from channel and send them in one request.
Instead of sending the Raft messages, one message per gRPC call, this PR creates a one-way stream between the sender and the receiver. Each messages gets pushed to a channel. We use smart batching to pick up as many messages as we can and send them over the stream in order. If we see connection issues etc., there are mechanisms in place to recreate the stream.
Another issue I saw was related to Zero being unable to maintain quorum. It was because of an unbuffered channel in checkQuorum asking for read index, which didn't allow multiple requests to be pushed into one batch causing check quorum to fail even with one second timeout. After allocating a buffered channel, all the check quorum requests finish within a millisecond, rarely going above 7ms in my tests.
This change is