Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buffer applyCh with up to conf.MaxAppendEntries #124

Closed
wants to merge 1 commit into from

Conversation

stapelberg
Copy link
Contributor

This change improves throughput in busy Raft clusters.
By buffering messages, individual RPCs contain more Raft messages.
In my tests, this improves throughput from about 4.5 kqps to about 5.5 kqps.

As-is:
n1-standard-8-c8deaa9d333f69fb56c8935036e7ca5c-no-buffer

With my change:
n1-standard-8-f2fed4ebe05df23eb322c805b503a144

(Both tests were performed with 3 n1-standard-8 nodes on Google Compute Engine in the europe-west1-d region.)

@stapelberg
Copy link
Contributor Author

The travis build failed because it ran into a timeout. I think restarting it will help.

@ongardie
Copy link
Contributor

ongardie commented Jun 25, 2016

With the disclaimer that I might be completely wrong about Go's scheduler, this PR makes sense to me. That loop within leaderLoop tries to drain up to conf.MaxAppendEntries on the channel because it's more efficient to send a large batch of AppendEntries. With a buffered channel, it's easy to see that the leaderLoop will drain many entries from the channel. With an unbuffered channel, there could be multiple goroutines blocked to send on the channel, but I don't think go makes any guarantee that a non-blocking receive in a loop will consume all of them. I think this depends on getting lucky with the scheduler.

There's some related discussion that I found helpful in this thread: https://groups.google.com/d/msg/golang-nuts/UnzE5vgyzqw/eZdhrhoWJQUJ

So +1 for merging.

@ongardie-sfdc
Copy link

And I'm second-guessing myself on my previous comment.

cc @superfell

@schristoff
Copy link
Contributor

Hey @stapelberg - thank you for this. Is there any way you could rebase with master so CircleCI can take a stab at this?

@superfell
Copy link
Contributor

One side effect of this change would be related to how timeout works in ApplyLog. The timeout is based on the item being put on the channel, so timeouts won't start happening now until the channel buffer is full.

This change improves throughput in busy Raft clusters.
By buffering messages, individual RPCs contain more Raft messages.
In my tests, this improves throughput from about 4.5 kqps to about 5.5 kqps.
@stapelberg
Copy link
Contributor Author

Rebased!

@hanshasselberg
Copy link
Member

@superfell thanks for helping out! I don't understand which timeouts you are referring to. Could you try to explain that again? Thanks a lot!

@hanshasselberg
Copy link
Member

@superfell All good now, I figured out which timeout you mean:

raft/api.go

Lines 656 to 684 in db5ceea

// ApplyLog performs Apply but takes in a Log directly. The only values
// currently taken from the submitted Log are Data and Extensions.
func (r *Raft) ApplyLog(log Log, timeout time.Duration) ApplyFuture {
metrics.IncrCounter([]string{"raft", "apply"}, 1)
var timer <-chan time.Time
if timeout > 0 {
timer = time.After(timeout)
}
// Create a log future, no index or term yet
logFuture := &logFuture{
log: Log{
Type: LogCommand,
Data: log.Data,
Extensions: log.Extensions,
},
}
logFuture.init()
select {
case <-timer:
return errorFuture{ErrEnqueueTimeout}
case <-r.shutdownCh:
return errorFuture{ErrRaftShutdown}
case r.applyCh <- logFuture:
return logFuture
}
}

But I still don't understand how the semantics are changing. Because right now we are starting the timeout before we write to the channel which means we are waiting for the leader to read it and that time is counted.
Which is the same after this change were we can write to the channel, but are still waiting similarly for the leader to read it.

Thanks!

@superfell
Copy link
Contributor

Its timing how long it takes to write to the channel. If you change the channel to a buffered channel, the write to the channel can succeed (if there's space), but it hasn't been processed by anything, and when it'll get processed is still some indeterminate time in the future. With the unbuffered channel as it currently is, Apply can't write it to the channel until the leader loop is ready to read it from the channel. So what the timeout is timing against is different, and gets more different the larger the channel buffer is.

@superfell
Copy link
Contributor

ApplyLog doesn't wait for the leaderLoop to read the item from the channel, it only waits until it put the future on the channel. It only waits until the leaderLoop reads from the channel currently because its an unbuffered channel.

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah @superfell is exactly right.

We'd need to do something like this off the top of my head to restore that behaviour:

  1. Add a context (or at least a done chan) to logFuture
  2. When the leader loop receives the future, it needs to check if it's already timed out (chan closed, context done etc).
  3. If it's not then it gets included in the batch and somehow signals back to the waiter that it's been acknowledged by the leader.

The really subtle bit is making this not racey: right now if the timout timer fires then the select exits and we can be sure the logFuture never reaches the leader loop.

In the above proposal, there is the chance of a race between the leader checking the future isn't timedout and acking it and then including in batch. In between these times, the caller may have actually timed out and gone away but the leader will still process the log and make the write.

On it's own that's not necessarily terrible - in general timeouts are not usually a guarantee that the operation failed. But it is a change of behaviour from the current API which maybe needs some thought?

@stapelberg I know this is an ancient PR so I'm not necessarily expecting you to pick this up again (although you are welcome) but I wanted to use GitHub's signals that there is an issue blocking merge here for anyone who wants to pick this up in the future.

Thanks again for the dicussions and contributions here folks even if the timeline is rather long!

@stapelberg
Copy link
Contributor Author

I know this is an ancient PR so I'm not necessarily expecting you to pick this up again (although you are welcome)

I don’t have the time or motivation right now to pick this up again, so if anyone else wants to take it over, feel free to :)

@alecjen
Copy link
Contributor

alecjen commented Feb 2, 2021

@banks I found that the proposed solution will not ever buffer applyCh for Apply operations running on a single routine. Because Apply is waiting for an ack from the done channel, it is effectively blocked until the leaderLoop completes the current dispatchLogs operation.

You may have already known this, but I think we can consider at least exposing the option to buffer the applyCh, keeping in mind that we are re-working the timeout assumptions in this case.

@ncabatoff
Copy link
Contributor

Closing as this was implemented in #445.

@ncabatoff ncabatoff closed this Jun 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants