buffer applyCh with up to conf.MaxAppendEntries #124

stapelberg · 2016-06-14T20:46:13Z

This change improves throughput in busy Raft clusters.
By buffering messages, individual RPCs contain more Raft messages.
In my tests, this improves throughput from about 4.5 kqps to about 5.5 kqps.

As-is:

With my change:

(Both tests were performed with 3 n1-standard-8 nodes on Google Compute Engine in the europe-west1-d region.)

stapelberg · 2016-06-14T20:52:31Z

The travis build failed because it ran into a timeout. I think restarting it will help.

ongardie · 2016-06-25T01:30:05Z

With the disclaimer that I might be completely wrong about Go's scheduler, this PR makes sense to me. That loop within leaderLoop tries to drain up to conf.MaxAppendEntries on the channel because it's more efficient to send a large batch of AppendEntries. With a buffered channel, it's easy to see that the leaderLoop will drain many entries from the channel. With an unbuffered channel, there could be multiple goroutines blocked to send on the channel, but I don't think go makes any guarantee that a non-blocking receive in a loop will consume all of them. I think this depends on getting lucky with the scheduler.

There's some related discussion that I found helpful in this thread: https://groups.google.com/d/msg/golang-nuts/UnzE5vgyzqw/eZdhrhoWJQUJ

So +1 for merging.

ongardie-sfdc · 2016-07-09T03:40:39Z

And I'm second-guessing myself on my previous comment.

cc @superfell

schristoff · 2019-08-22T15:45:08Z

Hey @stapelberg - thank you for this. Is there any way you could rebase with master so CircleCI can take a stab at this?

superfell · 2019-08-24T21:40:10Z

One side effect of this change would be related to how timeout works in ApplyLog. The timeout is based on the item being put on the channel, so timeouts won't start happening now until the channel buffer is full.

This change improves throughput in busy Raft clusters. By buffering messages, individual RPCs contain more Raft messages. In my tests, this improves throughput from about 4.5 kqps to about 5.5 kqps.

stapelberg · 2019-08-25T07:47:07Z

Rebased!

hanshasselberg · 2019-08-29T15:20:21Z

@superfell thanks for helping out! I don't understand which timeouts you are referring to. Could you try to explain that again? Thanks a lot!

hanshasselberg · 2019-08-29T15:53:24Z

@superfell All good now, I figured out which timeout you mean:

raft/api.go

Lines 656 to 684 in db5ceea

    
           // ApplyLog performs Apply but takes in a Log directly. The only values 
        
           // currently taken from the submitted Log are Data and Extensions. 
        
           func (r *Raft) ApplyLog(log Log, timeout time.Duration) ApplyFuture { 
        
           	metrics.IncrCounter([]string{"raft", "apply"}, 1) 
        
           	var timer <-chan time.Time 
        
           	if timeout > 0 { 
        
           		timer = time.After(timeout) 
        
           	} 
        
           	// Create a log future, no index or term yet 
        
           	logFuture := &logFuture{ 
        
           		log: Log{ 
        
           			Type:       LogCommand, 
        
           			Data:       log.Data, 
        
           			Extensions: log.Extensions, 
        
           		}, 
        
           	} 
        
           	logFuture.init() 
        
           	select { 
        
           	case <-timer: 
        
           		return errorFuture{ErrEnqueueTimeout} 
        
           	case <-r.shutdownCh: 
        
           		return errorFuture{ErrRaftShutdown} 
        
           	case r.applyCh <- logFuture: 
        
           		return logFuture 
        
           	} 
        
           }

But I still don't understand how the semantics are changing. Because right now we are starting the timeout before we write to the channel which means we are waiting for the leader to read it and that time is counted.
Which is the same after this change were we can write to the channel, but are still waiting similarly for the leader to read it.

Thanks!

superfell · 2019-08-29T16:06:50Z

Its timing how long it takes to write to the channel. If you change the channel to a buffered channel, the write to the channel can succeed (if there's space), but it hasn't been processed by anything, and when it'll get processed is still some indeterminate time in the future. With the unbuffered channel as it currently is, Apply can't write it to the channel until the leader loop is ready to read it from the channel. So what the timeout is timing against is different, and gets more different the larger the channel buffer is.

superfell · 2019-08-29T18:30:19Z

ApplyLog doesn't wait for the leaderLoop to read the item from the channel, it only waits until it put the future on the channel. It only waits until the leaderLoop reads from the channel currently because its an unbuffered channel.

banks

Yeah @superfell is exactly right.

We'd need to do something like this off the top of my head to restore that behaviour:

Add a context (or at least a done chan) to logFuture
When the leader loop receives the future, it needs to check if it's already timed out (chan closed, context done etc).
If it's not then it gets included in the batch and somehow signals back to the waiter that it's been acknowledged by the leader.

The really subtle bit is making this not racey: right now if the timout timer fires then the select exits and we can be sure the logFuture never reaches the leader loop.

In the above proposal, there is the chance of a race between the leader checking the future isn't timedout and acking it and then including in batch. In between these times, the caller may have actually timed out and gone away but the leader will still process the log and make the write.

On it's own that's not necessarily terrible - in general timeouts are not usually a guarantee that the operation failed. But it is a change of behaviour from the current API which maybe needs some thought?

@stapelberg I know this is an ancient PR so I'm not necessarily expecting you to pick this up again (although you are welcome) but I wanted to use GitHub's signals that there is an issue blocking merge here for anyone who wants to pick this up in the future.

Thanks again for the dicussions and contributions here folks even if the timeline is rather long!

stapelberg · 2019-11-12T20:10:47Z

I know this is an ancient PR so I'm not necessarily expecting you to pick this up again (although you are welcome)

I don’t have the time or motivation right now to pick this up again, so if anyone else wants to take it over, feel free to :)

alecjen · 2021-02-02T17:59:54Z

@banks I found that the proposed solution will not ever buffer applyCh for Apply operations running on a single routine. Because Apply is waiting for an ack from the done channel, it is effectively blocked until the leaderLoop completes the current dispatchLogs operation.

You may have already known this, but I think we can consider at least exposing the option to buffer the applyCh, keeping in mind that we are re-working the timeout assumptions in this case.

ncabatoff · 2021-06-21T19:17:12Z

Closing as this was implemented in #445.

schristoff added the thinking label Aug 22, 2019

buffer applyCh with up to conf.MaxAppendEntries

7e3d1ec

This change improves throughput in busy Raft clusters. By buffering messages, individual RPCs contain more Raft messages. In my tests, this improves throughput from about 4.5 kqps to about 5.5 kqps.

stapelberg force-pushed the patch-5 branch from 05314fd to 7e3d1ec Compare August 25, 2019 07:46

banks requested changes Nov 7, 2019

View reviewed changes

alecjen mentioned this pull request Feb 1, 2021

Buffer applyCh to size MaxAppendEntries #439

Closed

alecjen mentioned this pull request Mar 1, 2021

Allow buffered applyCh #445

Merged

briankassouf mentioned this pull request Jun 21, 2021

raft: Set BatchApplyCh for more consistent batch sizes hashicorp/vault#11907

Merged

ncabatoff closed this Jun 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

buffer applyCh with up to conf.MaxAppendEntries #124

buffer applyCh with up to conf.MaxAppendEntries #124

stapelberg commented Jun 14, 2016

stapelberg commented Jun 14, 2016

ongardie commented Jun 25, 2016 •

edited

Loading

ongardie-sfdc commented Jul 9, 2016

schristoff commented Aug 22, 2019

superfell commented Aug 24, 2019

stapelberg commented Aug 25, 2019

hanshasselberg commented Aug 29, 2019

hanshasselberg commented Aug 29, 2019

superfell commented Aug 29, 2019

superfell commented Aug 29, 2019

banks left a comment

stapelberg commented Nov 12, 2019

alecjen commented Feb 2, 2021

ncabatoff commented Jun 21, 2021

buffer applyCh with up to conf.MaxAppendEntries #124

buffer applyCh with up to conf.MaxAppendEntries #124

Conversation

stapelberg commented Jun 14, 2016

stapelberg commented Jun 14, 2016

ongardie commented Jun 25, 2016 • edited Loading

ongardie-sfdc commented Jul 9, 2016

schristoff commented Aug 22, 2019

superfell commented Aug 24, 2019

stapelberg commented Aug 25, 2019

hanshasselberg commented Aug 29, 2019

hanshasselberg commented Aug 29, 2019

superfell commented Aug 29, 2019

superfell commented Aug 29, 2019

banks left a comment

Choose a reason for hiding this comment

stapelberg commented Nov 12, 2019

alecjen commented Feb 2, 2021

ncabatoff commented Jun 21, 2021

ongardie commented Jun 25, 2016 •

edited

Loading