Sending data to new nodes is very slow #1180

srh · 2017-07-12T07:34:00Z

There's a chance this is caused by the same underlying problem as #1168.

(Edit: Actually, this might not involve sending a "snapshot" -- I haven't verified that it actually updated using the snapshot mechanism.)

Anyway, if you make a 3-node cluster, kill a node, and then load a bunch of data, and then revive the node (as described in #1169 (comment)), it takes a very long time to bring the new node up to speed.

Something on the order of 1m to 1m30s to load the data initially, and then 15 minutes to catch the new node up.

While it's catching up, the CPU usage on any given core is at most 5%, maybe 10%. It seems like the slowness is from too much idling.

Another peculiar thing is the high disk usage of the node (of svr3) -- the dataset was 1million.rdf.gz:

60M	/home/srh/blah/svr1/w
195M	/home/srh/blah/svr1/p
255M	/home/srh/blah/svr1
60M	/home/srh/blah/svr2/w
181M	/home/srh/blah/svr2/p
241M	/home/srh/blah/svr2
60M	/home/srh/blah/svr3/w
380M	/home/srh/blah/svr3/p
440M	/home/srh/blah/svr3

The actual contents:

../svr1/p:
total 199048
-rw-r--r-- 1 srh srh 203817056 Jul 12 00:05 000000.vlog
-rw-r--r-- 1 srh srh         0 Jul 11 23:46 clog

../svr1/w:
total 61304
-rw-r--r-- 1 srh srh 62770134 Jul 12 00:05 000000.vlog
-rw-r--r-- 1 srh srh        0 Jul 11 23:46 clog

../svr2/p:
total 184968
-rw-r--r-- 1 srh srh 189401412 Jul 12 00:05 000000.vlog
-rw-r--r-- 1 srh srh         0 Jul 11 23:46 clog

../svr2/w:
total 61304
-rw-r--r-- 1 srh srh 62769560 Jul 12 00:05 000000.vlog
-rw-r--r-- 1 srh srh        0 Jul 11 23:46 clog

../svr3/p:
total 388188
-rw-r--r-- 1 srh srh 397493236 Jul 12 00:05 000000.vlog
-rw-r--r-- 1 srh srh       165 Jul 11 23:46 000001.sst
-rw-r--r-- 1 srh srh         0 Jul 11 23:49 clog

../svr3/w:
total 61304
-rw-r--r-- 1 srh srh 62770533 Jul 12 00:05 000000.vlog
-rw-r--r-- 1 srh srh        0 Jul 11 23:49 clog

Then (after idling for a while) upon killing the processes, each server dumped a 38239904-byte sst file into p/.

Edit: I'll investigate whether it's specific to idling.

The text was updated successfully, but these errors were encountered:

srh · 2017-07-13T20:36:49Z

It's not sending a snapshot, it's sending individual Raft log entries. The reason is, a snapshot wasn't created. The reason is, the default value of MaxPendingCount is 1000, but we only went through 850 or so Raft indexes while loading the data (which has multiple writes per index).

It is sending Raft log entries much more slowly from server 1 -> 3, than it did for server 1 -> 2 when initially loading the data.

Setting MaxPendingCount to 50, we can trigger the actual Raft snapshot retrieval code. It does its job very quickly, by comparison.

Right now we create a snapshot on 1 minute intervals, but only if 1000 Raft log entries have been created. I guess the idea is that 1000 log entries is a relatively small number to catch up with. It turns out to be a relatively large number. But I still need to figure out why catch-up is so much slower.

srh · 2017-07-13T21:40:38Z

RECEIVED: MsgApp 1-->3
		SENDING: MsgAppResp 3-->1
2017/07/13 14:32:37 draft.go:766: CommittedEntries (group 1) size 1
2017/07/13 14:32:37 draft.go:771: Committed entries (group 1) index 39
2017/07/13 14:32:38 draft.go:766: CommittedEntries (group 0) size 0
2017/07/13 14:32:38 draft.go:766: CommittedEntries (group 1) size 0
RECEIVED: MsgApp 1-->3
		SENDING: MsgAppResp 3-->1
2017/07/13 14:32:38 draft.go:766: CommittedEntries (group 1) size 1
2017/07/13 14:32:38 draft.go:771: Committed entries (group 1) index 40
2017/07/13 14:32:39 draft.go:766: CommittedEntries (group 0) size 0
2017/07/13 14:32:39 draft.go:766: CommittedEntries (group 1) size 0
RECEIVED: MsgApp 1-->3
		SENDING: MsgAppResp 3-->1
2017/07/13 14:32:39 draft.go:766: CommittedEntries (group 1) size 1
2017/07/13 14:32:39 draft.go:771: Committed entries (group 1) index 41
2017/07/13 14:32:40 draft.go:766: CommittedEntries (group 1) size 0
2017/07/13 14:32:40 draft.go:766: CommittedEntries (group 0) size 0
RECEIVED: MsgApp 1-->3
		SENDING: MsgAppResp 3-->1
2017/07/13 14:32:40 draft.go:766: CommittedEntries (group 1) size 1
2017/07/13 14:32:40 draft.go:771: Committed entries (group 1) index 42
2017/07/13 14:32:41 draft.go:766: CommittedEntries (group 0) size 0
2017/07/13 14:32:41 draft.go:766: CommittedEntries (group 1) size 0
RECEIVED: MsgApp 1-->3
		SENDING: MsgAppResp 3-->1
2017/07/13 14:32:41 draft.go:766: CommittedEntries (group 1) size 1
2017/07/13 14:32:41 draft.go:771: Committed entries (group 1) index 43
2017/07/13 14:32:42 draft.go:766: CommittedEntries (group 1) size 0
2017/07/13 14:32:42 draft.go:766: CommittedEntries (group 0) size 0
RECEIVED: MsgApp 1-->3
		SENDING: MsgAppResp 3-->1
2017/07/13 14:32:42 draft.go:766: CommittedEntries (group 1) size 1
2017/07/13 14:32:42 draft.go:771: Committed entries (group 1) index 44
2017/07/13 14:32:43 draft.go:766: CommittedEntries (group 1) size 0
2017/07/13 14:32:43 draft.go:766: CommittedEntries (group 0) size 0

It seems like it's sending exactly one raft index per second.

I think the most likely culprit is that we call .Tick() once per second, and etcd for whatever reason can't send old log entries more than once per tick.

Edit: Confirmed it's tied to the ticker frequency (by reducing the frequency to 1/10 second and seeing an exactly 10x speedup).

srh · 2017-07-13T23:09:46Z

Possibly it's because the heartbeat interval was 1 ticker unit, and the follower was in Probe state the entire time. (See etcd/raft/design.md) Will investigate further.

srh · 2017-07-14T01:36:09Z

It is indeed sending once per heartbeat interval, not once per tick interval. So it looks like the sender is in probe state.

By tweaking log messages I can see that each MsgAppResp comes with Reject = false. That message is definitely getting passed to the Raft node's Step function. So etcd should be switching from probe to replicate state. Edit: And what I mean is, apparently it isn't.

srh · 2017-07-14T04:18:27Z

This behavior is also present in release/v0.7.7.

manishrjain · 2017-07-14T04:36:57Z

Would make sense to file an issue against coreos/etcd, and mention this thing about Raft. They would have seen this problem already.

Most likely, we do need to do ticks more often (I think they're probably doing every 10ms or something). IIRC, each tick is one heartbeat.

janardhan1993 · 2017-07-14T05:53:03Z

@srh, @manishrjain The maxmsgsize is causing this issue.

srh · 2017-07-17T05:37:00Z

maxmsgsize, combined with etcd only sending one MsgApp every heartbeat, which is currently once per second.

Generally speaking, it looks like etcd assumes that heartbeats happen much faster than Raft operations do. It seems like it's not designed for applications that frequently create Raft log entries. Or it's designed for more frequent snapshots.

An example of a very simple change to etcd which brings peers up much much faster (with no tie to heartbeat rate) is e41e005. That is still terrible, it's just an example. (It's actually quite a bad patch -- it ping-pongs messages, so each time a heartbeat happens, a new MsgApp/MsgAppResp/MsgApp/MsgAppResp ping-pong chain gets initiated, until we reach the limit of MaxInflight which has the value 256. It works poorly if your latency is high.)

So, what's the right thing to do here? Making the tick rate and heartbeat timeout 5ms seems okay in general, if you only have a few Raft groups (so low heartbeat message overhead), and we could make the leadership election timeout a bigger number of ticks.

Now suppose increase MaxMsgSize to 10 MB. You'd think, ah, 10 MB * 200 messages/second = 2000 MB/second. etcd's loop is no longer arbitrarily limiting our bandwidth, right? We could max out a 20Gbit/second network interface.

The problem here is, what happens is that every 5 ms, we'll run go n.sendMessage(to, data). Every 5 ms, we'll start sending a new RaftMessage gRPC call. That will make a new HTTP2 stream and begin sending the message over that stream. (Messages for a given stream will get cut up into separate frames of at most 16KB each.) So what happens is, on network interfaces that have limited bandwidth, we'd end up sending a whole ton of 10MB RaftMessages simultaneously. Probably this gets limited to MaxInflight, which is 256. So we'd spend up to 2.56 GB of memory (for each raft group) buffering these messages, and then try to apply them quickly while more come in. What also happens is that doSendMessage has a 1 second timeout. We'll get a bunch of aborted messages.

So what should we do?

Option 1: Snapshot more frequently. Our snapshots are cheap to make, super-expensive to send. (And we don't really send the snapshot, we send the latest state of the db, on top of which we can replay a sequence of older log entries and still converge to the up-to-date state.) But snapshots are faster to send than having etcd send log entries.

Option 2: Let us send incremental snapshots. (You tell a peer, "I am up to date for raft index N, please send info to update me to the latest raft index.") Right now we don't have that. But we could have that. There are lots of ways we could do it. In the long run, that might happen.

Option 3: Change etcd's sending of log entries so that it's what you'd get if you decided to write a bunch of log entries on a stream, in a loop. In other words, don't send one per heartbeat, basically write them in a loop on a gRPC stream (which undergoes flow control through gRPC and TCP). We just have to make sure we don't concurrently access things in etcd we shouldn't be accessing.

srh · 2017-07-19T04:43:55Z

I'm going with option 3. Part of the reason is, we will need this anyway. We need to get feedback on our rate of writes across the cluster, so that we can stop clients from sending us a storm of writes and keep write buffering and its memory usage to a minimal level. This would be just one part of that chain of feedback.

srh · 2017-07-28T01:14:08Z

Never mind, we aren't doing option 3 now. We alleviated this problem with #1269.

srh self-assigned this Jul 12, 2017

srh changed the title ~~Sending snapshot to new nodes is very slow~~ Sending data to new nodes is very slow Jul 12, 2017

manishrjain added the investigate Requires further investigation label Jul 13, 2017

manishrjain added this to the v0.8 milestone Jul 13, 2017

manishrjain mentioned this issue Jul 13, 2017

Slow Queries After Doing A Bulk Load On Nightly Build #1168

Closed

srh added the in progress label Jul 13, 2017

srh mentioned this issue Jul 13, 2017

Decrease tick interval to 50 ms. #1191

Closed

srh mentioned this issue Jul 14, 2017

Server joining Raft cluster might not catch up. #1193

Closed

This was referenced Jul 25, 2017

Stream log messages to peer that's catching up #1249

Closed

Sam/stream msg app #1252

Closed

manishrjain modified the milestones: v0.8, v0.8.1 Jul 26, 2017

srh mentioned this issue Jul 27, 2017

Decrease tick interval to 20 ms. #1269

Closed

srh closed this as completed Jul 28, 2017

manishrjain removed the in progress label Jul 28, 2017

manishrjain added the investigate Requires further investigation label Mar 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sending data to new nodes is very slow #1180

Sending data to new nodes is very slow #1180

srh commented Jul 12, 2017 •

edited by manishrjain

Loading

srh commented Jul 13, 2017

srh commented Jul 13, 2017 •

edited

Loading

srh commented Jul 13, 2017

srh commented Jul 14, 2017 •

edited

Loading

srh commented Jul 14, 2017

manishrjain commented Jul 14, 2017 •

edited

Loading

janardhan1993 commented Jul 14, 2017

srh commented Jul 17, 2017

srh commented Jul 19, 2017

srh commented Jul 28, 2017

Sending data to new nodes is very slow #1180

Sending data to new nodes is very slow #1180

Comments

srh commented Jul 12, 2017 • edited by manishrjain Loading

srh commented Jul 13, 2017

srh commented Jul 13, 2017 • edited Loading

srh commented Jul 13, 2017

srh commented Jul 14, 2017 • edited Loading

srh commented Jul 14, 2017

manishrjain commented Jul 14, 2017 • edited Loading

janardhan1993 commented Jul 14, 2017

srh commented Jul 17, 2017

srh commented Jul 19, 2017

srh commented Jul 28, 2017

srh commented Jul 12, 2017 •

edited by manishrjain

Loading

srh commented Jul 13, 2017 •

edited

Loading

srh commented Jul 14, 2017 •

edited

Loading

manishrjain commented Jul 14, 2017 •

edited

Loading