Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sending data to new nodes is very slow #1180

Closed
srh opened this issue Jul 12, 2017 · 10 comments
Closed

Sending data to new nodes is very slow #1180

srh opened this issue Jul 12, 2017 · 10 comments
Assignees
Labels
investigate Requires further investigation
Milestone

Comments

@srh
Copy link

srh commented Jul 12, 2017

There's a chance this is caused by the same underlying problem as #1168.

(Edit: Actually, this might not involve sending a "snapshot" -- I haven't verified that it actually updated using the snapshot mechanism.)

Anyway, if you make a 3-node cluster, kill a node, and then load a bunch of data, and then revive the node (as described in #1169 (comment)), it takes a very long time to bring the new node up to speed.

Something on the order of 1m to 1m30s to load the data initially, and then 15 minutes to catch the new node up.

While it's catching up, the CPU usage on any given core is at most 5%, maybe 10%. It seems like the slowness is from too much idling.

Another peculiar thing is the high disk usage of the node (of svr3) -- the dataset was 1million.rdf.gz:

60M	/home/srh/blah/svr1/w
195M	/home/srh/blah/svr1/p
255M	/home/srh/blah/svr1
60M	/home/srh/blah/svr2/w
181M	/home/srh/blah/svr2/p
241M	/home/srh/blah/svr2
60M	/home/srh/blah/svr3/w
380M	/home/srh/blah/svr3/p
440M	/home/srh/blah/svr3

The actual contents:

../svr1/p:
total 199048
-rw-r--r-- 1 srh srh 203817056 Jul 12 00:05 000000.vlog
-rw-r--r-- 1 srh srh         0 Jul 11 23:46 clog

../svr1/w:
total 61304
-rw-r--r-- 1 srh srh 62770134 Jul 12 00:05 000000.vlog
-rw-r--r-- 1 srh srh        0 Jul 11 23:46 clog

../svr2/p:
total 184968
-rw-r--r-- 1 srh srh 189401412 Jul 12 00:05 000000.vlog
-rw-r--r-- 1 srh srh         0 Jul 11 23:46 clog

../svr2/w:
total 61304
-rw-r--r-- 1 srh srh 62769560 Jul 12 00:05 000000.vlog
-rw-r--r-- 1 srh srh        0 Jul 11 23:46 clog

../svr3/p:
total 388188
-rw-r--r-- 1 srh srh 397493236 Jul 12 00:05 000000.vlog
-rw-r--r-- 1 srh srh       165 Jul 11 23:46 000001.sst
-rw-r--r-- 1 srh srh         0 Jul 11 23:49 clog

../svr3/w:
total 61304
-rw-r--r-- 1 srh srh 62770533 Jul 12 00:05 000000.vlog
-rw-r--r-- 1 srh srh        0 Jul 11 23:49 clog

Then (after idling for a while) upon killing the processes, each server dumped a 38239904-byte sst file into p/.

Edit: I'll investigate whether it's specific to idling.

@srh srh self-assigned this Jul 12, 2017
@srh srh changed the title Sending snapshot to new nodes is very slow Sending data to new nodes is very slow Jul 12, 2017
@manishrjain manishrjain added the investigate Requires further investigation label Jul 13, 2017
@manishrjain manishrjain added this to the v0.8 milestone Jul 13, 2017
@srh
Copy link
Author

srh commented Jul 13, 2017

It's not sending a snapshot, it's sending individual Raft log entries. The reason is, a snapshot wasn't created. The reason is, the default value of MaxPendingCount is 1000, but we only went through 850 or so Raft indexes while loading the data (which has multiple writes per index).

It is sending Raft log entries much more slowly from server 1 -> 3, than it did for server 1 -> 2 when initially loading the data.

Setting MaxPendingCount to 50, we can trigger the actual Raft snapshot retrieval code. It does its job very quickly, by comparison.

Right now we create a snapshot on 1 minute intervals, but only if 1000 Raft log entries have been created. I guess the idea is that 1000 log entries is a relatively small number to catch up with. It turns out to be a relatively large number. But I still need to figure out why catch-up is so much slower.

@srh
Copy link
Author

srh commented Jul 13, 2017

RECEIVED: MsgApp 1-->3
		SENDING: MsgAppResp 3-->1
2017/07/13 14:32:37 draft.go:766: CommittedEntries (group 1) size 1
2017/07/13 14:32:37 draft.go:771: Committed entries (group 1) index 39
2017/07/13 14:32:38 draft.go:766: CommittedEntries (group 0) size 0
2017/07/13 14:32:38 draft.go:766: CommittedEntries (group 1) size 0
RECEIVED: MsgApp 1-->3
		SENDING: MsgAppResp 3-->1
2017/07/13 14:32:38 draft.go:766: CommittedEntries (group 1) size 1
2017/07/13 14:32:38 draft.go:771: Committed entries (group 1) index 40
2017/07/13 14:32:39 draft.go:766: CommittedEntries (group 0) size 0
2017/07/13 14:32:39 draft.go:766: CommittedEntries (group 1) size 0
RECEIVED: MsgApp 1-->3
		SENDING: MsgAppResp 3-->1
2017/07/13 14:32:39 draft.go:766: CommittedEntries (group 1) size 1
2017/07/13 14:32:39 draft.go:771: Committed entries (group 1) index 41
2017/07/13 14:32:40 draft.go:766: CommittedEntries (group 1) size 0
2017/07/13 14:32:40 draft.go:766: CommittedEntries (group 0) size 0
RECEIVED: MsgApp 1-->3
		SENDING: MsgAppResp 3-->1
2017/07/13 14:32:40 draft.go:766: CommittedEntries (group 1) size 1
2017/07/13 14:32:40 draft.go:771: Committed entries (group 1) index 42
2017/07/13 14:32:41 draft.go:766: CommittedEntries (group 0) size 0
2017/07/13 14:32:41 draft.go:766: CommittedEntries (group 1) size 0
RECEIVED: MsgApp 1-->3
		SENDING: MsgAppResp 3-->1
2017/07/13 14:32:41 draft.go:766: CommittedEntries (group 1) size 1
2017/07/13 14:32:41 draft.go:771: Committed entries (group 1) index 43
2017/07/13 14:32:42 draft.go:766: CommittedEntries (group 1) size 0
2017/07/13 14:32:42 draft.go:766: CommittedEntries (group 0) size 0
RECEIVED: MsgApp 1-->3
		SENDING: MsgAppResp 3-->1
2017/07/13 14:32:42 draft.go:766: CommittedEntries (group 1) size 1
2017/07/13 14:32:42 draft.go:771: Committed entries (group 1) index 44
2017/07/13 14:32:43 draft.go:766: CommittedEntries (group 1) size 0
2017/07/13 14:32:43 draft.go:766: CommittedEntries (group 0) size 0

It seems like it's sending exactly one raft index per second.

I think the most likely culprit is that we call .Tick() once per second, and etcd for whatever reason can't send old log entries more than once per tick.

Edit: Confirmed it's tied to the ticker frequency (by reducing the frequency to 1/10 second and seeing an exactly 10x speedup).

@srh
Copy link
Author

srh commented Jul 13, 2017

Possibly it's because the heartbeat interval was 1 ticker unit, and the follower was in Probe state the entire time. (See etcd/raft/design.md) Will investigate further.

@srh
Copy link
Author

srh commented Jul 14, 2017

It is indeed sending once per heartbeat interval, not once per tick interval. So it looks like the sender is in probe state.

By tweaking log messages I can see that each MsgAppResp comes with Reject = false. That message is definitely getting passed to the Raft node's Step function. So etcd should be switching from probe to replicate state. Edit: And what I mean is, apparently it isn't.

@srh
Copy link
Author

srh commented Jul 14, 2017

This behavior is also present in release/v0.7.7.

@manishrjain
Copy link
Contributor

manishrjain commented Jul 14, 2017

Would make sense to file an issue against coreos/etcd, and mention this thing about Raft. They would have seen this problem already.

Most likely, we do need to do ticks more often (I think they're probably doing every 10ms or something). IIRC, each tick is one heartbeat.

@janardhan1993
Copy link
Contributor

@srh, @manishrjain The maxmsgsize is causing this issue.

@srh
Copy link
Author

srh commented Jul 17, 2017

maxmsgsize, combined with etcd only sending one MsgApp every heartbeat, which is currently once per second.

Generally speaking, it looks like etcd assumes that heartbeats happen much faster than Raft operations do. It seems like it's not designed for applications that frequently create Raft log entries. Or it's designed for more frequent snapshots.

An example of a very simple change to etcd which brings peers up much much faster (with no tie to heartbeat rate) is e41e005. That is still terrible, it's just an example. (It's actually quite a bad patch -- it ping-pongs messages, so each time a heartbeat happens, a new MsgApp/MsgAppResp/MsgApp/MsgAppResp ping-pong chain gets initiated, until we reach the limit of MaxInflight which has the value 256. It works poorly if your latency is high.)

So, what's the right thing to do here? Making the tick rate and heartbeat timeout 5ms seems okay in general, if you only have a few Raft groups (so low heartbeat message overhead), and we could make the leadership election timeout a bigger number of ticks.

Now suppose increase MaxMsgSize to 10 MB. You'd think, ah, 10 MB * 200 messages/second = 2000 MB/second. etcd's loop is no longer arbitrarily limiting our bandwidth, right? We could max out a 20Gbit/second network interface.

The problem here is, what happens is that every 5 ms, we'll run go n.sendMessage(to, data). Every 5 ms, we'll start sending a new RaftMessage gRPC call. That will make a new HTTP2 stream and begin sending the message over that stream. (Messages for a given stream will get cut up into separate frames of at most 16KB each.) So what happens is, on network interfaces that have limited bandwidth, we'd end up sending a whole ton of 10MB RaftMessages simultaneously. Probably this gets limited to MaxInflight, which is 256. So we'd spend up to 2.56 GB of memory (for each raft group) buffering these messages, and then try to apply them quickly while more come in. What also happens is that doSendMessage has a 1 second timeout. We'll get a bunch of aborted messages.

So what should we do?

Option 1: Snapshot more frequently. Our snapshots are cheap to make, super-expensive to send. (And we don't really send the snapshot, we send the latest state of the db, on top of which we can replay a sequence of older log entries and still converge to the up-to-date state.) But snapshots are faster to send than having etcd send log entries.

Option 2: Let us send incremental snapshots. (You tell a peer, "I am up to date for raft index N, please send info to update me to the latest raft index.") Right now we don't have that. But we could have that. There are lots of ways we could do it. In the long run, that might happen.

Option 3: Change etcd's sending of log entries so that it's what you'd get if you decided to write a bunch of log entries on a stream, in a loop. In other words, don't send one per heartbeat, basically write them in a loop on a gRPC stream (which undergoes flow control through gRPC and TCP). We just have to make sure we don't concurrently access things in etcd we shouldn't be accessing.

@srh
Copy link
Author

srh commented Jul 19, 2017

I'm going with option 3. Part of the reason is, we will need this anyway. We need to get feedback on our rate of writes across the cluster, so that we can stop clients from sending us a storm of writes and keep write buffering and its memory usage to a minimal level. This would be just one part of that chain of feedback.

@srh
Copy link
Author

srh commented Jul 28, 2017

Never mind, we aren't doing option 3 now. We alleviated this problem with #1269.

@srh srh closed this as completed Jul 28, 2017
@manishrjain manishrjain added the investigate Requires further investigation label Mar 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigate Requires further investigation
Development

No branches or pull requests

3 participants