-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sending data to new nodes is very slow #1180
Comments
It's not sending a snapshot, it's sending individual Raft log entries. The reason is, a snapshot wasn't created. The reason is, the default value of MaxPendingCount is 1000, but we only went through 850 or so Raft indexes while loading the data (which has multiple writes per index). It is sending Raft log entries much more slowly from server 1 -> 3, than it did for server 1 -> 2 when initially loading the data. Setting MaxPendingCount to 50, we can trigger the actual Raft snapshot retrieval code. It does its job very quickly, by comparison. Right now we create a snapshot on 1 minute intervals, but only if 1000 Raft log entries have been created. I guess the idea is that 1000 log entries is a relatively small number to catch up with. It turns out to be a relatively large number. But I still need to figure out why catch-up is so much slower. |
It seems like it's sending exactly one raft index per second. I think the most likely culprit is that we call .Tick() once per second, and etcd for whatever reason can't send old log entries more than once per tick. Edit: Confirmed it's tied to the ticker frequency (by reducing the frequency to 1/10 second and seeing an exactly 10x speedup). |
Possibly it's because the heartbeat interval was 1 ticker unit, and the follower was in Probe state the entire time. (See etcd/raft/design.md) Will investigate further. |
It is indeed sending once per heartbeat interval, not once per tick interval. So it looks like the sender is in probe state. By tweaking log messages I can see that each MsgAppResp comes with Reject = false. That message is definitely getting passed to the Raft node's |
This behavior is also present in |
Would make sense to file an issue against coreos/etcd, and mention this thing about Raft. They would have seen this problem already. Most likely, we do need to do ticks more often (I think they're probably doing every 10ms or something). IIRC, each tick is one heartbeat. |
@srh, @manishrjain The maxmsgsize is causing this issue. |
maxmsgsize, combined with etcd only sending one MsgApp every heartbeat, which is currently once per second. Generally speaking, it looks like etcd assumes that heartbeats happen much faster than Raft operations do. It seems like it's not designed for applications that frequently create Raft log entries. Or it's designed for more frequent snapshots. An example of a very simple change to etcd which brings peers up much much faster (with no tie to heartbeat rate) is e41e005. That is still terrible, it's just an example. (It's actually quite a bad patch -- it ping-pongs messages, so each time a heartbeat happens, a new MsgApp/MsgAppResp/MsgApp/MsgAppResp ping-pong chain gets initiated, until we reach the limit of MaxInflight which has the value 256. It works poorly if your latency is high.) So, what's the right thing to do here? Making the tick rate and heartbeat timeout 5ms seems okay in general, if you only have a few Raft groups (so low heartbeat message overhead), and we could make the leadership election timeout a bigger number of ticks. Now suppose increase MaxMsgSize to 10 MB. You'd think, ah, 10 MB * 200 messages/second = 2000 MB/second. etcd's loop is no longer arbitrarily limiting our bandwidth, right? We could max out a 20Gbit/second network interface. The problem here is, what happens is that every 5 ms, we'll run So what should we do? Option 1: Snapshot more frequently. Our snapshots are cheap to make, super-expensive to send. (And we don't really send the snapshot, we send the latest state of the db, on top of which we can replay a sequence of older log entries and still converge to the up-to-date state.) But snapshots are faster to send than having etcd send log entries. Option 2: Let us send incremental snapshots. (You tell a peer, "I am up to date for raft index N, please send info to update me to the latest raft index.") Right now we don't have that. But we could have that. There are lots of ways we could do it. In the long run, that might happen. Option 3: Change etcd's sending of log entries so that it's what you'd get if you decided to write a bunch of log entries on a stream, in a loop. In other words, don't send one per heartbeat, basically write them in a loop on a gRPC stream (which undergoes flow control through gRPC and TCP). We just have to make sure we don't concurrently access things in etcd we shouldn't be accessing. |
I'm going with option 3. Part of the reason is, we will need this anyway. We need to get feedback on our rate of writes across the cluster, so that we can stop clients from sending us a storm of writes and keep write buffering and its memory usage to a minimal level. This would be just one part of that chain of feedback. |
Never mind, we aren't doing option 3 now. We alleviated this problem with #1269. |
There's a chance this is caused by the same underlying problem as #1168.
(Edit: Actually, this might not involve sending a "snapshot" -- I haven't verified that it actually updated using the snapshot mechanism.)
Anyway, if you make a 3-node cluster, kill a node, and then load a bunch of data, and then revive the node (as described in #1169 (comment)), it takes a very long time to bring the new node up to speed.
Something on the order of 1m to 1m30s to load the data initially, and then 15 minutes to catch the new node up.
While it's catching up, the CPU usage on any given core is at most 5%, maybe 10%. It seems like the slowness is from too much idling.
Another peculiar thing is the high disk usage of the node (of svr3) -- the dataset was 1million.rdf.gz:
The actual contents:
Then (after idling for a while) upon killing the processes, each server dumped a 38239904-byte sst file into p/.
Edit: I'll investigate whether it's specific to idling.
The text was updated successfully, but these errors were encountered: