-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: Raft log catching up extremely slowly #27983
Comments
One theory: Raft has a flow control mechanism which limits the number of outstanding unacknowledged MsgApps, but also optimistically opens up one slot in the flow control window for every MsgHeartbeatResp received. Some quick math lines up: if we could only send one (16KB) message per (500ms) heartbeat interval), our throughput would be 2.7GB/day. The commit index is only advanced for MsgAppResp, not MsgHeartbeatResp, so this would also explain the lack of advancement. In order for this to happen, we would need:
The ProgressStateReplicate requirement is tricky. Followers start in ProgressStateProbe and only enter ProgressStateReplicate when the leader receives an MsgAppResp. So we had at least one successful MsgAppResp, then no more. This special case is one way in which MsgAppResp is different from other messages: cockroach/pkg/storage/replica.go Lines 4796 to 4817 in 2ecc022
|
The "dropped messages" stat was also zero during this time, so the messages were not getting dropped at the raft transport layer. |
The fact that the commit index does not advance is normal: a new leader cannot commit any entries from past terms until it has committed an entry from its own term (which requires that that a quorum of followers is fully up to date). This is described in section 3.6.2 of the raft dissertation. The only problem, then, is that the log replication is proceeding extremely slowly. I still think the theory in the second comment above is the best one. The fact that entries cannot be committed makes this more plausible, since successful MsgAppResps would not bump the commit index and invalidate the theory. It is enough for most MsgAppResps to be dropped, instead of requiring all of them to fail. Alternately, this could reflect a node alternating between ProgressStateProbe and ProgressStateReplicate. I had dismissed this theory because I think it requires some MsgAppResps to succeed, but that is now possible. |
OK, I've figured it out. There are two issues with raft's flow control, one of which is exacerbated by our MsgAppResp-coalescing optimization. (and perhaps by our configuration that allows for a large number of small messages in flight, instead of the reverse) Raft's flow control mostly tries to operate on a "one in, one out" principle. After receiving a MsgAppResp, multiple slots may free up in the flow control window, but only one MsgApp will be sent. The only thing that can lead to multiple inflight messages is incoming traffic (which is rare here, since the only incoming traffic is the occasional lease request as the range tries to reestablish itself). And then when we do manage to get multiple messages in flight at once, the fact that we coalesce redundant MsgAppResps means that we will tend to fall back into the one-at-a-time mode. Even worse, there's a bug in the "one in, one out" implementation. If the flow control window was not full when the MsgAppResp was received, it won't send a new message. This degrades to "one message per heartbeat interval", which matches the measured transfer rates. Instead of attempting to send exactly the right number of messages, raft should be trying to fill up the flow control when whenever it sends MsgApp. |
We allow multiple in-flight append messages, but prior to this change the only way we'd ever send them is if there is a steady stream of new proposals. Catching up a follower that is far behind would be unnecessarily slow (this is exacerbated by a quirk of CockroachDB's use of raft which limits our ability to catch up via snapshot in some cases). See cockroachdb/cockroach#27983
We allow multiple in-flight append messages, but prior to this change the only way we'd ever send them is if there is a steady stream of new proposals. Catching up a follower that is far behind would be unnecessarily slow (this is exacerbated by a quirk of CockroachDB's use of raft which limits our ability to catch up via snapshot in some cases). See cockroachdb/cockroach#27983
Picks up etcd-io/etcd#9982 and etcd-io/etcd#9985 (and no other changes to packages we use). Fixes cockroachdb#27983 Fixes cockroachdb#27804 Release note (bug fix): Additional fixes for out-of-memory errors caused by very large raft logs. Release note (performance improvement): Greatly improved performance when catching up followers that are behind when raft logs are large.
28511: vendor: Update etcd r=tschottdorf a=bdarnell Picks up etcd-io/etcd#9982 and etcd-io/etcd#9985 (and no other changes to packages we use). Fixes #27983 Fixes #27804 Release note (bug fix): Additional fixes for out-of-memory errors caused by very large raft logs. Release note (performance improvement): Greatly improved performance when catching up followers that are behind when raft logs are large. Co-authored-by: Ben Darnell <[email protected]>
We allow multiple in-flight append messages, but prior to this change the only way we'd ever send them is if there is a steady stream of new proposals. Catching up a follower that is far behind would be unnecessarily slow (this is exacerbated by a quirk of CockroachDB's use of raft which limits our ability to catch up via snapshot in some cases). See cockroachdb/cockroach#27983
Why the live node can become leader? Majority of two replicas is two, one down, how can the other one become leader? |
@DorianZheng I think what happens is that the leader was elected before quorum was lost. A leader steps down if it hears about a higher term and followers campaign if they don't receive heartbeats. This means that a sole leader will remain in this state; none of the two mechanisms will let it step down. Of course it won't be able to make any progress, but it can append to the uncommitted Raft log. |
Thanks @tschottdorf for answering. So will leader use heartbeat to detect if it is still a leader?If not, why? |
We don't currently use heartbeats to detect when we can no longer be leader because our followers have gone away. We used to in 1.1, but we turned that off in 2.0 because it was expensive to do this work for every replica even when they were idle. (We could probably do this more cheaply if we wanted to put more work into it, but we didn't expect it to be an issue) |
A cluster that got into a weird state has revealed something that doesn't appear to be working correctly in raft. The cluster was running 2.0.4 with the patch in #27804 (comment)
A range had gotten into a state with only two replicas, and one of them had been down for a long time. The live node was left as leader and continually added to its raft log. After the downed follower came back online, that multi-GB raft log had to be copied to the follower before new progress could be made (including establishing a new lease for the range or adding a third replica).
This process went extremely slowly, taking 30 hours in one case to transfer a few GB. While this was happening, the range status page showed that the follower's Last Index was increasing steadily, but the Commit and Applied indexes remained constant until the follower caught up completely. This is unexpected; the Commit and Applied indexes should increase as the follower progresses, tracking only a few message round trips behind the Last index.
Note that raft leadership was stable during this time (term number was only 141), so it was not the case that elections were being called frequently (which could slow down the leader and prevent it from committing entries as it processes MsgAppResps).
The text was updated successfully, but these errors were encountered: