Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Fix: Avoid Dgraph cluster getting stuck in infinite leader election #3391

Merged
merged 6 commits into from
May 9, 2019

Conversation

manishrjain
Copy link
Contributor

@manishrjain manishrjain commented May 9, 2019

Dgraph Alphas were calculating snapshots and checkpoints in the main Raft loop, which depending upon disk speed caused Ticks to not be done for seconds. This caused the followers to assume that the leader is unavailable, triggering an election. Because the checkpoint and snapshot calculation happens every 30s, the election was happening every 30s as well.

This PR moves both of those outside the main loop into their own goroutine (colocated with the code which shuts down Raft node). Tested successfully with a live cluster which was exhibiting these symptoms.

This PR also tracks how many heartbeats have come in and gone out from each node and prints them out under V(3). Useful for debugging.

The PR improves upon and uses x.Timer to track Raft.Ready components' latencies and report them in both Alphas and Zeros. This fixes the incorrect statement we were making about disk latency being the primary cause of Raft.Ready being slow.


This change is Reviewable

@manishrjain manishrjain requested a review from a team as a code owner May 9, 2019 02:21
@danielmai
Copy link
Contributor


conn/node.go, line 253 at r1 (raw file):

		switch msg.Type {
		case raftpb.MsgHeartbeat, raftpb.MsgHeartbeatResp:
			atomic.AddInt64(&n.heartbeatsOut, 1)

This happens in glog.V(2) but the ticker in ReportRaftComms only zeroes out heartbeats in glog.V(3). Is that OK?

@danielmai
Copy link
Contributor


conn/raft_server.go, line 232 at r1 (raw file):

				switch msg.Type {
				case raftpb.MsgHeartbeat, raftpb.MsgHeartbeatResp:
					atomic.AddInt64(&n.heartbeatsIn, 1)

Same question here too about incrementing in V(2) and glogging/zero-ing in V(3).

Copy link
Contributor Author

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 5 files reviewed, 2 unresolved discussions (waiting on @danielmai and @manishrjain)


conn/node.go, line 253 at r1 (raw file):

Previously, danielmai (Daniel Mai) wrote…

This happens in glog.V(2) but the ticker in ReportRaftComms only zeroes out heartbeats in glog.V(3). Is that OK?

Yeah, that's alright. Even if uint64 overflows (??) and goes back to zero, it's not a big deal.

Copy link
Contributor

@danielmai danielmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r1.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @danielmai and @manishrjain)

Copy link
Contributor

@danielmai danielmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @manishrjain)


conn/node.go, line 253 at r1 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Yeah, that's alright. Even if uint64 overflows (??) and goes back to zero, it's not a big deal.

Sounds good.

Copy link
Contributor

@danielmai danielmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @manishrjain)

@manishrjain manishrjain merged commit bfcf784 into master May 9, 2019
@manishrjain manishrjain deleted the mrjn/raft-comms branch May 9, 2019 02:44
manishrjain added a commit that referenced this pull request May 9, 2019
…on (#3391)

Dgraph Alphas were calculating snapshots and checkpoints in the main Raft loop, which depending upon disk speed caused Ticks to not be done for seconds. This caused the followers to assume that the leader is unavailable, triggering an election. Because the checkpoint and snapshot calculation happens every 30s, the election was happening every 30s as well.

This PR moves both of those outside the main loop into their own goroutine (colocated with the code which shuts down Raft node). Tested successfully with a live cluster which was exhibiting these symptoms.

This PR also tracks how many heartbeats have come in and gone out from each node and prints them out under V(3). Useful for debugging.

The PR improves upon and uses x.Timer to track Raft.Ready components' latencies and report them in both Alphas and Zeros. This fixes the incorrect statement we were making about disk latency being the primary cause of Raft.Ready being slow.

Changes:
* Report Heartbeat comms
* Add logs around heartbeats.
* Move snapshot and checkpoint calculation outside of the main Raft loop. Capture the latency of individual components in Raft.Ready better.
* Add timer to Zero as well. Fix a bug: Use a for loop when going over slow ticker.
* Move num pending txns to V(2).
* Move the checkpointing code outside of the Run func.
@@ -72,6 +72,9 @@ type Node struct {
// The stages are proposed -> committed (accepted by cluster) ->
// applied (to PL) -> synced (to BadgerDB).
Applied y.WaterMark

heartbeatsOut int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atomic package issues here? This is 64 bit. https://golang.org/pkg/sync/atomic/#pkg-note-BUG

@@ -155,6 +158,20 @@ func NewNode(rc *pb.RaftContext, store *raftwal.DiskStorage) *Node {
return n
}

func (n *Node) ReportRaftComms() {
if !glog.V(3) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this work for level > 3 too?

@@ -964,14 +983,13 @@ func (n *node) Run() {
span.End()
ostats.RecordWithTags(context.Background(),
[]tag.Mutator{tag.Upsert(x.KeyMethod, "alpha.RunLoop")},
x.LatencyMs.M(x.SinceMs(start)))
x.LatencyMs.M(float64(timer.Total())/1e6))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could divide by time.MilliSecond instead here

func (t *Timer) All() []time.Duration {
return t.records
func (t *Timer) String() string {
sort.Slice(t.records, func(i, j int) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is sorting helpful in viewing the results?

dna2github pushed a commit to dna2fork/dgraph that referenced this pull request Jul 19, 2019
…on (hypermodeinc#3391)

Dgraph Alphas were calculating snapshots and checkpoints in the main Raft loop, which depending upon disk speed caused Ticks to not be done for seconds. This caused the followers to assume that the leader is unavailable, triggering an election. Because the checkpoint and snapshot calculation happens every 30s, the election was happening every 30s as well.

This PR moves both of those outside the main loop into their own goroutine (colocated with the code which shuts down Raft node). Tested successfully with a live cluster which was exhibiting these symptoms.

This PR also tracks how many heartbeats have come in and gone out from each node and prints them out under V(3). Useful for debugging.

The PR improves upon and uses x.Timer to track Raft.Ready components' latencies and report them in both Alphas and Zeros. This fixes the incorrect statement we were making about disk latency being the primary cause of Raft.Ready being slow.

Changes:
* Report Heartbeat comms
* Add logs around heartbeats.
* Move snapshot and checkpoint calculation outside of the main Raft loop. Capture the latency of individual components in Raft.Ready better.
* Add timer to Zero as well. Fix a bug: Use a for loop when going over slow ticker.
* Move num pending txns to V(2).
* Move the checkpointing code outside of the Run func.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants