[GOAL2-731] Better slow peers disconnection logic by tsachiherman · Pull Request #15 · algorand/go-algorand

tsachiherman · 2019-06-13T17:26:00Z

Our existing FPR could generate large quantity of messages.
These, in turn, could cause the internal output buffer to overflow, triggering a peer disconnection.
That's not the desired behavior; instead, we want to disconnect the peer if the messages that are being written to it are too old.

…ding buffer size.

CLAassistant · 2019-06-13T17:26:09Z

All committers have signed the CLA.

merge from master

algobolson

I think there's one hole in checking for slow peers that we can fix.

derbear · 2019-06-18T01:49:37Z

+		config.Consensus[protocol.ConsensusCurrentVersion].SoftCommitteeSize +
+		config.Consensus[protocol.ConsensusCurrentVersion].CertCommitteeSize +
+		config.Consensus[protocol.ConsensusCurrentVersion].NextCommitteeSize +
+		config.Consensus[protocol.ConsensusCurrentVersion].LateCommitteeSize)


As a small nit, I would change "single round" to "single period" (and say that this is the total number of messages sent at once).

I don't think it makes a big difference here as it's a heuristic, but I would also add RedoCommitteeSize and DownCommitteeSize. In particular, the committee for the down votes is the largest, at 6000 possible votes. (It needs to be large because it intersects with the cert votes, which are the key committing votes.)

We also pipeline (relaying) all of these votes from the next round and the next period, so it's possible that this number should be 3x as big (as in, we might pipeline 3 periods' worth of votes). On the other hand, this is a pretty unlikely situation and means that the network is experiencing extreme congestion. I think with the current committee sizes, the sum of all our committee sizes is about 20000 messages, which would make 3x about 60000 messages (so with 0.5KB votes this is 30MB).

I'll increase the size of the buffer by RedoCommitteeSize+ DownCommitteeSize.

derbear · 2019-06-18T01:56:11Z

 var networkBroadcastSendMicros = metrics.MakeCounter(metrics.MetricName{Name: "algod_network_broadcast_send_micros_total", Description: "microseconds spent broadcasting"})
-var networkBroadcastsDropped = metrics.MakeCounter(metrics.MetricName{Name: "algod_broadcasts_dropped_total", Description: "number of broadcast messages not sent to some peer"})
+var networkBroadcastsDropped = metrics.MakeCounter(metrics.MetricName{Name: "algod_broadcasts_dropped_total", Description: "number of broadcast messages not sent to any peer"})
+var networkPeerBroadcastDropped = metrics.MakeCounter(metrics.MetricName{Name: "algod_peer_broadcast_dropped_total", Description: "number of broadcast messages not sent to some peer"})


Could we have separate metrics for drops of high-priority messages and low-priority messages? It seems that high-priority drops would be much more alarming than low-priority drops (a lot of low-priority drops means that we might have a ping-pong script bug; a lot of high-priority drops means that the network could be about to stall).

That's a good idea. I'll defer this to a separate PR. Opened a JIRA issue to track this:
https://algorand.atlassian.net/browse/GOAL2-790

derbear · 2019-06-18T02:11:11Z

To clarify, this commit also changes behavior for high-priority broadcasts, removing the distinction between high- and low-priority broadcasts, right?

Specifically, for any message, drop a send (1) to a certain peer if the non-blocking send fails and (2) to all peers if the message took too long to leave the queue.

I think this change is probably a good idea. The interesting thing is how it'll affect the agreement protocol given that before this change, vote delivery was "guaranteed" on a persistent connection. It might be a good idea to stress-test this change somehow with a private network.

tsachiherman · 2019-06-18T15:03:29Z

To clarify, this commit also changes behavior for high-priority broadcasts, removing the distinction between high- and low-priority broadcasts, right?

Specifically, for any message, drop a send (1) to a certain peer if the non-blocking send fails and (2) to all peers if the message took too long to leave the queue.

I think this change is probably a good idea. The interesting thing is how it'll affect the agreement protocol given that before this change, vote delivery was "guaranteed" on a persistent connection. It might be a good idea to stress-test this change somehow with a private network.

Your observations are correct. If the peer is too slow to process messages, that peer is going to start loosing messages. That's a connection-dependent message drop.
if the message is too old and being eliminated from all the peers, it means that we're sending the messages too fast for the broadcastThread to process.

merge from master

Implement return type checking

Disconnect slow peers based on their network activity rather than pen…

3079306

…ding buffer size.

tsachiherman self-assigned this Jun 13, 2019

tsachiherman requested review from algobolson, algoradam and zeldovich June 13, 2019 17:37

zeldovich reviewed Jun 13, 2019

View reviewed changes

Comment thread network/wsNetwork.go Outdated

zeldovich reviewed Jun 13, 2019

View reviewed changes

Comment thread network/wsNetwork.go Outdated

Make changes according to reviewer's comments.

825c3a8

tsachiherman requested a review from zeldovich June 13, 2019 19:34

zeldovich reviewed Jun 13, 2019

View reviewed changes

Comment thread network/wsPeer.go Outdated

Remove the wp.closing case, since it's not really needed.

331d09c

zeldovich previously approved these changes Jun 14, 2019

View reviewed changes

Merge pull request #1 from algorand/master

3f6de0e

merge from master

tsachiherman changed the title ~~[GOAL2-720] Better slow peers disconnection logic~~ [GOAL2-731] Better slow peers disconnection logic Jun 17, 2019

algobolson reviewed Jun 17, 2019

View reviewed changes

Comment thread network/wsPeer.go

derbear reviewed Jun 18, 2019

View reviewed changes

Comment thread network/wsNetwork.go Outdated

tsachizehub added 2 commits June 18, 2019 10:30

Merge branch 'master' into tsachi/disconnectslowpeers

def7231

Add another test for stale messages during writeLoopSend

3a2c801

tsachiherman dismissed zeldovich’s stale review via 3a2c801 June 18, 2019 14:55

tsachiherman and others added 2 commits June 18, 2019 14:56

Merge pull request #2 from algorand/master

2dea925

merge from master

Improve unit test.

d6a0c24

tsachiherman requested review from algobolson and derbear June 18, 2019 20:24

algobolson approved these changes Jun 19, 2019

View reviewed changes

zeldovich merged commit ea85541 into algorand:master Jun 19, 2019

ian-algorand mentioned this pull request Aug 19, 2019

Improve dropped message metrics #249

Closed

pzbitskiy pushed a commit to pzbitskiy/go-algorand that referenced this pull request Mar 19, 2020

Merge pull request algorand#15 from pzbitskiy/pavel/stf-teal

165ea64

Implement return type checking

shiqizng pushed a commit to shiqizng/go-algorand that referenced this pull request Apr 7, 2022

Move Accounts to tab. (algorand#15)

5a4ce63

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GOAL2-731] Better slow peers disconnection logic#15

[GOAL2-731] Better slow peers disconnection logic#15
zeldovich merged 8 commits intoalgorand:masterfrom
tsachiherman:tsachi/disconnectslowpeers

tsachiherman commented Jun 13, 2019

Uh oh!

CLAassistant commented Jun 13, 2019 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

algobolson left a comment

Uh oh!

Uh oh!

derbear Jun 18, 2019

Uh oh!

tsachiherman Jun 18, 2019

Uh oh!

derbear Jun 18, 2019

Uh oh!

tsachiherman Jun 18, 2019

Uh oh!

Uh oh!

derbear commented Jun 18, 2019

Uh oh!

tsachiherman commented Jun 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

tsachiherman commented Jun 13, 2019

Uh oh!

CLAassistant commented Jun 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

algobolson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

derbear Jun 18, 2019

Choose a reason for hiding this comment

Uh oh!

tsachiherman Jun 18, 2019

Choose a reason for hiding this comment

Uh oh!

derbear Jun 18, 2019

Choose a reason for hiding this comment

Uh oh!

tsachiherman Jun 18, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

derbear commented Jun 18, 2019

Uh oh!

tsachiherman commented Jun 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

CLAassistant commented Jun 13, 2019 •

edited

Loading