Skip to content

[GOAL2-731] Better slow peers disconnection logic#15

Merged
zeldovich merged 8 commits intoalgorand:masterfrom
tsachiherman:tsachi/disconnectslowpeers
Jun 19, 2019
Merged

[GOAL2-731] Better slow peers disconnection logic#15
zeldovich merged 8 commits intoalgorand:masterfrom
tsachiherman:tsachi/disconnectslowpeers

Conversation

@tsachiherman
Copy link
Copy Markdown
Contributor

Our existing FPR could generate large quantity of messages.
These, in turn, could cause the internal output buffer to overflow, triggering a peer disconnection.
That's not the desired behavior; instead, we want to disconnect the peer if the messages that are being written to it are too old.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Jun 13, 2019

CLA assistant check
All committers have signed the CLA.

Comment thread network/wsNetwork.go Outdated
Comment thread network/wsNetwork.go Outdated
@tsachiherman tsachiherman requested a review from zeldovich June 13, 2019 19:34
Comment thread network/wsPeer.go Outdated
zeldovich
zeldovich previously approved these changes Jun 14, 2019
@tsachiherman tsachiherman changed the title [GOAL2-720] Better slow peers disconnection logic [GOAL2-731] Better slow peers disconnection logic Jun 17, 2019
Copy link
Copy Markdown
Contributor

@algobolson algobolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's one hole in checking for slow peers that we can fix.

Comment thread network/wsPeer.go
Comment thread network/wsNetwork.go Outdated
config.Consensus[protocol.ConsensusCurrentVersion].SoftCommitteeSize +
config.Consensus[protocol.ConsensusCurrentVersion].CertCommitteeSize +
config.Consensus[protocol.ConsensusCurrentVersion].NextCommitteeSize +
config.Consensus[protocol.ConsensusCurrentVersion].LateCommitteeSize)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a small nit, I would change "single round" to "single period" (and say that this is the total number of messages sent at once).

I don't think it makes a big difference here as it's a heuristic, but I would also add RedoCommitteeSize and DownCommitteeSize. In particular, the committee for the down votes is the largest, at 6000 possible votes. (It needs to be large because it intersects with the cert votes, which are the key committing votes.)

We also pipeline (relaying) all of these votes from the next round and the next period, so it's possible that this number should be 3x as big (as in, we might pipeline 3 periods' worth of votes). On the other hand, this is a pretty unlikely situation and means that the network is experiencing extreme congestion. I think with the current committee sizes, the sum of all our committee sizes is about 20000 messages, which would make 3x about 60000 messages (so with 0.5KB votes this is 30MB).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll increase the size of the buffer by RedoCommitteeSize+ DownCommitteeSize.

Comment thread network/wsNetwork.go
var networkBroadcastSendMicros = metrics.MakeCounter(metrics.MetricName{Name: "algod_network_broadcast_send_micros_total", Description: "microseconds spent broadcasting"})
var networkBroadcastsDropped = metrics.MakeCounter(metrics.MetricName{Name: "algod_broadcasts_dropped_total", Description: "number of broadcast messages not sent to some peer"})
var networkBroadcastsDropped = metrics.MakeCounter(metrics.MetricName{Name: "algod_broadcasts_dropped_total", Description: "number of broadcast messages not sent to any peer"})
var networkPeerBroadcastDropped = metrics.MakeCounter(metrics.MetricName{Name: "algod_peer_broadcast_dropped_total", Description: "number of broadcast messages not sent to some peer"})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have separate metrics for drops of high-priority messages and low-priority messages? It seems that high-priority drops would be much more alarming than low-priority drops (a lot of low-priority drops means that we might have a ping-pong script bug; a lot of high-priority drops means that the network could be about to stall).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. I'll defer this to a separate PR. Opened a JIRA issue to track this:
https://algorand.atlassian.net/browse/GOAL2-790

Comment thread network/wsNetwork.go Outdated
@derbear
Copy link
Copy Markdown
Contributor

derbear commented Jun 18, 2019

To clarify, this commit also changes behavior for high-priority broadcasts, removing the distinction between high- and low-priority broadcasts, right?

Specifically, for any message, drop a send (1) to a certain peer if the non-blocking send fails and (2) to all peers if the message took too long to leave the queue.

I think this change is probably a good idea. The interesting thing is how it'll affect the agreement protocol given that before this change, vote delivery was "guaranteed" on a persistent connection. It might be a good idea to stress-test this change somehow with a private network.

@tsachiherman
Copy link
Copy Markdown
Contributor Author

To clarify, this commit also changes behavior for high-priority broadcasts, removing the distinction between high- and low-priority broadcasts, right?

Specifically, for any message, drop a send (1) to a certain peer if the non-blocking send fails and (2) to all peers if the message took too long to leave the queue.

I think this change is probably a good idea. The interesting thing is how it'll affect the agreement protocol given that before this change, vote delivery was "guaranteed" on a persistent connection. It might be a good idea to stress-test this change somehow with a private network.

Your observations are correct. If the peer is too slow to process messages, that peer is going to start loosing messages. That's a connection-dependent message drop.
if the message is too old and being eliminated from all the peers, it means that we're sending the messages too fast for the broadcastThread to process.

@zeldovich zeldovich merged commit ea85541 into algorand:master Jun 19, 2019
pzbitskiy pushed a commit to pzbitskiy/go-algorand that referenced this pull request Mar 19, 2020
shiqizng pushed a commit to shiqizng/go-algorand that referenced this pull request Apr 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants