-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Add a Prometheus alert on no incoming connection #7517
Conversation
| message: 'The node {{ $labels.instance }} has less than 3 peers for more | ||
| than 15 minutes' | ||
| - alert: NoIncomingConnection | ||
| expr: increase(polkadot_sub_libp2p_incoming_connections_total[20m]) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without a for would this not fire each time a new node starts as the first value would be 0 and there are no other values in the past 20m in the timeseries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested this with my own server, where my node was offline for a few hours:
Instead of 20m I put 6h. In the period before I started the node again, the values were simply missing, and I assume missing values wouldn't trigger the alert.
Interestingly, however, between 1am (when my node crashed) and 7am, values were still present and decreasing linearly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, however, between 1am (when my node crashed) and 7am, values were still present and decreasing linearly.
Just to make sure there is no confusion, the increase rate decreased.
| message: 'The node {{ $labels.instance }} has less than 3 peers for more | ||
| than 15 minutes' | ||
| - alert: NoIncomingConnection | ||
| expr: increase(polkadot_sub_libp2p_incoming_connections_total[20m]) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, however, between 1am (when my node crashed) and 7am, values were still present and decreasing linearly.
Just to make sure there is no confusion, the increase rate decreased.
For what it's worth, this PR is ready to be merged (assuming it gets a second approval), but we're reluctant in doing it in order to not spam the devops with alerts. |
Approved it :D |
|
Since Max's approval no longer counts, I need a third approval 😬 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, looks good to me 👍
I'm assuming your initial comment regarding validators + sentries is also no longer relevant?
|
Well, right now it would trigger alerts on our (Parity) nodes, since the change hasn't been done for Polkadot nodes yet as far as I know. Since #8079 removes sentry nodes altogether, we can merge this close to the next release. |
|
Apparently our devops have finished the work. |
|
bot merge |
|
Waiting for commit status. |
|
Checks failed; merge aborted. |

Note that this will trigger for all validators right now, as they have stable connections to their sentries.
These alerts assume that there exists a background noise of connections.
In practice, all of our nodes except for validators have an increase of 7000 to 24000 per 20 minutes, so quite far away from the 0 where the alert would trigger.