Skip to content

collator-proto/metrics: Fix blindspot in collation fetch latency metrics#11600

Merged
lexnv merged 6 commits intomasterfrom
lexnv/fix-metrics
Apr 3, 2026
Merged

collator-proto/metrics: Fix blindspot in collation fetch latency metrics#11600
lexnv merged 6 commits intomasterfrom
lexnv/fix-metrics

Conversation

@lexnv
Copy link
Copy Markdown
Contributor

@lexnv lexnv commented Apr 1, 2026

The current implementation of the collation_fetch_latency metric contains a critical observability blindspot due to insufficient histogram bucket resolution.

Currently, the the collation fetching is capped at an upper bound of 5 seconds. This effectively creates a black box for investigating latency events. Any fetch operation exceeding the 5s threshold is aggregated into the final bucket, regardless if the fetching took 30s or 1h. This obscures the true distribution of network delays and prevents accurate performance profiling for high-latency scenarios.

The discrepancy was identified with https://github.com/lexnv/block-confidence-monitor and confirmed via manual analysis of the logs. Without granular visibility into these outliers, we cannot effectively measure the success of our block confidence / debug bottlenecks in validator-side protocols.

While at it, have increased the granularity of other buckets which might be relevant.

Part of the block confidence work:

cc @sandreim @skunert

lexnv added 2 commits April 1, 2026 09:57
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv self-assigned this Apr 1, 2026
@lexnv lexnv added the T0-node This PR/Issue is related to the topic “node”. label Apr 1, 2026
Comment thread polkadot/node/network/collator-protocol/src/collator_side/metrics.rs Outdated
Comment thread polkadot/node/network/collator-protocol/src/collator_side/metrics.rs Outdated
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@skunert
Copy link
Copy Markdown
Contributor

skunert commented Apr 1, 2026

So with @sandreim comments, does this change even make sense? If the candidates drop anyway, we should never see entries in the buckets > 10 right?

@lexnv
Copy link
Copy Markdown
Contributor Author

lexnv commented Apr 1, 2026

Screenshot 2026-04-01 at 15 14 48

This is mostly needed for collation_fetch_latency - How much time collations spend waiting to be fetched. We can already see this on polkadot-yap 3428: the 2 spikes to 10s would have been capped at 5s (ie the blindspot is real here):

https://grafana.teleport.parity.io/goto/aQvXSItDg?orgId=1

So with @sandreim comments, does this change even make sense? If the candidates drop anyway, we should never see entries in the buckets > 10 right?

Have reverted the metrics where I bumped the block times, yep.

)
.buckets(vec![
0.001, 0.01, 0.025, 0.05, 0.1, 0.15, 0.25, 0.35, 0.5, 0.75, 1.0, 2.0, 5.0,
10.0, 15.0, 20.0, 25.0, 30.0, 35.0, 60.0,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove some of the lower end buckets, anything lower than 250ms is excellent considering some reasonable latency validator/collator.

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv
Copy link
Copy Markdown
Contributor Author

lexnv commented Apr 2, 2026

Screenshot 2026-04-02 at 12 56 02

Graph showing the latency spikes to 30s that were captured by the block-confidence-monitoring tool, we should have better insights into these issues.

We have also deployed a trace version with extra monitoring on validators, which should give us more details into what is happening on the validator-side of the code 🙏

Thanks a lot for the review 🙏

@lexnv lexnv enabled auto-merge April 2, 2026 09:59
@lexnv
Copy link
Copy Markdown
Contributor Author

lexnv commented Apr 2, 2026

/cmd prdoc --audience node_dev --bump patch

@lexnv lexnv added this pull request to the merge queue Apr 3, 2026
Merged via the queue into master with commit b77d5ec Apr 3, 2026
268 of 278 checks passed
@lexnv lexnv deleted the lexnv/fix-metrics branch April 3, 2026 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

T0-node This PR/Issue is related to the topic “node”.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants