collator-proto/metrics: Fix blindspot in collation fetch latency metrics by lexnv · Pull Request #11600 · paritytech/polkadot-sdk

lexnv · 2026-04-01T10:05:27Z

The current implementation of the collation_fetch_latency metric contains a critical observability blindspot due to insufficient histogram bucket resolution.

Currently, the the collation fetching is capped at an upper bound of 5 seconds. This effectively creates a black box for investigating latency events. Any fetch operation exceeding the 5s threshold is aggregated into the final bucket, regardless if the fetching took 30s or 1h. This obscures the true distribution of network delays and prevents accurate performance profiling for high-latency scenarios.

The discrepancy was identified with https://github.com/lexnv/block-confidence-monitor and confirmed via manual analysis of the logs. Without granular visibility into these outliers, we cannot effectively measure the success of our block confidence / debug bottlenecks in validator-side protocols.

While at it, have increased the granularity of other buckets which might be relevant.

Part of the block confidence work:

Lower Block Confidence on Polkadot 96% vs Kusama parachains 99% #11377

cc @sandreim @skunert

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

skunert · 2026-04-01T12:09:53Z

So with @sandreim comments, does this change even make sense? If the candidates drop anyway, we should never see entries in the buckets > 10 right?

lexnv · 2026-04-01T12:16:52Z

This is mostly needed for collation_fetch_latency - How much time collations spend waiting to be fetched. We can already see this on polkadot-yap 3428: the 2 spikes to 10s would have been capped at 5s (ie the blindspot is real here):

https://grafana.teleport.parity.io/goto/aQvXSItDg?orgId=1

So with @sandreim comments, does this change even make sense? If the candidates drop anyway, we should never see entries in the buckets > 10 right?

Have reverted the metrics where I bumped the block times, yep.

sandreim · 2026-04-01T13:10:48Z

 					)
 					.buckets(vec![
 						0.001, 0.01, 0.025, 0.05, 0.1, 0.15, 0.25, 0.35, 0.5, 0.75, 1.0, 2.0, 5.0,
+						10.0, 15.0, 20.0, 25.0, 30.0, 35.0, 60.0,


I'd remove some of the lower end buckets, anything lower than 250ms is excellent considering some reasonable latency validator/collator.

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

lexnv · 2026-04-02T09:58:56Z

Graph showing the latency spikes to 30s that were captured by the block-confidence-monitoring tool, we should have better insights into these issues.

We have also deployed a trace version with extra monitoring on validators, which should give us more details into what is happening on the validator-side of the code 🙏

Thanks a lot for the review 🙏

lexnv · 2026-04-02T10:16:00Z

/cmd prdoc --audience node_dev --bump patch

…e_dev --bump patch'

lexnv added 2 commits April 1, 2026 09:57

collator-proto/metrics: Fix blindspot in collation fetch latency metrics

ab2a221

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

metrics: Bump some other buckets

61d37db

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

lexnv self-assigned this Apr 1, 2026

lexnv added the T0-node This PR/Issue is related to the topic “node”. label Apr 1, 2026

sandreim reviewed Apr 1, 2026

View reviewed changes

Comment thread polkadot/node/network/collator-protocol/src/collator_side/metrics.rs Outdated

Comment thread polkadot/node/network/collator-protocol/src/collator_side/metrics.rs Outdated

Adjust metrics

1b3ad0b

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

lexnv mentioned this pull request Apr 1, 2026

[DNM/noreview] Deploy latest patches on YAP Polkadot w 3 cores fixed #11598

Open

sandreim approved these changes Apr 1, 2026

View reviewed changes

metrics: Trim buckets lower than 250ms

5c055ae

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

lexnv mentioned this pull request Apr 1, 2026

Lower Block Confidence on Polkadot 96% vs Kusama parachains 99% #11377

Open

skunert approved these changes Apr 1, 2026

View reviewed changes

lexnv enabled auto-merge April 2, 2026 09:59

github-actions Bot and others added 2 commits April 2, 2026 10:19

Update from github-actions[bot] running command 'prdoc --audience nod…

ee40243

…e_dev --bump patch'

Merge branch 'master' into lexnv/fix-metrics

a9aa946

lexnv added this pull request to the merge queue Apr 3, 2026

Merged via the queue into master with commit b77d5ec Apr 3, 2026
268 of 278 checks passed

lexnv deleted the lexnv/fix-metrics branch April 3, 2026 10:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

collator-proto/metrics: Fix blindspot in collation fetch latency metrics#11600

collator-proto/metrics: Fix blindspot in collation fetch latency metrics#11600
lexnv merged 6 commits intomasterfrom
lexnv/fix-metrics

lexnv commented Apr 1, 2026

Uh oh!

Uh oh!

Uh oh!

skunert commented Apr 1, 2026

Uh oh!

lexnv commented Apr 1, 2026 •

edited

Loading

Uh oh!

sandreim Apr 1, 2026

Uh oh!

lexnv commented Apr 2, 2026

Uh oh!

lexnv commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lexnv commented Apr 1, 2026

Uh oh!

Uh oh!

Uh oh!

skunert commented Apr 1, 2026

Uh oh!

lexnv commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sandreim Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

lexnv commented Apr 2, 2026

Uh oh!

lexnv commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lexnv commented Apr 1, 2026 •

edited

Loading