Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,8 @@ impl metrics::Metrics for Metrics {
"How much time collations spend waiting to be fetched",
)
.buckets(vec![
0.001, 0.01, 0.025, 0.05, 0.1, 0.15, 0.25, 0.35, 0.5, 0.75, 1.0, 2.0, 5.0,
0.25, 0.35, 0.5, 0.75, 1.0, 2.0, 5.0, 10.0, 15.0, 20.0, 25.0, 30.0, 35.0,
60.0,
]),
)?,
registry,
Expand Down Expand Up @@ -215,7 +216,9 @@ impl metrics::Metrics for Metrics {
"polkadot_parachain_collation_expired",
"How many collations expired (not backed or not included)",
)
.buckets(vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]),
.buckets(vec![
1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 15.0, 20.0, 25.0, 30.0,
]),
&["state"],
)?,
registry,
Expand Down
21 changes: 21 additions & 0 deletions prdoc/pr_11600.prdoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
title: 'collator-proto/metrics: Fix blindspot in collation fetch latency metrics'
doc:
- audience: Node Dev
description: |-
The current implementation of the `collation_fetch_latency` metric contains a critical observability blindspot due to insufficient histogram bucket resolution.


Currently, the the collation fetching is capped at an upper bound of 5 seconds. This effectively creates a black box for investigating latency events. Any fetch operation exceeding the 5s threshold is aggregated into the final bucket, regardless if the fetching took 30s or 1h. This obscures the true distribution of network delays and prevents accurate performance profiling for high-latency scenarios.

The discrepancy was identified with https://github.com/lexnv/block-confidence-monitor and confirmed via manual analysis of the logs. Without granular visibility into these outliers, we cannot effectively measure the success of our block confidence / debug bottlenecks in validator-side protocols.

While at it, have increased the granularity of other buckets which might be relevant.

Part of the block confidence work:
- https://github.com/paritytech/polkadot-sdk/issues/11377


cc @sandreim @skunert
crates:
- name: polkadot-collator-protocol
bump: patch
Loading