Skip to content

feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels#1716

Merged
yuranich merged 3 commits into
tetherto:feature-qvac-lib-registry-server-metrics-monitoringfrom
yuranich:feature-registry-dashboard-expansion
Apr 23, 2026
Merged

feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels#1716
yuranich merged 3 commits into
tetherto:feature-qvac-lib-registry-server-metrics-monitoringfrom
yuranich:feature-registry-dashboard-expansion

Conversation

@yuranich

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

The Grafana dashboard only visualizes a small subset of what the /metrics endpoint exposes. A full scrape includes 12 QVAC metrics, 20 Hypercore metrics, 12 Hyperswarm metrics, and ~25 DHT/UDX metrics, but the existing dashboard covers only 6 QVAC metrics and zero Holepunch P2P layer metrics. Operators have no visibility into swarm connectivity, replication RTT, wire throughput, DHT health, firewall state, or protocol-level anomalies (invalid data/requests, packet drops), and registry-specific gaps remain — the blob core's raw replication numbers are collapsed into a single boolean, per-node on-disk bytes are not plotted over time, and full-replica seeder counts are not displayed.

How does it solve it?

Adds 17 panels grouped into two blocks.

Registry block (4 panels):

  • Blob Core Replication timeseries (length / contiguous / gap) — mirrors the existing View Core Replication panel.
  • Core Seeders (full replicas) timeseries combining qvac_registry_view_core_seeders and qvac_registry_blob_core_seeders per instance.
  • Blob Core Bytes full-width timeseries for per-instance on-disk growth.
  • Totals Refresh Age stat (unit s, red at >600s, -1 mapped to "never") — catches a stalled background refresh that would silently freeze model_count / total_blob_bytes.

New "Holepunch P2P Metrics" section:

  • Stat row: Swarm Peers, Firewalled Nodes, Hypercore Invalid Data, Hypercore Invalid Requests, UDX Packet Drops, Avg Congestion Window.
  • Timeseries row: Swarm Peers Over Time, Replication RTT.
  • Timeseries row: Swarm Connection Churn (client/server opened/closed rates), Hypercore Wire Data Throughput (Bps).
  • Timeseries row: UDX Bytes (Bps), DHT Query & Request Rate.

The two Loki log panels shift from y=65/75 to y=102/112 to make room; the new Totals Refresh Age stat fills the previously-empty x=20 slot in the existing y=37 stat row. Panel IDs allocated from 26 upward to avoid collision. Protocol internals (per-type wire counters, DHT ping/find-node/down-hint, hypercore sessions/hotswaps/cache) are deliberately not shown to keep the dashboard focused on actionable signals.

Breaking changes

None. Dashboard-only change; no metric additions, removals, or renames.

@yuranich yuranich requested review from a team as code owners April 23, 2026 07:30
@yuranich yuranich merged commit 4a19b32 into tetherto:feature-qvac-lib-registry-server-metrics-monitoring Apr 23, 2026
6 checks passed
yuranich added a commit that referenced this pull request Apr 24, 2026
…stry server (#1724)

* QVAC-17131 feat: add Prometheus metrics monitoring to registry server (#1600)

* feat: add Prometheus metrics monitoring to registry server

* fix: restrict registry ping RPC to role and timestamp to avoid exposing operational data

* fix: make metrics bind host configurable and move off port 9090

* feat: replace per-model size gauge with view-derived total blob bytes (#1689)

* feat[bc]: rename gauges, add seeder metrics, and eagerly open blob core on indexers (#1692)

* feat[bc]: rename gauge metrics off _total suffix and pre-initialise rpc counters

* feat: add core seeder metrics and eagerly open blob core on indexers

* style: drop eslint-disable directives via helper function for gauge registration

* refactor[bc]: drop core_name label from blob core metrics and use median for view-derived stat panels

* style: drop noisy comment above registerGauge helper

* feat[bc]: replace blob_core_fully_downloaded with length/contiguous_length pair and drop blind-peer metrics (#1702)

* feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels (#1716)

* feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels

* fix: use vm_name label in QVAC and Holepunch panel legends instead of raw instance IP:port

* fix: apply $vm template filter to QVAC and Holepunch selectors for consistent per-node filtering


* chore[docs]: tighten registry Grafana dashboard panels based on staging review (#1718)

* chore[docs]: tighten registry Grafana dashboard panels based on staging review

* chore[docs]: drop redundant Blob Core Contiguous stat, cluster blob panels near the top

* chore[docs]: promote View Core Replication and Blob Core Bytes to the top of the metrics section (#1719)

* chore[docs]: promote View Core Replication and Blob Core Bytes to the top of the metrics section

* chore[docs]: split View Core Replication into length, contiguous, and gap panels

* chore: remove dead blind-peer helpers and fix stale metrics docs

- Drop unreferenced getConnectedBlindPeerKeys / getConfiguredBlindPeerKeys /
  isBlindPeerConnected chain and the _peerConnectionCounts map that only
  existed to back isBlindPeerConnected. Left over from the dropped
  blob_core_blind_peers gauge (1de851b).
- Fix DEPLOYMENT_GUIDE.md: default metrics port is 9210, not 9090; drop
  the hypermetrics reference since it is not a dependency (abandoned,
  incompatible with Hypercore v11) and per-core visibility is provided
  by the registry_blob_core_* / registry_view_core_* gauges.
Proletter pushed a commit that referenced this pull request May 24, 2026
…stry server (#1724)

* QVAC-17131 feat: add Prometheus metrics monitoring to registry server (#1600)

* feat: add Prometheus metrics monitoring to registry server

* fix: restrict registry ping RPC to role and timestamp to avoid exposing operational data

* fix: make metrics bind host configurable and move off port 9090

* feat: replace per-model size gauge with view-derived total blob bytes (#1689)

* feat[bc]: rename gauges, add seeder metrics, and eagerly open blob core on indexers (#1692)

* feat[bc]: rename gauge metrics off _total suffix and pre-initialise rpc counters

* feat: add core seeder metrics and eagerly open blob core on indexers

* style: drop eslint-disable directives via helper function for gauge registration

* refactor[bc]: drop core_name label from blob core metrics and use median for view-derived stat panels

* style: drop noisy comment above registerGauge helper

* feat[bc]: replace blob_core_fully_downloaded with length/contiguous_length pair and drop blind-peer metrics (#1702)

* feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels (#1716)

* feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels

* fix: use vm_name label in QVAC and Holepunch panel legends instead of raw instance IP:port

* fix: apply $vm template filter to QVAC and Holepunch selectors for consistent per-node filtering


* chore[docs]: tighten registry Grafana dashboard panels based on staging review (#1718)

* chore[docs]: tighten registry Grafana dashboard panels based on staging review

* chore[docs]: drop redundant Blob Core Contiguous stat, cluster blob panels near the top

* chore[docs]: promote View Core Replication and Blob Core Bytes to the top of the metrics section (#1719)

* chore[docs]: promote View Core Replication and Blob Core Bytes to the top of the metrics section

* chore[docs]: split View Core Replication into length, contiguous, and gap panels

* chore: remove dead blind-peer helpers and fix stale metrics docs

- Drop unreferenced getConnectedBlindPeerKeys / getConfiguredBlindPeerKeys /
  isBlindPeerConnected chain and the _peerConnectionCounts map that only
  existed to back isBlindPeerConnected. Left over from the dropped
  blob_core_blind_peers gauge (1de851b).
- Fix DEPLOYMENT_GUIDE.md: default metrics port is 9210, not 9090; drop
  the hypermetrics reference since it is not a dependency (abandoned,
  incompatible with Hypercore v11) and per-core visibility is provided
  by the registry_blob_core_* / registry_view_core_* gauges.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant