Skip to content

QVAC-17131 feat: add Prometheus metrics monitoring to registry server#1600

Merged
yuranich merged 3 commits into
tetherto:feature-qvac-lib-registry-server-metrics-monitoringfrom
yuranich:feat/QVAC-17131-registry-server-metrics-monitoring
Apr 20, 2026
Merged

QVAC-17131 feat: add Prometheus metrics monitoring to registry server#1600
yuranich merged 3 commits into
tetherto:feature-qvac-lib-registry-server-metrics-monitoringfrom
yuranich:feat/QVAC-17131-registry-server-metrics-monitoring

Conversation

@yuranich

@yuranich yuranich commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

The registry server has no operational visibility beyond pino logs and PM2 process stats. Operators cannot answer questions about model availability, download throughput, replication health, or per-model storage without SSH and ad-hoc scripts.

How does it solve it?

Implements 4-layer Prometheus metrics monitoring (pitch: april_pitches/Registry/Registry Server Metrics Monitoring.md):

  • Layer 1: In-process /metrics HTTP endpoint (default port 9210, loopback-bound) wiring hypercore-stats, hyperswarm-stats, hypermetrics, plus 13 QVAC-specific gauges/counters (view core length, model count, RPC request/error rates, blob core health, per-model size). Bind address is configurable via --metrics-host; port 9210 is chosen to avoid clashing with Prometheus's own 9090 and to sit next to pm2-prometheus-exporter on 9209.
  • Layer 2: hyper-health-check sidecar dependency for external core discoverability validation.
  • Layer 3: PM2 ecosystem.config.js standardizing the ad-hoc process setup.
  • Layer 4: Grafana dashboard JSON (Holepunch baseline + QVAC panels) and deployment docs.

RPC handlers instrumented with request/error counters. All existing tests pass (37 unit + 13 integration), 7 new metrics integration tests added.

Breaking changes

None. Metrics endpoint remains opt-in and the default bind host is 127.0.0.1 (no behaviour change for unconfigured deployments). Default port moved from 9090 to 9210; any operator who was already scraping 9090 must update their Prometheus scrape config or pass --metrics-port 9090 explicitly.

Proletter
Proletter previously approved these changes Apr 20, 2026
@yuranich yuranich changed the base branch from main to feature-qvac-lib-registry-server-metrics-monitoring April 20, 2026 13:56
@yuranich yuranich merged commit 51c199a into tetherto:feature-qvac-lib-registry-server-metrics-monitoring Apr 20, 2026
10 of 11 checks passed
yuranich added a commit that referenced this pull request Apr 24, 2026
…stry server (#1724)

* QVAC-17131 feat: add Prometheus metrics monitoring to registry server (#1600)

* feat: add Prometheus metrics monitoring to registry server

* fix: restrict registry ping RPC to role and timestamp to avoid exposing operational data

* fix: make metrics bind host configurable and move off port 9090

* feat: replace per-model size gauge with view-derived total blob bytes (#1689)

* feat[bc]: rename gauges, add seeder metrics, and eagerly open blob core on indexers (#1692)

* feat[bc]: rename gauge metrics off _total suffix and pre-initialise rpc counters

* feat: add core seeder metrics and eagerly open blob core on indexers

* style: drop eslint-disable directives via helper function for gauge registration

* refactor[bc]: drop core_name label from blob core metrics and use median for view-derived stat panels

* style: drop noisy comment above registerGauge helper

* feat[bc]: replace blob_core_fully_downloaded with length/contiguous_length pair and drop blind-peer metrics (#1702)

* feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels (#1716)

* feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels

* fix: use vm_name label in QVAC and Holepunch panel legends instead of raw instance IP:port

* fix: apply $vm template filter to QVAC and Holepunch selectors for consistent per-node filtering


* chore[docs]: tighten registry Grafana dashboard panels based on staging review (#1718)

* chore[docs]: tighten registry Grafana dashboard panels based on staging review

* chore[docs]: drop redundant Blob Core Contiguous stat, cluster blob panels near the top

* chore[docs]: promote View Core Replication and Blob Core Bytes to the top of the metrics section (#1719)

* chore[docs]: promote View Core Replication and Blob Core Bytes to the top of the metrics section

* chore[docs]: split View Core Replication into length, contiguous, and gap panels

* chore: remove dead blind-peer helpers and fix stale metrics docs

- Drop unreferenced getConnectedBlindPeerKeys / getConfiguredBlindPeerKeys /
  isBlindPeerConnected chain and the _peerConnectionCounts map that only
  existed to back isBlindPeerConnected. Left over from the dropped
  blob_core_blind_peers gauge (1de851b).
- Fix DEPLOYMENT_GUIDE.md: default metrics port is 9210, not 9090; drop
  the hypermetrics reference since it is not a dependency (abandoned,
  incompatible with Hypercore v11) and per-core visibility is provided
  by the registry_blob_core_* / registry_view_core_* gauges.
Proletter pushed a commit that referenced this pull request May 24, 2026
…stry server (#1724)

* QVAC-17131 feat: add Prometheus metrics monitoring to registry server (#1600)

* feat: add Prometheus metrics monitoring to registry server

* fix: restrict registry ping RPC to role and timestamp to avoid exposing operational data

* fix: make metrics bind host configurable and move off port 9090

* feat: replace per-model size gauge with view-derived total blob bytes (#1689)

* feat[bc]: rename gauges, add seeder metrics, and eagerly open blob core on indexers (#1692)

* feat[bc]: rename gauge metrics off _total suffix and pre-initialise rpc counters

* feat: add core seeder metrics and eagerly open blob core on indexers

* style: drop eslint-disable directives via helper function for gauge registration

* refactor[bc]: drop core_name label from blob core metrics and use median for view-derived stat panels

* style: drop noisy comment above registerGauge helper

* feat[bc]: replace blob_core_fully_downloaded with length/contiguous_length pair and drop blind-peer metrics (#1702)

* feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels (#1716)

* feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels

* fix: use vm_name label in QVAC and Holepunch panel legends instead of raw instance IP:port

* fix: apply $vm template filter to QVAC and Holepunch selectors for consistent per-node filtering


* chore[docs]: tighten registry Grafana dashboard panels based on staging review (#1718)

* chore[docs]: tighten registry Grafana dashboard panels based on staging review

* chore[docs]: drop redundant Blob Core Contiguous stat, cluster blob panels near the top

* chore[docs]: promote View Core Replication and Blob Core Bytes to the top of the metrics section (#1719)

* chore[docs]: promote View Core Replication and Blob Core Bytes to the top of the metrics section

* chore[docs]: split View Core Replication into length, contiguous, and gap panels

* chore: remove dead blind-peer helpers and fix stale metrics docs

- Drop unreferenced getConnectedBlindPeerKeys / getConfiguredBlindPeerKeys /
  isBlindPeerConnected chain and the _peerConnectionCounts map that only
  existed to back isBlindPeerConnected. Left over from the dropped
  blob_core_blind_peers gauge (1de851b).
- Fix DEPLOYMENT_GUIDE.md: default metrics port is 9210, not 9090; drop
  the hypermetrics reference since it is not a dependency (abandoned,
  incompatible with Hypercore v11) and per-core visibility is provided
  by the registry_blob_core_* / registry_view_core_* gauges.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants