Skip to content

feat: add Prometheus metrics monitoring and Grafana dashboard to registry server#1724

Merged
yuranich merged 9 commits into
mainfrom
feature-qvac-lib-registry-server-metrics-monitoring
Apr 24, 2026
Merged

feat: add Prometheus metrics monitoring and Grafana dashboard to registry server#1724
yuranich merged 9 commits into
mainfrom
feature-qvac-lib-registry-server-metrics-monitoring

Conversation

@yuranich

@yuranich yuranich commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

The registry server had no operational visibility beyond pino logs and PM2 process stats. Operators could not answer questions about model availability, replication progress, per-core contiguous length vs total length, peer counts, full-replica (seeder) counts, RPC request/error rates, or total blob storage without SSH-ing into each node and running ad-hoc scripts. There was also no checked-in PM2 ecosystem config, no reference Grafana dashboard, and no standard way to scrape the registry process.

This branch implements the full monitoring stack pitched in QVAC-17131 and merges it back to main after staging verification.

How does it solve it?

Four layers, all scoped to packages/qvac-lib-registry-server/:

  • In-process Prometheus /metrics endpoint on a dedicated HTTP server (default port 9210, bound to 127.0.0.1, configurable via --metrics-host / --metrics-port; 9210 avoids clashing with Prometheus's own 9090 and sits next to pm2-prometheus-exporter on 9209). Wires the Holepunch baseline (hypercore-stats, hyperswarm-stats) for aggregate swarm/DHT stats and adds QVAC-specific instrumentation for per-core visibility (hypermetrics is not wired — it's abandoned upstream and incompatible with Hypercore v11, so per-core gauges are exposed directly):
    • view-core health: registry_view_core_length, registry_view_core_contiguous_length, registry_totals_refresh_age_seconds
    • blob-core replication: registry_blob_core_length, registry_blob_core_contiguous_length, registry_blob_core_peers, registry_blob_core_seeders (peers advertising full replicas), registry_blob_core_bytes
    • catalog + RPC: registry_model_count, registry_rpc_requests_total, registry_rpc_errors_total
    • Blob core is eagerly opened on indexers so length/contiguous_length are populated before traffic arrives.
  • External hyper-health-check sidecar as a runtime dependency for independent core discoverability validation.
  • PM2 ecosystem.config.js checked into the repo, replacing the ad-hoc per-server setups described in the deployment guide. Defaults are loopback-safe for public-OSS reasons.
  • Grafana dashboard (docs/grafana/REGISTRY_DASHBOARD.json) covering the Holepunch baseline plus QVAC panels — view-core replication, blob-core length/contiguous/gap, seeders, peers, bytes, RPC request/error rates — with View Core Replication and Blob Core Bytes promoted to the top of the QVAC section as the first signals operators should read. Deployment guide updated with scrape config and dashboard import instructions.

Net diff against main: 12 files changed, +4161 / −103.

Verification

Deployed to the staging cluster (registry-stg-01/02/03); Grafana dashboard is live and reporting the expected signals. Screenshot from staging attached below.
Screenshot 2026-04-23 at 17 48 33

Test suites on the feature branch:

  • Registry-server unit tests: pass
  • Registry-server integration tests, including 7 new metrics.integration.test.js cases: pass

Breaking changes

None

…#1600)

* feat: add Prometheus metrics monitoring to registry server

* fix: restrict registry ping RPC to role and timestamp to avoid exposing operational data

* fix: make metrics bind host configurable and move off port 9090
…re on indexers (#1692)

* feat[bc]: rename gauge metrics off _total suffix and pre-initialise rpc counters

* feat: add core seeder metrics and eagerly open blob core on indexers

* style: drop eslint-disable directives via helper function for gauge registration

* refactor[bc]: drop core_name label from blob core metrics and use median for view-derived stat panels

* style: drop noisy comment above registerGauge helper
…nd Holepunch P2P panels (#1716)

* feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels

Made-with: Cursor

* fix: use vm_name label in QVAC and Holepunch panel legends instead of raw instance IP:port

Made-with: Cursor

* fix: apply $vm template filter to QVAC and Holepunch selectors for consistent per-node filtering

Made-with: Cursor
…ng review (#1718)

* chore[docs]: tighten registry Grafana dashboard panels based on staging review

* chore[docs]: drop redundant Blob Core Contiguous stat, cluster blob panels near the top
… top of the metrics section (#1719)

* chore[docs]: promote View Core Replication and Blob Core Bytes to the top of the metrics section

* chore[docs]: split View Core Replication into length, contiguous, and gap panels
- Drop unreferenced getConnectedBlindPeerKeys / getConfiguredBlindPeerKeys /
  isBlindPeerConnected chain and the _peerConnectionCounts map that only
  existed to back isBlindPeerConnected. Left over from the dropped
  blob_core_blind_peers gauge (1de851b).
- Fix DEPLOYMENT_GUIDE.md: default metrics port is 9210, not 9090; drop
  the hypermetrics reference since it is not a dependency (abandoned,
  incompatible with Hypercore v11) and per-core visibility is provided
  by the registry_blob_core_* / registry_view_core_* gauges.

Made-with: Cursor
@github-actions

github-actions Bot commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ❌ (0/1)
- 1 Team Lead OR Management approval ✅ (2/1)

**Bypass rule:** Triggered (2+ Team Lead approvals (Tier 1 exception)). This PR is approved regardless of tier.

---
*This comment is automatically updated when reviews change.*

@yuranich

Copy link
Copy Markdown
Contributor Author

/review

@yuranich yuranich merged commit 2b44bd6 into main Apr 24, 2026
49 of 79 checks passed
@yuranich yuranich deleted the feature-qvac-lib-registry-server-metrics-monitoring branch April 24, 2026 09:06
Proletter pushed a commit that referenced this pull request May 24, 2026
…stry server (#1724)

* QVAC-17131 feat: add Prometheus metrics monitoring to registry server (#1600)

* feat: add Prometheus metrics monitoring to registry server

* fix: restrict registry ping RPC to role and timestamp to avoid exposing operational data

* fix: make metrics bind host configurable and move off port 9090

* feat: replace per-model size gauge with view-derived total blob bytes (#1689)

* feat[bc]: rename gauges, add seeder metrics, and eagerly open blob core on indexers (#1692)

* feat[bc]: rename gauge metrics off _total suffix and pre-initialise rpc counters

* feat: add core seeder metrics and eagerly open blob core on indexers

* style: drop eslint-disable directives via helper function for gauge registration

* refactor[bc]: drop core_name label from blob core metrics and use median for view-derived stat panels

* style: drop noisy comment above registerGauge helper

* feat[bc]: replace blob_core_fully_downloaded with length/contiguous_length pair and drop blind-peer metrics (#1702)

* feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels (#1716)

* feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels

* fix: use vm_name label in QVAC and Holepunch panel legends instead of raw instance IP:port

* fix: apply $vm template filter to QVAC and Holepunch selectors for consistent per-node filtering


* chore[docs]: tighten registry Grafana dashboard panels based on staging review (#1718)

* chore[docs]: tighten registry Grafana dashboard panels based on staging review

* chore[docs]: drop redundant Blob Core Contiguous stat, cluster blob panels near the top

* chore[docs]: promote View Core Replication and Blob Core Bytes to the top of the metrics section (#1719)

* chore[docs]: promote View Core Replication and Blob Core Bytes to the top of the metrics section

* chore[docs]: split View Core Replication into length, contiguous, and gap panels

* chore: remove dead blind-peer helpers and fix stale metrics docs

- Drop unreferenced getConnectedBlindPeerKeys / getConfiguredBlindPeerKeys /
  isBlindPeerConnected chain and the _peerConnectionCounts map that only
  existed to back isBlindPeerConnected. Left over from the dropped
  blob_core_blind_peers gauge (1de851b).
- Fix DEPLOYMENT_GUIDE.md: default metrics port is 9210, not 9090; drop
  the hypermetrics reference since it is not a dependency (abandoned,
  incompatible with Hypercore v11) and per-core visibility is provided
  by the registry_blob_core_* / registry_view_core_* gauges.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants