feat: add Prometheus metrics monitoring and Grafana dashboard to registry server#1724
Merged
yuranich merged 9 commits intoApr 24, 2026
Merged
Conversation
…#1600) * feat: add Prometheus metrics monitoring to registry server * fix: restrict registry ping RPC to role and timestamp to avoid exposing operational data * fix: make metrics bind host configurable and move off port 9090
…re on indexers (#1692) * feat[bc]: rename gauge metrics off _total suffix and pre-initialise rpc counters * feat: add core seeder metrics and eagerly open blob core on indexers * style: drop eslint-disable directives via helper function for gauge registration * refactor[bc]: drop core_name label from blob core metrics and use median for view-derived stat panels * style: drop noisy comment above registerGauge helper
…ength pair and drop blind-peer metrics (#1702)
…nd Holepunch P2P panels (#1716) * feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels Made-with: Cursor * fix: use vm_name label in QVAC and Holepunch panel legends instead of raw instance IP:port Made-with: Cursor * fix: apply $vm template filter to QVAC and Holepunch selectors for consistent per-node filtering Made-with: Cursor
…ng review (#1718) * chore[docs]: tighten registry Grafana dashboard panels based on staging review * chore[docs]: drop redundant Blob Core Contiguous stat, cluster blob panels near the top
… top of the metrics section (#1719) * chore[docs]: promote View Core Replication and Blob Core Bytes to the top of the metrics section * chore[docs]: split View Core Replication into length, contiguous, and gap panels
- Drop unreferenced getConnectedBlindPeerKeys / getConfiguredBlindPeerKeys / isBlindPeerConnected chain and the _peerConnectionCounts map that only existed to back isBlindPeerConnected. Left over from the dropped blob_core_blind_peers gauge (1de851b). - Fix DEPLOYMENT_GUIDE.md: default metrics port is 9210, not 9090; drop the hypermetrics reference since it is not a dependency (abandoned, incompatible with Hypercore v11) and per-core visibility is provided by the registry_blob_core_* / registry_view_core_* gauges. Made-with: Cursor
NamelsKing
approved these changes
Apr 23, 2026
Contributor
Tier-based Approval Status |
Proletter
approved these changes
Apr 23, 2026
Contributor
Author
|
/review |
Proletter
pushed a commit
that referenced
this pull request
May 24, 2026
…stry server (#1724) * QVAC-17131 feat: add Prometheus metrics monitoring to registry server (#1600) * feat: add Prometheus metrics monitoring to registry server * fix: restrict registry ping RPC to role and timestamp to avoid exposing operational data * fix: make metrics bind host configurable and move off port 9090 * feat: replace per-model size gauge with view-derived total blob bytes (#1689) * feat[bc]: rename gauges, add seeder metrics, and eagerly open blob core on indexers (#1692) * feat[bc]: rename gauge metrics off _total suffix and pre-initialise rpc counters * feat: add core seeder metrics and eagerly open blob core on indexers * style: drop eslint-disable directives via helper function for gauge registration * refactor[bc]: drop core_name label from blob core metrics and use median for view-derived stat panels * style: drop noisy comment above registerGauge helper * feat[bc]: replace blob_core_fully_downloaded with length/contiguous_length pair and drop blind-peer metrics (#1702) * feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels (#1716) * feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels * fix: use vm_name label in QVAC and Holepunch panel legends instead of raw instance IP:port * fix: apply $vm template filter to QVAC and Holepunch selectors for consistent per-node filtering * chore[docs]: tighten registry Grafana dashboard panels based on staging review (#1718) * chore[docs]: tighten registry Grafana dashboard panels based on staging review * chore[docs]: drop redundant Blob Core Contiguous stat, cluster blob panels near the top * chore[docs]: promote View Core Replication and Blob Core Bytes to the top of the metrics section (#1719) * chore[docs]: promote View Core Replication and Blob Core Bytes to the top of the metrics section * chore[docs]: split View Core Replication into length, contiguous, and gap panels * chore: remove dead blind-peer helpers and fix stale metrics docs - Drop unreferenced getConnectedBlindPeerKeys / getConfiguredBlindPeerKeys / isBlindPeerConnected chain and the _peerConnectionCounts map that only existed to back isBlindPeerConnected. Left over from the dropped blob_core_blind_peers gauge (1de851b). - Fix DEPLOYMENT_GUIDE.md: default metrics port is 9210, not 9090; drop the hypermetrics reference since it is not a dependency (abandoned, incompatible with Hypercore v11) and per-core visibility is provided by the registry_blob_core_* / registry_view_core_* gauges.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
The registry server had no operational visibility beyond pino logs and PM2 process stats. Operators could not answer questions about model availability, replication progress, per-core contiguous length vs total length, peer counts, full-replica (seeder) counts, RPC request/error rates, or total blob storage without SSH-ing into each node and running ad-hoc scripts. There was also no checked-in PM2 ecosystem config, no reference Grafana dashboard, and no standard way to scrape the registry process.
This branch implements the full monitoring stack pitched in QVAC-17131 and merges it back to
mainafter staging verification.How does it solve it?
Four layers, all scoped to
packages/qvac-lib-registry-server/:/metricsendpoint on a dedicated HTTP server (default port9210, bound to127.0.0.1, configurable via--metrics-host/--metrics-port;9210avoids clashing with Prometheus's own9090and sits next topm2-prometheus-exporteron9209). Wires the Holepunch baseline (hypercore-stats,hyperswarm-stats) for aggregate swarm/DHT stats and adds QVAC-specific instrumentation for per-core visibility (hypermetricsis not wired — it's abandoned upstream and incompatible with Hypercore v11, so per-core gauges are exposed directly):registry_view_core_length,registry_view_core_contiguous_length,registry_totals_refresh_age_secondsregistry_blob_core_length,registry_blob_core_contiguous_length,registry_blob_core_peers,registry_blob_core_seeders(peers advertising full replicas),registry_blob_core_bytesregistry_model_count,registry_rpc_requests_total,registry_rpc_errors_totalhyper-health-checksidecar as a runtime dependency for independent core discoverability validation.ecosystem.config.jschecked into the repo, replacing the ad-hoc per-server setups described in the deployment guide. Defaults are loopback-safe for public-OSS reasons.docs/grafana/REGISTRY_DASHBOARD.json) covering the Holepunch baseline plus QVAC panels — view-core replication, blob-core length/contiguous/gap, seeders, peers, bytes, RPC request/error rates — withView Core ReplicationandBlob Core Bytespromoted to the top of the QVAC section as the first signals operators should read. Deployment guide updated with scrape config and dashboard import instructions.Net diff against
main: 12 files changed, +4161 / −103.Verification
Deployed to the staging cluster (

registry-stg-01/02/03); Grafana dashboard is live and reporting the expected signals. Screenshot from staging attached below.Test suites on the feature branch:
metrics.integration.test.jscases: passBreaking changes
None