QVAC-17131 feat: add Prometheus metrics monitoring to registry server#1600
Merged
yuranich merged 3 commits intoApr 20, 2026
Conversation
Made-with: Cursor
Proletter
previously approved these changes
Apr 20, 2026
…ng operational data
51c199a
into
tetherto:feature-qvac-lib-registry-server-metrics-monitoring
10 of 11 checks passed
yuranich
added a commit
that referenced
this pull request
Apr 24, 2026
…stry server (#1724) * QVAC-17131 feat: add Prometheus metrics monitoring to registry server (#1600) * feat: add Prometheus metrics monitoring to registry server * fix: restrict registry ping RPC to role and timestamp to avoid exposing operational data * fix: make metrics bind host configurable and move off port 9090 * feat: replace per-model size gauge with view-derived total blob bytes (#1689) * feat[bc]: rename gauges, add seeder metrics, and eagerly open blob core on indexers (#1692) * feat[bc]: rename gauge metrics off _total suffix and pre-initialise rpc counters * feat: add core seeder metrics and eagerly open blob core on indexers * style: drop eslint-disable directives via helper function for gauge registration * refactor[bc]: drop core_name label from blob core metrics and use median for view-derived stat panels * style: drop noisy comment above registerGauge helper * feat[bc]: replace blob_core_fully_downloaded with length/contiguous_length pair and drop blind-peer metrics (#1702) * feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels (#1716) * feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels * fix: use vm_name label in QVAC and Holepunch panel legends instead of raw instance IP:port * fix: apply $vm template filter to QVAC and Holepunch selectors for consistent per-node filtering * chore[docs]: tighten registry Grafana dashboard panels based on staging review (#1718) * chore[docs]: tighten registry Grafana dashboard panels based on staging review * chore[docs]: drop redundant Blob Core Contiguous stat, cluster blob panels near the top * chore[docs]: promote View Core Replication and Blob Core Bytes to the top of the metrics section (#1719) * chore[docs]: promote View Core Replication and Blob Core Bytes to the top of the metrics section * chore[docs]: split View Core Replication into length, contiguous, and gap panels * chore: remove dead blind-peer helpers and fix stale metrics docs - Drop unreferenced getConnectedBlindPeerKeys / getConfiguredBlindPeerKeys / isBlindPeerConnected chain and the _peerConnectionCounts map that only existed to back isBlindPeerConnected. Left over from the dropped blob_core_blind_peers gauge (1de851b). - Fix DEPLOYMENT_GUIDE.md: default metrics port is 9210, not 9090; drop the hypermetrics reference since it is not a dependency (abandoned, incompatible with Hypercore v11) and per-core visibility is provided by the registry_blob_core_* / registry_view_core_* gauges.
Proletter
pushed a commit
that referenced
this pull request
May 24, 2026
…stry server (#1724) * QVAC-17131 feat: add Prometheus metrics monitoring to registry server (#1600) * feat: add Prometheus metrics monitoring to registry server * fix: restrict registry ping RPC to role and timestamp to avoid exposing operational data * fix: make metrics bind host configurable and move off port 9090 * feat: replace per-model size gauge with view-derived total blob bytes (#1689) * feat[bc]: rename gauges, add seeder metrics, and eagerly open blob core on indexers (#1692) * feat[bc]: rename gauge metrics off _total suffix and pre-initialise rpc counters * feat: add core seeder metrics and eagerly open blob core on indexers * style: drop eslint-disable directives via helper function for gauge registration * refactor[bc]: drop core_name label from blob core metrics and use median for view-derived stat panels * style: drop noisy comment above registerGauge helper * feat[bc]: replace blob_core_fully_downloaded with length/contiguous_length pair and drop blind-peer metrics (#1702) * feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels (#1716) * feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels * fix: use vm_name label in QVAC and Holepunch panel legends instead of raw instance IP:port * fix: apply $vm template filter to QVAC and Holepunch selectors for consistent per-node filtering * chore[docs]: tighten registry Grafana dashboard panels based on staging review (#1718) * chore[docs]: tighten registry Grafana dashboard panels based on staging review * chore[docs]: drop redundant Blob Core Contiguous stat, cluster blob panels near the top * chore[docs]: promote View Core Replication and Blob Core Bytes to the top of the metrics section (#1719) * chore[docs]: promote View Core Replication and Blob Core Bytes to the top of the metrics section * chore[docs]: split View Core Replication into length, contiguous, and gap panels * chore: remove dead blind-peer helpers and fix stale metrics docs - Drop unreferenced getConnectedBlindPeerKeys / getConfiguredBlindPeerKeys / isBlindPeerConnected chain and the _peerConnectionCounts map that only existed to back isBlindPeerConnected. Left over from the dropped blob_core_blind_peers gauge (1de851b). - Fix DEPLOYMENT_GUIDE.md: default metrics port is 9210, not 9090; drop the hypermetrics reference since it is not a dependency (abandoned, incompatible with Hypercore v11) and per-core visibility is provided by the registry_blob_core_* / registry_view_core_* gauges.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
The registry server has no operational visibility beyond pino logs and PM2 process stats. Operators cannot answer questions about model availability, download throughput, replication health, or per-model storage without SSH and ad-hoc scripts.
How does it solve it?
Implements 4-layer Prometheus metrics monitoring (pitch: april_pitches/Registry/Registry Server Metrics Monitoring.md):
/metricsHTTP endpoint (default port 9210, loopback-bound) wiring hypercore-stats, hyperswarm-stats, hypermetrics, plus 13 QVAC-specific gauges/counters (view core length, model count, RPC request/error rates, blob core health, per-model size). Bind address is configurable via--metrics-host; port 9210 is chosen to avoid clashing with Prometheus's own 9090 and to sit next to pm2-prometheus-exporter on 9209.RPC handlers instrumented with request/error counters. All existing tests pass (37 unit + 13 integration), 7 new metrics integration tests added.
Breaking changes
None. Metrics endpoint remains opt-in and the default bind host is
127.0.0.1(no behaviour change for unconfigured deployments). Default port moved from 9090 to 9210; any operator who was already scraping 9090 must update their Prometheus scrape config or pass--metrics-port 9090explicitly.