Skip to content

feat[bc]: rename gauges, add seeder metrics, and eagerly open blob core on indexers#1692

Merged
yuranich merged 5 commits into
tetherto:feature-qvac-lib-registry-server-metrics-monitoringfrom
yuranich:feat/registry-server-metric-naming
Apr 22, 2026
Merged

feat[bc]: rename gauges, add seeder metrics, and eagerly open blob core on indexers#1692
yuranich merged 5 commits into
tetherto:feature-qvac-lib-registry-server-metrics-monitoringfrom
yuranich:feat/registry-server-metric-naming

Conversation

@yuranich

@yuranich yuranich commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Four observability gaps that surfaced on the first staging scrapes after #1689 landed:

  1. _total suffix on gauges violates Prometheus / OpenMetrics naming conventions. qvac_registry_models_total and qvac_registry_blob_cores_total are gauges (values go up and down). Scrape linters flag them and OpenMetrics parsers can reject them.
  2. RPC counters export no series before the first RPC call, so rate() returns NaN on fresh dashboards — panels look broken on cold start.
  3. qvac_registry_blob_core_* metrics are empty on indexer nodes that don't use QVAC_BLIND_PEER_KEYS mirror replication — the blob core is populated into blobsCores lazily (only on addModel or _setupBlindPeering), so on an indexer that runs the topic-pull flow it stays Map(0) indefinitely.
  4. No visibility into replication durability. Operators have no way to see how many full replicas of the view core or any blob core exist in the swarm. hypercore_total_peers is too coarse (unions all unique UDX streams across every hypercore, including RPC-only clients).

How does it solve it?

  • Renames qvac_registry_models_totalqvac_registry_model_count and qvac_registry_blob_cores_totalqvac_registry_blob_core_count. RPC counters keep _total — they're actual counters.
  • Pre-initialises both RPC counters with the five known methods (add-model, put-license, update-model-metadata, delete-model, ping) at 0 in the QvacMetrics constructor, so rate() returns 0 from the first scrape.
  • Eagerly opens the blob core at the end of _open() when base.isIndexer || base.localWriter, so blob_core_peers, blob_core_byte_length, blob_core_fully_downloaded, etc. populate on indexers that use the topic-pull blind-peer flow. Reader-only nodes skip this (their writable: true would create a local core with the wrong key).
  • Adds replication-durability gauges:
    • qvac_registry_view_core_seeders — peers that hold the view core fully and advertise remoteUploading. View is small (a few MB of autobase metadata), so this converges to connected-peer count within an RTT — no separate raw-peers metric needed.
    • qvac_registry_blob_core_seeders{core_name} — same signal per blob core. Paired with the existing qvac_registry_blob_core_peers, the peers - seeders gap exposes peers currently downloading vs. serving.
  • Updates DEPLOYMENT_GUIDE.md — metric table, replication-durability alerting guidance, note that _seeders metrics use p.remoteOpened && p.remoteUploading && p.remoteContiguousLength >= core.length so partial/mid-handshake peers aren't counted.
  • Updates the in-tree Grafana dashboard — two panels retargeted at the new metric names.
  • Extends the metrics integration test — asserts new names exist, legacy names are gone, RPC counter series are pre-initialised to 0, and view_core_seeders is exported as a single series that is 0 with no connected peers.

Verified with npm run lint, npm run test:unit (37/37), and npm run test:integration (30/30, 146/146 asserts).

Breaking changes

  • qvac_registry_models_totalqvac_registry_model_count
  • qvac_registry_blob_cores_totalqvac_registry_blob_core_count

Anyone scraping these series must update dashboards / alerts. The in-tree Grafana dashboard is updated in this PR; no other consumers are known.

@yuranich yuranich requested review from a team as code owners April 21, 2026 14:33
@yuranich yuranich changed the title feat[bc]: rename gauge metrics off _total suffix and pre-initialise rpc counters feat[bc]: rename gauges, add seeder metrics, and eagerly open blob core on indexers Apr 22, 2026
@yuranich yuranich merged commit ad15947 into tetherto:feature-qvac-lib-registry-server-metrics-monitoring Apr 22, 2026
6 checks passed
yuranich added a commit that referenced this pull request Apr 24, 2026
…stry server (#1724)

* QVAC-17131 feat: add Prometheus metrics monitoring to registry server (#1600)

* feat: add Prometheus metrics monitoring to registry server

* fix: restrict registry ping RPC to role and timestamp to avoid exposing operational data

* fix: make metrics bind host configurable and move off port 9090

* feat: replace per-model size gauge with view-derived total blob bytes (#1689)

* feat[bc]: rename gauges, add seeder metrics, and eagerly open blob core on indexers (#1692)

* feat[bc]: rename gauge metrics off _total suffix and pre-initialise rpc counters

* feat: add core seeder metrics and eagerly open blob core on indexers

* style: drop eslint-disable directives via helper function for gauge registration

* refactor[bc]: drop core_name label from blob core metrics and use median for view-derived stat panels

* style: drop noisy comment above registerGauge helper

* feat[bc]: replace blob_core_fully_downloaded with length/contiguous_length pair and drop blind-peer metrics (#1702)

* feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels (#1716)

* feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels

* fix: use vm_name label in QVAC and Holepunch panel legends instead of raw instance IP:port

* fix: apply $vm template filter to QVAC and Holepunch selectors for consistent per-node filtering


* chore[docs]: tighten registry Grafana dashboard panels based on staging review (#1718)

* chore[docs]: tighten registry Grafana dashboard panels based on staging review

* chore[docs]: drop redundant Blob Core Contiguous stat, cluster blob panels near the top

* chore[docs]: promote View Core Replication and Blob Core Bytes to the top of the metrics section (#1719)

* chore[docs]: promote View Core Replication and Blob Core Bytes to the top of the metrics section

* chore[docs]: split View Core Replication into length, contiguous, and gap panels

* chore: remove dead blind-peer helpers and fix stale metrics docs

- Drop unreferenced getConnectedBlindPeerKeys / getConfiguredBlindPeerKeys /
  isBlindPeerConnected chain and the _peerConnectionCounts map that only
  existed to back isBlindPeerConnected. Left over from the dropped
  blob_core_blind_peers gauge (1de851b).
- Fix DEPLOYMENT_GUIDE.md: default metrics port is 9210, not 9090; drop
  the hypermetrics reference since it is not a dependency (abandoned,
  incompatible with Hypercore v11) and per-core visibility is provided
  by the registry_blob_core_* / registry_view_core_* gauges.
Proletter pushed a commit that referenced this pull request May 24, 2026
…stry server (#1724)

* QVAC-17131 feat: add Prometheus metrics monitoring to registry server (#1600)

* feat: add Prometheus metrics monitoring to registry server

* fix: restrict registry ping RPC to role and timestamp to avoid exposing operational data

* fix: make metrics bind host configurable and move off port 9090

* feat: replace per-model size gauge with view-derived total blob bytes (#1689)

* feat[bc]: rename gauges, add seeder metrics, and eagerly open blob core on indexers (#1692)

* feat[bc]: rename gauge metrics off _total suffix and pre-initialise rpc counters

* feat: add core seeder metrics and eagerly open blob core on indexers

* style: drop eslint-disable directives via helper function for gauge registration

* refactor[bc]: drop core_name label from blob core metrics and use median for view-derived stat panels

* style: drop noisy comment above registerGauge helper

* feat[bc]: replace blob_core_fully_downloaded with length/contiguous_length pair and drop blind-peer metrics (#1702)

* feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels (#1716)

* feat: expand Grafana dashboard with blob-core replication, seeders, and Holepunch P2P panels

* fix: use vm_name label in QVAC and Holepunch panel legends instead of raw instance IP:port

* fix: apply $vm template filter to QVAC and Holepunch selectors for consistent per-node filtering


* chore[docs]: tighten registry Grafana dashboard panels based on staging review (#1718)

* chore[docs]: tighten registry Grafana dashboard panels based on staging review

* chore[docs]: drop redundant Blob Core Contiguous stat, cluster blob panels near the top

* chore[docs]: promote View Core Replication and Blob Core Bytes to the top of the metrics section (#1719)

* chore[docs]: promote View Core Replication and Blob Core Bytes to the top of the metrics section

* chore[docs]: split View Core Replication into length, contiguous, and gap panels

* chore: remove dead blind-peer helpers and fix stale metrics docs

- Drop unreferenced getConnectedBlindPeerKeys / getConfiguredBlindPeerKeys /
  isBlindPeerConnected chain and the _peerConnectionCounts map that only
  existed to back isBlindPeerConnected. Left over from the dropped
  blob_core_blind_peers gauge (1de851b).
- Fix DEPLOYMENT_GUIDE.md: default metrics port is 9210, not 9090; drop
  the hypermetrics reference since it is not a dependency (abandoned,
  incompatible with Hypercore v11) and per-core visibility is provided
  by the registry_blob_core_* / registry_view_core_* gauges.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant