Skip to content

OTel receiver for CockroachDB performance monitoring #155197

@npcomplete777

Description

@npcomplete777

Is your feature request related to a problem? Please describe.

Currently, there is no official OpenTelemetry receiver for CockroachDB that provides query-level performance observability. While CockroachDB exposes rich metrics through crdb_internal.* tables, users must build custom solutions to collect and export these metrics to modern observability platforms. This creates friction for teams wanting to monitor query performance, index usage, contention, and transaction statistics using OpenTelemetry-based tooling.

The problem becomes particularly acute when trying to:

  • Track actual SQL query text alongside performance metrics (not just fingerprint IDs)
  • Monitor statement-level latencies (parse, plan, run) for query optimization
  • Identify contentious indexes and tables affecting application performance
  • Integrate CockroachDB metrics into existing OpenTelemetry pipelines

Describe the solution you'd like

I've developed a production-ready OpenTelemetry receiver for CockroachDB that addresses these gaps:
https://github.com/npcomplete777/cockroachdbreceiver

Key capabilities:

  • Collects 40+ metrics from crdb_internal.* tables covering statements, transactions, indexes, contention, sessions, and cluster health
  • Preserves actual SQL query text in metric attributes for human-readable observability
  • Implements cardinality control via configurable query limits (default: top 20 queries by execution count)
  • Distinguishes between production-safe and expensive metrics with clear documentation
  • Supports both CockroachDB Serverless and self-hosted deployments
  • Includes comprehensive configuration examples for production and non-production environments

The request:
Would CockroachLabs consider either:

  1. Taking ownership of this receiver as an official community contribution, or
  2. Providing a code review and guidance on best practices for querying crdb_internal.* tables, or
  3. Endorsing this as a community solution if it meets CockroachDB's standards for observability tooling

This would help the CockroachDB community adopt OpenTelemetry more easily and provide a reference implementation for database observability patterns.

Describe alternatives you've considered

Alternative 1: Prometheus Exporter

  • Exists but doesn't integrate with OpenTelemetry pipelines
  • Requires separate infrastructure and configuration
  • Doesn't provide the same level of query-level granularity

Alternative 2: Manual Queries

  • Users write custom scripts to query crdb_internal.* tables
  • Lacks standardization across organizations
  • Requires maintaining custom code and handling schema changes

Alternative 3: Datadog/New Relic Native Integrations

  • Vendor-locked solutions
  • Doesn't work for teams using vendor-agnostic OpenTelemetry
  • Limited customization of collected metrics

Alternative 4: Using Built-in Metrics Endpoint

  • CockroachDB's /_status/vars endpoint provides Prometheus-format metrics
  • But doesn't include query-level observability (actual SQL text, per-query latencies)
  • Misses contention, index usage, and transaction-level insights

Additional context

Code maturity:

  • Apache 2.0 licensed
  • Includes unit tests, validation, and error handling
  • Production configuration examples with security best practices
  • Clear documentation distinguishing safe vs. expensive metrics

Technical implementation:

  • Queries aggregated statistics tables (not raw data) to minimize overhead
  • Implements connection pooling and configurable timeouts
  • Uses OpenTelemetry Collector framework v0.136.0
  • Compatible with CockroachDB v22.1+

Community value:
This receiver would benefit teams running CockroachDB who want to:

  • Monitor database performance in Grafana/Prometheus/Dynatrace/Datadog using OpenTelemetry
  • Correlate database metrics with application traces
  • Implement query-level SLOs based on actual statement performance
  • Troubleshoot contention and lock issues proactively

Questions for the CockroachLabs team:

  1. Are there any concerns about query patterns used against crdb_internal.* tables?
  2. Would you recommend any changes to make this more robust or performant?
  3. Are there upcoming schema changes to crdb_internal.* that should be accounted for?
  4. Would CockroachLabs be interested in maintaining this as an official receiver or community project?

Example metric output:

cockroachdb.statement.execution.count{query="SELECT * FROM users WHERE id = $1", app_name="api-server", database="production"} 15420
cockroachdb.statement.latency.service.mean{query="SELECT * FROM users WHERE id = $1"} 0.0023
cockroachdb.index.contention.events{database="production", table="orders", index="idx_user_id"} 47

I'm happy to collaborate on this and contribute it to the CockroachDB ecosystem if there is interest.

Jira issue: CRDB-55312

Metadata

Metadata

Assignees

Labels

C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-communityOriginated from the communityT-observabilityX-blathers-triagedblathers was able to find an owner

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions