feat: api sends metrics and traces to local otel#2961
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
1 Skipped Deployment
|
|
Caution Review failedThe pull request is closed. 📝 WalkthroughWalkthroughThis pull request introduces several new deployment configurations and service definitions aimed at enhancing observability and tracing. New services such as Tempo, OpenTelemetry Collector, Prometheus, and Loki are defined in the docker-compose file along with updates to environment variables and volumes. Additional configuration files for Grafana datasources, Loki, Prometheus, Tempo, and the OpenTelemetry Collector are provided. The PR also refines the initialization and metrics registration in various Go modules by updating method signatures, error handling, and introducing new metrics interfaces and observable types, while removing deprecated configurations. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Telemetry Client
participant Collector as OTLP Collector
participant Batch as Batch Processor
participant Tempo as Tempo Exporter
participant Prom as Prometheus Exporter
participant Debug as Debug Exporter
Client->>Collector: Send Telemetry Data (traces/metrics/logs)
Collector->>Batch: Receive via OTLP (HTTP/gRPC)
Batch->>Collector: Process data in batches
alt Traces Pipeline
Collector->>Tempo: Export traces
Collector->>Debug: Export traces for debugging
else Metrics Pipeline
Collector->>Prom: Export metrics
Collector->>Debug: Export metrics for debugging
else Logs Pipeline
Collector->>Debug: Export logs for debugging
end
sequenceDiagram
participant API as API Runner
participant Otel as otel.InitGrafana
participant Resource as Resource Setup
participant Exporter as Exporter Setup
participant Shutdown as Shutdown Manager
API->>Otel: Initialize Grafana with config (including NodeID, CloudRegion)
Otel->>Resource: Create and configure resource attributes
Otel->>Exporter: Set up trace and metric exporters (with compression, insecure options)
Exporter->>Shutdown: Register shutdown callbacks
Otel-->>API: Return error status (if any)
Possibly related PRs
Suggested reviewers
Warning There were issues while running some tools. Please review the errors and either fix the tool’s configuration or disable the tool if it’s a critical failure. 🔧 golangci-lint (1.62.2)Error: can't load config: the Go language version (go1.23) used to build golangci-lint is lower than the targeted Go version (1.24.0) Tip ⚡🧪 Multi-step agentic review comment chat (experimental)
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
|
Thank you for following the naming conventions for pull request titles! 🙏 |
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (11)
deployment/prometheus/config.yaml (1)
1-20: Configure scrape settings based on application needs.The configuration looks good for setting up Prometheus to scrape metrics from the different services. However, consider if the default 15s scrape interval is appropriate for your use case - shorter intervals provide more granular data but increase load.
Also consider adding job-specific scrape intervals if different components have different requirements:
- job_name: "prometheus" + scrape_interval: 15s static_configs: - targets: ["localhost:9090"] - job_name: "tempo" + scrape_interval: 30s # Adjust based on your needs static_configs: - targets: ["tempo:3200"]go/pkg/membership/serf.go (1)
101-113: Well-implemented hostname resolution functionThe
parseHostfunction correctly resolves hostnames to IP addresses and includes proper error handling. Nice work!One minor consideration: the function always selects the first IP address from the resolved list. In environments where a host resolves to multiple IP addresses, it might not always be the correct one to use (though this is typically fine for Docker Compose environments).
You might consider adding a comment explaining why you're selecting the first address or enhancing the function to be more selective about which IP address to use if that becomes necessary in the future.
go/pkg/otel/metrics/observable.go (1)
1-17: Clean implementation of the Int64ObservableGauge interfaceThe implementation is well-structured and follows OpenTelemetry's patterns for observable metrics. The type assertion
var _ Int64ObservableGauge = (*int64ObservableGauge)(nil)is a good practice to ensure compile-time verification of interface implementation.Since the
int64ObservableGaugestruct is unexported but implements an exported interface, consider adding a factory function to create instances of this type, like:// NewInt64ObservableGauge creates a new Int64ObservableGauge with the given meter, name, and options func NewInt64ObservableGauge(meter metric.Meter, name string, opts ...metric.Int64ObservableGaugeOption) Int64ObservableGauge { return &int64ObservableGauge{ m: meter, name: name, opts: opts, } }go/pkg/otel/grafana.go (2)
93-97: Consider making TLS verification configurable.The code uses
WithInsecure()which disables TLS verification. While this is commented as being for local development, it could pose a security risk in production environments.- otlptracehttp.WithInsecure(), // For local development + otlptracehttp.WithInsecure(config.GrafanaEndpoint == "localhost" || strings.HasPrefix(config.GrafanaEndpoint, "127.0.0.1")), // Only disable TLS for local endpointsAlternatively, add a boolean flag to the Config struct to explicitly control this behavior.
120-124: Consider making TLS verification configurable.Similar to the trace exporter, the metric exporter uses
WithInsecure()which should be made configurable based on the environment.- otlpmetrichttp.WithInsecure(), // For local development + otlpmetrichttp.WithInsecure(config.GrafanaEndpoint == "localhost" || strings.HasPrefix(config.GrafanaEndpoint, "127.0.0.1")), // Only disable TLS for local endpointsdeployment/docker-compose.yaml (6)
173-186: Pin the OpenTelemetry Collector version.Using
otel/opentelemetry-collector-contrib:latestmay introduce unpredictability if the image changes upstream. Pin the image to a specific version to ensure consistent builds.- image: otel/opentelemetry-collector-contrib:latest + image: otel/opentelemetry-collector-contrib:<pinned-version>
187-199: Consider pinned Prometheus version and retention settings.“latest” can cause unexpected upgrades. Also, you might want to specify a retention period (e.g.,
--storage.tsdb.retention.time=7d) to manage disk usage.
200-212: Secure your Grafana admin credentials and pin its version.Storing admin credentials in plain text is fine for local testing, but for production, consider environment variable management or secrets. Also, pinning Grafana to a known version prevents unexpected upgrades.
213-223: Pin the Tempo version.Using
grafana/tempo:latestmay introduce instability when new versions roll out. Pin a specific version to maintain consistent behavior.
224-233: Pin the Loki version.Similarly, consider pinning
grafana/loki:latestto a stable version to avoid unintended breakage.
237-238: Review volumes for data retention strategy.
tempoandlokivolumes will grow over time. Confirm whether you need a retention policy or an automatic cleanup approach in your local environment for these new volumes.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (19)
deployment/docker-compose.yaml(3 hunks)deployment/grafana/provisioning/datasources/datasources.yaml(1 hunks)deployment/loki/config.yaml(1 hunks)deployment/otel-collector-config.yaml(1 hunks)deployment/prometheus/config.yaml(1 hunks)deployment/prometheus/prometheus.yaml(0 hunks)deployment/tempo/config.yaml(1 hunks)go/apps/api/run.go(1 hunks)go/pkg/cache/cache.go(2 hunks)go/pkg/cache/cache_test.go(6 hunks)go/pkg/cache/simulation_test.go(1 hunks)go/pkg/clickhouse/client.go(1 hunks)go/pkg/cluster/cluster.go(3 hunks)go/pkg/membership/serf.go(3 hunks)go/pkg/otel/grafana.go(3 hunks)go/pkg/otel/metrics/interface.go(1 hunks)go/pkg/otel/metrics/metrics.go(9 hunks)go/pkg/otel/metrics/observable.go(1 hunks)go/pkg/zen/middleware_metrics.go(3 hunks)
💤 Files with no reviewable changes (1)
- deployment/prometheus/prometheus.yaml
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: Test Go API Local / Test
- GitHub Check: Test API / API Test Local
- GitHub Check: Test Packages / Test ./internal/clickhouse
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Test Agent Local / test_agent_local
- GitHub Check: Build / Build
- GitHub Check: autofix
🔇 Additional comments (31)
deployment/grafana/provisioning/datasources/datasources.yaml (1)
1-20: LGTM! Good observability stack setup.The Grafana datasources configuration correctly sets up Prometheus, Tempo, and Loki as data sources, which is essential for the metrics and traces implementation. Prometheus is appropriately set as the default datasource.
go/pkg/cache/simulation_test.go (1)
108-116: LGTM! Improved error handling.Good improvement to handle errors from the cache initialization rather than letting potential errors propagate silently. This makes the test more robust.
deployment/tempo/config.yaml (1)
1-24: Nice configuration for local Tempo setup!The configuration is well structured with appropriate sections for server, distributor, storage, ingester, and compactor. The OTLP receiver endpoints are properly configured for both HTTP and gRPC protocols on standard ports.
Note that using
/tmp/tempofor storage means trace data will be lost on system restarts. While this is acceptable for a local development environment, a more persistent storage location would be needed for any production-like environments.go/pkg/otel/metrics/interface.go (1)
1-15: Well-designed interfaces for OpenTelemetry metricsThese interfaces provide a clean abstraction over the OpenTelemetry metrics API. The
Int64CounterandInt64ObservableGaugeinterfaces are minimal and focused on their specific responsibilities, which follows good interface design principles.go/pkg/membership/serf.go (1)
49-53: Good improvement for hostname resolutionAdding hostname resolution is a good enhancement for Docker Compose environments where containers often have dynamic hostnames that need to be resolved to IP addresses.
go/pkg/cluster/cluster.go (3)
10-13: LGTM! Appropriate imports for the metrics system.The added imports provide all the necessary dependencies for the metrics implementation, including the custom metrics package and OpenTelemetry's attribute and metric packages.
55-58: Good error handling for metrics registration.Registering metrics during cluster initialization with proper error propagation ensures that metrics are correctly set up before the cluster begins operating.
92-109: Well-implemented metrics registration.The
registerMetricsmethod:
- Properly registers a callback with the metrics system
- Reports the cluster size by counting members
- Includes the node ID as an attribute for better observability
- Handles errors appropriately both in registration and during execution
This implementation aligns well with the PR objective of sending metrics to OpenTelemetry.
deployment/loki/config.yaml (1)
1-62: LGTM! Properly configured Loki for local development.The configuration is suitable for a local development environment with:
- Authentication disabled for simplicity
- Server listening on standard ports (HTTP 3100, gRPC 9096)
- Debug logging enabled
- Local filesystem storage
- Single-node configuration (replication factor 1)
- In-memory key-value store for the ring
- Embedded cache for query results
- TSDB storage with a v13 schema
The anonymous usage reporting is properly documented with clear instructions on how to disable it if needed.
deployment/otel-collector-config.yaml (1)
1-44: LGTM! Well-configured OpenTelemetry Collector.The configuration establishes a complete observability pipeline:
- Receivers: Properly configured to accept both HTTP and gRPC OTLP on standard ports
- Processors: Batch processing configured with appropriate limits
- Exporters: Well-defined exporters for:
- Traces to Tempo
- Metrics to Prometheus
- Debug output for troubleshooting
- Pipelines: Properly structured for traces, metrics, and logs
The insecure TLS settings are acceptable for a local development environment.
go/pkg/zen/middleware_metrics.go (2)
10-13: LGTM! Appropriate imports for metrics integration.The added imports provide all the necessary dependencies for implementing HTTP request metrics with OpenTelemetry.
65-70: LGTM! Well-implemented HTTP metrics collection.The implementation:
- Counts HTTP requests with the OpenTelemetry counter
- Attaches useful attributes:
- Host
- HTTP method
- Path
- Status code
These metrics will provide valuable insights into API performance and usage patterns.
go/apps/api/run.go (1)
65-73: Good implementation of new OTel configuration fields.The updated call to
InitGrafanacorrectly implements the new function signature by:
- Adding the new
NodeIDandCloudRegionfields to enhance observability- Passing the
shutdownsinstance directly, which aligns with the new approach to resource cleanupThis change improves service identification in metrics and traces, making it easier to troubleshoot issues across multiple service instances.
go/pkg/cache/cache_test.go (4)
18-26: Proper error handling for cache initialization.The test has been correctly updated to capture and verify errors from the
cache.Newfunction.
36-45: Proper error handling for cache initialization.The test has been correctly updated to capture and verify errors from the
cache.Newfunction.
59-68: Proper error handling for cache initialization.The test has been correctly updated to capture and verify errors from the
cache.Newfunction.
83-91: Proper error handling for cache initialization.The test has been correctly updated to capture and verify errors from the
cache.Newfunction.go/pkg/cache/cache.go (2)
58-95: Improved error handling in cache initialization.The
Newfunction now properly returns errors instead of panicking, which is a significant improvement in error handling. This change allows calling code to gracefully handle initialization failures, making the application more robust.
98-144: Enhanced metrics registration using callbacks.The refactored
registerMetricsmethod (previouslycollectMetrics) improves the metrics collection approach by:
- Using callback registration instead of a ticker-based approach, which is more efficient
- Providing proper error handling for each metric registration
- Following OpenTelemetry best practices for observability
This implementation will better integrate with the new OpenTelemetry Collector service introduced in this PR.
go/pkg/otel/grafana.go (3)
25-31: Good addition of service identification fields.The new
NodeIDandCloudRegionfields enhance observability by providing better context for metrics and traces. This makes it easier to:
- Distinguish between multiple instances of the same service
- Identify regional performance patterns or issues
- Correlate metrics across distributed services
78-90: Improved resource creation with enriched attributes.The creation of a resource with common attributes including service name, version, instance ID, and region follows OpenTelemetry best practices. This provides richer context for all telemetry data.
104-146: Well-structured shutdown management.The code correctly registers shutdown handlers for all created resources using the provided
shutdownsinstance. This is a cleaner approach than returning a slice of functions and ensures proper cleanup of resources during application termination.deployment/docker-compose.yaml (2)
55-55: Confirm dependency on Tempo service.Adding
tempoas a dependency toapiv2can blockapiv2from starting if Tempo is unavailable. Consider whether a hard dependency is needed or if a soft/fallback approach is possible.Please verify whether
apiv2truly requires Tempo to be running at startup or if you can safely allow it to continue without blocking.
65-65: Clarify endpoint vs. comment.The comment says “Point directly to Tempo,” but the value references
otel-collector:4318. Verify whether the comment is correct or if the endpoint should indeed point to the collector.go/pkg/otel/metrics/metrics.go (7)
24-24: No substantive changes detected.
39-39: Transition to custom Int64Counter looks fine.The
Requestsmetric remains a counter, consistent with measuring event frequency.
133-133: Appropriate use of gauge for “Size”.Cache size is a point-in-time measurement that can fluctuate, making a gauge a suitable choice.
135-149: Appropriate introduction of a gauge for “Capacity”.“Capacity” is also an instantaneous measurement, so using a gauge is appropriate for monitoring current available space.
151-167: New “Cluster” struct and gauge for “Size”.Using a gauge to observe cluster node count is sensible for monitoring dynamic scaling.
240-246: Gauge usage for “cache_size”.A dynamic measure of current cache occupancy is an appropriate gauge.
248-254: Gauge usage for “cluster_size”.Measuring the number of nodes is a point-in-time value, so a gauge is suitable.
| // if opts.TLS == nil { | ||
| // | ||
| // opts.TLS = new(tls.Config) | ||
| // } |
There was a problem hiding this comment.
Security concern: TLS configuration has been commented out.
The code that initializes TLS configuration for the ClickHouse connection has been commented out. While this might be intentional for a local development environment where TLS isn't required, it could pose security risks in production deployments.
Consider making this configuration conditional based on the environment rather than commenting it out:
- // if opts.TLS == nil {
- //
- // opts.TLS = new(tls.Config)
- // }
+ if env != "local" && opts.TLS == nil {
+ opts.TLS = new(tls.Config)
+ }Don't forget to add the crypto/tls import back if you restore this code.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // if opts.TLS == nil { | |
| // | |
| // opts.TLS = new(tls.Config) | |
| // } | |
| if env != "local" && opts.TLS == nil { | |
| opts.TLS = new(tls.Config) | |
| } |
|
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
go/pkg/otel/metrics/doc.go(1 hunks)go/pkg/otel/metrics/interface.go(1 hunks)go/pkg/otel/metrics/metrics.go(9 hunks)go/pkg/otel/metrics/observable.go(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- go/pkg/otel/metrics/doc.go
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: Test Packages / Test ./internal/clickhouse
- GitHub Check: Build / Build
- GitHub Check: Test API / API Test Local
- GitHub Check: Test Go API Local / Test
- GitHub Check: Test Agent Local / test_agent_local
- GitHub Check: autofix
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (18)
go/pkg/otel/metrics/interface.go (2)
9-21: Interfaces for counters look solid.The documentation is thorough and properly clarifies concurrency safety for the
Addmethod.
23-38: Observable interface design is clear.The interface reflects typical usage in OpenTelemetry for registering callbacks. The concurrency note is helpful for guiding initialization usage.
go/pkg/otel/metrics/observable.go (2)
5-34: Gauge observable implementation looks correct.The design matches the
Int64Observableinterface. The method doc clarifies concurrency constraints and intended usage, which is good practice.
36-66: Counter observable implementation is well structured.Likewise, the counter version adequately ensures the callback is registered. The monotonic behavior note aligns with OpenTelemetry guidelines.
go/pkg/otel/metrics/metrics.go (14)
24-24: No change to review.
39-39: No significant comment.
51-57: No further concerns.
66-72: No further concerns.
81-87: No further concerns.
98-104: No further concerns.
127-133: No further concerns.
151-167: No issues identified.
200-205: Correct approach to use a counter for cache hits.
208-214: Correct approach to use a counter for cache misses.
216-222: Good to track cache writes with a counter.
224-230: Using a counter for evictions is appropriate.
240-245: Gauge usage for cache size is suitably chosen.
248-254: Gauge for cluster size is a good fit.
Summary by CodeRabbit
New Features
Refactor