Swap our metrics provider from statsd to otel-metrics + instrument our Fetcher threadpool #726

piyush-zlai · 2025-05-02T02:24:09Z

Summary

PR to swap our metrics reporter from statsd to open telemetry metrics. We need otel to allow us to capture metrics in Etsy without the need of a prometheus statsd exporter sidecar that they've seen issues with occasionally. Otel in general is a popular metrics ingestion interface with a number of supported backends (e.g. prom / datadog / gcloud / aws cloudwatch). Wiring up Otel also enables us to set up traces and spans in the repo in the future.
Broad changes:

Decouple bulk of the metrics reporting logic from the Metrics.Context. The metrics reporter we use is pluggable. Currently this is just the OpenTelemetry but in principle we can support others in the future.
Online module creates the appropriate otel SDK - either we use the Http provider or the Prometheus Http server. We need the Http provider to plug into Vert.x as their Micrometer integration works with that. The Prom http server is what Etsy is keen we use.

Checklist

Added Unit Tests
Covered by existing CI
Integration tested
Documentation update

Tested via docker container and a local instance of open telemetry:
Start up fetcher docker svc

docker run -v ~/.config/gcloud/application_default_credentials.json:/gcp/credentials.json  -p 9000:9000  -e "GCP_PROJECT_ID=canary-443022"  -e "GOOGLE_CLOUD_PROJECT=canary-443022"  -e "GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance"  -e "EXPORTER_OTLP_ENDPOINT=http://host.docker.internal:4318"  -e GOOGLE_APPLICATION_CREDENTIALS=/gcp/credentials.json  zipline-fetcher:latest

And then otel:

./otelcol --config otel-collector-config.yaml
...

We see:

2025-04-18T17:35:37.351-0400	info	ResourceMetrics #0
Resource SchemaURL: 
Resource attributes:
     -> service.name: Str(ai.chronon)
     -> telemetry.sdk.language: Str(java)
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.version: Str(1.49.0)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope ai.chronon 3.7.0-M11
Metric #0
Descriptor:
     -> Name: kv_store.bigtable.cache.insert
     -> Description: 
     -> Unit: 
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> dataset: Str(TableId{tableId=CHRONON_METADATA})
     -> environment: Str(kv_store)
     -> production: Str(false)
StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC
Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC
Value: 1
Metric #1
Descriptor:
     -> Name: kv_store.bigtable.multiGet.latency
     -> Description: 
     -> Unit: 
     -> DataType: Histogram
     -> AggregationTemporality: Cumulative
HistogramDataPoints #0
Data point attributes:
     -> dataset: Str(TableId{tableId=CHRONON_METADATA})
     -> environment: Str(kv_store)
     -> production: Str(false)
StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC
Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC
Count: 1
Sum: 229.000000
Min: 229.000000
Max: 229.000000
ExplicitBounds #0: 0.000000
...
Buckets #0, Count: 0
...

Summary by CodeRabbit

New Features
- Introduced OpenTelemetry-based metrics reporting throughout the platform, replacing the previous StatsD approach.
- Added a Dockerfile and startup script for a new Fetcher service, supporting both AWS and GCP integrations with configurable metrics export.
- Enhanced thread pool monitoring with a new executor that provides detailed metrics on task execution and queue status.
Improvements
- Metrics tags are now structured as key-value maps, improving clarity and flexibility.
- Metrics reporting is now context-aware, supporting per-dataset and per-table metrics.
- Increased thread pool queue capacity for better throughput under load.
- Replaced StatsD metrics configuration with OpenTelemetry OTLP in service launcher and build configurations.
Bug Fixes
- Improved error handling and logging in metrics reporting and thread pool management.
Chores
- Updated dependencies to include OpenTelemetry, Micrometer OTLP registry, Prometheus, OkHttp, and Kotlin libraries.
- Refactored build and test configurations to support new telemetry libraries and remove deprecated dependencies.

coderabbitai · 2025-05-02T02:24:16Z

Walkthrough

This update introduces a comprehensive migration from StatsD-based metrics to OpenTelemetry (OTel) metrics reporting across the codebase. It adds a pluggable MetricsReporter interface, a new OtelMetricsReporter implementation, and updates all metrics-related code to use structured tags and OTel APIs. The build system and dependencies are updated to include OTel and Micrometer OTLP registry libraries. Metrics context handling is improved with per-table caching and structured tagging. Docker and service launcher scripts are added or modified to support the new metrics configuration. The thread pool executor is now instrumented for metrics collection.

Changes

Files/Groups	Change Summary
`online/src/main/scala/ai/chronon/online/metrics/Metrics.scala`, `MetricsReporter.scala`, `OtelMetricsReporter.scala`	Replaced StatsD metrics with pluggable MetricsReporter; implemented OTel metrics reporter; changed tags to Map; added dataset tag; refactored all metric reporting methods.
`cloud_gcp/.../BigTableKVStoreImpl.scala`	Introduced per-table metrics context caching using TrieMap; switched to structured tags (Map); updated metrics calls to use cached contexts and new tag format.
`online/src/main/scala/ai/chronon/online/metrics/FlexibleExecutionContext.scala`, `InstrumentedThreadPoolExecutor.scala`	Added InstrumentedThreadPoolExecutor for periodic metrics reporting and task timing; increased executor queue size.
`online/src/main/scala/ai/chronon/online/fetcher/Fetcher.scala`	Removed StatsD event reporting for fetcher version; deleted related imports and method.
`online/src/main/java/ai/chronon/online/JavaFetcher.java`	Updated Metrics.Context instantiation to include new dataset parameter.
`online/src/test/scala/ai/chronon/online/test/TagsTest.scala`	Updated tests to use OTel tag caching from OtelMetricsReporter instead of custom TTLCache.
`online/BUILD.bazel`	Added OTel dependencies via new OTEL_DEPS variable; removed direct StatsD dependency; updated deps in metrics and core libraries.
`service_commons/BUILD.bazel`	Switched Micrometer dependency from StatsD to OTLP registry.
`service_commons/src/main/java/ai/chronon/service/ChrononServiceLauncher.java`	Replaced StatsD metrics setup with OTel OTLP-based configuration; added initializeMetrics method.
`docker/fetcher/Dockerfile`, `docker/fetcher/start.sh`	Added new Dockerfile and startup script for Fetcher, with OTel configuration support and environment validation.
`maven_install.json`, `tools/build_rules/dependencies/maven_repository.bzl`	Added/updated Maven artifacts for OTel, Micrometer OTLP, Prometheus, OkHttp, Kotlin stdlib; expanded dependency mappings and SPI registrations.

Sequence Diagram(s)

sequenceDiagram
    participant Service
    participant MetricsContext
    participant MetricsReporter
    participant OTelExporter

    Service->>MetricsContext: Create or retrieve context (with dataset)
    Service->>MetricsReporter: increment/count/gauge/distribution(metric, value, tags)
    MetricsReporter->>OTelExporter: Record metric (with attributes)

Suggested reviewers

nikhil-zlai
david-zlai

Poem

Metrics now travel on OTel's wings,
StatsD's old songs replaced by new things.
Contexts are cached, tags mapped with care,
Executors report as they churn and they share.
From Docker to Java, the changes are deep—
Let OpenTelemetry measure while services leap!
🚀📊

Warning

Review ran into problems

🔥 Problems

GitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository.

Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between b8f7931 and 6df3660.

📒 Files selected for processing (1)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (14 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala

⏰ Context from checks skipped due to timeout of 90000ms (18)

GitHub Check: service_tests
GitHub Check: cloud_gcp_tests
GitHub Check: service_tests
GitHub Check: cloud_gcp_tests
GitHub Check: cloud_aws_tests
GitHub Check: cloud_aws_tests
GitHub Check: service_commons_tests
GitHub Check: online_tests
GitHub Check: api_tests
GitHub Check: flink_tests
GitHub Check: aggregator_tests
GitHub Check: bazel_config_tests
GitHub Check: api_tests
GitHub Check: online_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: aggregator_tests
GitHub Check: flink_tests
GitHub Check: enforce_triggered_workflows

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🔭 Outside diff range comments (2)

online/BUILD.bazel (1)
64-64: 💡 Verification agent
🧩 Analysis chain

Remove unnecessary Datadog dependency

Since you're migrating to OpenTelemetry, the Datadog StatsD client may no longer be needed.

🏁 Script executed:
#!/bin/bash
# Check if the Datadog client is still used in the codebase
rg -A 3 -B 3 "com\.datadoghq\..*\.client" --type scala
Length of output: 56

🏁 Script executed:
#!/bin/bash
# search StatsD pkg
rg -A3 -B3 "com\.timgroup\.statsd"
# search StatsDClient
rg -A3 -B3 "StatsDClient"
Length of output: 390

🏁 Script executed:
#!/bin/bash
# search imports of Datadog StatsD in code
rg -A3 -B3 "import com\.datadoghq"
# search Datadog client class references
rg -A3 -B3 "NonBlockingStatsDClient"
Length of output: 73
Remove unused Datadog StatsD client dependency

No code imports or references to com.datadoghq:java-dogstatsd-client were found.
File: online/BUILD.bazel at line 64 – delete
- maven_artifact("com.datadoghq:java-dogstatsd-client"),
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (1)

94-99: 🛠️ Refactor suggestion

Cache can grow unbounded.

tableToContext retains one Context per Bigtable table name and never evicts.
Today we only have a handful of tables (see learning), but if that assumption ever changes this will leak memory across the JVM lifetime.
Consider an LRU/TTL cache or a WeakReference-based map.

♻️ Duplicate comments (18)

maven_install.json (18)

24457-24462: Add OkHttp to sources list.

24623-24624: Add Micrometer OTLP to sources.

24729-24730: Add OpenTelemetry proto to sources.

24739-24748: Add exporter-common & OTLP to sources.

24765-24776: Add Prometheus config & exporters to sources.

25298-25301: Add Kotlin stdlib-jdk7/jdk8 to sources.

25925-25928: Add OkHttp to sources mapping.

26091-26094: Add Micrometer OTLP to sources mapping.

26197-26198: Add OpenTelemetry proto to sources mapping.

26207-26216: Add exporter-common & OTLP to sources mapping.

26233-26244: Add Prometheus config & exporters to sources mapping.

26766-26769: Add Kotlin stdlib-jdk7/jdk8 sources mapping.

27393-27398: Add OkHttp to sources mapping.

27559-27562: Add Micrometer OTLP to sources mapping.

27665-27666: Add OpenTelemetry proto to sources mapping.

27675-27684: Add exporter-common & OTLP to sources mapping.

27698-27704: Add Prometheus config & exporters to sources mapping.

28234-28237: Add Kotlin stdlib-jdk7/jdk8 sources mapping.

🧹 Nitpick comments (16)

service_commons/src/main/java/ai/chronon/service/ChrononServiceLauncher.java (1)
36-56: Good extraction of metrics initialization.

OTLP configuration properly set up with required parameters.

Consider extracting "ai.chronon" as a constant rather than hardcoding it.
-    private void initializeMetrics(VertxOptions options) {
-        String serviceName = "ai.chronon";
+    private static final String SERVICE_NAME = "ai.chronon";
+    
+    private void initializeMetrics(VertxOptions options) {
docker/fetcher/start.sh (1)
32-41: Java command properly configured.

All required parameters passed with good error handling.

Consider adding file existence checks for JAR files before execution.
+if [ ! -f "$FETCHER_JAR" ]; then
+  echo "Error: Fetcher JAR $FETCHER_JAR not found"
+  exit 1
+fi
+
+if [ ! -f "$ONLINE_JAR" ]; then
+  echo "Error: Online JAR $ONLINE_JAR not found"
+  exit 1
+fi
+
docker/fetcher/Dockerfile (3)

15-22: Consider using package manager for Python tools

Using pip with system Python can cause conflicts. Consider using a virtual environment.

26-28: Use apt repository for Scala installation

Direct .deb download bypasses package management. Consider using official repositories.

47-47: Health check could be more robust

Current health check only verifies HTTP availability. Consider deeper health verification.

online/src/main/scala/ai/chronon/online/metrics/MetricsReporter.scala (1)

8-17: Well-designed metrics interface

Clean abstraction for metrics reporting with appropriate methods for different metric types.

Consider adding:

Direct timing metric method

Unit documentation for values

Batch operation support
online/src/main/scala/ai/chronon/online/metrics/InstrumentedThreadPoolExecutor.scala (3)
20-20: Make metrics interval more configurable

Consider making this configurable via environment variable for runtime tuning.

63-66: Enhance error handling

Add specific exception handling for different error types instead of catching all exceptions.

77-100: Consider higher precision timing

For very fast operations, millisecond precision may be insufficient. Consider using System.nanoTime().
-    val submitTime = System.currentTimeMillis()
+    val submitTime = System.nanoTime()

     val instrumentedTask = new Runnable {
       override def run(): Unit = {
-        val startTime = System.currentTimeMillis()
+        val startTime = System.nanoTime()
         val waitTime = startTime - submitTime

         // Record wait time
-        metricsContext.distribution("wait_time_ms", waitTime)
+        metricsContext.distribution("wait_time_ns", waitTime)
+        metricsContext.distribution("wait_time_ms", waitTime / 1_000_000)

         command.run()
-        val endTime = System.currentTimeMillis()
+        val endTime = System.nanoTime()
         val execTime = endTime - startTime
         val totalTime = endTime - submitTime

         // Record timing metrics
-        metricsContext.distribution("execution_time_ms", execTime)
-        metricsContext.distribution("total_time_ms", totalTime)
+        metricsContext.distribution("execution_time_ns", execTime)
+        metricsContext.distribution("execution_time_ms", execTime / 1_000_000)
+        metricsContext.distribution("total_time_ns", totalTime)
+        metricsContext.distribution("total_time_ms", totalTime / 1_000_000)
       }
     }
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (3)

196-199: Prefer System.nanoTime() for latency.

System.currentTimeMillis() is wall-clock and can jump; nanoTime() gives monotonic duration.

218-220: Missing latency metric on failure.

On error you increment multiGet.bigtable_errors but skip a latency distribution, losing SLO visibility. Record the duration before returning.

333-336: DRY: context lookup duplicated.

The tableToContext lookup pattern appears in multiGet, list, and multiPut. Extract a small helper (getDatasetCtx) to avoid repetition.

online/src/main/scala/ai/chronon/online/metrics/Metrics.scala (2)

176-201: Duplicate tag keys silently overwrite.

joinNames.foreach(addTag(Tag.Join, _)) keeps only the last join when multiple are present. If multi-join tagging is important, change value to a concatenated string.

205-235: Metric name already contains environment.

prefix(metric) prepends environment. Passing the same value in tag Environment duplicates information; consider dropping one for cardinality.

online/src/main/scala/ai/chronon/online/metrics/OtelMetricsReporter.scala (2)

46-50: Attribute merge allocates each call.

attributes.toBuilder creates a new builder per record. Cache common attribute sets (ctx + no extra tags) separately to reduce alloc pressure on hot paths.

98-116: Prometheus & HTTP are mutually exclusive.

buildOtelMetricReader() ignores the case where both reader types are desired (e.g., HTTP for prod, Prometheus for local). Consider supporting dual registration or documenting the limitation.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 9499ce5 and 7f9b653.

📒 Files selected for processing (16)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (14 hunks)
docker/fetcher/Dockerfile (1 hunks)
docker/fetcher/start.sh (1 hunks)
maven_install.json (37 hunks)
online/BUILD.bazel (4 hunks)
online/src/main/java/ai/chronon/online/JavaFetcher.java (1 hunks)
online/src/main/scala/ai/chronon/online/fetcher/Fetcher.scala (2 hunks)
online/src/main/scala/ai/chronon/online/metrics/FlexibleExecutionContext.scala (1 hunks)
online/src/main/scala/ai/chronon/online/metrics/InstrumentedThreadPoolExecutor.scala (1 hunks)
online/src/main/scala/ai/chronon/online/metrics/Metrics.scala (6 hunks)
online/src/main/scala/ai/chronon/online/metrics/MetricsReporter.scala (1 hunks)
online/src/main/scala/ai/chronon/online/metrics/OtelMetricsReporter.scala (1 hunks)
online/src/test/scala/ai/chronon/online/test/TagsTest.scala (2 hunks)
service_commons/BUILD.bazel (1 hunks)
service_commons/src/main/java/ai/chronon/service/ChrononServiceLauncher.java (1 hunks)
tools/build_rules/dependencies/maven_repository.bzl (2 hunks)

🧰 Additional context used

🧠 Learnings (1)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (2)

Learnt from: piyush-zlai
PR: zipline-ai/chronon#657
File: cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala:93-97
Timestamp: 2025-04-21T15:10:40.819Z
Learning: The BigTableKVStoreImpl in the chronon codebase only interacts with 4 BigTable tables total, so unbounded caching in tableToContext is not a concern.

Learnt from: chewy-zlai
PR: zipline-ai/chronon#47
File: online/src/main/scala/ai/chronon/online/MetadataStore.scala:232-0
Timestamp: 2024-10-17T00:12:09.763Z
Learning: In the `KVStore` trait located at `online/src/main/scala/ai/chronon/online/KVStore.scala`, there are two `create` methods: `def create(dataset: String): Unit` and `def create(dataset: String, props: Map[String, Any]): Unit`. The version with `props` ignores the `props` parameter, and the simpler version without `props` is appropriate when `props` are not needed.

⏰ Context from checks skipped due to timeout of 90000ms (17)

GitHub Check: service_tests
GitHub Check: service_tests
GitHub Check: cloud_gcp_tests
GitHub Check: cloud_gcp_tests
GitHub Check: service_commons_tests
GitHub Check: cloud_aws_tests
GitHub Check: cloud_aws_tests
GitHub Check: api_tests
GitHub Check: online_tests
GitHub Check: online_tests
GitHub Check: flink_tests
GitHub Check: flink_tests
GitHub Check: aggregator_tests
GitHub Check: bazel_config_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: aggregator_tests
GitHub Check: api_tests

🔇 Additional comments (42)

service_commons/BUILD.bazel (1)

12-12: Dependency updated for OpenTelemetry metrics.

Changed from StatsD to OTLP registry to support the metrics provider migration.

online/src/main/java/ai/chronon/online/JavaFetcher.java (2)

202-202: Constructor call updated with additional parameter.

Added null parameter to align with the Metrics.Context constructor change in the Scala codebase that added an optional dataset field.

206-206: Constructor call updated with additional parameter.

Added null parameter to align with the Metrics.Context constructor change in the Scala codebase that added an optional dataset field.
online/src/main/scala/ai/chronon/online/metrics/FlexibleExecutionContext.scala (1)

38-45:
✅ Verification successful

Thread pool executor instrumented with metrics.

Replaced ThreadPoolExecutor with InstrumentedThreadPoolExecutor to collect metrics on the thread pool. Also increased queue capacity from 1000 to 10000.

🏁 Script executed:
#!/bin/bash
# Checking for the InstrumentedThreadPoolExecutor class implementation
rg -A 5 "class InstrumentedThreadPoolExecutor"
Length of output: 924
Instrumentation confirmed; ready to merge

InstrumentedThreadPoolExecutor found in online/src/main/scala/ai/chronon/online/metrics/InstrumentedThreadPoolExecutor.scala

Queue capacity increased from 1,000 to 10,000
online/src/main/scala/ai/chronon/online/fetcher/Fetcher.scala (2)

437-440: Metrics.Context constructor formatting adjusted.

Reformatted the constructor call for better readability while maintaining the same functionality.

494-495: Metrics.Context constructor formatting adjusted.

Reformatted the constructor call for better readability while maintaining the same functionality.

tools/build_rules/dependencies/maven_repository.bzl (2)

73-73: LGTM - Micrometer OTLP registry added.

Version matches existing Micrometer dependencies.

226-232: OpenTelemetry dependencies look complete.

Comprehensive OTel dependency set with consistent versioning.

online/src/test/scala/ai/chronon/online/test/TagsTest.scala (4)

21-22: Appropriate imports for OTel integration.

30-30: Good test approach using no-op OpenTelemetry.

Using noop OpenTelemetry avoids external dependencies.

55-59: Tag cache verification logic properly updated.

Test logic updated to use OtelMetricsReporter's cache.

61-63: Appropriate assertion maintained.

Still verifies tag string equivalence with new implementation.

service_commons/src/main/java/ai/chronon/service/ChrononServiceLauncher.java (3)

3-8: Imports properly updated for OTel.

StatsD imports replaced with OpenTelemetry equivalents.

18-20: Documentation correctly updated.

Javadoc reflects the migration to OpenTelemetry.

27-34: Better metrics enablement logic.

Using Optional for null-safe property parsing is a good practice.

docker/fetcher/start.sh (3)

4-11: Good environment variable validation.

Important to fail fast if required vars are missing.

13-19: Clear cloud provider selection logic.

Clean conditional for AWS/GCP configuration.

21-26: Appropriate metrics enablement check.

Using environment variable presence for configuration is a good pattern.

docker/fetcher/Dockerfile (1)

1-53: Dockerfile follows best practices

Environment variables are well-defined, dependencies are properly installed, and service configuration is clear.

online/BUILD.bazel (1)

1-9: Comprehensive OpenTelemetry dependencies

All necessary OpenTelemetry components are included.

online/src/main/scala/ai/chronon/online/metrics/InstrumentedThreadPoolExecutor.scala (1)

1-113: Well-implemented thread pool instrumentation

The implementation properly captures execution metrics and reports thread pool state.

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (1)

143-148: Tag value is table, not dataset.

metricsContext.copy(dataset = targetId.toString) records the physical table id in the dataset tag.
Down-stream dashboards may expect the logical dataset name. Double-check the intended semantics.

online/src/main/scala/ai/chronon/online/metrics/Metrics.scala (1)

130-146: Static client selection locks system properties.

client is initialised once; later tests or sub-modules changing System.setProperty won’t affect reporter choice. If hot-reloading matters, lazy-create per call.

maven_install.json (19)

3-4: Skip autogenerated metadata update.

1197-1203: Add OkHttp & Okio artifacts.

1778-1784: Add Micrometer OTLP registry dependency.

2131-2137: Add OpenTelemetry proto (alpha). Confirm alpha stability before promotion.

2147-2151: Add opentelemetry-api dependency.

2161-2242: Bulk add OpenTelemetry SDK & exporters. Validate version alignment across all modules.

2257-2284: Add Prometheus client libraries.

4172-4184: Add Kotlin stdlib JDK7/JDK8.

6420-6426: Update dependency graph for OkHttp/Okio.

6738-6741: Update Micrometer OTLP registry graph.

7007-7024: Update OpenTelemetry dependency graph. Verify no missing transitive deps.

8476-8481: Update Kotlin stdlib graph.

11312-11330: Update OkHttp internal packages mapping.

12166-12168: Update Micrometer OTLP graph.

12540-12549: Update OpenTelemetry proto graph.

12578-12600: Update exporter-common & OTLP provider graph.

12628-12631: Update autoconfigure SPI graph.

12678-12699: Update Prometheus metrics graph.

30256-30329: Add autoconfigure providers for OTLP & Prometheus.

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala

coderabbitai · 2025-05-02T02:27:44Z

online/src/main/scala/ai/chronon/online/metrics/OtelMetricsReporter.scala

+  override def longGauge(metric: String, value: Long, tags: Map[String, String] = Map.empty)(implicit
+      context: Context): Unit = {
+    val gauge = longGauges.getOrElseUpdate(metric, meter.gaugeBuilder(metric).ofLongs().build())
+    val mergedAttributes = mergeAttributes(tagCache(context), tags)
+    gauge.set(value, mergedAttributes)
+  }
+
+  override def doubleGauge(metric: String, value: Double, tags: Map[String, String] = Map.empty)(implicit
+      context: Context): Unit = {
+    val gauge = doubleGauges.getOrElseUpdate(metric, meter.gaugeBuilder(metric).build())
+    val mergedAttributes = mergeAttributes(tagCache(context), tags)
+    gauge.set(value, mergedAttributes)
+  }


⚠️ Potential issue

LongGauge / DoubleGauge misuse – won’t compile.

OpenTelemetry gauge instruments are observable; gaugeBuilder(...).build() expects a callback and returns Observable*Gauge, which has no set method.
Current code will not compile. Use an UpDownCounter or asynchronous gauge with a polling callback.

-val gauge = longGauges.getOrElseUpdate(metric, meter.gaugeBuilder(metric).ofLongs().build()) -gauge.set(value, mergedAttributes) +val counter = longGauges.getOrElseUpdate(metric, meter.upDownCounterBuilder(metric).ofLongs().build()) +counter.add(value, mergedAttributes)

tchow-zlai · 2025-05-02T17:23:30Z

online/BUILD.bazel

    }),
    visibility = ["//visibility:public"],
-    deps = [
+    deps = OTEL_DEPS + [


ooc why do we need this here? could we just depend on the metrics_lib ?

no it seems like as lib includes all sources, we need to list otel libs here too (the deps on metrics_lib don't seem to get included by default)

tchow-zlai · 2025-05-02T17:23:34Z

online/BUILD.bazel

    maven_artifact("net.bytebuddy:byte-buddy-agent"),
    maven_artifact("org.apache.hadoop:hadoop-common"),
    maven_artifact("org.apache.hadoop:hadoop-client-api"),
+    maven_artifact("io.opentelemetry:opentelemetry-api"),


…r Fetcher threadpool (#726) ## Summary PR to swap our metrics reporter from statsd to open telemetry metrics. We need otel to allow us to capture metrics in our clients without the need of a prometheus statsd exporter sidecar that they've seen issues with occasionally. Otel in general is a popular metrics ingestion interface with a number of supported backends (e.g. prom / datadog / gcloud / aws cloudwatch). Wiring up Otel also enables us to set up traces and spans in the repo in the future. Broad changes: - Decouple bulk of the metrics reporting logic from the Metrics.Context. The metrics reporter we use is pluggable. Currently this is just the OpenTelemetry but in principle we can support others in the future. - Online module creates the appropriate [otel SDK](https://opentelemetry.io/docs/languages/java/sdk/) - either we use the [Http provider or the Prometheus Http server](https://opentelemetry.io/docs/languages/java/configuration/#properties-exporters). We need the Http provider to plug into Vert.x as their Micrometer integration works with that. The Prom http server is what our clients is keen we use. ## Checklist - [ ] Added Unit Tests - [X] Covered by existing CI - [X] Integration tested - [ ] Documentation update Tested via docker container and a local instance of open telemetry: Start up fetcher docker svc ``` docker run -v ~/.config/gcloud/application_default_credentials.json:/gcp/credentials.json -p 9000:9000 -e "GCP_PROJECT_ID=canary-443022" -e "GOOGLE_CLOUD_PROJECT=canary-443022" -e "GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance" -e "EXPORTER_OTLP_ENDPOINT=http://host.docker.internal:4318" -e GOOGLE_APPLICATION_CREDENTIALS=/gcp/credentials.json zipline-fetcher:latest ``` And then otel: ``` ./otelcol --config otel-collector-config.yaml ... ``` We see: ``` 2025-04-18T17:35:37.351-0400 info ResourceMetrics #0 Resource SchemaURL: Resource attributes: -> service.name: Str(ai.chronon) -> telemetry.sdk.language: Str(java) -> telemetry.sdk.name: Str(opentelemetry) -> telemetry.sdk.version: Str(1.49.0) ScopeMetrics #0 ScopeMetrics SchemaURL: InstrumentationScope ai.chronon 3.7.0-M11 Metric #0 Descriptor: -> Name: kv_store.bigtable.cache.insert -> Description: -> Unit: -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> dataset: Str(TableId{tableId=CHRONON_METADATA}) -> environment: Str(kv_store) -> production: Str(false) StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC Value: 1 Metric #1 Descriptor: -> Name: kv_store.bigtable.multiGet.latency -> Description: -> Unit: -> DataType: Histogram -> AggregationTemporality: Cumulative HistogramDataPoints #0 Data point attributes: -> dataset: Str(TableId{tableId=CHRONON_METADATA}) -> environment: Str(kv_store) -> production: Str(false) StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC Count: 1 Sum: 229.000000 Min: 229.000000 Max: 229.000000 ExplicitBounds #0: 0.000000 ... Buckets #0, Count: 0 ... ```  ## Summary by CodeRabbit - **New Features** - Introduced OpenTelemetry-based metrics reporting throughout the platform, replacing the previous StatsD approach. - Added a Dockerfile and startup script for a new Fetcher service, supporting both AWS and GCP integrations with configurable metrics export. - Enhanced thread pool monitoring with a new executor that provides detailed metrics on task execution and queue status. - **Improvements** - Metrics tags are now structured as key-value maps, improving clarity and flexibility. - Metrics reporting is now context-aware, supporting per-dataset and per-table metrics. - Increased thread pool queue capacity for better throughput under load. - Replaced StatsD metrics configuration with OpenTelemetry OTLP in service launcher and build configurations. - **Bug Fixes** - Improved error handling and logging in metrics reporting and thread pool management. - **Chores** - Updated dependencies to include OpenTelemetry, Micrometer OTLP registry, Prometheus, OkHttp, and Kotlin libraries. - Refactored build and test configurations to support new telemetry libraries and remove deprecated dependencies.

…r Fetcher threadpool (#726) ## Summary PR to swap our metrics reporter from statsd to open telemetry metrics. We need otel to allow us to capture metrics in our clients without the need of a prometheus statsd exporter sidecar that they've seen issues with occasionally. Otel in general is a popular metrics ingestion interface with a number of supported baour clientsends (e.g. prom / datadog / gcloud / aws cloudwatch). Wiring up Otel also enables us to set up traces and spans in the repo in the future. Broad changes: - Decouple bulk of the metrics reporting logic from the Metrics.Context. The metrics reporter we use is pluggable. Currently this is just the OpenTelemetry but in principle we can support others in the future. - Online module creates the appropriate [otel SDK](https://opentelemetry.io/docs/languages/java/sdk/) - either we use the [Http provider or the Prometheus Http server](https://opentelemetry.io/docs/languages/java/configuration/#properties-exporters). We need the Http provider to plug into Vert.x as their Micrometer integration works with that. The Prom http server is what our clients is keen we use. ## Cheour clientslist - [ ] Added Unit Tests - [X] Covered by existing CI - [X] Integration tested - [ ] Documentation update Tested via doour clientser container and a local instance of open telemetry: Start up fetcher doour clientser svc ``` doour clientser run -v ~/.config/gcloud/application_default_credentials.json:/gcp/credentials.json -p 9000:9000 -e "GCP_PROJECT_ID=canary-443022" -e "GOOGLE_CLOUD_PROJECT=canary-443022" -e "GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance" -e "EXPORTER_OTLP_ENDPOINT=http://host.doour clientser.internal:4318" -e GOOGLE_APPLICATION_CREDENTIALS=/gcp/credentials.json zipline-fetcher:latest ``` And then otel: ``` ./otelcol --config otel-collector-config.yaml ... ``` We see: ``` 2025-04-18T17:35:37.351-0400 info ResourceMetrics #0 Resource SchemaURL: Resource attributes: -> service.name: Str(ai.chronon) -> telemetry.sdk.language: Str(java) -> telemetry.sdk.name: Str(opentelemetry) -> telemetry.sdk.version: Str(1.49.0) ScopeMetrics #0 ScopeMetrics SchemaURL: InstrumentationScope ai.chronon 3.7.0-M11 Metric #0 Descriptor: -> Name: kv_store.bigtable.cache.insert -> Description: -> Unit: -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> dataset: Str(TableId{tableId=CHRONON_METADATA}) -> environment: Str(kv_store) -> production: Str(false) StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC Value: 1 Metric #1 Descriptor: -> Name: kv_store.bigtable.multiGet.latency -> Description: -> Unit: -> DataType: Histogram -> AggregationTemporality: Cumulative HistogramDataPoints #0 Data point attributes: -> dataset: Str(TableId{tableId=CHRONON_METADATA}) -> environment: Str(kv_store) -> production: Str(false) StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC Count: 1 Sum: 229.000000 Min: 229.000000 Max: 229.000000 ExplicitBounds #0: 0.000000 ... Buour clientsets #0, Count: 0 ... ```  ## Summary by CodeRabbit - **New Features** - Introduced OpenTelemetry-based metrics reporting throughout the platform, replacing the previous StatsD approach. - Added a Doour clientserfile and startup script for a new Fetcher service, supporting both AWS and GCP integrations with configurable metrics export. - Enhanced thread pool monitoring with a new executor that provides detailed metrics on task execution and queue status. - **Improvements** - Metrics tags are now structured as key-value maps, improving clarity and flexibility. - Metrics reporting is now context-aware, supporting per-dataset and per-table metrics. - Increased thread pool queue capacity for better throughput under load. - Replaced StatsD metrics configuration with OpenTelemetry OTLP in service launcher and build configurations. - **Bug Fixes** - Improved error handling and logging in metrics reporting and thread pool management. - **Chores** - Updated dependencies to include OpenTelemetry, Micrometer OTLP registry, Prometheus, OkHttp, and Kotlin libraries. - Refactored build and test configurations to support new telemetry libraries and remove deprecated dependencies.

piyush-zlai added 8 commits May 1, 2025 19:19

Swap statsd for Otel metrics with support for http / prom readers

ae7cc6f

Add instrumented threadpool

d06a426

style: Apply scalafix and scalafmt changes

898dc0f

Clean up import

75dd32a

Default to passed in svc name

ce77a98

Add a delay to reporting tp metrics

2633c13

Reintroduce docker fetcher file

b20ed68

style: Apply scalafix and scalafmt changes

7f9b653

coderabbitai bot reviewed May 2, 2025

View reviewed changes

piyush-zlai added 3 commits May 1, 2025 19:46

Fix order

b8f7931

Fix bug

80d9121

style: Apply scalafix and scalafmt changes

6df3660

piyush-zlai requested review from kumar-zlai, nikhil-zlai and tchow-zlai May 2, 2025 16:23

nikhil-zlai approved these changes May 2, 2025

View reviewed changes

tchow-zlai reviewed May 2, 2025

View reviewed changes

piyush-zlai merged commit 2ff8387 into main May 2, 2025
21 checks passed

piyush-zlai deleted the piyush/metrics_reporter branch May 2, 2025 19:44

coderabbitai bot mentioned this pull request May 22, 2025

Wire up Vert.x metrics in fetcher docker with Prometheus Otel reader #800

Merged

4 tasks

coderabbitai bot mentioned this pull request Jul 21, 2025

[Spark][Optimization] Modify MergeJob to implement storage partition join using row_id #990

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Swap our metrics provider from statsd to otel-metrics + instrument our Fetcher threadpool #726

Swap our metrics provider from statsd to otel-metrics + instrument our Fetcher threadpool #726

Uh oh!

piyush-zlai commented May 2, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented May 2, 2025 •

edited

Loading

Review ran into problems

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot May 2, 2025

Uh oh!

tchow-zlai May 2, 2025

Uh oh!

piyush-zlai May 2, 2025

Uh oh!

tchow-zlai May 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Swap our metrics provider from statsd to otel-metrics + instrument our Fetcher threadpool #726

Swap our metrics provider from statsd to otel-metrics + instrument our Fetcher threadpool #726

Uh oh!

Conversation

piyush-zlai commented May 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Suggested reviewers

Poem

Review ran into problems

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot May 2, 2025

Choose a reason for hiding this comment

Uh oh!

tchow-zlai May 2, 2025

Choose a reason for hiding this comment

Uh oh!

piyush-zlai May 2, 2025

Choose a reason for hiding this comment

Uh oh!

tchow-zlai May 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

piyush-zlai commented May 2, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented May 2, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)