Skip to content

Conversation

@piyush-zlai
Copy link
Contributor

@piyush-zlai piyush-zlai commented May 2, 2025

Summary

PR to swap our metrics reporter from statsd to open telemetry metrics. We need otel to allow us to capture metrics in Etsy without the need of a prometheus statsd exporter sidecar that they've seen issues with occasionally. Otel in general is a popular metrics ingestion interface with a number of supported backends (e.g. prom / datadog / gcloud / aws cloudwatch). Wiring up Otel also enables us to set up traces and spans in the repo in the future.
Broad changes:

  • Decouple bulk of the metrics reporting logic from the Metrics.Context. The metrics reporter we use is pluggable. Currently this is just the OpenTelemetry but in principle we can support others in the future.
  • Online module creates the appropriate otel SDK - either we use the Http provider or the Prometheus Http server. We need the Http provider to plug into Vert.x as their Micrometer integration works with that. The Prom http server is what Etsy is keen we use.

Checklist

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested
  • Documentation update

Tested via docker container and a local instance of open telemetry:
Start up fetcher docker svc

docker run -v ~/.config/gcloud/application_default_credentials.json:/gcp/credentials.json  -p 9000:9000  -e "GCP_PROJECT_ID=canary-443022"  -e "GOOGLE_CLOUD_PROJECT=canary-443022"  -e "GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance"  -e "EXPORTER_OTLP_ENDPOINT=http://host.docker.internal:4318"  -e GOOGLE_APPLICATION_CREDENTIALS=/gcp/credentials.json  zipline-fetcher:latest

And then otel:

./otelcol --config otel-collector-config.yaml
...

We see:

2025-04-18T17:35:37.351-0400	info	ResourceMetrics #0
Resource SchemaURL: 
Resource attributes:
     -> service.name: Str(ai.chronon)
     -> telemetry.sdk.language: Str(java)
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.version: Str(1.49.0)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope ai.chronon 3.7.0-M11
Metric #0
Descriptor:
     -> Name: kv_store.bigtable.cache.insert
     -> Description: 
     -> Unit: 
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> dataset: Str(TableId{tableId=CHRONON_METADATA})
     -> environment: Str(kv_store)
     -> production: Str(false)
StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC
Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC
Value: 1
Metric #1
Descriptor:
     -> Name: kv_store.bigtable.multiGet.latency
     -> Description: 
     -> Unit: 
     -> DataType: Histogram
     -> AggregationTemporality: Cumulative
HistogramDataPoints #0
Data point attributes:
     -> dataset: Str(TableId{tableId=CHRONON_METADATA})
     -> environment: Str(kv_store)
     -> production: Str(false)
StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC
Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC
Count: 1
Sum: 229.000000
Min: 229.000000
Max: 229.000000
ExplicitBounds #0: 0.000000
...
Buckets #0, Count: 0
...

Summary by CodeRabbit

  • New Features

    • Introduced OpenTelemetry-based metrics reporting throughout the platform, replacing the previous StatsD approach.
    • Added a Dockerfile and startup script for a new Fetcher service, supporting both AWS and GCP integrations with configurable metrics export.
    • Enhanced thread pool monitoring with a new executor that provides detailed metrics on task execution and queue status.
  • Improvements

    • Metrics tags are now structured as key-value maps, improving clarity and flexibility.
    • Metrics reporting is now context-aware, supporting per-dataset and per-table metrics.
    • Increased thread pool queue capacity for better throughput under load.
    • Replaced StatsD metrics configuration with OpenTelemetry OTLP in service launcher and build configurations.
  • Bug Fixes

    • Improved error handling and logging in metrics reporting and thread pool management.
  • Chores

    • Updated dependencies to include OpenTelemetry, Micrometer OTLP registry, Prometheus, OkHttp, and Kotlin libraries.
    • Refactored build and test configurations to support new telemetry libraries and remove deprecated dependencies.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented May 2, 2025

Walkthrough

This update introduces a comprehensive migration from StatsD-based metrics to OpenTelemetry (OTel) metrics reporting across the codebase. It adds a pluggable MetricsReporter interface, a new OtelMetricsReporter implementation, and updates all metrics-related code to use structured tags and OTel APIs. The build system and dependencies are updated to include OTel and Micrometer OTLP registry libraries. Metrics context handling is improved with per-table caching and structured tagging. Docker and service launcher scripts are added or modified to support the new metrics configuration. The thread pool executor is now instrumented for metrics collection.

Changes

Files/Groups Change Summary
online/src/main/scala/ai/chronon/online/metrics/Metrics.scala, MetricsReporter.scala, OtelMetricsReporter.scala Replaced StatsD metrics with pluggable MetricsReporter; implemented OTel metrics reporter; changed tags to Map; added dataset tag; refactored all metric reporting methods.
cloud_gcp/.../BigTableKVStoreImpl.scala Introduced per-table metrics context caching using TrieMap; switched to structured tags (Map); updated metrics calls to use cached contexts and new tag format.
online/src/main/scala/ai/chronon/online/metrics/FlexibleExecutionContext.scala, InstrumentedThreadPoolExecutor.scala Added InstrumentedThreadPoolExecutor for periodic metrics reporting and task timing; increased executor queue size.
online/src/main/scala/ai/chronon/online/fetcher/Fetcher.scala Removed StatsD event reporting for fetcher version; deleted related imports and method.
online/src/main/java/ai/chronon/online/JavaFetcher.java Updated Metrics.Context instantiation to include new dataset parameter.
online/src/test/scala/ai/chronon/online/test/TagsTest.scala Updated tests to use OTel tag caching from OtelMetricsReporter instead of custom TTLCache.
online/BUILD.bazel Added OTel dependencies via new OTEL_DEPS variable; removed direct StatsD dependency; updated deps in metrics and core libraries.
service_commons/BUILD.bazel Switched Micrometer dependency from StatsD to OTLP registry.
service_commons/src/main/java/ai/chronon/service/ChrononServiceLauncher.java Replaced StatsD metrics setup with OTel OTLP-based configuration; added initializeMetrics method.
docker/fetcher/Dockerfile, docker/fetcher/start.sh Added new Dockerfile and startup script for Fetcher, with OTel configuration support and environment validation.
maven_install.json, tools/build_rules/dependencies/maven_repository.bzl Added/updated Maven artifacts for OTel, Micrometer OTLP, Prometheus, OkHttp, Kotlin stdlib; expanded dependency mappings and SPI registrations.

Sequence Diagram(s)

sequenceDiagram
    participant Service
    participant MetricsContext
    participant MetricsReporter
    participant OTelExporter

    Service->>MetricsContext: Create or retrieve context (with dataset)
    Service->>MetricsReporter: increment/count/gauge/distribution(metric, value, tags)
    MetricsReporter->>OTelExporter: Record metric (with attributes)
Loading

Suggested reviewers

  • nikhil-zlai
  • david-zlai

Poem

Metrics now travel on OTel's wings,
StatsD's old songs replaced by new things.
Contexts are cached, tags mapped with care,
Executors report as they churn and they share.
From Docker to Java, the changes are deep—
Let OpenTelemetry measure while services leap!
🚀📊

Warning

Review ran into problems

🔥 Problems

GitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository.

Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between b8f7931 and 6df3660.

📒 Files selected for processing (1)
  • cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (14 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala
⏰ Context from checks skipped due to timeout of 90000ms (18)
  • GitHub Check: service_tests
  • GitHub Check: cloud_gcp_tests
  • GitHub Check: service_tests
  • GitHub Check: cloud_gcp_tests
  • GitHub Check: cloud_aws_tests
  • GitHub Check: cloud_aws_tests
  • GitHub Check: service_commons_tests
  • GitHub Check: online_tests
  • GitHub Check: api_tests
  • GitHub Check: flink_tests
  • GitHub Check: aggregator_tests
  • GitHub Check: bazel_config_tests
  • GitHub Check: api_tests
  • GitHub Check: online_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: aggregator_tests
  • GitHub Check: flink_tests
  • GitHub Check: enforce_triggered_workflows

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (2)
online/BUILD.bazel (1)

64-64: 💡 Verification agent

🧩 Analysis chain

Remove unnecessary Datadog dependency

Since you're migrating to OpenTelemetry, the Datadog StatsD client may no longer be needed.


🏁 Script executed:

#!/bin/bash
# Check if the Datadog client is still used in the codebase
rg -A 3 -B 3 "com\.datadoghq\..*\.client" --type scala

Length of output: 56


🏁 Script executed:

#!/bin/bash
# search StatsD pkg
rg -A3 -B3 "com\.timgroup\.statsd"
# search StatsDClient
rg -A3 -B3 "StatsDClient"

Length of output: 390


🏁 Script executed:

#!/bin/bash
# search imports of Datadog StatsD in code
rg -A3 -B3 "import com\.datadoghq"
# search Datadog client class references
rg -A3 -B3 "NonBlockingStatsDClient"

Length of output: 73


Remove unused Datadog StatsD client dependency

No code imports or references to com.datadoghq:java-dogstatsd-client were found.

  • File: online/BUILD.bazel at line 64 – delete
    - maven_artifact("com.datadoghq:java-dogstatsd-client"),
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (1)

94-99: 🛠️ Refactor suggestion

Cache can grow unbounded.

tableToContext retains one Context per Bigtable table name and never evicts.
Today we only have a handful of tables (see learning), but if that assumption ever changes this will leak memory across the JVM lifetime.
Consider an LRU/TTL cache or a WeakReference-based map.

♻️ Duplicate comments (18)
maven_install.json (18)

24457-24462: Add OkHttp to sources list.


24623-24624: Add Micrometer OTLP to sources.


24729-24730: Add OpenTelemetry proto to sources.


24739-24748: Add exporter-common & OTLP to sources.


24765-24776: Add Prometheus config & exporters to sources.


25298-25301: Add Kotlin stdlib-jdk7/jdk8 to sources.


25925-25928: Add OkHttp to sources mapping.


26091-26094: Add Micrometer OTLP to sources mapping.


26197-26198: Add OpenTelemetry proto to sources mapping.


26207-26216: Add exporter-common & OTLP to sources mapping.


26233-26244: Add Prometheus config & exporters to sources mapping.


26766-26769: Add Kotlin stdlib-jdk7/jdk8 sources mapping.


27393-27398: Add OkHttp to sources mapping.


27559-27562: Add Micrometer OTLP to sources mapping.


27665-27666: Add OpenTelemetry proto to sources mapping.


27675-27684: Add exporter-common & OTLP to sources mapping.


27698-27704: Add Prometheus config & exporters to sources mapping.


28234-28237: Add Kotlin stdlib-jdk7/jdk8 sources mapping.

🧹 Nitpick comments (16)
service_commons/src/main/java/ai/chronon/service/ChrononServiceLauncher.java (1)

36-56: Good extraction of metrics initialization.

OTLP configuration properly set up with required parameters.

Consider extracting "ai.chronon" as a constant rather than hardcoding it.

-    private void initializeMetrics(VertxOptions options) {
-        String serviceName = "ai.chronon";
+    private static final String SERVICE_NAME = "ai.chronon";
+    
+    private void initializeMetrics(VertxOptions options) {
docker/fetcher/start.sh (1)

32-41: Java command properly configured.

All required parameters passed with good error handling.

Consider adding file existence checks for JAR files before execution.

+if [ ! -f "$FETCHER_JAR" ]; then
+  echo "Error: Fetcher JAR $FETCHER_JAR not found"
+  exit 1
+fi
+
+if [ ! -f "$ONLINE_JAR" ]; then
+  echo "Error: Online JAR $ONLINE_JAR not found"
+  exit 1
+fi
+
docker/fetcher/Dockerfile (3)

15-22: Consider using package manager for Python tools

Using pip with system Python can cause conflicts. Consider using a virtual environment.


26-28: Use apt repository for Scala installation

Direct .deb download bypasses package management. Consider using official repositories.


47-47: Health check could be more robust

Current health check only verifies HTTP availability. Consider deeper health verification.

online/src/main/scala/ai/chronon/online/metrics/MetricsReporter.scala (1)

8-17: Well-designed metrics interface

Clean abstraction for metrics reporting with appropriate methods for different metric types.

Consider adding:

  1. Direct timing metric method
  2. Unit documentation for values
  3. Batch operation support
online/src/main/scala/ai/chronon/online/metrics/InstrumentedThreadPoolExecutor.scala (3)

20-20: Make metrics interval more configurable

Consider making this configurable via environment variable for runtime tuning.


63-66: Enhance error handling

Add specific exception handling for different error types instead of catching all exceptions.


77-100: Consider higher precision timing

For very fast operations, millisecond precision may be insufficient. Consider using System.nanoTime().

-    val submitTime = System.currentTimeMillis()
+    val submitTime = System.nanoTime()

     val instrumentedTask = new Runnable {
       override def run(): Unit = {
-        val startTime = System.currentTimeMillis()
+        val startTime = System.nanoTime()
         val waitTime = startTime - submitTime

         // Record wait time
-        metricsContext.distribution("wait_time_ms", waitTime)
+        metricsContext.distribution("wait_time_ns", waitTime)
+        metricsContext.distribution("wait_time_ms", waitTime / 1_000_000)

         command.run()
-        val endTime = System.currentTimeMillis()
+        val endTime = System.nanoTime()
         val execTime = endTime - startTime
         val totalTime = endTime - submitTime

         // Record timing metrics
-        metricsContext.distribution("execution_time_ms", execTime)
-        metricsContext.distribution("total_time_ms", totalTime)
+        metricsContext.distribution("execution_time_ns", execTime)
+        metricsContext.distribution("execution_time_ms", execTime / 1_000_000)
+        metricsContext.distribution("total_time_ns", totalTime)
+        metricsContext.distribution("total_time_ms", totalTime / 1_000_000)
       }
     }
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (3)

196-199: Prefer System.nanoTime() for latency.

System.currentTimeMillis() is wall-clock and can jump; nanoTime() gives monotonic duration.


218-220: Missing latency metric on failure.

On error you increment multiGet.bigtable_errors but skip a latency distribution, losing SLO visibility. Record the duration before returning.


333-336: DRY: context lookup duplicated.

The tableToContext lookup pattern appears in multiGet, list, and multiPut. Extract a small helper (getDatasetCtx) to avoid repetition.

online/src/main/scala/ai/chronon/online/metrics/Metrics.scala (2)

176-201: Duplicate tag keys silently overwrite.

joinNames.foreach(addTag(Tag.Join, _)) keeps only the last join when multiple are present. If multi-join tagging is important, change value to a concatenated string.


205-235: Metric name already contains environment.

prefix(metric) prepends environment. Passing the same value in tag Environment duplicates information; consider dropping one for cardinality.

online/src/main/scala/ai/chronon/online/metrics/OtelMetricsReporter.scala (2)

46-50: Attribute merge allocates each call.

attributes.toBuilder creates a new builder per record. Cache common attribute sets (ctx + no extra tags) separately to reduce alloc pressure on hot paths.


98-116: Prometheus & HTTP are mutually exclusive.

buildOtelMetricReader() ignores the case where both reader types are desired (e.g., HTTP for prod, Prometheus for local). Consider supporting dual registration or documenting the limitation.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 9499ce5 and 7f9b653.

📒 Files selected for processing (16)
  • cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (14 hunks)
  • docker/fetcher/Dockerfile (1 hunks)
  • docker/fetcher/start.sh (1 hunks)
  • maven_install.json (37 hunks)
  • online/BUILD.bazel (4 hunks)
  • online/src/main/java/ai/chronon/online/JavaFetcher.java (1 hunks)
  • online/src/main/scala/ai/chronon/online/fetcher/Fetcher.scala (2 hunks)
  • online/src/main/scala/ai/chronon/online/metrics/FlexibleExecutionContext.scala (1 hunks)
  • online/src/main/scala/ai/chronon/online/metrics/InstrumentedThreadPoolExecutor.scala (1 hunks)
  • online/src/main/scala/ai/chronon/online/metrics/Metrics.scala (6 hunks)
  • online/src/main/scala/ai/chronon/online/metrics/MetricsReporter.scala (1 hunks)
  • online/src/main/scala/ai/chronon/online/metrics/OtelMetricsReporter.scala (1 hunks)
  • online/src/test/scala/ai/chronon/online/test/TagsTest.scala (2 hunks)
  • service_commons/BUILD.bazel (1 hunks)
  • service_commons/src/main/java/ai/chronon/service/ChrononServiceLauncher.java (1 hunks)
  • tools/build_rules/dependencies/maven_repository.bzl (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (2)
Learnt from: piyush-zlai
PR: zipline-ai/chronon#657
File: cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala:93-97
Timestamp: 2025-04-21T15:10:40.819Z
Learning: The BigTableKVStoreImpl in the chronon codebase only interacts with 4 BigTable tables total, so unbounded caching in tableToContext is not a concern.
Learnt from: chewy-zlai
PR: zipline-ai/chronon#47
File: online/src/main/scala/ai/chronon/online/MetadataStore.scala:232-0
Timestamp: 2024-10-17T00:12:09.763Z
Learning: In the `KVStore` trait located at `online/src/main/scala/ai/chronon/online/KVStore.scala`, there are two `create` methods: `def create(dataset: String): Unit` and `def create(dataset: String, props: Map[String, Any]): Unit`. The version with `props` ignores the `props` parameter, and the simpler version without `props` is appropriate when `props` are not needed.
⏰ Context from checks skipped due to timeout of 90000ms (17)
  • GitHub Check: service_tests
  • GitHub Check: service_tests
  • GitHub Check: cloud_gcp_tests
  • GitHub Check: cloud_gcp_tests
  • GitHub Check: service_commons_tests
  • GitHub Check: cloud_aws_tests
  • GitHub Check: cloud_aws_tests
  • GitHub Check: api_tests
  • GitHub Check: online_tests
  • GitHub Check: online_tests
  • GitHub Check: flink_tests
  • GitHub Check: flink_tests
  • GitHub Check: aggregator_tests
  • GitHub Check: bazel_config_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: aggregator_tests
  • GitHub Check: api_tests
🔇 Additional comments (42)
service_commons/BUILD.bazel (1)

12-12: Dependency updated for OpenTelemetry metrics.

Changed from StatsD to OTLP registry to support the metrics provider migration.

online/src/main/java/ai/chronon/online/JavaFetcher.java (2)

202-202: Constructor call updated with additional parameter.

Added null parameter to align with the Metrics.Context constructor change in the Scala codebase that added an optional dataset field.


206-206: Constructor call updated with additional parameter.

Added null parameter to align with the Metrics.Context constructor change in the Scala codebase that added an optional dataset field.

online/src/main/scala/ai/chronon/online/metrics/FlexibleExecutionContext.scala (1)

38-45:

✅ Verification successful

Thread pool executor instrumented with metrics.

Replaced ThreadPoolExecutor with InstrumentedThreadPoolExecutor to collect metrics on the thread pool. Also increased queue capacity from 1000 to 10000.


🏁 Script executed:

#!/bin/bash
# Checking for the InstrumentedThreadPoolExecutor class implementation
rg -A 5 "class InstrumentedThreadPoolExecutor"

Length of output: 924


Instrumentation confirmed; ready to merge

  • InstrumentedThreadPoolExecutor found in online/src/main/scala/ai/chronon/online/metrics/InstrumentedThreadPoolExecutor.scala
  • Queue capacity increased from 1,000 to 10,000
online/src/main/scala/ai/chronon/online/fetcher/Fetcher.scala (2)

437-440: Metrics.Context constructor formatting adjusted.

Reformatted the constructor call for better readability while maintaining the same functionality.


494-495: Metrics.Context constructor formatting adjusted.

Reformatted the constructor call for better readability while maintaining the same functionality.

tools/build_rules/dependencies/maven_repository.bzl (2)

73-73: LGTM - Micrometer OTLP registry added.

Version matches existing Micrometer dependencies.


226-232: OpenTelemetry dependencies look complete.

Comprehensive OTel dependency set with consistent versioning.

online/src/test/scala/ai/chronon/online/test/TagsTest.scala (4)

21-22: Appropriate imports for OTel integration.


30-30: Good test approach using no-op OpenTelemetry.

Using noop OpenTelemetry avoids external dependencies.


55-59: Tag cache verification logic properly updated.

Test logic updated to use OtelMetricsReporter's cache.


61-63: Appropriate assertion maintained.

Still verifies tag string equivalence with new implementation.

service_commons/src/main/java/ai/chronon/service/ChrononServiceLauncher.java (3)

3-8: Imports properly updated for OTel.

StatsD imports replaced with OpenTelemetry equivalents.


18-20: Documentation correctly updated.

Javadoc reflects the migration to OpenTelemetry.


27-34: Better metrics enablement logic.

Using Optional for null-safe property parsing is a good practice.

docker/fetcher/start.sh (3)

4-11: Good environment variable validation.

Important to fail fast if required vars are missing.


13-19: Clear cloud provider selection logic.

Clean conditional for AWS/GCP configuration.


21-26: Appropriate metrics enablement check.

Using environment variable presence for configuration is a good pattern.

docker/fetcher/Dockerfile (1)

1-53: Dockerfile follows best practices

Environment variables are well-defined, dependencies are properly installed, and service configuration is clear.

online/BUILD.bazel (1)

1-9: Comprehensive OpenTelemetry dependencies

All necessary OpenTelemetry components are included.

online/src/main/scala/ai/chronon/online/metrics/InstrumentedThreadPoolExecutor.scala (1)

1-113: Well-implemented thread pool instrumentation

The implementation properly captures execution metrics and reports thread pool state.

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (1)

143-148: Tag value is table, not dataset.

metricsContext.copy(dataset = targetId.toString) records the physical table id in the dataset tag.
Down-stream dashboards may expect the logical dataset name. Double-check the intended semantics.

online/src/main/scala/ai/chronon/online/metrics/Metrics.scala (1)

130-146: Static client selection locks system properties.

client is initialised once; later tests or sub-modules changing System.setProperty won’t affect reporter choice. If hot-reloading matters, lazy-create per call.

maven_install.json (19)

3-4: Skip autogenerated metadata update.


1197-1203: Add OkHttp & Okio artifacts.


1778-1784: Add Micrometer OTLP registry dependency.


2131-2137: Add OpenTelemetry proto (alpha). Confirm alpha stability before promotion.


2147-2151: Add opentelemetry-api dependency.


2161-2242: Bulk add OpenTelemetry SDK & exporters. Validate version alignment across all modules.


2257-2284: Add Prometheus client libraries.


4172-4184: Add Kotlin stdlib JDK7/JDK8.


6420-6426: Update dependency graph for OkHttp/Okio.


6738-6741: Update Micrometer OTLP registry graph.


7007-7024: Update OpenTelemetry dependency graph. Verify no missing transitive deps.


8476-8481: Update Kotlin stdlib graph.


11312-11330: Update OkHttp internal packages mapping.


12166-12168: Update Micrometer OTLP graph.


12540-12549: Update OpenTelemetry proto graph.


12578-12600: Update exporter-common & OTLP provider graph.


12628-12631: Update autoconfigure SPI graph.


12678-12699: Update Prometheus metrics graph.


30256-30329: Add autoconfigure providers for OTLP & Prometheus.

Comment on lines +59 to +71
override def longGauge(metric: String, value: Long, tags: Map[String, String] = Map.empty)(implicit
context: Context): Unit = {
val gauge = longGauges.getOrElseUpdate(metric, meter.gaugeBuilder(metric).ofLongs().build())
val mergedAttributes = mergeAttributes(tagCache(context), tags)
gauge.set(value, mergedAttributes)
}

override def doubleGauge(metric: String, value: Double, tags: Map[String, String] = Map.empty)(implicit
context: Context): Unit = {
val gauge = doubleGauges.getOrElseUpdate(metric, meter.gaugeBuilder(metric).build())
val mergedAttributes = mergeAttributes(tagCache(context), tags)
gauge.set(value, mergedAttributes)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

LongGauge / DoubleGauge misuse – won’t compile.

OpenTelemetry gauge instruments are observable; gaugeBuilder(...).build() expects a callback and returns Observable*Gauge, which has no set method.
Current code will not compile. Use an UpDownCounter or asynchronous gauge with a polling callback.

-val gauge = longGauges.getOrElseUpdate(metric, meter.gaugeBuilder(metric).ofLongs().build())
-gauge.set(value, mergedAttributes)
+val counter = longGauges.getOrElseUpdate(metric, meter.upDownCounterBuilder(metric).ofLongs().build())
+counter.add(value, mergedAttributes)

}),
visibility = ["//visibility:public"],
deps = [
deps = OTEL_DEPS + [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooc why do we need this here? could we just depend on the metrics_lib ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no it seems like as lib includes all sources, we need to list otel libs here too (the deps on metrics_lib don't seem to get included by default)

maven_artifact("net.bytebuddy:byte-buddy-agent"),
maven_artifact("org.apache.hadoop:hadoop-common"),
maven_artifact("org.apache.hadoop:hadoop-client-api"),
maven_artifact("io.opentelemetry:opentelemetry-api"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@piyush-zlai piyush-zlai merged commit 2ff8387 into main May 2, 2025
21 checks passed
@piyush-zlai piyush-zlai deleted the piyush/metrics_reporter branch May 2, 2025 19:44
chewy-zlai pushed a commit that referenced this pull request May 15, 2025
…r Fetcher threadpool (#726)

## Summary
PR to swap our metrics reporter from statsd to open telemetry metrics.
We need otel to allow us to capture metrics in our clients without the need of
a prometheus statsd exporter sidecar that they've seen issues with
occasionally. Otel in general is a popular metrics ingestion interface
with a number of supported backends (e.g. prom / datadog / gcloud / aws
cloudwatch). Wiring up Otel also enables us to set up traces and spans
in the repo in the future.
Broad changes:
- Decouple bulk of the metrics reporting logic from the Metrics.Context.
The metrics reporter we use is pluggable. Currently this is just the
OpenTelemetry but in principle we can support others in the future.
- Online module creates the appropriate [otel
SDK](https://opentelemetry.io/docs/languages/java/sdk/) - either we use
the [Http provider or the Prometheus Http
server](https://opentelemetry.io/docs/languages/java/configuration/#properties-exporters).
We need the Http provider to plug into Vert.x as their Micrometer
integration works with that. The Prom http server is what our clients is keen
we use.

## Checklist
- [ ] Added Unit Tests
- [X] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update

Tested via docker container and a local instance of open telemetry:
Start up fetcher docker svc
```
docker run -v ~/.config/gcloud/application_default_credentials.json:/gcp/credentials.json  -p 9000:9000  -e "GCP_PROJECT_ID=canary-443022"  -e "GOOGLE_CLOUD_PROJECT=canary-443022"  -e "GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance"  -e "EXPORTER_OTLP_ENDPOINT=http://host.docker.internal:4318"  -e GOOGLE_APPLICATION_CREDENTIALS=/gcp/credentials.json  zipline-fetcher:latest
```

And then otel:
```
./otelcol --config otel-collector-config.yaml
...
```

We see:
```
2025-04-18T17:35:37.351-0400	info	ResourceMetrics #0
Resource SchemaURL: 
Resource attributes:
     -> service.name: Str(ai.chronon)
     -> telemetry.sdk.language: Str(java)
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.version: Str(1.49.0)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope ai.chronon 3.7.0-M11
Metric #0
Descriptor:
     -> Name: kv_store.bigtable.cache.insert
     -> Description: 
     -> Unit: 
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> dataset: Str(TableId{tableId=CHRONON_METADATA})
     -> environment: Str(kv_store)
     -> production: Str(false)
StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC
Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC
Value: 1
Metric #1
Descriptor:
     -> Name: kv_store.bigtable.multiGet.latency
     -> Description: 
     -> Unit: 
     -> DataType: Histogram
     -> AggregationTemporality: Cumulative
HistogramDataPoints #0
Data point attributes:
     -> dataset: Str(TableId{tableId=CHRONON_METADATA})
     -> environment: Str(kv_store)
     -> production: Str(false)
StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC
Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC
Count: 1
Sum: 229.000000
Min: 229.000000
Max: 229.000000
ExplicitBounds #0: 0.000000
...
Buckets #0, Count: 0
...
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced OpenTelemetry-based metrics reporting throughout the
platform, replacing the previous StatsD approach.
- Added a Dockerfile and startup script for a new Fetcher service,
supporting both AWS and GCP integrations with configurable metrics
export.
- Enhanced thread pool monitoring with a new executor that provides
detailed metrics on task execution and queue status.

- **Improvements**
- Metrics tags are now structured as key-value maps, improving clarity
and flexibility.
- Metrics reporting is now context-aware, supporting per-dataset and
per-table metrics.
- Increased thread pool queue capacity for better throughput under load.
- Replaced StatsD metrics configuration with OpenTelemetry OTLP in
service launcher and build configurations.

- **Bug Fixes**
- Improved error handling and logging in metrics reporting and thread
pool management.

- **Chores**
- Updated dependencies to include OpenTelemetry, Micrometer OTLP
registry, Prometheus, OkHttp, and Kotlin libraries.
- Refactored build and test configurations to support new telemetry
libraries and remove deprecated dependencies.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
chewy-zlai pushed a commit that referenced this pull request May 15, 2025
…r Fetcher threadpool (#726)

## Summary
PR to swap our metrics reporter from statsd to open telemetry metrics.
We need otel to allow us to capture metrics in our clients without the need of
a prometheus statsd exporter sidecar that they've seen issues with
occasionally. Otel in general is a popular metrics ingestion interface
with a number of supported backends (e.g. prom / datadog / gcloud / aws
cloudwatch). Wiring up Otel also enables us to set up traces and spans
in the repo in the future.
Broad changes:
- Decouple bulk of the metrics reporting logic from the Metrics.Context.
The metrics reporter we use is pluggable. Currently this is just the
OpenTelemetry but in principle we can support others in the future.
- Online module creates the appropriate [otel
SDK](https://opentelemetry.io/docs/languages/java/sdk/) - either we use
the [Http provider or the Prometheus Http
server](https://opentelemetry.io/docs/languages/java/configuration/#properties-exporters).
We need the Http provider to plug into Vert.x as their Micrometer
integration works with that. The Prom http server is what our clients is keen
we use.

## Checklist
- [ ] Added Unit Tests
- [X] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update

Tested via docker container and a local instance of open telemetry:
Start up fetcher docker svc
```
docker run -v ~/.config/gcloud/application_default_credentials.json:/gcp/credentials.json  -p 9000:9000  -e "GCP_PROJECT_ID=canary-443022"  -e "GOOGLE_CLOUD_PROJECT=canary-443022"  -e "GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance"  -e "EXPORTER_OTLP_ENDPOINT=http://host.docker.internal:4318"  -e GOOGLE_APPLICATION_CREDENTIALS=/gcp/credentials.json  zipline-fetcher:latest
```

And then otel:
```
./otelcol --config otel-collector-config.yaml
...
```

We see:
```
2025-04-18T17:35:37.351-0400	info	ResourceMetrics #0
Resource SchemaURL: 
Resource attributes:
     -> service.name: Str(ai.chronon)
     -> telemetry.sdk.language: Str(java)
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.version: Str(1.49.0)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope ai.chronon 3.7.0-M11
Metric #0
Descriptor:
     -> Name: kv_store.bigtable.cache.insert
     -> Description: 
     -> Unit: 
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> dataset: Str(TableId{tableId=CHRONON_METADATA})
     -> environment: Str(kv_store)
     -> production: Str(false)
StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC
Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC
Value: 1
Metric #1
Descriptor:
     -> Name: kv_store.bigtable.multiGet.latency
     -> Description: 
     -> Unit: 
     -> DataType: Histogram
     -> AggregationTemporality: Cumulative
HistogramDataPoints #0
Data point attributes:
     -> dataset: Str(TableId{tableId=CHRONON_METADATA})
     -> environment: Str(kv_store)
     -> production: Str(false)
StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC
Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC
Count: 1
Sum: 229.000000
Min: 229.000000
Max: 229.000000
ExplicitBounds #0: 0.000000
...
Buckets #0, Count: 0
...
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced OpenTelemetry-based metrics reporting throughout the
platform, replacing the previous StatsD approach.
- Added a Dockerfile and startup script for a new Fetcher service,
supporting both AWS and GCP integrations with configurable metrics
export.
- Enhanced thread pool monitoring with a new executor that provides
detailed metrics on task execution and queue status.

- **Improvements**
- Metrics tags are now structured as key-value maps, improving clarity
and flexibility.
- Metrics reporting is now context-aware, supporting per-dataset and
per-table metrics.
- Increased thread pool queue capacity for better throughput under load.
- Replaced StatsD metrics configuration with OpenTelemetry OTLP in
service launcher and build configurations.

- **Bug Fixes**
- Improved error handling and logging in metrics reporting and thread
pool management.

- **Chores**
- Updated dependencies to include OpenTelemetry, Micrometer OTLP
registry, Prometheus, OkHttp, and Kotlin libraries.
- Refactored build and test configurations to support new telemetry
libraries and remove deprecated dependencies.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
chewy-zlai pushed a commit that referenced this pull request May 16, 2025
…r Fetcher threadpool (#726)

## Summary
PR to swap our metrics reporter from statsd to open telemetry metrics.
We need otel to allow us to capture metrics in our clients without the need of
a prometheus statsd exporter sidecar that they've seen issues with
occasionally. Otel in general is a popular metrics ingestion interface
with a number of supported baour clientsends (e.g. prom / datadog / gcloud / aws
cloudwatch). Wiring up Otel also enables us to set up traces and spans
in the repo in the future.
Broad changes:
- Decouple bulk of the metrics reporting logic from the Metrics.Context.
The metrics reporter we use is pluggable. Currently this is just the
OpenTelemetry but in principle we can support others in the future.
- Online module creates the appropriate [otel
SDK](https://opentelemetry.io/docs/languages/java/sdk/) - either we use
the [Http provider or the Prometheus Http
server](https://opentelemetry.io/docs/languages/java/configuration/#properties-exporters).
We need the Http provider to plug into Vert.x as their Micrometer
integration works with that. The Prom http server is what our clients is keen
we use.

## Cheour clientslist
- [ ] Added Unit Tests
- [X] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update

Tested via doour clientser container and a local instance of open telemetry:
Start up fetcher doour clientser svc
```
doour clientser run -v ~/.config/gcloud/application_default_credentials.json:/gcp/credentials.json  -p 9000:9000  -e "GCP_PROJECT_ID=canary-443022"  -e "GOOGLE_CLOUD_PROJECT=canary-443022"  -e "GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance"  -e "EXPORTER_OTLP_ENDPOINT=http://host.doour clientser.internal:4318"  -e GOOGLE_APPLICATION_CREDENTIALS=/gcp/credentials.json  zipline-fetcher:latest
```

And then otel:
```
./otelcol --config otel-collector-config.yaml
...
```

We see:
```
2025-04-18T17:35:37.351-0400	info	ResourceMetrics #0
Resource SchemaURL: 
Resource attributes:
     -> service.name: Str(ai.chronon)
     -> telemetry.sdk.language: Str(java)
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.version: Str(1.49.0)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope ai.chronon 3.7.0-M11
Metric #0
Descriptor:
     -> Name: kv_store.bigtable.cache.insert
     -> Description: 
     -> Unit: 
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> dataset: Str(TableId{tableId=CHRONON_METADATA})
     -> environment: Str(kv_store)
     -> production: Str(false)
StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC
Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC
Value: 1
Metric #1
Descriptor:
     -> Name: kv_store.bigtable.multiGet.latency
     -> Description: 
     -> Unit: 
     -> DataType: Histogram
     -> AggregationTemporality: Cumulative
HistogramDataPoints #0
Data point attributes:
     -> dataset: Str(TableId{tableId=CHRONON_METADATA})
     -> environment: Str(kv_store)
     -> production: Str(false)
StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC
Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC
Count: 1
Sum: 229.000000
Min: 229.000000
Max: 229.000000
ExplicitBounds #0: 0.000000
...
Buour clientsets #0, Count: 0
...
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced OpenTelemetry-based metrics reporting throughout the
platform, replacing the previous StatsD approach.
- Added a Doour clientserfile and startup script for a new Fetcher service,
supporting both AWS and GCP integrations with configurable metrics
export.
- Enhanced thread pool monitoring with a new executor that provides
detailed metrics on task execution and queue status.

- **Improvements**
- Metrics tags are now structured as key-value maps, improving clarity
and flexibility.
- Metrics reporting is now context-aware, supporting per-dataset and
per-table metrics.
- Increased thread pool queue capacity for better throughput under load.
- Replaced StatsD metrics configuration with OpenTelemetry OTLP in
service launcher and build configurations.

- **Bug Fixes**
- Improved error handling and logging in metrics reporting and thread
pool management.

- **Chores**
- Updated dependencies to include OpenTelemetry, Micrometer OTLP
registry, Prometheus, OkHttp, and Kotlin libraries.
- Refactored build and test configurations to support new telemetry
libraries and remove deprecated dependencies.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants