-
Notifications
You must be signed in to change notification settings - Fork 8
Swap our metrics provider from statsd to otel-metrics + instrument our Fetcher threadpool #726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis update introduces a comprehensive migration from StatsD-based metrics to OpenTelemetry (OTel) metrics reporting across the codebase. It adds a pluggable Changes
Sequence Diagram(s)sequenceDiagram
participant Service
participant MetricsContext
participant MetricsReporter
participant OTelExporter
Service->>MetricsContext: Create or retrieve context (with dataset)
Service->>MetricsReporter: increment/count/gauge/distribution(metric, value, tags)
MetricsReporter->>OTelExporter: Record metric (with attributes)
Suggested reviewers
Poem
Warning Review ran into problems🔥 ProblemsGitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository. Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (18)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (2)
online/BUILD.bazel (1)
64-64: 💡 Verification agent🧩 Analysis chain
Remove unnecessary Datadog dependency
Since you're migrating to OpenTelemetry, the Datadog StatsD client may no longer be needed.
🏁 Script executed:
#!/bin/bash # Check if the Datadog client is still used in the codebase rg -A 3 -B 3 "com\.datadoghq\..*\.client" --type scalaLength of output: 56
🏁 Script executed:
#!/bin/bash # search StatsD pkg rg -A3 -B3 "com\.timgroup\.statsd" # search StatsDClient rg -A3 -B3 "StatsDClient"Length of output: 390
🏁 Script executed:
#!/bin/bash # search imports of Datadog StatsD in code rg -A3 -B3 "import com\.datadoghq" # search Datadog client class references rg -A3 -B3 "NonBlockingStatsDClient"Length of output: 73
Remove unused Datadog StatsD client dependency
No code imports or references to
com.datadoghq:java-dogstatsd-clientwere found.
- File:
online/BUILD.bazelat line 64 – delete- maven_artifact("com.datadoghq:java-dogstatsd-client"),cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (1)
94-99: 🛠️ Refactor suggestionCache can grow unbounded.
tableToContextretains oneContextper Bigtable table name and never evicts.
Today we only have a handful of tables (see learning), but if that assumption ever changes this will leak memory across the JVM lifetime.
Consider an LRU/TTL cache or aWeakReference-based map.
♻️ Duplicate comments (18)
maven_install.json (18)
24457-24462: Add OkHttp to sources list.
24623-24624: Add Micrometer OTLP to sources.
24729-24730: Add OpenTelemetry proto to sources.
24739-24748: Add exporter-common & OTLP to sources.
24765-24776: Add Prometheus config & exporters to sources.
25298-25301: Add Kotlin stdlib-jdk7/jdk8 to sources.
25925-25928: Add OkHttp to sources mapping.
26091-26094: Add Micrometer OTLP to sources mapping.
26197-26198: Add OpenTelemetry proto to sources mapping.
26207-26216: Add exporter-common & OTLP to sources mapping.
26233-26244: Add Prometheus config & exporters to sources mapping.
26766-26769: Add Kotlin stdlib-jdk7/jdk8 sources mapping.
27393-27398: Add OkHttp to sources mapping.
27559-27562: Add Micrometer OTLP to sources mapping.
27665-27666: Add OpenTelemetry proto to sources mapping.
27675-27684: Add exporter-common & OTLP to sources mapping.
27698-27704: Add Prometheus config & exporters to sources mapping.
28234-28237: Add Kotlin stdlib-jdk7/jdk8 sources mapping.
🧹 Nitpick comments (16)
service_commons/src/main/java/ai/chronon/service/ChrononServiceLauncher.java (1)
36-56: Good extraction of metrics initialization.OTLP configuration properly set up with required parameters.
Consider extracting "ai.chronon" as a constant rather than hardcoding it.
- private void initializeMetrics(VertxOptions options) { - String serviceName = "ai.chronon"; + private static final String SERVICE_NAME = "ai.chronon"; + + private void initializeMetrics(VertxOptions options) {docker/fetcher/start.sh (1)
32-41: Java command properly configured.All required parameters passed with good error handling.
Consider adding file existence checks for JAR files before execution.
+if [ ! -f "$FETCHER_JAR" ]; then + echo "Error: Fetcher JAR $FETCHER_JAR not found" + exit 1 +fi + +if [ ! -f "$ONLINE_JAR" ]; then + echo "Error: Online JAR $ONLINE_JAR not found" + exit 1 +fi +docker/fetcher/Dockerfile (3)
15-22: Consider using package manager for Python toolsUsing pip with system Python can cause conflicts. Consider using a virtual environment.
26-28: Use apt repository for Scala installationDirect .deb download bypasses package management. Consider using official repositories.
47-47: Health check could be more robustCurrent health check only verifies HTTP availability. Consider deeper health verification.
online/src/main/scala/ai/chronon/online/metrics/MetricsReporter.scala (1)
8-17: Well-designed metrics interfaceClean abstraction for metrics reporting with appropriate methods for different metric types.
Consider adding:
- Direct timing metric method
- Unit documentation for values
- Batch operation support
online/src/main/scala/ai/chronon/online/metrics/InstrumentedThreadPoolExecutor.scala (3)
20-20: Make metrics interval more configurableConsider making this configurable via environment variable for runtime tuning.
63-66: Enhance error handlingAdd specific exception handling for different error types instead of catching all exceptions.
77-100: Consider higher precision timingFor very fast operations, millisecond precision may be insufficient. Consider using System.nanoTime().
- val submitTime = System.currentTimeMillis() + val submitTime = System.nanoTime() val instrumentedTask = new Runnable { override def run(): Unit = { - val startTime = System.currentTimeMillis() + val startTime = System.nanoTime() val waitTime = startTime - submitTime // Record wait time - metricsContext.distribution("wait_time_ms", waitTime) + metricsContext.distribution("wait_time_ns", waitTime) + metricsContext.distribution("wait_time_ms", waitTime / 1_000_000) command.run() - val endTime = System.currentTimeMillis() + val endTime = System.nanoTime() val execTime = endTime - startTime val totalTime = endTime - submitTime // Record timing metrics - metricsContext.distribution("execution_time_ms", execTime) - metricsContext.distribution("total_time_ms", totalTime) + metricsContext.distribution("execution_time_ns", execTime) + metricsContext.distribution("execution_time_ms", execTime / 1_000_000) + metricsContext.distribution("total_time_ns", totalTime) + metricsContext.distribution("total_time_ms", totalTime / 1_000_000) } }cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (3)
196-199: PreferSystem.nanoTime()for latency.
System.currentTimeMillis()is wall-clock and can jump;nanoTime()gives monotonic duration.
218-220: Missing latency metric on failure.On error you increment
multiGet.bigtable_errorsbut skip a latency distribution, losing SLO visibility. Record the duration before returning.
333-336: DRY: context lookup duplicated.The
tableToContextlookup pattern appears inmultiGet,list, andmultiPut. Extract a small helper (getDatasetCtx) to avoid repetition.online/src/main/scala/ai/chronon/online/metrics/Metrics.scala (2)
176-201: Duplicate tag keys silently overwrite.
joinNames.foreach(addTag(Tag.Join, _))keeps only the last join when multiple are present. If multi-join tagging is important, change value to a concatenated string.
205-235: Metric name already contains environment.
prefix(metric)prependsenvironment. Passing the same value intag Environmentduplicates information; consider dropping one for cardinality.online/src/main/scala/ai/chronon/online/metrics/OtelMetricsReporter.scala (2)
46-50: Attribute merge allocates each call.
attributes.toBuildercreates a new builder per record. Cache common attribute sets (ctx + no extra tags) separately to reduce alloc pressure on hot paths.
98-116: Prometheus & HTTP are mutually exclusive.
buildOtelMetricReader()ignores the case where both reader types are desired (e.g., HTTP for prod, Prometheus for local). Consider supporting dual registration or documenting the limitation.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (16)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala(14 hunks)docker/fetcher/Dockerfile(1 hunks)docker/fetcher/start.sh(1 hunks)maven_install.json(37 hunks)online/BUILD.bazel(4 hunks)online/src/main/java/ai/chronon/online/JavaFetcher.java(1 hunks)online/src/main/scala/ai/chronon/online/fetcher/Fetcher.scala(2 hunks)online/src/main/scala/ai/chronon/online/metrics/FlexibleExecutionContext.scala(1 hunks)online/src/main/scala/ai/chronon/online/metrics/InstrumentedThreadPoolExecutor.scala(1 hunks)online/src/main/scala/ai/chronon/online/metrics/Metrics.scala(6 hunks)online/src/main/scala/ai/chronon/online/metrics/MetricsReporter.scala(1 hunks)online/src/main/scala/ai/chronon/online/metrics/OtelMetricsReporter.scala(1 hunks)online/src/test/scala/ai/chronon/online/test/TagsTest.scala(2 hunks)service_commons/BUILD.bazel(1 hunks)service_commons/src/main/java/ai/chronon/service/ChrononServiceLauncher.java(1 hunks)tools/build_rules/dependencies/maven_repository.bzl(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (2)
Learnt from: piyush-zlai
PR: zipline-ai/chronon#657
File: cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala:93-97
Timestamp: 2025-04-21T15:10:40.819Z
Learning: The BigTableKVStoreImpl in the chronon codebase only interacts with 4 BigTable tables total, so unbounded caching in tableToContext is not a concern.
Learnt from: chewy-zlai
PR: zipline-ai/chronon#47
File: online/src/main/scala/ai/chronon/online/MetadataStore.scala:232-0
Timestamp: 2024-10-17T00:12:09.763Z
Learning: In the `KVStore` trait located at `online/src/main/scala/ai/chronon/online/KVStore.scala`, there are two `create` methods: `def create(dataset: String): Unit` and `def create(dataset: String, props: Map[String, Any]): Unit`. The version with `props` ignores the `props` parameter, and the simpler version without `props` is appropriate when `props` are not needed.
⏰ Context from checks skipped due to timeout of 90000ms (17)
- GitHub Check: service_tests
- GitHub Check: service_tests
- GitHub Check: cloud_gcp_tests
- GitHub Check: cloud_gcp_tests
- GitHub Check: service_commons_tests
- GitHub Check: cloud_aws_tests
- GitHub Check: cloud_aws_tests
- GitHub Check: api_tests
- GitHub Check: online_tests
- GitHub Check: online_tests
- GitHub Check: flink_tests
- GitHub Check: flink_tests
- GitHub Check: aggregator_tests
- GitHub Check: bazel_config_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: aggregator_tests
- GitHub Check: api_tests
🔇 Additional comments (42)
service_commons/BUILD.bazel (1)
12-12: Dependency updated for OpenTelemetry metrics.Changed from StatsD to OTLP registry to support the metrics provider migration.
online/src/main/java/ai/chronon/online/JavaFetcher.java (2)
202-202: Constructor call updated with additional parameter.Added null parameter to align with the Metrics.Context constructor change in the Scala codebase that added an optional dataset field.
206-206: Constructor call updated with additional parameter.Added null parameter to align with the Metrics.Context constructor change in the Scala codebase that added an optional dataset field.
online/src/main/scala/ai/chronon/online/metrics/FlexibleExecutionContext.scala (1)
38-45:✅ Verification successful
Thread pool executor instrumented with metrics.
Replaced ThreadPoolExecutor with InstrumentedThreadPoolExecutor to collect metrics on the thread pool. Also increased queue capacity from 1000 to 10000.
🏁 Script executed:
#!/bin/bash # Checking for the InstrumentedThreadPoolExecutor class implementation rg -A 5 "class InstrumentedThreadPoolExecutor"Length of output: 924
Instrumentation confirmed; ready to merge
InstrumentedThreadPoolExecutorfound in online/src/main/scala/ai/chronon/online/metrics/InstrumentedThreadPoolExecutor.scala- Queue capacity increased from 1,000 to 10,000
online/src/main/scala/ai/chronon/online/fetcher/Fetcher.scala (2)
437-440: Metrics.Context constructor formatting adjusted.Reformatted the constructor call for better readability while maintaining the same functionality.
494-495: Metrics.Context constructor formatting adjusted.Reformatted the constructor call for better readability while maintaining the same functionality.
tools/build_rules/dependencies/maven_repository.bzl (2)
73-73: LGTM - Micrometer OTLP registry added.Version matches existing Micrometer dependencies.
226-232: OpenTelemetry dependencies look complete.Comprehensive OTel dependency set with consistent versioning.
online/src/test/scala/ai/chronon/online/test/TagsTest.scala (4)
21-22: Appropriate imports for OTel integration.
30-30: Good test approach using no-op OpenTelemetry.Using noop OpenTelemetry avoids external dependencies.
55-59: Tag cache verification logic properly updated.Test logic updated to use OtelMetricsReporter's cache.
61-63: Appropriate assertion maintained.Still verifies tag string equivalence with new implementation.
service_commons/src/main/java/ai/chronon/service/ChrononServiceLauncher.java (3)
3-8: Imports properly updated for OTel.StatsD imports replaced with OpenTelemetry equivalents.
18-20: Documentation correctly updated.Javadoc reflects the migration to OpenTelemetry.
27-34: Better metrics enablement logic.Using Optional for null-safe property parsing is a good practice.
docker/fetcher/start.sh (3)
4-11: Good environment variable validation.Important to fail fast if required vars are missing.
13-19: Clear cloud provider selection logic.Clean conditional for AWS/GCP configuration.
21-26: Appropriate metrics enablement check.Using environment variable presence for configuration is a good pattern.
docker/fetcher/Dockerfile (1)
1-53: Dockerfile follows best practicesEnvironment variables are well-defined, dependencies are properly installed, and service configuration is clear.
online/BUILD.bazel (1)
1-9: Comprehensive OpenTelemetry dependenciesAll necessary OpenTelemetry components are included.
online/src/main/scala/ai/chronon/online/metrics/InstrumentedThreadPoolExecutor.scala (1)
1-113: Well-implemented thread pool instrumentationThe implementation properly captures execution metrics and reports thread pool state.
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (1)
143-148: Tag value is table, not dataset.
metricsContext.copy(dataset = targetId.toString)records the physical table id in thedatasettag.
Down-stream dashboards may expect the logical dataset name. Double-check the intended semantics.online/src/main/scala/ai/chronon/online/metrics/Metrics.scala (1)
130-146: Static client selection locks system properties.
clientis initialised once; later tests or sub-modules changingSystem.setPropertywon’t affect reporter choice. If hot-reloading matters, lazy-create per call.maven_install.json (19)
3-4: Skip autogenerated metadata update.
1197-1203: Add OkHttp & Okio artifacts.
1778-1784: Add Micrometer OTLP registry dependency.
2131-2137: Add OpenTelemetry proto (alpha). Confirm alpha stability before promotion.
2147-2151: Add opentelemetry-api dependency.
2161-2242: Bulk add OpenTelemetry SDK & exporters. Validate version alignment across all modules.
2257-2284: Add Prometheus client libraries.
4172-4184: Add Kotlin stdlib JDK7/JDK8.
6420-6426: Update dependency graph for OkHttp/Okio.
6738-6741: Update Micrometer OTLP registry graph.
7007-7024: Update OpenTelemetry dependency graph. Verify no missing transitive deps.
8476-8481: Update Kotlin stdlib graph.
11312-11330: Update OkHttp internal packages mapping.
12166-12168: Update Micrometer OTLP graph.
12540-12549: Update OpenTelemetry proto graph.
12578-12600: Update exporter-common & OTLP provider graph.
12628-12631: Update autoconfigure SPI graph.
12678-12699: Update Prometheus metrics graph.
30256-30329: Add autoconfigure providers for OTLP & Prometheus.
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala
Show resolved
Hide resolved
| override def longGauge(metric: String, value: Long, tags: Map[String, String] = Map.empty)(implicit | ||
| context: Context): Unit = { | ||
| val gauge = longGauges.getOrElseUpdate(metric, meter.gaugeBuilder(metric).ofLongs().build()) | ||
| val mergedAttributes = mergeAttributes(tagCache(context), tags) | ||
| gauge.set(value, mergedAttributes) | ||
| } | ||
|
|
||
| override def doubleGauge(metric: String, value: Double, tags: Map[String, String] = Map.empty)(implicit | ||
| context: Context): Unit = { | ||
| val gauge = doubleGauges.getOrElseUpdate(metric, meter.gaugeBuilder(metric).build()) | ||
| val mergedAttributes = mergeAttributes(tagCache(context), tags) | ||
| gauge.set(value, mergedAttributes) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LongGauge / DoubleGauge misuse – won’t compile.
OpenTelemetry gauge instruments are observable; gaugeBuilder(...).build() expects a callback and returns Observable*Gauge, which has no set method.
Current code will not compile. Use an UpDownCounter or asynchronous gauge with a polling callback.
-val gauge = longGauges.getOrElseUpdate(metric, meter.gaugeBuilder(metric).ofLongs().build())
-gauge.set(value, mergedAttributes)
+val counter = longGauges.getOrElseUpdate(metric, meter.upDownCounterBuilder(metric).ofLongs().build())
+counter.add(value, mergedAttributes)| }), | ||
| visibility = ["//visibility:public"], | ||
| deps = [ | ||
| deps = OTEL_DEPS + [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooc why do we need this here? could we just depend on the metrics_lib ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no it seems like as lib includes all sources, we need to list otel libs here too (the deps on metrics_lib don't seem to get included by default)
| maven_artifact("net.bytebuddy:byte-buddy-agent"), | ||
| maven_artifact("org.apache.hadoop:hadoop-common"), | ||
| maven_artifact("org.apache.hadoop:hadoop-client-api"), | ||
| maven_artifact("io.opentelemetry:opentelemetry-api"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
…r Fetcher threadpool (#726) ## Summary PR to swap our metrics reporter from statsd to open telemetry metrics. We need otel to allow us to capture metrics in our clients without the need of a prometheus statsd exporter sidecar that they've seen issues with occasionally. Otel in general is a popular metrics ingestion interface with a number of supported backends (e.g. prom / datadog / gcloud / aws cloudwatch). Wiring up Otel also enables us to set up traces and spans in the repo in the future. Broad changes: - Decouple bulk of the metrics reporting logic from the Metrics.Context. The metrics reporter we use is pluggable. Currently this is just the OpenTelemetry but in principle we can support others in the future. - Online module creates the appropriate [otel SDK](https://opentelemetry.io/docs/languages/java/sdk/) - either we use the [Http provider or the Prometheus Http server](https://opentelemetry.io/docs/languages/java/configuration/#properties-exporters). We need the Http provider to plug into Vert.x as their Micrometer integration works with that. The Prom http server is what our clients is keen we use. ## Checklist - [ ] Added Unit Tests - [X] Covered by existing CI - [X] Integration tested - [ ] Documentation update Tested via docker container and a local instance of open telemetry: Start up fetcher docker svc ``` docker run -v ~/.config/gcloud/application_default_credentials.json:/gcp/credentials.json -p 9000:9000 -e "GCP_PROJECT_ID=canary-443022" -e "GOOGLE_CLOUD_PROJECT=canary-443022" -e "GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance" -e "EXPORTER_OTLP_ENDPOINT=http://host.docker.internal:4318" -e GOOGLE_APPLICATION_CREDENTIALS=/gcp/credentials.json zipline-fetcher:latest ``` And then otel: ``` ./otelcol --config otel-collector-config.yaml ... ``` We see: ``` 2025-04-18T17:35:37.351-0400 info ResourceMetrics #0 Resource SchemaURL: Resource attributes: -> service.name: Str(ai.chronon) -> telemetry.sdk.language: Str(java) -> telemetry.sdk.name: Str(opentelemetry) -> telemetry.sdk.version: Str(1.49.0) ScopeMetrics #0 ScopeMetrics SchemaURL: InstrumentationScope ai.chronon 3.7.0-M11 Metric #0 Descriptor: -> Name: kv_store.bigtable.cache.insert -> Description: -> Unit: -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> dataset: Str(TableId{tableId=CHRONON_METADATA}) -> environment: Str(kv_store) -> production: Str(false) StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC Value: 1 Metric #1 Descriptor: -> Name: kv_store.bigtable.multiGet.latency -> Description: -> Unit: -> DataType: Histogram -> AggregationTemporality: Cumulative HistogramDataPoints #0 Data point attributes: -> dataset: Str(TableId{tableId=CHRONON_METADATA}) -> environment: Str(kv_store) -> production: Str(false) StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC Count: 1 Sum: 229.000000 Min: 229.000000 Max: 229.000000 ExplicitBounds #0: 0.000000 ... Buckets #0, Count: 0 ... ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced OpenTelemetry-based metrics reporting throughout the platform, replacing the previous StatsD approach. - Added a Dockerfile and startup script for a new Fetcher service, supporting both AWS and GCP integrations with configurable metrics export. - Enhanced thread pool monitoring with a new executor that provides detailed metrics on task execution and queue status. - **Improvements** - Metrics tags are now structured as key-value maps, improving clarity and flexibility. - Metrics reporting is now context-aware, supporting per-dataset and per-table metrics. - Increased thread pool queue capacity for better throughput under load. - Replaced StatsD metrics configuration with OpenTelemetry OTLP in service launcher and build configurations. - **Bug Fixes** - Improved error handling and logging in metrics reporting and thread pool management. - **Chores** - Updated dependencies to include OpenTelemetry, Micrometer OTLP registry, Prometheus, OkHttp, and Kotlin libraries. - Refactored build and test configurations to support new telemetry libraries and remove deprecated dependencies. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…r Fetcher threadpool (#726) ## Summary PR to swap our metrics reporter from statsd to open telemetry metrics. We need otel to allow us to capture metrics in our clients without the need of a prometheus statsd exporter sidecar that they've seen issues with occasionally. Otel in general is a popular metrics ingestion interface with a number of supported backends (e.g. prom / datadog / gcloud / aws cloudwatch). Wiring up Otel also enables us to set up traces and spans in the repo in the future. Broad changes: - Decouple bulk of the metrics reporting logic from the Metrics.Context. The metrics reporter we use is pluggable. Currently this is just the OpenTelemetry but in principle we can support others in the future. - Online module creates the appropriate [otel SDK](https://opentelemetry.io/docs/languages/java/sdk/) - either we use the [Http provider or the Prometheus Http server](https://opentelemetry.io/docs/languages/java/configuration/#properties-exporters). We need the Http provider to plug into Vert.x as their Micrometer integration works with that. The Prom http server is what our clients is keen we use. ## Checklist - [ ] Added Unit Tests - [X] Covered by existing CI - [X] Integration tested - [ ] Documentation update Tested via docker container and a local instance of open telemetry: Start up fetcher docker svc ``` docker run -v ~/.config/gcloud/application_default_credentials.json:/gcp/credentials.json -p 9000:9000 -e "GCP_PROJECT_ID=canary-443022" -e "GOOGLE_CLOUD_PROJECT=canary-443022" -e "GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance" -e "EXPORTER_OTLP_ENDPOINT=http://host.docker.internal:4318" -e GOOGLE_APPLICATION_CREDENTIALS=/gcp/credentials.json zipline-fetcher:latest ``` And then otel: ``` ./otelcol --config otel-collector-config.yaml ... ``` We see: ``` 2025-04-18T17:35:37.351-0400 info ResourceMetrics #0 Resource SchemaURL: Resource attributes: -> service.name: Str(ai.chronon) -> telemetry.sdk.language: Str(java) -> telemetry.sdk.name: Str(opentelemetry) -> telemetry.sdk.version: Str(1.49.0) ScopeMetrics #0 ScopeMetrics SchemaURL: InstrumentationScope ai.chronon 3.7.0-M11 Metric #0 Descriptor: -> Name: kv_store.bigtable.cache.insert -> Description: -> Unit: -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> dataset: Str(TableId{tableId=CHRONON_METADATA}) -> environment: Str(kv_store) -> production: Str(false) StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC Value: 1 Metric #1 Descriptor: -> Name: kv_store.bigtable.multiGet.latency -> Description: -> Unit: -> DataType: Histogram -> AggregationTemporality: Cumulative HistogramDataPoints #0 Data point attributes: -> dataset: Str(TableId{tableId=CHRONON_METADATA}) -> environment: Str(kv_store) -> production: Str(false) StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC Count: 1 Sum: 229.000000 Min: 229.000000 Max: 229.000000 ExplicitBounds #0: 0.000000 ... Buckets #0, Count: 0 ... ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced OpenTelemetry-based metrics reporting throughout the platform, replacing the previous StatsD approach. - Added a Dockerfile and startup script for a new Fetcher service, supporting both AWS and GCP integrations with configurable metrics export. - Enhanced thread pool monitoring with a new executor that provides detailed metrics on task execution and queue status. - **Improvements** - Metrics tags are now structured as key-value maps, improving clarity and flexibility. - Metrics reporting is now context-aware, supporting per-dataset and per-table metrics. - Increased thread pool queue capacity for better throughput under load. - Replaced StatsD metrics configuration with OpenTelemetry OTLP in service launcher and build configurations. - **Bug Fixes** - Improved error handling and logging in metrics reporting and thread pool management. - **Chores** - Updated dependencies to include OpenTelemetry, Micrometer OTLP registry, Prometheus, OkHttp, and Kotlin libraries. - Refactored build and test configurations to support new telemetry libraries and remove deprecated dependencies. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…r Fetcher threadpool (#726) ## Summary PR to swap our metrics reporter from statsd to open telemetry metrics. We need otel to allow us to capture metrics in our clients without the need of a prometheus statsd exporter sidecar that they've seen issues with occasionally. Otel in general is a popular metrics ingestion interface with a number of supported baour clientsends (e.g. prom / datadog / gcloud / aws cloudwatch). Wiring up Otel also enables us to set up traces and spans in the repo in the future. Broad changes: - Decouple bulk of the metrics reporting logic from the Metrics.Context. The metrics reporter we use is pluggable. Currently this is just the OpenTelemetry but in principle we can support others in the future. - Online module creates the appropriate [otel SDK](https://opentelemetry.io/docs/languages/java/sdk/) - either we use the [Http provider or the Prometheus Http server](https://opentelemetry.io/docs/languages/java/configuration/#properties-exporters). We need the Http provider to plug into Vert.x as their Micrometer integration works with that. The Prom http server is what our clients is keen we use. ## Cheour clientslist - [ ] Added Unit Tests - [X] Covered by existing CI - [X] Integration tested - [ ] Documentation update Tested via doour clientser container and a local instance of open telemetry: Start up fetcher doour clientser svc ``` doour clientser run -v ~/.config/gcloud/application_default_credentials.json:/gcp/credentials.json -p 9000:9000 -e "GCP_PROJECT_ID=canary-443022" -e "GOOGLE_CLOUD_PROJECT=canary-443022" -e "GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance" -e "EXPORTER_OTLP_ENDPOINT=http://host.doour clientser.internal:4318" -e GOOGLE_APPLICATION_CREDENTIALS=/gcp/credentials.json zipline-fetcher:latest ``` And then otel: ``` ./otelcol --config otel-collector-config.yaml ... ``` We see: ``` 2025-04-18T17:35:37.351-0400 info ResourceMetrics #0 Resource SchemaURL: Resource attributes: -> service.name: Str(ai.chronon) -> telemetry.sdk.language: Str(java) -> telemetry.sdk.name: Str(opentelemetry) -> telemetry.sdk.version: Str(1.49.0) ScopeMetrics #0 ScopeMetrics SchemaURL: InstrumentationScope ai.chronon 3.7.0-M11 Metric #0 Descriptor: -> Name: kv_store.bigtable.cache.insert -> Description: -> Unit: -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> dataset: Str(TableId{tableId=CHRONON_METADATA}) -> environment: Str(kv_store) -> production: Str(false) StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC Value: 1 Metric #1 Descriptor: -> Name: kv_store.bigtable.multiGet.latency -> Description: -> Unit: -> DataType: Histogram -> AggregationTemporality: Cumulative HistogramDataPoints #0 Data point attributes: -> dataset: Str(TableId{tableId=CHRONON_METADATA}) -> environment: Str(kv_store) -> production: Str(false) StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC Count: 1 Sum: 229.000000 Min: 229.000000 Max: 229.000000 ExplicitBounds #0: 0.000000 ... Buour clientsets #0, Count: 0 ... ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced OpenTelemetry-based metrics reporting throughout the platform, replacing the previous StatsD approach. - Added a Doour clientserfile and startup script for a new Fetcher service, supporting both AWS and GCP integrations with configurable metrics export. - Enhanced thread pool monitoring with a new executor that provides detailed metrics on task execution and queue status. - **Improvements** - Metrics tags are now structured as key-value maps, improving clarity and flexibility. - Metrics reporting is now context-aware, supporting per-dataset and per-table metrics. - Increased thread pool queue capacity for better throughput under load. - Replaced StatsD metrics configuration with OpenTelemetry OTLP in service launcher and build configurations. - **Bug Fixes** - Improved error handling and logging in metrics reporting and thread pool management. - **Chores** - Updated dependencies to include OpenTelemetry, Micrometer OTLP registry, Prometheus, OkHttp, and Kotlin libraries. - Refactored build and test configurations to support new telemetry libraries and remove deprecated dependencies. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
PR to swap our metrics reporter from statsd to open telemetry metrics. We need otel to allow us to capture metrics in Etsy without the need of a prometheus statsd exporter sidecar that they've seen issues with occasionally. Otel in general is a popular metrics ingestion interface with a number of supported backends (e.g. prom / datadog / gcloud / aws cloudwatch). Wiring up Otel also enables us to set up traces and spans in the repo in the future.
Broad changes:
Checklist
Tested via docker container and a local instance of open telemetry:
Start up fetcher docker svc
And then otel:
We see:
Summary by CodeRabbit
New Features
Improvements
Bug Fixes
Chores