Emit health metrics for OTel SDK metric collection and export#145179
Emit health metrics for OTel SDK metric collection and export#145179mamazzol merged 24 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/es-core-infra (Team:Core/Infra) |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR updates OpenTelemetry dependencies and switches runtime telemetry module usage to the generic runtime_telemetry. Verification metadata and Gradle build scripts are updated. Code changes add a static helper to merge ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java (1)
35-49:⚠️ Potential issue | 🔴 CriticalComplete the
meterProvider→resourcesmigration.This refactor removes the standalone provider state, but
attemptFlushMetrics()still referencesmeterProviderat Line 115 and Line 117. That leaves the class uncompilable. Please switch that method to guard onresourcesand flushresources.meterProvider()instead so the pre-init path remains a no-op.Suggested fix
`@Override` public void attemptFlushMetrics() { synchronized (mutex) { - if (meterProvider != null) { + if (resources != null) { // If the timeout expires, this quietly returns, which is ok in this context. - meterProvider.forceFlush().join(10, TimeUnit.SECONDS); + resources.meterProvider().forceFlush().join(10, TimeUnit.SECONDS); } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java` around lines 35 - 49, The attemptFlushMetrics() method in OTelSdkMeterSupplier still references the removed meterProvider field causing compilation errors; update attemptFlushMetrics() to check if resources is non-null before proceeding (guard on resources) and call resources.meterProvider().forceFlush() (or the existing flush call) instead of meterProvider so the pre-init path is a no-op and flushing uses the new resources state.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In
`@modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java`:
- Around line 35-49: The attemptFlushMetrics() method in OTelSdkMeterSupplier
still references the removed meterProvider field causing compilation errors;
update attemptFlushMetrics() to check if resources is non-null before proceeding
(guard on resources) and call resources.meterProvider().forceFlush() (or the
existing flush call) instead of meterProvider so the pre-init path is a no-op
and flushing uses the new resources state.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: d0cc285b-4719-4b75-9da8-2149686d02ef
📒 Files selected for processing (1)
modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java
There was a problem hiding this comment.
🧹 Nitpick comments (1)
modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java (1)
112-120: Minor:attemptFlushMetricsonly flushes main provider
meterHealthMeterProviderisn't flushed here, so health metrics about the exporter won't be immediately available after this call. Sinceclose()properly shuts down both providers (which triggers their final flush), this isn't data loss—just a timing gap.If immediate visibility of health metrics matters at flush time, consider:
public void attemptFlushMetrics() { synchronized (mutex) { if (resources != null) { // If the timeout expires, this quietly returns, which is ok in this context. resources.meterProvider.forceFlush().join(10, TimeUnit.SECONDS); + resources.meterHealthMeterProvider.forceFlush().join(10, TimeUnit.SECONDS); } } }Not blocking—current design is functional.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java` around lines 112 - 120, attemptFlushMetrics currently only forces a flush on resources.meterProvider and omits flushing meterHealthMeterProvider, causing health/exporter metrics to lag; update attemptFlushMetrics to also call meterHealthMeterProvider.forceFlush().join(10, TimeUnit.SECONDS) (or equivalent non-blocking invocation if preferred), keeping both calls inside the synchronized(mutex) block and following the same timeout/semantics used for resources so health metrics are flushed at the same time as the main provider (see attemptFlushMetrics, resources, meterHealthMeterProvider, and close()).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java`:
- Around line 112-120: attemptFlushMetrics currently only forces a flush on
resources.meterProvider and omits flushing meterHealthMeterProvider, causing
health/exporter metrics to lag; update attemptFlushMetrics to also call
meterHealthMeterProvider.forceFlush().join(10, TimeUnit.SECONDS) (or equivalent
non-blocking invocation if preferred), keeping both calls inside the
synchronized(mutex) block and following the same timeout/semantics used for
resources so health metrics are flushed at the same time as the main provider
(see attemptFlushMetrics, resources, meterHealthMeterProvider, and close()).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 9ca8e794-6131-4b62-822b-a4414ed8e1a2
📒 Files selected for processing (1)
modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java
| .setMeterProvider(() -> meterProvider) | ||
| .setAggregationTemporalitySelector(AggregationTemporalitySelector.deltaPreferred()); | ||
| .setAggregationTemporalitySelector(AggregationTemporalitySelector.deltaPreferred()) | ||
| .setInternalTelemetryVersion(InternalTelemetryVersion.LATEST); |
There was a problem hiding this comment.
(This line here seems to be the real change in this commit; the rest is refactoring.)
mosche
left a comment
There was a problem hiding this comment.
how about adding a Java REST test to check for the presence of the health metrics?
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java`:
- Around line 115-120: The duplicate post-flush calls cause a second collection
cycle; in OTelSdkMeterSupplier (look for resources.systemMeterProvider and
resources.meterHealthMeterProvider and their forceFlush().join(...) calls)
remove the second pair of forceFlush().join(10, TimeUnit.SECONDS) invocations so
only the initial systemMeterProvider.forceFlush().join(...) and
meterHealthMeterProvider.forceFlush().join(...) remain, preventing duplicate
exported samples on repeated /_flush_telemetry calls.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 10bd325d-b9c2-433b-9255-6f0d42e94e3c
📒 Files selected for processing (5)
build-tools/src/main/java/org/elasticsearch/gradle/testclusters/RunTask.javalibs/core/src/main/java/org/elasticsearch/core/TimeValue.javamodules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.javatest/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/OtelMetricsIT.javax-pack/extras/plugins/microsoft-graph-authz/src/main/java/org/elasticsearch/xpack/security/authz/microsoft/MicrosoftGraphAuthzRealm.java
prdoyle
left a comment
There was a problem hiding this comment.
The flush duplication seems harmless to me. I can fix that in my next PR assuming the duplication is actually incorrect.
|
Sadly, the test failure looks real: |
Relevant changes in the PR:
The metrics we get with this PR are:
otel.sdk.metric_reader.collection.durationotel.sdk.exporter.metric_data_point.exportedotel.sdk.exporter.metric_data_point.inflightotel.sdk.exporter.operation.durationThese new metrics are described here
--
The PR also contains a small refactor to add a
toDurationmethod toTimeValue.ES-14162