Skip to content

Emit health metrics for OTel SDK metric collection and export#145179

Merged
mamazzol merged 24 commits intoelastic:mainfrom
mamazzol:otel-health
Apr 3, 2026
Merged

Emit health metrics for OTel SDK metric collection and export#145179
mamazzol merged 24 commits intoelastic:mainfrom
mamazzol:otel-health

Conversation

@mamazzol
Copy link
Copy Markdown
Contributor

@mamazzol mamazzol commented Mar 30, 2026

Relevant changes in the PR:

  • Bump OTEL SDK version used so that we get health metrics from PeriodicMetricReader, only added in 1.60.x
    • Bumped the library for telemetry accordingly as it's developed against a specific SDK.
  • Passed the a specific health MeterProvider to the OTLPExporter so it can read metrics related to export.
  • Further disable APM Agent code injection when OTEL SDK is in use as it was conflicting with the metric export. This will probably never be executed in Production as we will cleanly replace both metrics and traces, but it's still good to have there if we ever decide to do metrics-only for some time.
  • Refactored things a bit to make it more clean.

The metrics we get with this PR are:

  • otel.sdk.metric_reader.collection.duration
  • otel.sdk.exporter.metric_data_point.exported
  • otel.sdk.exporter.metric_data_point.inflight
  • otel.sdk.exporter.operation.duration
    These new metrics are described here

--

The PR also contains a small refactor to add a toDuration method to TimeValue.

ES-14162

@mamazzol mamazzol requested review from a team as code owners March 30, 2026 10:48
@elasticsearchmachine elasticsearchmachine added v9.4.0 needs:triage Requires assignment of a team area label labels Mar 30, 2026
@mamazzol mamazzol added >refactoring Team:Core/Infra Meta label for core/infra team :Core/Infra/Metrics Metrics and metering infrastructure and removed needs:triage Requires assignment of a team area label labels Mar 30, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 30, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR updates OpenTelemetry dependencies and switches runtime telemetry module usage to the generic runtime_telemetry. Verification metadata and Gradle build scripts are updated. Code changes add a static helper to merge disable_instrumentations for APM JVM options, refactor meter initialization into a lazily-created resources record in OTelSdkMeterSupplier, and adjust RunTask to enable OTEL metrics when requested. It also adds TimeValue.toDuration(), a new OtelMetricsIT test, entitlement-policy mapping tweaks, and a minor OkHttp timeout change.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • 🛠️ Update Documentation: Commit on current branch
  • 🛠️ Update Documentation: Create PR

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java (1)

35-49: ⚠️ Potential issue | 🔴 Critical

Complete the meterProviderresources migration.

This refactor removes the standalone provider state, but attemptFlushMetrics() still references meterProvider at Line 115 and Line 117. That leaves the class uncompilable. Please switch that method to guard on resources and flush resources.meterProvider() instead so the pre-init path remains a no-op.

Suggested fix
     `@Override`
     public void attemptFlushMetrics() {
         synchronized (mutex) {
-            if (meterProvider != null) {
+            if (resources != null) {
                 // If the timeout expires, this quietly returns, which is ok in this context.
-                meterProvider.forceFlush().join(10, TimeUnit.SECONDS);
+                resources.meterProvider().forceFlush().join(10, TimeUnit.SECONDS);
             }
         }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java`
around lines 35 - 49, The attemptFlushMetrics() method in OTelSdkMeterSupplier
still references the removed meterProvider field causing compilation errors;
update attemptFlushMetrics() to check if resources is non-null before proceeding
(guard on resources) and call resources.meterProvider().forceFlush() (or the
existing flush call) instead of meterProvider so the pre-init path is a no-op
and flushing uses the new resources state.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In
`@modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java`:
- Around line 35-49: The attemptFlushMetrics() method in OTelSdkMeterSupplier
still references the removed meterProvider field causing compilation errors;
update attemptFlushMetrics() to check if resources is non-null before proceeding
(guard on resources) and call resources.meterProvider().forceFlush() (or the
existing flush call) instead of meterProvider so the pre-init path is a no-op
and flushing uses the new resources state.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: d0cc285b-4719-4b75-9da8-2149686d02ef

📥 Commits

Reviewing files that changed from the base of the PR and between a215113 and 0fb8b2f.

📒 Files selected for processing (1)
  • modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java (1)

112-120: Minor: attemptFlushMetrics only flushes main provider

meterHealthMeterProvider isn't flushed here, so health metrics about the exporter won't be immediately available after this call. Since close() properly shuts down both providers (which triggers their final flush), this isn't data loss—just a timing gap.

If immediate visibility of health metrics matters at flush time, consider:

 public void attemptFlushMetrics() {
     synchronized (mutex) {
         if (resources != null) {
             // If the timeout expires, this quietly returns, which is ok in this context.
             resources.meterProvider.forceFlush().join(10, TimeUnit.SECONDS);
+            resources.meterHealthMeterProvider.forceFlush().join(10, TimeUnit.SECONDS);
         }
     }
 }

Not blocking—current design is functional.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java`
around lines 112 - 120, attemptFlushMetrics currently only forces a flush on
resources.meterProvider and omits flushing meterHealthMeterProvider, causing
health/exporter metrics to lag; update attemptFlushMetrics to also call
meterHealthMeterProvider.forceFlush().join(10, TimeUnit.SECONDS) (or equivalent
non-blocking invocation if preferred), keeping both calls inside the
synchronized(mutex) block and following the same timeout/semantics used for
resources so health metrics are flushed at the same time as the main provider
(see attemptFlushMetrics, resources, meterHealthMeterProvider, and close()).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java`:
- Around line 112-120: attemptFlushMetrics currently only forces a flush on
resources.meterProvider and omits flushing meterHealthMeterProvider, causing
health/exporter metrics to lag; update attemptFlushMetrics to also call
meterHealthMeterProvider.forceFlush().join(10, TimeUnit.SECONDS) (or equivalent
non-blocking invocation if preferred), keeping both calls inside the
synchronized(mutex) block and following the same timeout/semantics used for
resources so health metrics are flushed at the same time as the main provider
(see attemptFlushMetrics, resources, meterHealthMeterProvider, and close()).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 9ca8e794-6131-4b62-822b-a4414ed8e1a2

📥 Commits

Reviewing files that changed from the base of the PR and between 0fb8b2f and 66006de.

📒 Files selected for processing (1)
  • modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java

Comment thread modules/apm/build.gradle
.setMeterProvider(() -> meterProvider)
.setAggregationTemporalitySelector(AggregationTemporalitySelector.deltaPreferred());
.setAggregationTemporalitySelector(AggregationTemporalitySelector.deltaPreferred())
.setInternalTelemetryVersion(InternalTelemetryVersion.LATEST);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This line here seems to be the real change in this commit; the rest is refactoring.)

Copy link
Copy Markdown
Contributor

@mosche mosche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about adding a Java REST test to check for the presence of the health metrics?

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java`:
- Around line 115-120: The duplicate post-flush calls cause a second collection
cycle; in OTelSdkMeterSupplier (look for resources.systemMeterProvider and
resources.meterHealthMeterProvider and their forceFlush().join(...) calls)
remove the second pair of forceFlush().join(10, TimeUnit.SECONDS) invocations so
only the initial systemMeterProvider.forceFlush().join(...) and
meterHealthMeterProvider.forceFlush().join(...) remain, preventing duplicate
exported samples on repeated /_flush_telemetry calls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 10bd325d-b9c2-433b-9255-6f0d42e94e3c

📥 Commits

Reviewing files that changed from the base of the PR and between 66006de and 9d6a989.

📒 Files selected for processing (5)
  • build-tools/src/main/java/org/elasticsearch/gradle/testclusters/RunTask.java
  • libs/core/src/main/java/org/elasticsearch/core/TimeValue.java
  • modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/OtelMetricsIT.java
  • x-pack/extras/plugins/microsoft-graph-authz/src/main/java/org/elasticsearch/xpack/security/authz/microsoft/MicrosoftGraphAuthzRealm.java

Copy link
Copy Markdown
Contributor

@prdoyle prdoyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flush duplication seems harmless to me. I can fix that in my next PR assuming the duplication is actually incorrect.

@mamazzol mamazzol enabled auto-merge (squash) April 1, 2026 16:12
@prdoyle
Copy link
Copy Markdown
Contributor

prdoyle commented Apr 1, 2026

Sadly, the test failure looks real:


> Task :test:external-modules:test-apm-integration:javaRestTest
--
 
REPRODUCE WITH: ./gradlew ":test:external-modules:test-apm-integration:javaRestTest" --tests "org.elasticsearch.test.apmintegration.OtelMetricsIT.testOTelHealthMetrics" -Dtests.seed=B2A156C70A19C801 -Dtests.locale=kok-Latn-IN -Dtests.timezone=Indian/Cocos -Druntime.java=25
 
OtelMetricsIT > testOTelHealthMetrics FAILED
java.lang.AssertionError: Timeout waiting for OTel SDK health metrics. Missing: otel.sdk.metric_reader.collection.duration
at __randomizedtesting.SeedInfo.seed([B2A156C70A19C801:9017BA491968E53A]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.elasticsearch.test.apmintegration.OtelMetricsIT.testOTelHealthMetrics(OtelMetricsIT.java:74)


@mamazzol mamazzol merged commit ac64506 into elastic:main Apr 3, 2026
35 checks passed
@mamazzol mamazzol deleted the otel-health branch April 7, 2026 07:41
mromaios pushed a commit to mromaios/elasticsearch that referenced this pull request Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Core/Infra/Metrics Metrics and metering infrastructure >refactoring Team:Core/Infra Meta label for core/infra team v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants