Emit JVM metrics with OTEL SDK by mamazzol · Pull Request #143059 · elastic/elasticsearch

mamazzol · 2026-02-25T13:37:45Z

In this PR, we introduce OTEL SDK and use it to emit the registered metrics using the OTEL SDK directly.

Summary

OTEL Version bump: this is extracted to Bump OpenTelemetry dependencies #142323 to separate the possible side effect investigation. I will merge main back when that is merged.
The switch between APM Agent and OTEL is driven by a sysprop that enables the OTEL SDK initialisation and disables the APM Agent metrics via setting. This allows to still attach the agent for traces, if needed.
APMMeterService is agnostic after creation, as it just has a Supplier. The new OTelSdkMeterSupplier will be responsible for the initialisation and setup. For now this is minimal and mostly default.
SystemMetrics class exists to close the gap between the two metrics set. Eventually, we can consider what we want to support, but for migration purposes we are emitting all the JMX metrics emitted by APM Agent.
The new OTEL Semconv for JXX metrics is covered by using the runtime library, currently relying on JMX features instead of JFR. This means we will have duplicate metrics until the dashboards will be migrated to the OTEL versions.

APM Agent vs OTEL SDK metrics

I collected the APM Agent emitted metrics by using --with-apm-agent but a list is also available here

system.process.cpu.total.norm.pct: system CPU time in non-idle/non-iowait states, normalized by number of cores.
system.cpu.total.norm.pct: system CPU time in non-idle/non-iowait states, normalized by number of cores.
system.process.memory.size: total virtual memory size of the process.
system.memory.actual.free: “actual free” memory
system.memory.total: total physical memory.
jvm.thread.count: current number of live JVM threads (daemon + non-daemon).
jvm.fd.used / jvm.fd.max: current / maximum number of open file descriptors.
jvm.gc.alloc: approximation of total heap bytes allocated.
jvm.gc.count: total number of GC collections that have occurred (by GC name).
- attribute : name="G1 Concurrent GC"
jvm.gc.time: approximate accumulated GC elapsed time in milliseconds (by GC name).
- attribute : name="G1 Concurrent GC"
jvm.memory.heap.used / jvm.memory.heap.committed / jvm.memory.heap.max: used / committed / max heap memory.
jvm.memory.non_heap.used / jvm.memory.non_heap.committed / jvm.memory.non_heap.max
jvm.memory.heap.pool.used / jvm.memory.heap.pool.committed / jvm.memory.heap.pool.max: used / committed / max heap memory in a heap pool (by pool name)
- attribute : name="G1 Survivor Space"
jvm.memory.non_heap.pool.used / jvm.memory.non_heap.pool.committed / jvm.memory.non_heap.pool.max: used / committed / max non-heap memory in a heap pool (by pool name)
- attribute : name="Metaspace" / name="CodeHeap 'profiled nmethods'"

The OTEL SDK with the code currently added, emits the following metrics.

Manually added in SystemMetrics:
All the above, with the same attribute structure to migrate to OTEL SDK without breaking anything.
Emitted by the JMX from the RuntimeMetrics library:

jvm.class.count: Number of classes currently loaded.
jvm.class.loaded: Number of classes loaded since JVM start.
jvm.class.unloaded: Number of classes unloaded since JVM start.
jvm.cpu.count: Number of processors available to the Java virtual machine.
jvm.cpu.recent_utilization: Recent CPU utilization for the process as reported by the JVM.
jvm.cpu.time: CPU time used by the process as reported by the JVM.
jvm.gc.alloc: An approximation of the total amount of memory, in bytes, allocated in heap memory.
jvm.gc.duration: Duration of JVM garbage collection actions.
jvm.memory.committed: Measure of memory committed.
jvm.memory.limit: Measure of max obtainable memory.
jvm.memory.used: Measure of memory used.
jvm.memory.used_after_last_gc: Measure of memory used, as measured after the most recent garbage collection event on this pool.
jvm.thread.count: Number of executing platform threads.

ES-14013

elasticsearchmachine · 2026-02-25T13:38:56Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

rjernst

This is looking pretty good.

Broadly, can you please ensure the changes here that can be tested have appropriate tests?

server/src/main/java/org/elasticsearch/monitor/os/OsProbe.java

rjernst · 2026-02-25T22:36:54Z

server/src/main/java/org/elasticsearch/monitor/os/OsProbe.java

+            long free = parseMeminfoKbToBytes(meminfoLines, "MemFree:");
+            long buffers = parseMeminfoKbToBytes(meminfoLines, "Buffers:");
+            long cached = parseMeminfoKbToBytes(meminfoLines, "Cached:");
+            if (free > 0 && buffers > 0 && cached > 0) {


Should these be >= 0?

0 is used as a failure state in parseMeminfoKbToBytes and other places where a result of 0 would indicate a critical system situation. If that doesn't seem right, I can change that to be negative value so 0 can be an actual value to be used.

0 seems like a valid possible value for eg buffers or cached, so a negative error seems better.

...or just stop catching the NumberFormatException and IndexOutOfoundsException. It's already called in a try-catch block after all. Catching these exceptions, discarding them, returning a special value, and then checking for that special value in the caller seems like a rather complicated way to discard useful troubleshooting information.

server/src/main/java/org/elasticsearch/monitor/metrics/SystemMetrics.java

modules/apm/build.gradle

modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkSettings.java

distribution/tools/server-cli/src/main/java/org/elasticsearch/server/cli/APMJvmOptions.java

modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/APMMeterService.java

modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java

mamazzol · 2026-02-26T14:50:10Z

I added the tests I was initially developing as a different PR since I think it makes sense to have them all together.
It's mainly one integration test to verify that the metrics we want are emitted and then a change to allow running ES locally with a mock APM Server that can process OTLP.

mosche

a few comments from my side

mosche · 2026-03-02T10:24:31Z

build-tools/build.gradle

    implementation buildLibs.asm
    implementation buildLibs.jackson.core
    implementation buildLibs.jackson.databind
+    implementation "io.opentelemetry.proto:opentelemetry-proto:1.9.0-alpha"


are we using experimental alpha versions for the production code as well?

I think the underlying code is stable, the artifact is published as alpha but I see the otel-data plugin manually compiles the classes to avoid consuming this artifact directly (see). It's also an older version.

We could do that here, and find a smart Gradle project structure to share the task.
What do you think? Is this currently problematic?

The "alpha" dependencies look funny, but apparently they are all alpha and there's no non-alpha version of this. That's super weird.

https://mvnrepository.com/artifact/io.opentelemetry.proto/opentelemetry-proto/versions

It's also an older version.
What do you think? Is this currently problematic?

No, I don't think this matters much... just make sure to have them on a consistent version initially at least

mosche · 2026-03-02T10:38:13Z

modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/MetricValidator.java

+        // APM Java agent-compatible metric names (see https://www.elastic.co/docs/reference/apm/agents/java/metrics#metrics-jvm)
+        "system.cpu.*",
+        "system.memory.*",
+        "system.process.*",
+        "jvm.file_descriptor.*",
+        "jvm.gc.*",
+        "jvm.memory.*"


Wondering if we better avoid using our own abstractions for JVM metrics. Our abstraction doesn't provide any value here, actually worse - we know these are not going to pass validation and have to permit everything anyways.

Hey @mosche I'm not sure what you mean. What abstraction are you referring to?

We have to add an exception for every combination of metric name and attribute for validation purposes. Reason being that these don't match our ES specific naming conventions, because - of course - these are not ES specific. For that reason, it doesn't seem to make sense to use the ES wrapper around OTEL (with our MetricRegistry being the entry point). Instead, we could directly register these metrics with OTEL, which is fine here, but not exposed anywhere else.

That sounds like a good idea. I wonder if we could do it as a follow-on item, given that this PR did not introduce this scheme.

I don't have a strong opinion on this, we can absolutely do as a follow up (or decide to not do it).
However, in either case I suggest to move the JVM metric related exclusions somewhere else and not mix them with current invalid, but permitted for bwc exclusions.
There could be something like permittedSemanticConventions or similar ...

I'm not sure there's a distinction there: the metrics Matteo added also use names that are not compliant with the OTel naming conventions, right? And we're adding them only for BWC with existing dashboards.

Fair point! Happy to keep as is in that case 👍

modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/APMMeterService.java

mosche · 2026-03-02T15:12:55Z

modules/apm/build.gradle


 def otelVersion = '1.57.0'
-def otelSemconvVersion = '1.21.0-alpha'
+def otelInstrumentationVersion = '2.23.0-alpha'


This is for the JVM metrics only, right? Do we know if we can expect this to be stable soonish?

Yes it's only for the JVM metrics. I don't have a clear horizon on when this will become stable but the instrumentation itself is quite mature and I think we can keep it as is until we are confident in an upgrade.

Are the respective semantic conventions stable? Or do we risk migrating to new conventions that might be broken due to changes there?

...ntegration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/MetricsOTelSDKIT.java

prdoyle

I think your changes to RecordingApmServer will be incompatible with mine, but that's ok: I'm happy to resolve the conflicts.

I'm changing that class from capturing just strings to capturing meaningful telemetry information in a form that works for both OTLP and APM "intake" NDJSON formats so that tests can make format-independent assertions about the telemetry.

The version of RecordingApmServer I have in #143517 is not yet sufficient to capture OTLP metrics because it doesn't capture the "dimension" values, so I'll need to tweak it for that anyway.

prdoyle · 2026-03-04T13:53:45Z

...ntegration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/MetricsOTelSDKIT.java

+    );
+
+    @ClassRule
+    public static RecordingApmServer mockApmServer = new RecordingApmServer();


Perhaps consider calling this recordingApmServer because there's another unrelated class called MockApmServer.

prdoyle · 2026-03-04T13:54:06Z

...egration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/RecordingApmServer.java

+        for (ResourceMetrics resourceMetrics : request.getResourceMetricsList()) {
+            for (ScopeMetrics scopeMetrics : resourceMetrics.getScopeMetricsList()) {
+                for (Metric metric : scopeMetrics.getMetricsList()) {
+                    received.offer(metric.getName());


This is recording only the metric name?

prdoyle · 2026-03-04T13:59:21Z

build-tools/src/main/java/org/elasticsearch/gradle/testclusters/MockApmServer.java

 * The HTTP server used is the JDK embedded com.sun.net.httpserver
 */
 @NotThreadSafe
 public class MockApmServer {


I wasn't aware of this class. I wonder whether we need both this and RecordingApmServer?

According to the javadocs, this one seems to be looking only for "transactions" (aka spans with no parents). That's pretty easy to do for OTLP too. But I wonder if the javadocs are obsolete now? It seems like the main thing this guy does is a lot of logging. 🤔

This is for local dev support when running a local cluster (build-tools), @prdoyle.
Most of what this does is filtering and formatting what's received to only log to console what the dev wants to see.

Interesting, I see. So more of a LoggingApmServer perhaps.

mosche · 2026-03-04T14:44:33Z

modules/apm/src/main/java/org/elasticsearch/telemetry/apm/APM.java

            APMAgentSettings.TELEMETRY_API_KEY_SETTING,
            // Metrics
            APMAgentSettings.TELEMETRY_METRICS_ENABLED_SETTING,
+            OTelSdkSettings.TELEMETRY_OTEL_METRICS_ENDPOINT,


Wondering, should we "announce" these settings conditionally based on the sys property to enable them?

It's an interesting question.

I've run into multiple weeks of churn and toil caused by being churlish about conditional support for CPS configuration in serverless.

I think I'd lean toward adding the settings unconditionally. The downside is limited because this OTel work will become unconditional soon enough (🤞) so it will be moot after that; and the risk seems low that someone will be surprised and disappointed that these settings don't do anything for them in the mean time. The upside is that users will have a simpler time migrating to the OTel settings generally because there are fewer interdependencies between the various settings. For example, they could get the OTel settings established before they commit to enabling it; and they could quickly disable OTel without also having to delete all their settings.

I think I'd lean toward adding the settings unconditionally.

👌

mosche · 2026-03-04T14:52:52Z

modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/MetricValidator.java

+                // https://github.com/elastic/dev/issues/3436 remove the usage of percentile as attribute and move to metric name.
+                || (metricName.startsWith("es.thread_pool.") && attribute.equals("percentile"))
+                // ML metrics use dot-separated attribute key
+                || (metricName.startsWith("es.ml.") && attribute.equals("es.ml.is_master"))


hmmm, where's that coming from and why didn't that fail previously? This change doesn't seem to be related to otel jvm metrics at all.

This has always been failing, but emitting metric with OTEL SDK makes this appear frequently in the logs so I noticed. Here's how it works:
The assertion produces an Exception, that is caught in the lambda inside that is declared inside the metric registration. So this happens at metrics collection/emitting times, so it's quite frequent.
When this happens, OTEL catches it and logs it.

I think we should discuss if we like this behaviour or not and address it in a separate item.
I am not sure what happens at the moment if this exception is thrown when APM agent collects the metric but it's possible it's just logged in a different place that is not very visible.

mosche · 2026-03-04T15:54:04Z

distribution/tools/server-cli/src/main/java/org/elasticsearch/server/cli/APMJvmOptions.java

        final Path agentJar = findAgentJar();

-        if (agentJar == null) {
+        if (attachAgent == false || agentJar == null) {


For now we must attach the agent, tracing and metrics can be enabled dynamically

Should we disable the agent based solely on OTEL_METRICS_ENABLED_SYSTEM_PROPERTY then?

I don't think we can do that yet.
The agent has to remain until both traces and metrics are covered (and the feature toggle for OTEL is on).

Why is that? Aren't we the only ones using OTEL_METRICS_ENABLED_SYSTEM_PROPERTY right now?

mosche

This is looking fairly good already!
Some concerns if the semantic conventions around JVM metrics are stable, or if we have to expect breaking changes there.

My comments are mostly optional, expect:

we must continue to attach the agent
why the new exclusions for ML?

prdoyle · 2026-03-05T15:30:42Z

server/src/main/java/org/elasticsearch/telemetry/TelemetryProvider.java


 public interface TelemetryProvider {

+    String OTEL_METRICS_ENABLED_SYSTEM_PROPERTY = "telemetry.otel.metrics.enabled";


I wonder: if we call this something like es.experimental.telemetry.otel.metrics.enabled would that dissuade people from using it before it's ready?

prdoyle

A few suggested changes, but nothing major.

prdoyle · 2026-03-09T16:08:23Z

modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/APMMeterService.java

-            }
+        try {
+            otelMeterSupplier.close();
+        } catch (Exception e) {


Since MeterSupplier has no throws clause, do we still need this catch clause? Is it just to catch runtime exceptions now?

Yes, I think this would be the place where we can catch any runtime exception and adopt any solution to make sure everything is flushed safely.

prdoyle · 2026-03-09T16:10:50Z

server/src/main/java/org/elasticsearch/monitor/metrics/SystemMetrics.java

-            "The number of opened file descriptors.",
-            "{file_descriptor}",
-            ProcessProbe::getOpenFileDescriptorCount
+    // TODO: remove when dashboards are migrated to OTel SDK auto-emitted jvm.memory.used, jvm.memory.committed,


If we're going to merge this TODO we should probably include a Jira or GitHub Issue link.

prdoyle · 2026-03-09T16:11:50Z

server/src/main/java/org/elasticsearch/monitor/metrics/SystemMetrics.java

        );
    }

+    // TODO: jvm.gc.count and jvm.gc.time can be removed when dashboards are migrated to OTel SDK auto-emitted


prdoyle · 2026-03-09T16:14:58Z

server/src/main/java/org/elasticsearch/monitor/os/OsProbe.java

    long getTotalMemFromProcMeminfo() throws IOException {
        List<String> meminfoLines = readProcMeminfo();
-        return parseMeminfoKbToBytes(meminfoLines, "MemTotal:");
+        return Math.max(0, parseMeminfoKbToBytes(meminfoLines, "MemTotal:"));


This max will replace a -1 with a 0. Is that the correct default?

Yes I wanted to keep what it was doing before even after the change to parseMeminfoKbToBytes now returning -1 on error

...pm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/MetricsApmIT.java

prdoyle · 2026-03-09T16:20:15Z

...egration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/RecordingApmServer.java

    }

+    /**
+     * Parses OTLP protobuf metrics and normalizes them into the same JSON shape that the APM agent produces.


Huh, interesting approach. That should be helpful when I combine this with #143517.

test/external-modules/apm-integration/build.gradle

Emit JVM metrics with OTEL SDK

4ca3130

mamazzol requested review from a team as code owners February 25, 2026 13:37

elasticsearchmachine added needs:triage Requires assignment of a team area label v9.4.0 labels Feb 25, 2026

mamazzol added >refactoring Team:Core/Infra Meta label for core/infra team :Core/Infra/Metrics Metrics and metering infrastructure labels Feb 25, 2026

elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Feb 25, 2026

mamazzol added 2 commits February 25, 2026 13:59

Fix third party audit

0befcf4

IT and Mock APM for OTEL SDK metrics

392e66f

rjernst reviewed Feb 26, 2026

View reviewed changes

mamazzol added 2 commits February 26, 2026 14:15

Address comments

1e4f3b7

Merge branch 'main' into otel-sdk-metrics

1eedc1f

mamazzol added 2 commits February 27, 2026 10:20

Merge branch 'main' into otel-sdk-metrics

bf3fdda

Merge branch 'main' into otel-sdk-metrics

acd8435

mosche reviewed Mar 2, 2026

View reviewed changes

mamazzol mentioned this pull request Mar 4, 2026

APM telemetry tests #143517

Merged

prdoyle reviewed Mar 4, 2026

View reviewed changes

mosche reviewed Mar 4, 2026

View reviewed changes

prdoyle reviewed Mar 5, 2026

View reviewed changes

mamazzol added 2 commits March 9, 2026 15:39

Address review comments, fully duplicate APM metrics

762fcf1

Merge branch 'main' into otel-sdk-metrics

f8377fe

prdoyle approved these changes Mar 9, 2026

View reviewed changes

Addressing review comment

f2b3be5

mamazzol added 3 commits March 10, 2026 09:31

Merge branch 'main' into otel-sdk-metrics

1d35828

Merge branch 'main' into otel-sdk-metrics

d8a9767

Merge branch 'main' into otel-sdk-metrics

cabd064

mamazzol merged commit 85fce73 into elastic:main Mar 11, 2026
36 checks passed

mark-vieira mentioned this pull request Mar 12, 2026

[CI] MetricsApmIT testApmIntegration {withOTel=true} failing #144072

Closed

michalborek pushed a commit to michalborek/elasticsearch that referenced this pull request Mar 23, 2026

Emit JVM metrics with OTEL SDK (elastic#143059)

33b35ba


		public interface TelemetryProvider {

		String OTEL_METRICS_ENABLED_SYSTEM_PROPERTY = "telemetry.otel.metrics.enabled";

Conversation

mamazzol commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

APM Agent vs OTEL SDK metrics

Uh oh!

elasticsearchmachine commented Feb 25, 2026

Uh oh!

rjernst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mamazzol Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prdoyle Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mamazzol commented Feb 26, 2026

Uh oh!

mosche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

prdoyle left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mosche Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

mamazzol commented Feb 25, 2026 •

edited

Loading

mamazzol Feb 26, 2026 •

edited

Loading

prdoyle Mar 4, 2026 •

edited

Loading

mosche Mar 4, 2026 •

edited

Loading

prdoyle Mar 4, 2026 •

edited

Loading

mosche Mar 4, 2026 •

edited

Loading