Skip to content

Emit JVM metrics with OTEL SDK#143059

Merged
mamazzol merged 13 commits intoelastic:mainfrom
mamazzol:otel-sdk-metrics
Mar 11, 2026
Merged

Emit JVM metrics with OTEL SDK#143059
mamazzol merged 13 commits intoelastic:mainfrom
mamazzol:otel-sdk-metrics

Conversation

@mamazzol
Copy link
Copy Markdown
Contributor

@mamazzol mamazzol commented Feb 25, 2026

In this PR, we introduce OTEL SDK and use it to emit the registered metrics using the OTEL SDK directly.

Summary

  • OTEL Version bump: this is extracted to Bump OpenTelemetry dependencies #142323 to separate the possible side effect investigation. I will merge main back when that is merged.
  • The switch between APM Agent and OTEL is driven by a sysprop that enables the OTEL SDK initialisation and disables the APM Agent metrics via setting. This allows to still attach the agent for traces, if needed.
  • APMMeterService is agnostic after creation, as it just has a Supplier. The new OTelSdkMeterSupplier will be responsible for the initialisation and setup. For now this is minimal and mostly default.
  • SystemMetrics class exists to close the gap between the two metrics set. Eventually, we can consider what we want to support, but for migration purposes we are emitting all the JMX metrics emitted by APM Agent.
  • The new OTEL Semconv for JXX metrics is covered by using the runtime library, currently relying on JMX features instead of JFR. This means we will have duplicate metrics until the dashboards will be migrated to the OTEL versions.

APM Agent vs OTEL SDK metrics

I collected the APM Agent emitted metrics by using --with-apm-agent but a list is also available here

  • system.process.cpu.total.norm.pct: system CPU time in non-idle/non-iowait states, normalized by number of cores.
  • system.cpu.total.norm.pct: system CPU time in non-idle/non-iowait states, normalized by number of cores.
  • system.process.memory.size: total virtual memory size of the process.
  • system.memory.actual.free: “actual free” memory
  • system.memory.total: total physical memory.
  • jvm.thread.count: current number of live JVM threads (daemon + non-daemon).
  • jvm.fd.used / jvm.fd.max: current / maximum number of open file descriptors.
  • jvm.gc.alloc: approximation of total heap bytes allocated.
  • jvm.gc.count: total number of GC collections that have occurred (by GC name).
    • attribute : name="G1 Concurrent GC"
  • jvm.gc.time: approximate accumulated GC elapsed time in milliseconds (by GC name).
    • attribute : name="G1 Concurrent GC"
  • jvm.memory.heap.used / jvm.memory.heap.committed / jvm.memory.heap.max: used / committed / max heap memory.
  • jvm.memory.non_heap.used / jvm.memory.non_heap.committed / jvm.memory.non_heap.max
  • jvm.memory.heap.pool.used / jvm.memory.heap.pool.committed / jvm.memory.heap.pool.max: used / committed / max heap memory in a heap pool (by pool name)
    • attribute : name="G1 Survivor Space"
  • jvm.memory.non_heap.pool.used / jvm.memory.non_heap.pool.committed / jvm.memory.non_heap.pool.max: used / committed / max non-heap memory in a heap pool (by pool name)
    • attribute : name="Metaspace" / name="CodeHeap 'profiled nmethods'"

The OTEL SDK with the code currently added, emits the following metrics.

  1. Manually added in SystemMetrics:
    All the above, with the same attribute structure to migrate to OTEL SDK without breaking anything.

  2. Emitted by the JMX from the RuntimeMetrics library:

  • jvm.class.count: Number of classes currently loaded.
  • jvm.class.loaded: Number of classes loaded since JVM start.
  • jvm.class.unloaded: Number of classes unloaded since JVM start.
  • jvm.cpu.count: Number of processors available to the Java virtual machine.
  • jvm.cpu.recent_utilization: Recent CPU utilization for the process as reported by the JVM.
  • jvm.cpu.time: CPU time used by the process as reported by the JVM.
  • jvm.gc.alloc: An approximation of the total amount of memory, in bytes, allocated in heap memory.
  • jvm.gc.duration: Duration of JVM garbage collection actions.
  • jvm.memory.committed: Measure of memory committed.
  • jvm.memory.limit: Measure of max obtainable memory.
  • jvm.memory.used: Measure of memory used.
  • jvm.memory.used_after_last_gc: Measure of memory used, as measured after the most recent garbage collection event on this pool.
  • jvm.thread.count: Number of executing platform threads.

ES-14013

@mamazzol mamazzol requested review from a team as code owners February 25, 2026 13:37
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.4.0 labels Feb 25, 2026
@mamazzol mamazzol added >refactoring Team:Core/Infra Meta label for core/infra team :Core/Infra/Metrics Metrics and metering infrastructure labels Feb 25, 2026
@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Feb 25, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

Copy link
Copy Markdown
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good.

Broadly, can you please ensure the changes here that can be tested have appropriate tests?

long free = parseMeminfoKbToBytes(meminfoLines, "MemFree:");
long buffers = parseMeminfoKbToBytes(meminfoLines, "Buffers:");
long cached = parseMeminfoKbToBytes(meminfoLines, "Cached:");
if (free > 0 && buffers > 0 && cached > 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be >= 0?

Copy link
Copy Markdown
Contributor Author

@mamazzol mamazzol Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 is used as a failure state in parseMeminfoKbToBytes and other places where a result of 0 would indicate a critical system situation. If that doesn't seem right, I can change that to be negative value so 0 can be an actual value to be used.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 seems like a valid possible value for eg buffers or cached, so a negative error seems better.

Copy link
Copy Markdown
Contributor

@prdoyle prdoyle Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...or just stop catching the NumberFormatException and IndexOutOfoundsException. It's already called in a try-catch block after all. Catching these exceptions, discarding them, returning a special value, and then checking for that special value in the caller seems like a rather complicated way to discard useful troubleshooting information.

@mamazzol
Copy link
Copy Markdown
Contributor Author

I added the tests I was initially developing as a different PR since I think it makes sense to have them all together.
It's mainly one integration test to verify that the metrics we want are emitted and then a change to allow running ES locally with a mock APM Server that can process OTLP.

Copy link
Copy Markdown
Contributor

@mosche mosche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few comments from my side

implementation buildLibs.asm
implementation buildLibs.jackson.core
implementation buildLibs.jackson.databind
implementation "io.opentelemetry.proto:opentelemetry-proto:1.9.0-alpha"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we using experimental alpha versions for the production code as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the underlying code is stable, the artifact is published as alpha but I see the otel-data plugin manually compiles the classes to avoid consuming this artifact directly (see). It's also an older version.

We could do that here, and find a smart Gradle project structure to share the task.
What do you think? Is this currently problematic?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "alpha" dependencies look funny, but apparently they are all alpha and there's no non-alpha version of this. That's super weird.

https://mvnrepository.com/artifact/io.opentelemetry.proto/opentelemetry-proto/versions

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also an older version.
What do you think? Is this currently problematic?

No, I don't think this matters much... just make sure to have them on a consistent version initially at least

Comment on lines +71 to +77
// APM Java agent-compatible metric names (see https://www.elastic.co/docs/reference/apm/agents/java/metrics#metrics-jvm)
"system.cpu.*",
"system.memory.*",
"system.process.*",
"jvm.file_descriptor.*",
"jvm.gc.*",
"jvm.memory.*"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we better avoid using our own abstractions for JVM metrics. Our abstraction doesn't provide any value here, actually worse - we know these are not going to pass validation and have to permit everything anyways.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mosche I'm not sure what you mean. What abstraction are you referring to?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to add an exception for every combination of metric name and attribute for validation purposes. Reason being that these don't match our ES specific naming conventions, because - of course - these are not ES specific. For that reason, it doesn't seem to make sense to use the ES wrapper around OTEL (with our MetricRegistry being the entry point). Instead, we could directly register these metrics with OTEL, which is fine here, but not exposed anywhere else.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good idea. I wonder if we could do it as a follow-on item, given that this PR did not introduce this scheme.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion on this, we can absolutely do as a follow up (or decide to not do it).
However, in either case I suggest to move the JVM metric related exclusions somewhere else and not mix them with current invalid, but permitted for bwc exclusions.
There could be something like permittedSemanticConventions or similar ...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure there's a distinction there: the metrics Matteo added also use names that are not compliant with the OTel naming conventions, right? And we're adding them only for BWC with existing dashboards.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point! Happy to keep as is in that case 👍


def otelVersion = '1.57.0'
def otelSemconvVersion = '1.21.0-alpha'
def otelInstrumentationVersion = '2.23.0-alpha'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the JVM metrics only, right? Do we know if we can expect this to be stable soonish?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's only for the JVM metrics. I don't have a clear horizon on when this will become stable but the instrumentation itself is quite mature and I think we can keep it as is until we are confident in an upgrade.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the respective semantic conventions stable? Or do we risk migrating to new conventions that might be broken due to changes there?

@mamazzol mamazzol mentioned this pull request Mar 4, 2026
Copy link
Copy Markdown
Contributor

@prdoyle prdoyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your changes to RecordingApmServer will be incompatible with mine, but that's ok: I'm happy to resolve the conflicts.

I'm changing that class from capturing just strings to capturing meaningful telemetry information in a form that works for both OTLP and APM "intake" NDJSON formats so that tests can make format-independent assertions about the telemetry.

The version of RecordingApmServer I have in #143517 is not yet sufficient to capture OTLP metrics because it doesn't capture the "dimension" values, so I'll need to tweak it for that anyway.

);

@ClassRule
public static RecordingApmServer mockApmServer = new RecordingApmServer();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps consider calling this recordingApmServer because there's another unrelated class called MockApmServer.

for (ResourceMetrics resourceMetrics : request.getResourceMetricsList()) {
for (ScopeMetrics scopeMetrics : resourceMetrics.getScopeMetricsList()) {
for (Metric metric : scopeMetrics.getMetricsList()) {
received.offer(metric.getName());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is recording only the metric name?

* The HTTP server used is the JDK embedded com.sun.net.httpserver
*/
@NotThreadSafe
public class MockApmServer {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware of this class. I wonder whether we need both this and RecordingApmServer?

According to the javadocs, this one seems to be looking only for "transactions" (aka spans with no parents). That's pretty easy to do for OTLP too. But I wonder if the javadocs are obsolete now? It seems like the main thing this guy does is a lot of logging. 🤔

Copy link
Copy Markdown
Contributor

@mosche mosche Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for local dev support when running a local cluster (build-tools), @prdoyle.
Most of what this does is filtering and formatting what's received to only log to console what the dev wants to see.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I see. So more of a LoggingApmServer perhaps.

APMAgentSettings.TELEMETRY_API_KEY_SETTING,
// Metrics
APMAgentSettings.TELEMETRY_METRICS_ENABLED_SETTING,
OTelSdkSettings.TELEMETRY_OTEL_METRICS_ENDPOINT,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering, should we "announce" these settings conditionally based on the sys property to enable them?

Copy link
Copy Markdown
Contributor

@prdoyle prdoyle Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an interesting question.

I've run into multiple weeks of churn and toil caused by being churlish about conditional support for CPS configuration in serverless.

I think I'd lean toward adding the settings unconditionally. The downside is limited because this OTel work will become unconditional soon enough (🤞) so it will be moot after that; and the risk seems low that someone will be surprised and disappointed that these settings don't do anything for them in the mean time. The upside is that users will have a simpler time migrating to the OTel settings generally because there are fewer interdependencies between the various settings. For example, they could get the OTel settings established before they commit to enabling it; and they could quickly disable OTel without also having to delete all their settings.

Copy link
Copy Markdown
Contributor

@mosche mosche Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd lean toward adding the settings unconditionally.

👌

// https://github.com/elastic/dev/issues/3436 remove the usage of percentile as attribute and move to metric name.
|| (metricName.startsWith("es.thread_pool.") && attribute.equals("percentile"))
// ML metrics use dot-separated attribute key
|| (metricName.startsWith("es.ml.") && attribute.equals("es.ml.is_master"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, where's that coming from and why didn't that fail previously? This change doesn't seem to be related to otel jvm metrics at all.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has always been failing, but emitting metric with OTEL SDK makes this appear frequently in the logs so I noticed. Here's how it works:
The assertion produces an Exception, that is caught in the lambda inside that is declared inside the metric registration. So this happens at metrics collection/emitting times, so it's quite frequent.
When this happens, OTEL catches it and logs it.

I think we should discuss if we like this behaviour or not and address it in a separate item.
I am not sure what happens at the moment if this exception is thrown when APM agent collects the metric but it's possible it's just logged in a different place that is not very visible.

final Path agentJar = findAgentJar();

if (agentJar == null) {
if (attachAgent == false || agentJar == null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we must attach the agent, tracing and metrics can be enabled dynamically

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we disable the agent based solely on OTEL_METRICS_ENABLED_SYSTEM_PROPERTY then?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can do that yet.
The agent has to remain until both traces and metrics are covered (and the feature toggle for OTEL is on).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that? Aren't we the only ones using OTEL_METRICS_ENABLED_SYSTEM_PROPERTY right now?

Copy link
Copy Markdown
Contributor

@mosche mosche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking fairly good already!
Some concerns if the semantic conventions around JVM metrics are stable, or if we have to expect breaking changes there.

My comments are mostly optional, expect:

  • we must continue to attach the agent
  • why the new exclusions for ML?


public interface TelemetryProvider {

String OTEL_METRICS_ENABLED_SYSTEM_PROPERTY = "telemetry.otel.metrics.enabled";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder: if we call this something like es.experimental.telemetry.otel.metrics.enabled would that dissuade people from using it before it's ready?

Copy link
Copy Markdown
Contributor

@prdoyle prdoyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggested changes, but nothing major.

}
try {
otelMeterSupplier.close();
} catch (Exception e) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since MeterSupplier has no throws clause, do we still need this catch clause? Is it just to catch runtime exceptions now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this would be the place where we can catch any runtime exception and adopt any solution to make sure everything is flushed safely.

"The number of opened file descriptors.",
"{file_descriptor}",
ProcessProbe::getOpenFileDescriptorCount
// TODO: remove when dashboards are migrated to OTel SDK auto-emitted jvm.memory.used, jvm.memory.committed,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to merge this TODO we should probably include a Jira or GitHub Issue link.

);
}

// TODO: jvm.gc.count and jvm.gc.time can be removed when dashboards are migrated to OTel SDK auto-emitted
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

long getTotalMemFromProcMeminfo() throws IOException {
List<String> meminfoLines = readProcMeminfo();
return parseMeminfoKbToBytes(meminfoLines, "MemTotal:");
return Math.max(0, parseMeminfoKbToBytes(meminfoLines, "MemTotal:"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This max will replace a -1 with a 0. Is that the correct default?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I wanted to keep what it was doing before even after the change to parseMeminfoKbToBytes now returning -1 on error

}

/**
* Parses OTLP protobuf metrics and normalizes them into the same JSON shape that the APM agent produces.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, interesting approach. That should be helpful when I combine this with #143517.

@mamazzol mamazzol merged commit 85fce73 into elastic:main Mar 11, 2026
36 checks passed
michalborek pushed a commit to michalborek/elasticsearch that referenced this pull request Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Core/Infra/Metrics Metrics and metering infrastructure >refactoring Team:Core/Infra Meta label for core/infra team v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants