Emit JVM metrics with OTEL SDK#143059
Conversation
|
Pinging @elastic/es-core-infra (Team:Core/Infra) |
rjernst
left a comment
There was a problem hiding this comment.
This is looking pretty good.
Broadly, can you please ensure the changes here that can be tested have appropriate tests?
| long free = parseMeminfoKbToBytes(meminfoLines, "MemFree:"); | ||
| long buffers = parseMeminfoKbToBytes(meminfoLines, "Buffers:"); | ||
| long cached = parseMeminfoKbToBytes(meminfoLines, "Cached:"); | ||
| if (free > 0 && buffers > 0 && cached > 0) { |
There was a problem hiding this comment.
0 is used as a failure state in parseMeminfoKbToBytes and other places where a result of 0 would indicate a critical system situation. If that doesn't seem right, I can change that to be negative value so 0 can be an actual value to be used.
There was a problem hiding this comment.
0 seems like a valid possible value for eg buffers or cached, so a negative error seems better.
There was a problem hiding this comment.
...or just stop catching the NumberFormatException and IndexOutOfoundsException. It's already called in a try-catch block after all. Catching these exceptions, discarding them, returning a special value, and then checking for that special value in the caller seems like a rather complicated way to discard useful troubleshooting information.
server/src/main/java/org/elasticsearch/monitor/metrics/SystemMetrics.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/monitor/metrics/SystemMetrics.java
Outdated
Show resolved
Hide resolved
modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkSettings.java
Outdated
Show resolved
Hide resolved
distribution/tools/server-cli/src/main/java/org/elasticsearch/server/cli/APMJvmOptions.java
Outdated
Show resolved
Hide resolved
modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/APMMeterService.java
Outdated
Show resolved
Hide resolved
modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java
Outdated
Show resolved
Hide resolved
|
I added the tests I was initially developing as a different PR since I think it makes sense to have them all together. |
mosche
left a comment
There was a problem hiding this comment.
a few comments from my side
build-tools/build.gradle
Outdated
| implementation buildLibs.asm | ||
| implementation buildLibs.jackson.core | ||
| implementation buildLibs.jackson.databind | ||
| implementation "io.opentelemetry.proto:opentelemetry-proto:1.9.0-alpha" |
There was a problem hiding this comment.
are we using experimental alpha versions for the production code as well?
There was a problem hiding this comment.
I think the underlying code is stable, the artifact is published as alpha but I see the otel-data plugin manually compiles the classes to avoid consuming this artifact directly (see). It's also an older version.
We could do that here, and find a smart Gradle project structure to share the task.
What do you think? Is this currently problematic?
There was a problem hiding this comment.
The "alpha" dependencies look funny, but apparently they are all alpha and there's no non-alpha version of this. That's super weird.
https://mvnrepository.com/artifact/io.opentelemetry.proto/opentelemetry-proto/versions
There was a problem hiding this comment.
It's also an older version.
What do you think? Is this currently problematic?
No, I don't think this matters much... just make sure to have them on a consistent version initially at least
| // APM Java agent-compatible metric names (see https://www.elastic.co/docs/reference/apm/agents/java/metrics#metrics-jvm) | ||
| "system.cpu.*", | ||
| "system.memory.*", | ||
| "system.process.*", | ||
| "jvm.file_descriptor.*", | ||
| "jvm.gc.*", | ||
| "jvm.memory.*" |
There was a problem hiding this comment.
Wondering if we better avoid using our own abstractions for JVM metrics. Our abstraction doesn't provide any value here, actually worse - we know these are not going to pass validation and have to permit everything anyways.
There was a problem hiding this comment.
Hey @mosche I'm not sure what you mean. What abstraction are you referring to?
There was a problem hiding this comment.
We have to add an exception for every combination of metric name and attribute for validation purposes. Reason being that these don't match our ES specific naming conventions, because - of course - these are not ES specific. For that reason, it doesn't seem to make sense to use the ES wrapper around OTEL (with our MetricRegistry being the entry point). Instead, we could directly register these metrics with OTEL, which is fine here, but not exposed anywhere else.
There was a problem hiding this comment.
That sounds like a good idea. I wonder if we could do it as a follow-on item, given that this PR did not introduce this scheme.
There was a problem hiding this comment.
I don't have a strong opinion on this, we can absolutely do as a follow up (or decide to not do it).
However, in either case I suggest to move the JVM metric related exclusions somewhere else and not mix them with current invalid, but permitted for bwc exclusions.
There could be something like permittedSemanticConventions or similar ...
There was a problem hiding this comment.
I'm not sure there's a distinction there: the metrics Matteo added also use names that are not compliant with the OTel naming conventions, right? And we're adding them only for BWC with existing dashboards.
There was a problem hiding this comment.
Fair point! Happy to keep as is in that case 👍
modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/APMMeterService.java
Outdated
Show resolved
Hide resolved
|
|
||
| def otelVersion = '1.57.0' | ||
| def otelSemconvVersion = '1.21.0-alpha' | ||
| def otelInstrumentationVersion = '2.23.0-alpha' |
There was a problem hiding this comment.
This is for the JVM metrics only, right? Do we know if we can expect this to be stable soonish?
There was a problem hiding this comment.
Yes it's only for the JVM metrics. I don't have a clear horizon on when this will become stable but the instrumentation itself is quite mature and I think we can keep it as is until we are confident in an upgrade.
There was a problem hiding this comment.
Are the respective semantic conventions stable? Or do we risk migrating to new conventions that might be broken due to changes there?
...ntegration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/MetricsOTelSDKIT.java
Outdated
Show resolved
Hide resolved
prdoyle
left a comment
There was a problem hiding this comment.
I think your changes to RecordingApmServer will be incompatible with mine, but that's ok: I'm happy to resolve the conflicts.
I'm changing that class from capturing just strings to capturing meaningful telemetry information in a form that works for both OTLP and APM "intake" NDJSON formats so that tests can make format-independent assertions about the telemetry.
The version of RecordingApmServer I have in #143517 is not yet sufficient to capture OTLP metrics because it doesn't capture the "dimension" values, so I'll need to tweak it for that anyway.
| ); | ||
|
|
||
| @ClassRule | ||
| public static RecordingApmServer mockApmServer = new RecordingApmServer(); |
There was a problem hiding this comment.
Perhaps consider calling this recordingApmServer because there's another unrelated class called MockApmServer.
| for (ResourceMetrics resourceMetrics : request.getResourceMetricsList()) { | ||
| for (ScopeMetrics scopeMetrics : resourceMetrics.getScopeMetricsList()) { | ||
| for (Metric metric : scopeMetrics.getMetricsList()) { | ||
| received.offer(metric.getName()); |
There was a problem hiding this comment.
This is recording only the metric name?
| * The HTTP server used is the JDK embedded com.sun.net.httpserver | ||
| */ | ||
| @NotThreadSafe | ||
| public class MockApmServer { |
There was a problem hiding this comment.
I wasn't aware of this class. I wonder whether we need both this and RecordingApmServer?
According to the javadocs, this one seems to be looking only for "transactions" (aka spans with no parents). That's pretty easy to do for OTLP too. But I wonder if the javadocs are obsolete now? It seems like the main thing this guy does is a lot of logging. 🤔
There was a problem hiding this comment.
This is for local dev support when running a local cluster (build-tools), @prdoyle.
Most of what this does is filtering and formatting what's received to only log to console what the dev wants to see.
There was a problem hiding this comment.
Interesting, I see. So more of a LoggingApmServer perhaps.
| APMAgentSettings.TELEMETRY_API_KEY_SETTING, | ||
| // Metrics | ||
| APMAgentSettings.TELEMETRY_METRICS_ENABLED_SETTING, | ||
| OTelSdkSettings.TELEMETRY_OTEL_METRICS_ENDPOINT, |
There was a problem hiding this comment.
Wondering, should we "announce" these settings conditionally based on the sys property to enable them?
There was a problem hiding this comment.
It's an interesting question.
I've run into multiple weeks of churn and toil caused by being churlish about conditional support for CPS configuration in serverless.
I think I'd lean toward adding the settings unconditionally. The downside is limited because this OTel work will become unconditional soon enough (🤞) so it will be moot after that; and the risk seems low that someone will be surprised and disappointed that these settings don't do anything for them in the mean time. The upside is that users will have a simpler time migrating to the OTel settings generally because there are fewer interdependencies between the various settings. For example, they could get the OTel settings established before they commit to enabling it; and they could quickly disable OTel without also having to delete all their settings.
There was a problem hiding this comment.
I think I'd lean toward adding the settings unconditionally.
👌
| // https://github.com/elastic/dev/issues/3436 remove the usage of percentile as attribute and move to metric name. | ||
| || (metricName.startsWith("es.thread_pool.") && attribute.equals("percentile")) | ||
| // ML metrics use dot-separated attribute key | ||
| || (metricName.startsWith("es.ml.") && attribute.equals("es.ml.is_master")) |
There was a problem hiding this comment.
hmmm, where's that coming from and why didn't that fail previously? This change doesn't seem to be related to otel jvm metrics at all.
There was a problem hiding this comment.
This has always been failing, but emitting metric with OTEL SDK makes this appear frequently in the logs so I noticed. Here's how it works:
The assertion produces an Exception, that is caught in the lambda inside that is declared inside the metric registration. So this happens at metrics collection/emitting times, so it's quite frequent.
When this happens, OTEL catches it and logs it.
I think we should discuss if we like this behaviour or not and address it in a separate item.
I am not sure what happens at the moment if this exception is thrown when APM agent collects the metric but it's possible it's just logged in a different place that is not very visible.
| final Path agentJar = findAgentJar(); | ||
|
|
||
| if (agentJar == null) { | ||
| if (attachAgent == false || agentJar == null) { |
There was a problem hiding this comment.
For now we must attach the agent, tracing and metrics can be enabled dynamically
There was a problem hiding this comment.
Should we disable the agent based solely on OTEL_METRICS_ENABLED_SYSTEM_PROPERTY then?
There was a problem hiding this comment.
I don't think we can do that yet.
The agent has to remain until both traces and metrics are covered (and the feature toggle for OTEL is on).
There was a problem hiding this comment.
Why is that? Aren't we the only ones using OTEL_METRICS_ENABLED_SYSTEM_PROPERTY right now?
mosche
left a comment
There was a problem hiding this comment.
This is looking fairly good already!
Some concerns if the semantic conventions around JVM metrics are stable, or if we have to expect breaking changes there.
My comments are mostly optional, expect:
- we must continue to attach the agent
- why the new exclusions for ML?
|
|
||
| public interface TelemetryProvider { | ||
|
|
||
| String OTEL_METRICS_ENABLED_SYSTEM_PROPERTY = "telemetry.otel.metrics.enabled"; |
There was a problem hiding this comment.
I wonder: if we call this something like es.experimental.telemetry.otel.metrics.enabled would that dissuade people from using it before it's ready?
prdoyle
left a comment
There was a problem hiding this comment.
A few suggested changes, but nothing major.
| } | ||
| try { | ||
| otelMeterSupplier.close(); | ||
| } catch (Exception e) { |
There was a problem hiding this comment.
Since MeterSupplier has no throws clause, do we still need this catch clause? Is it just to catch runtime exceptions now?
There was a problem hiding this comment.
Yes, I think this would be the place where we can catch any runtime exception and adopt any solution to make sure everything is flushed safely.
| "The number of opened file descriptors.", | ||
| "{file_descriptor}", | ||
| ProcessProbe::getOpenFileDescriptorCount | ||
| // TODO: remove when dashboards are migrated to OTel SDK auto-emitted jvm.memory.used, jvm.memory.committed, |
There was a problem hiding this comment.
If we're going to merge this TODO we should probably include a Jira or GitHub Issue link.
| ); | ||
| } | ||
|
|
||
| // TODO: jvm.gc.count and jvm.gc.time can be removed when dashboards are migrated to OTel SDK auto-emitted |
| long getTotalMemFromProcMeminfo() throws IOException { | ||
| List<String> meminfoLines = readProcMeminfo(); | ||
| return parseMeminfoKbToBytes(meminfoLines, "MemTotal:"); | ||
| return Math.max(0, parseMeminfoKbToBytes(meminfoLines, "MemTotal:")); |
There was a problem hiding this comment.
This max will replace a -1 with a 0. Is that the correct default?
There was a problem hiding this comment.
Yes I wanted to keep what it was doing before even after the change to parseMeminfoKbToBytes now returning -1 on error
...pm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/MetricsApmIT.java
Outdated
Show resolved
Hide resolved
...pm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/MetricsApmIT.java
Show resolved
Hide resolved
| } | ||
|
|
||
| /** | ||
| * Parses OTLP protobuf metrics and normalizes them into the same JSON shape that the APM agent produces. |
There was a problem hiding this comment.
Huh, interesting approach. That should be helpful when I combine this with #143517.
In this PR, we introduce OTEL SDK and use it to emit the registered metrics using the OTEL SDK directly.
Summary
APMMeterServiceis agnostic after creation, as it just has a Supplier. The newOTelSdkMeterSupplierwill be responsible for the initialisation and setup. For now this is minimal and mostly default.SystemMetricsclass exists to close the gap between the two metrics set. Eventually, we can consider what we want to support, but for migration purposes we are emitting all the JMX metrics emitted by APM Agent.APM Agent vs OTEL SDK metrics
I collected the APM Agent emitted metrics by using --with-apm-agent but a list is also available here
system.process.cpu.total.norm.pct: system CPU time in non-idle/non-iowait states, normalized by number of cores.system.cpu.total.norm.pct: system CPU time in non-idle/non-iowait states, normalized by number of cores.system.process.memory.size: total virtual memory size of the process.system.memory.actual.free: “actual free” memorysystem.memory.total: total physical memory.jvm.thread.count: current number of live JVM threads (daemon + non-daemon).jvm.fd.used/jvm.fd.max: current / maximum number of open file descriptors.jvm.gc.alloc: approximation of total heap bytes allocated.jvm.gc.count: total number of GC collections that have occurred (by GC name).attribute: name="G1 Concurrent GC"jvm.gc.time: approximate accumulated GC elapsed time in milliseconds (by GC name).attribute: name="G1 Concurrent GC"jvm.memory.heap.used/jvm.memory.heap.committed/jvm.memory.heap.max: used / committed / max heap memory.jvm.memory.non_heap.used/jvm.memory.non_heap.committed/jvm.memory.non_heap.maxjvm.memory.heap.pool.used/jvm.memory.heap.pool.committed/jvm.memory.heap.pool.max: used / committed / max heap memory in a heap pool (by pool name)attribute: name="G1 Survivor Space"jvm.memory.non_heap.pool.used/jvm.memory.non_heap.pool.committed/jvm.memory.non_heap.pool.max: used / committed / max non-heap memory in a heap pool (by pool name)attribute: name="Metaspace" / name="CodeHeap 'profiled nmethods'"The OTEL SDK with the code currently added, emits the following metrics.
Manually added in
SystemMetrics:All the above, with the same attribute structure to migrate to OTEL SDK without breaking anything.
Emitted by the JMX from the RuntimeMetrics library:
jvm.class.count: Number of classes currently loaded.jvm.class.loaded: Number of classes loaded since JVM start.jvm.class.unloaded: Number of classes unloaded since JVM start.jvm.cpu.count: Number of processors available to the Java virtual machine.jvm.cpu.recent_utilization: Recent CPU utilization for the process as reported by the JVM.jvm.cpu.time: CPU time used by the process as reported by the JVM.jvm.gc.alloc: An approximation of the total amount of memory, in bytes, allocated in heap memory.jvm.gc.duration: Duration of JVM garbage collection actions.jvm.memory.committed: Measure of memory committed.jvm.memory.limit: Measure of max obtainable memory.jvm.memory.used: Measure of memory used.jvm.memory.used_after_last_gc: Measure of memory used, as measured after the most recent garbage collection event on this pool.jvm.thread.count: Number of executing platform threads.ES-14013