Skip to content

APM telemetry tests#143517

Merged
prdoyle merged 17 commits intoelastic:mainfrom
prdoyle:otel-tests
Mar 30, 2026
Merged

APM telemetry tests#143517
prdoyle merged 17 commits intoelastic:mainfrom
prdoyle:otel-tests

Conversation

@prdoyle
Copy link
Copy Markdown
Contributor

@prdoyle prdoyle commented Mar 3, 2026

This lays a foundation for removing the APM agent and exporting telemetry directly using the OTel API. The intent is to test the current behaviour in a way that we can continue testing once the APM agent is gone.

One hurdle is that the APM agent doesn't support the OpenTelemetry Line Protocol (OTLP) and instead uses Elastic's intake NDJSON format, meaning that the tests need to be written in a way that is independent of the line protocol. The existing mock server didn't work that way: it simply captured all the NDJSON text and allowed the test to make assertions about the captured text.

To make this suitable for testing both before and after removing the APM agent, I've moved the parsing logic from the test itself into the mock server to parse incoming intake NDJSON into semantically meaningful data structures for each event, and changed the tests to assert on those structures. The intent is that a future PR can add mock endpoints that accept OTLP and produce the same data structures, allowing us to test both protocols with the same assertions, thereby achieving high confidence that removing the APM agent hasn't made any meaningful difference to the telemetry stream.

I've also added tests for the metrics the APM agent already emits (e.g. JVM and system metrics), so we have coverage for those before switching.

Relates to ES-14012

@prdoyle prdoyle self-assigned this Mar 3, 2026
@prdoyle prdoyle added >non-issue :Core/Infra/Metrics Metrics and metering infrastructure v9.4.0 labels Mar 3, 2026
@elasticsearchmachine elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label Mar 3, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@mamazzol
Copy link
Copy Markdown
Contributor

mamazzol commented Mar 4, 2026

Just a passing comment, without having looked deeply at the code:

  1. Some of the names/ attributes will be different in the OTEL JVM metrics, will that break this?
  2. OTLP is meant to be sent to a different path /v1/metrics, is this flexible to accommodate that?

In general, I found myself having the same issues in my PR when I tried to check the emitted metrics. I added another API to the server at the new path and added protobuf parsing.

This looks more flexible and long term so happy to make this work!

@prdoyle
Copy link
Copy Markdown
Contributor Author

prdoyle commented Mar 4, 2026

Hey @mamazzol,

  1. This is meant to capture the existing metrics to ensure we don't break them. When we start emitting metrics ourselves, we will have to emit both the old and new ones. This test covers the old ones, and can be extended to cover the new ones once they are implemented, though the data model will need to change a little to capture the metric "dimensions" (which the old metrics just include in the name string).
  2. Yep, the plan is to add additional endpoints to the mock server so it can accept both NDJSON records and OTLP on different endpoints, and convert them both to the same data structures.

The protobuf parsing is pretty straightforward assuming we're allowed to pull in additional test dependencies. (Production dependencies are frowned upon, but test dependencies ought to be ok.)


/**
* Parses a single line of APM intake NDJSON into a protocol-neutral {@link ReceivedTelemetry} event.
* Intake-specific; a future OTLP decoder will produce the same ADT from OTLP payloads.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh Cursor can be so earnest about its intentions sometimes.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 25, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: afc1e634-e79f-4115-be2b-576d92522609

📥 Commits

Reviewing files that changed from the base of the PR and between 147d65e and b55fe5d.

📒 Files selected for processing (1)
  • modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/APMMeterService.java

📝 Walkthrough

Walkthrough

Adds explicit flush hooks to telemetry: TelemetryProvider gains attemptFlushMetrics() and attemptFlushTraces(), implemented throughout test and APM code. APMMeterService was refactored to derive enabled from settings, introduced a DEFAULT_AGENT_INTERVAL constant, switched to a NoOpMeterSupplier, and exposes attemptFlushMetrics() that delegates to the configured meter supplier. APMTracer adds attemptFlushTraces() with a bounded wait. Integration test infra was overhauled with AbstractMetricsIT, OtelMetricsIT, ApmAgentMetricsIT, parsers, ReceivedTelemetry, RecordingApmServer, and a /_flush_telemetry REST handler.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • 🛠️ Update Documentation: Commit on current branch
  • 🛠️ Update Documentation: Create PR

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/OtelMetricsIT.java`:
- Around line 21-24: The cluster builder currently hardcodes "127.0.0.1" for the
OTLP endpoint; change the endpoint construction in the static `cluster` (built
via `AbstractMetricsIT.baseClusterBuilder()`) to use the recorder's loopback
address from `recordingApmServer.getHttpAddress()` instead of a literal IPv4
address, e.g. use the `InetSocketAddress`/address's host string (via
`recordingApmServer.getHttpAddress().getHostString()` or equivalent) combined
with `recordingApmServer.getPort()` so the exporter hits the correct loopback
address (handles `::1` on IPv6-preferred systems).

In
`@test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/RecordingApmServer.java`:
- Around line 37-42: The getReceivedMessages() implementation is draining the
ArrayBlockingQueue received (using poll) so it only returns leftovers; instead
maintain an append-only history list (e.g., a thread-safe
List<ReceivedTelemetry> retainedHistory) that the consumer thread and
addMessageConsumer(...) append to when they remove items from received, and
change getReceivedMessages() to return a snapshot of retainedHistory; update the
consumerThread() logic (referenced by messageConsumerThread and consumer) to add
each polled ReceivedTelemetry to retainedHistory before invoking the volatile
consumer, and ensure any existing code that currently drains received (the poll
calls) no longer discards items without recording them.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: b626d4ce-d98b-4b55-a3b7-fcb7f8858691

📥 Commits

Reviewing files that changed from the base of the PR and between dee07c7 and b31b200.

📒 Files selected for processing (21)
  • modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/APMMeterService.java
  • modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/APMTelemetryProvider.java
  • modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/MeterSupplier.java
  • modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java
  • modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/tracing/APMTracer.java
  • server/src/main/java/org/elasticsearch/telemetry/TelemetryProvider.java
  • test/external-modules/apm-integration/build.gradle
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/AbstractMetricsIT.java
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/ApmAgentMetricsIT.java
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/ApmIntakeMessageParser.java
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/MetricsApmIT.java
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/OtelMetricsIT.java
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/OtlpMetricsParser.java
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/ReceivedTelemetry.java
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/RecordingApmServer.java
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/TracesApmIT.java
  • test/external-modules/apm-integration/src/main/java/org/elasticsearch/test/apmintegration/ApmIntegrationPlugin.java
  • test/external-modules/apm-integration/src/main/java/org/elasticsearch/test/apmintegration/FlushTelemetryRestHandler.java
  • test/framework/src/main/java/org/elasticsearch/telemetry/TestTelemetryPlugin.java
  • test/framework/src/main/java/org/elasticsearch/test/transport/MockTransportService.java
  • x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/transport/netty4/SecurityNetty4HttpServerTransportTlsHandshakeThrottleTests.java
💤 Files with no reviewable changes (1)
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/MetricsApmIT.java

* Parses a single line of APM intake NDJSON into a protocol-neutral {@link ReceivedTelemetry} event.
* Intake-specific; a future OTLP decoder will produce the same ADT from OTLP payloads.
*/
public final class ApmIntakeMessageParser {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use this to also assert metric attributes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean. Are there existing assertions on attributes?

My intent was to generalize existing tests so they could be used as regression tests when we remove the APM agent. We could add more assertions in the future, of course, but I'd like to add them to this PR only if they already exist in the current tests.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to keep in mind for follow up PR then! For a complete coverage of metrics, attributes are important so we need to find a way to assert on them as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

import static org.elasticsearch.rest.RestRequest.Method.GET;

/**
* REST handler for tests that triggers a flush of all telemetry (traces, metrics) so tests can await export.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this doing for the tests that wouldn't work on it's own? Also, is this actually restricted to tests or would it be available to call?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The REST endpoint is just for tests. I added the flush methods because we're going to want them eventually anyway, but I don't think there's any need to expose them via a REST endpoint in production.

This endpoint provides a way for tests to cause metrics to be emitted to the RecordingApmServer at a predictable time, after which they can make assertions. Without this, we would just have to wait a while.

We could make the other test endpoints do flushes at the end automatically, but I opted for this because I thought in the future the tests would be more composable this way: they could call multiple REST endpoints in various combos and then do one flush.

public static ElasticsearchCluster cluster = AbstractMetricsIT.baseClusterBuilder()
.systemProperty("telemetry.otel.metrics.enabled", "true")
.setting("telemetry.otel.metrics.endpoint", () -> "http://127.0.0.1:" + recordingApmServer.getPort() + "/v1/metrics")
.setting("telemetry.otel.metrics.interval", "10m") // one giant batch instead of multiple small ones with deltas we need to sum
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was 10m a value you intended to use forever? Can it make suites go for way too long?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's essentially meant to be "forever" because a flush will happen first.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@test/external-modules/apm-integration/src/main/java/org/elasticsearch/test/apmintegration/FlushTelemetryRestHandler.java`:
- Around line 45-52: prepareRequest in FlushTelemetryRestHandler calls
telemetryProvider.get() without null-check, which can return null if the plugin
hasn't finished initialization; update prepareRequest to check the result of
telemetryProvider.get() before invoking
attemptFlushMetrics()/attemptFlushTraces() and handle the null case by sending
an appropriate error response (e.g., RestStatus.SERVICE_UNAVAILABLE with a short
message) instead of dereferencing null; reference the telemetryProvider.get()
call and the methods attemptFlushMetrics()/attemptFlushTraces() in the handler
to locate where to add the null-check and response path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 1b65d41a-70bb-4120-becc-63c9b8c3e337

📥 Commits

Reviewing files that changed from the base of the PR and between b31b200 and 147d65e.

📒 Files selected for processing (21)
  • modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/APMMeterService.java
  • modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/APMTelemetryProvider.java
  • modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/MeterSupplier.java
  • modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java
  • modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/tracing/APMTracer.java
  • server/src/main/java/org/elasticsearch/telemetry/TelemetryProvider.java
  • test/external-modules/apm-integration/build.gradle
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/AbstractMetricsIT.java
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/ApmAgentMetricsIT.java
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/ApmIntakeMessageParser.java
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/MetricsApmIT.java
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/OtelMetricsIT.java
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/OtlpMetricsParser.java
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/ReceivedTelemetry.java
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/RecordingApmServer.java
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/TracesApmIT.java
  • test/external-modules/apm-integration/src/main/java/org/elasticsearch/test/apmintegration/ApmIntegrationPlugin.java
  • test/external-modules/apm-integration/src/main/java/org/elasticsearch/test/apmintegration/FlushTelemetryRestHandler.java
  • test/framework/src/main/java/org/elasticsearch/telemetry/TestTelemetryPlugin.java
  • test/framework/src/main/java/org/elasticsearch/test/transport/MockTransportService.java
  • x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/transport/netty4/SecurityNetty4HttpServerTransportTlsHandshakeThrottleTests.java
💤 Files with no reviewable changes (1)
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/MetricsApmIT.java
✅ Files skipped from review due to trivial changes (2)
  • modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/MeterSupplier.java
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/ReceivedTelemetry.java
🚧 Files skipped from review as they are similar to previous changes (10)
  • modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/APMTelemetryProvider.java
  • modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/OTelSdkMeterSupplier.java
  • test/framework/src/main/java/org/elasticsearch/telemetry/TestTelemetryPlugin.java
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/OtelMetricsIT.java
  • test/external-modules/apm-integration/src/main/java/org/elasticsearch/test/apmintegration/ApmIntegrationPlugin.java
  • x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/transport/netty4/SecurityNetty4HttpServerTransportTlsHandshakeThrottleTests.java
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/ApmAgentMetricsIT.java
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/ApmIntakeMessageParser.java
  • test/external-modules/apm-integration/src/javaRestTest/java/org/elasticsearch/test/apmintegration/RecordingApmServer.java
  • modules/apm/src/main/java/org/elasticsearch/telemetry/apm/internal/APMMeterService.java

// from the code, but we're using the APM agent (instead of the OTel SDK) to export it.
// That's why this "else" branch, where otelMetricsEnabled is false, is still using OpenTelemetry.

/*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps this would be best placed near the method that flushed to not cause confusion with the comment above.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent was to comment why we're doing a wait, so I put the comment on agentFlushWaitMs, but I see your point. Let me rearrange things a little.

Copy link
Copy Markdown
Contributor

@mamazzol mamazzol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@prdoyle prdoyle enabled auto-merge (squash) March 27, 2026 13:40
@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Mar 27, 2026
@prdoyle prdoyle merged commit 53355a6 into elastic:main Mar 30, 2026
36 checks passed
mamazzol pushed a commit to mamazzol/elasticsearch that referenced this pull request Mar 30, 2026
* Fix tests so I can run them locally

* Flush interfaces

* Cursor refactor: two IT subclasses instead of one.

This way, instead of mutating the mode, we can have a separate test for each mode.

* Long interval, not short!

* Refactor APM ITs to use protocol-independent assertions

* TELEMETRY_TIMEOUT

* Tidy: use addAll instead of forEach(::add)

* Logging in APMMeterService

* Cleanup APMMeterService.attemptFlushMetrics

* Minor changes based on coderabbit review

* Rearrange comments per PR feedback
mouhc1ne pushed a commit to shmuelhanoch/elasticsearch that referenced this pull request Mar 31, 2026
* Fix tests so I can run them locally

* Flush interfaces

* Cursor refactor: two IT subclasses instead of one.

This way, instead of mutating the mode, we can have a separate test for each mode.

* Long interval, not short!

* Refactor APM ITs to use protocol-independent assertions

* TELEMETRY_TIMEOUT

* Tidy: use addAll instead of forEach(::add)

* Logging in APMMeterService

* Cleanup APMMeterService.attemptFlushMetrics

* Minor changes based on coderabbit review

* Rearrange comments per PR feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Core/Infra/Metrics Metrics and metering infrastructure >non-issue serverless-linked Added by automation, don't add manually Team:Core/Infra Meta label for core/infra team v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants