Skip to content

Conversation

@afharo
Copy link
Member

@afharo afharo commented Jul 28, 2025

Summary

Resolves #229933
Partially addresses #224860
Partially addresses #230002

Notable changes in this PR:

  • Adds instrumentation for the OTel Metrics client, accepting 2 configurations for the exporters: gRPC and HTTP.
  • Extends the OTel resource definition with additional properties, and makes them shared between tracing and metrics instrumentations.
  • Improves telemetry config schema definition
  • Applies validation to the tracing and metrics configuration so that config-schema-defined defaults can be used.
  • Conditionally register the metrics provider in the plugin monitoring_collection to stop it from replacing the newly registered global metrics provider if it has no exporters to set up.
  • Registers cherry-picked metric-relevant EDOT-provided autoinstrumentations only when metric collection is enabled.

Small demo of the collected metrics

I've created a small dashboard to demo the metrics that are automatically collected by the registered instrumentation:

image image

If you want to see it live, download this file, unzip it, and import the resulting export.ndjson file in a Serverless Observability project.

Then, click on "Add data" > "Application" > "OpenTelemetry" > "Managed OTLP Endpoint", and copy the URL and API Key

image

Then configure your local Kibana with the following:

telemetry.metrics:
  enabled: true
  interval: 10s
  exporters:
    - grpc:
        url: <the URL you copied>
        headers:
          authorization: "ApiKey <the API key that was generated>"

@opentelemetry/exporter-metrics-otlp-http

  • Purpose: What is this dependency used for? Briefly explain its role in your changes.

This is used to set up the OTel Trace OTLP exporter using the HTTP protocol. At the moment, we are only capable of shipping the OTel metrics using the gRPC protocol, but we'd like to be able to enable support for the HTTP exporter as well.

  • Justification: Why is adding this dependency the best approach?

It's the official and recommended exporter. When using the OTel/EDOT SDKs, if the process is run with the env var OTEL_EXPORTER_OTLP_ENDPOINT, it automatically instruments this exporter. We need to programmatically import it because we want to use the settings coming in kibana.yml instead, and we're not allowed to use dependencies that are not listed in our package.json (but this library was already installed by the SDK, as can be seen in the yarn.lock file).

  • Alternatives explored: Were other options considered (e.g., using existing internal libraries/utilities, implementing the functionality directly)? If so, why was this dependency chosen over them?

We didn't consider any other alternatives, since this is the official (and already installed by the SDK) OTel HTTP metrics exporter.

  • Existing dependencies: Does Kibana have a dependency providing similar functionality? If so, why is the new one preferred?

Yes, this library is already installed by the SDK for auto-configuration in case the env var OTEL_EXPORTER_OTLP_ENDPOINT is attached to the process.

Similarly, we already use OTel's Traces (as opposed to Metrics) OTLP exporter using HTTP (link). This is the Metrics homonimus.


Checklist

Check the PR satisfies following conditions.

Reviewers should verify this PR satisfies this list as well.

  • Documentation was added for features that require explanation or tutorials
  • Unit or functional tests were updated or added to match the most common scenarios
  • If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list
  • This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The release_note:breaking label should be applied in these situations.
  • The PR description includes the appropriate Release Notes section, and the correct release_note:* label is applied per the guidelines
  • Review the backport guidelines and apply applicable backport:* labels.

Identify risks

Does this PR introduce any risks? For example, consider risks like hard to test bugs, performance regression, potential of data loss.

Describe the risk, its severity, and mitigation for each identified risk. Invite stakeholders and evaluate how to proceed before merging.

@afharo afharo changed the title [OTel] Instrument metrics collection [OTel] Setup OTel's metrics client Jul 30, 2025
return response;
})
.finally(() => {
originalExit(exitCode);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't call process.exit here. Kibana has a graceful shutdown that could be affected by this if the OTel shutdown is completed faster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT of calling it only when process.exit was called?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the Node.js documentation, if you want to queue anything before exiting the process, you need to add it to the beforeExit listener.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that doesn't help when the process itself calls process.exit():

The 'beforeExit' event is not emitted for conditions causing explicit termination, such as calling process.exit() or uncaught exceptions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process.on('exit') only allows sync operations while beforeExit allows async work. Reacting to SIGINT and SIGTERM is probably sufficient though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a listener to uncaughtExceptionMonitor as well. This way, we will react to uncaught exceptions (that end up crashing the process, and typically want to flush the events).

The reason for using uncaughtExceptionMonitor instead of uncaughtException is that it doesn't stop the process from crashing.

Regarding explicit calls to process.exit() they only occur in bootstrap.ts, if there's a fatal error (that is logged). Fatal errors are either config or migration errors.

@afharo afharo self-assigned this Jul 30, 2025
@afharo afharo added Team:Core Platform Core services: plugins, logging, config, saved objects, http, ES client, i18n, etc t// Team:Security Platform Security: Auth, Users, Roles, Spaces, Audit Logging, etc t// Team:Monitoring Stack Monitoring team Team:Obs AI Assistant Observability AI Assistant Team:AI Infra Platform AppEx AI Infrastructure Team t// release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting labels Jul 30, 2025
@afharo afharo marked this pull request as ready for review July 31, 2025 07:02
@afharo afharo requested review from a team and vigneshshanmugam as code owners July 31, 2025 07:02
@afharo afharo added ci:project-deploy-elasticsearch Create an Elasticsearch Serverless project ci:project-persist-deployment Persist project deployment indefinitely labels Jul 31, 2025
readers: meterReaders,
});

api.metrics.setGlobalMeterProvider(meterProvider);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This essentially makes our Prometheus endpoint incompatible with the Core OTel metrics exporters.

IMO, it's OK for now (I didn't want to create more changes in this PR). But we'll need a follow-up PR to address this incompatibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand the impact of this a bit better? AFAICT If someone is exporting metrics to Prometheus this call to setGlobalMeterProvider will override the one in @kbn/tracing. So it's kind of an edge case possibility that someone configures opentelemetry.metrics.prometheus.enabled: true AND telemetry.metrics.enabled: true?

In that case, is it possible/preferable to throw an unhandled exception that says "global meter provider already registered... you cannot set both x and y"? Otherwise this approach LGTM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding is correct! (the only nit: it will override the one in @kbn/metrics 😬).

I've created this issue to address it #230184

In that case, is it possible/preferable to throw an unhandled exception that says "global meter provider already registered... you cannot set both x and y"?

I tried figuring out a way to do that, but I couldn't find any way to detect that a GlobalMeterProvider was already registered (OTel Metrics always has one, at least a NoopMeterProvider). We could check that the registered one is not an instance of the NoopMeterProvider, and warn/throw the error.

Howerver, I wish that we could come up with a way to add the Prometheus exporter to the Core-registered MeterProvider (similar to what we did with tracing for the langfuse and phoenix exporters defined in the inference plugin).

Copy link
Contributor

@jloleysens jloleysens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @afharo !

readers: meterReaders,
});

api.metrics.setGlobalMeterProvider(meterProvider);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand the impact of this a bit better? AFAICT If someone is exporting metrics to Prometheus this call to setGlobalMeterProvider will override the one in @kbn/tracing. So it's kind of an edge case possibility that someone configures opentelemetry.metrics.prometheus.enabled: true AND telemetry.metrics.enabled: true?

In that case, is it possible/preferable to throw an unhandled exception that says "global meter provider already registered... you cannot set both x and y"? Otherwise this approach LGTM.

return response;
})
.finally(() => {
originalExit(exitCode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process.on('exit') only allows sync operations while beforeExit allows async work. Reacting to SIGINT and SIGTERM is probably sufficient though.

@afharo afharo removed ci:project-deploy-elasticsearch Create an Elasticsearch Serverless project ci:project-persist-deployment Persist project deployment indefinitely labels Aug 1, 2025
@@ -0,0 +1,90 @@
# @kbn/metrics

This package includes the logic to initialize the OpenTelemetry Metrics client and its exporters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a private package? @kbn/tracing isn't.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not used outside platform. I'm planning to reorg the packages and will move @kbn/tracing and @kbn/tracing-config to the private end as well. But I didn't want to increase this PR's scope.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "platform" even mean here? What does "private" mean? Why would something like a Discover plugin be able to use it, but APM wouldn't? What about scripts that want to setup tracing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "platform" even mean here? What does "private" mean?

Those are SKA concepts that have been discussed many times.

In my (oversimplified) mental model:

  • "platform": not solution-specific code
  • "private": code that cannot be imported outside "platform" (cannot be imported by solution-specific code).

Why would something like a Discover plugin be able to use it, but APM wouldn't? What about scripts that want to setup tracing?

AFAIK, any scripts (solution's or not) setting up the OTel tracing should use @kbn/telemetry (which will remain as "shared") and not the internal @kbn/tracing or @kbn/metrics (the one that's currently private in this PR).

Re Discover plugin vs. APM: TBH, I would have loved to have a "core" group that would have restricted the "platform" plugins. But we discovered this during the migration and it was a risk to add it that late based on our tight deadlines. We might introduce that split in the future if we notice that "platform" plugins import core-internal packages.

NOTE: Claiming that a package is private is not set in stone. If we identify a use case where it's needed, we can move it to shared. However, starting it as private makes the public surface more manageable to everyone (maintainers and consumers).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we are confusing core with Platform, which is my point. This setup is silly, and in general I don't see why we should default to private.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your point re SKA. Do you consider this a blocker in this PR? I don't see how initMetrics from @kbn/metrics would be used in isolation (it would replace the global MeterProvider potentially set up in initTelemetry), and that's why I consider it "private" (too bad that it's still exposed to all platform packages and plugins).

If you think that it should be "shared", happy to move it to unblock the PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's not a blocker, I just think it's a useless concept, and I hope we fix it soon (either everything as public, which would be my preference, or "scoped" private packages (or maybe just sub-packages of packages).

switch (variant.type) {
case 'grpc': {
const metadata = new Metadata();
Object.entries(variant.value.headers || {}).forEach(([key, value]) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we not call this metadata in the case of grpc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The env var is called OTEL_EXPORTER_OTLP_HEADERS, and they look like headers Authorization=.... This is why I chose to keep headers in the config as well. The fact that it's passed as metadata is an implementation detail, IMO.

/**
* Global toggle for telemetry. It disables all form of telemetry: product analytics, OTel tracing and OTel metrics.
*/
enabled: schema.boolean({ defaultValue: true }),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I got thrown off by the fact it's called telemetryTracingSchemaProps., should it just be telemetrySchemaProps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is the current default. Otherwise, the "product telemetry" would be disabled by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but do you want to change the name of the var?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I missed the 2nd comment (network issues the other day). I'll rename the var.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed in fa33d15

return async () => {};
}
if (telemetryConfig.tracing.enabled) {
initTracing({ resource, tracingConfig: telemetryConfig.tracing });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe to call this before registering instrumentations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so.

If an instrumentation calls trace.getTracer(), it'll get the tracer in the TracerProvider registered by initTracing.

If registered before calling initTracing, trace.getTracer() returns the NoopTracerProvider, essentially discarding the traces.

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/telemetry 3 0 -3
@kbn/telemetry-config 1 0 -1
@kbn/tracing 18 17 -1
total -5
Unknown metric groups

API count

id before after diff
@kbn/metrics - 5 +5
@kbn/metrics-config - 6 +6
@kbn/telemetry-config 4 5 +1
total +12

ESLint disabled line counts

id before after diff
@kbn/tracing 2 1 -1

Total ESLint disabled count

id before after diff
@kbn/tracing 2 1 -1

History

cc @afharo

Copy link
Contributor

@dgieselaar dgieselaar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@pickypg pickypg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as #230184 is a fast-follow.

@afharo afharo merged commit d9b3db0 into elastic:main Aug 4, 2025
14 checks passed
@afharo afharo deleted the otel/instrument-metrics branch August 4, 2025 14:24
szaffarano pushed a commit to szaffarano/kibana that referenced this pull request Aug 5, 2025
## Summary

Resolves elastic#229933
Partially addresses elastic#224860
Partially addresses elastic#230002

Notable changes in this PR:

* Adds instrumentation for the OTel Metrics client, accepting 2
configurations for the exporters: gRPC and HTTP.
* Extends the OTel resource definition with additional properties, and
makes them shared between tracing and metrics instrumentations.
* Improves `telemetry` config schema definition
* Applies validation to the tracing and metrics configuration so that
config-schema-defined defaults can be used.
* Conditionally register the metrics provider in the plugin
`monitoring_collection` to stop it from replacing the newly registered
global metrics provider if it has no exporters to set up.
* Registers cherry-picked metric-relevant EDOT-provided
autoinstrumentations only when metric collection is enabled.

---

### Small demo of the collected metrics

I've created a small dashboard to demo the metrics that are
automatically collected by the registered instrumentation:

<img width="2310" height="884" alt="image"
src="https://github.com/user-attachments/assets/9b3ebea4-b45c-4f33-a05f-c9f2ac7c3175"
/>
<img width="2305" height="839" alt="image"
src="https://github.com/user-attachments/assets/fcb77d72-38e9-494f-a164-736bbc5fef05"
/>

If you want to see it live, download [this
file](https://github.com/user-attachments/files/21534687/export.ndjson.zip),
unzip it, and import the resulting `export.ndjson` file in a Serverless
Observability project.

Then, click on "Add data" > "Application" > "OpenTelemetry" > "Managed
OTLP Endpoint", and copy the URL and API Key

<img width="1727" height="991" alt="image"
src="https://github.com/user-attachments/assets/60d94e92-ca6d-4002-a6cd-b4c6e3b85c3f"
/>

Then configure your local Kibana with the following:

```yaml
telemetry.metrics:
  enabled: true
  interval: 10s
  exporters:
    - grpc:
        url: <the URL you copied>
        headers:
          authorization: "ApiKey <the API key that was generated>"
```

---

### `@opentelemetry/exporter-metrics-otlp-http`

- [x] **Purpose:** What is this dependency used for? Briefly explain its
role in your changes.

This is used to set up the OTel Trace OTLP exporter using the HTTP
protocol. At the moment, we are only capable of shipping the OTel
metrics using the gRPC protocol, but we'd like to be able to enable
support for the HTTP exporter as well.

- [x] **Justification:** Why is adding this dependency the best
approach?

It's the official and recommended exporter. When using the OTel/EDOT
SDKs, if the process is run with the env var
`OTEL_EXPORTER_OTLP_ENDPOINT`, it automatically instruments this
exporter. We need to programmatically import it because we want to use
the settings coming in `kibana.yml` instead, and we're not allowed to
use dependencies that are not listed in our `package.json` (but this
library was already installed by the SDK, as can be seen in the
[yarn.lock
file](https://github.com/elastic/kibana/pull/229696/files#diff-51e4f558fae534656963876761c95b83b6ef5da5103c4adef6768219ed76c2deR9483)).

- [x] **Alternatives explored:** Were other options considered (e.g.,
using existing internal libraries/utilities, implementing the
functionality directly)? If so, why was this dependency chosen over
them?

We didn't consider any other alternatives, since this is the official
(and already installed by the SDK) OTel HTTP metrics exporter.

- [x] **Existing dependencies:** Does Kibana have a dependency providing
similar functionality? If so, why is the new one preferred?

Yes, this library is already installed by the SDK for auto-configuration
in case the env var `OTEL_EXPORTER_OTLP_ENDPOINT` is attached to the
process.

Similarly, we already use OTel's **Traces** (as opposed to Metrics) OTLP
exporter using HTTP
([link](https://github.com/elastic/kibana/blob/6d55439bd795b6c8f01084dc4ce8b0e2cb7eb0a1/package.json#L1135)).
This is the Metrics homonimus.

---

### Checklist

Check the PR satisfies following conditions. 

Reviewers should verify this PR satisfies this list as well.

- [x]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [x] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [x] This was checked for breaking HTTP API changes, and any breaking
changes have been approved by the breaking-change committee. The
`release_note:breaking` label should be applied in these situations.
- [x] The PR description includes the appropriate Release Notes section,
and the correct `release_note:*` label is applied per the
[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
- [x] Review the [backport
guidelines](https://docs.google.com/document/d/1VyN5k91e5OVumlc0Gb9RPa3h1ewuPE705nRtioPiTvY/edit?usp=sharing)
and apply applicable `backport:*` labels.

### Identify risks

Does this PR introduce any risks? For example, consider risks like hard
to test bugs, performance regression, potential of data loss.

Describe the risk, its severity, and mitigation for each identified
risk. Invite stakeholders and evaluate how to proceed before merging.

- [ ] [See some risk
examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx)
- [ ] ...

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
delanni pushed a commit to delanni/kibana that referenced this pull request Aug 5, 2025
## Summary

Resolves elastic#229933
Partially addresses elastic#224860
Partially addresses elastic#230002

Notable changes in this PR:

* Adds instrumentation for the OTel Metrics client, accepting 2
configurations for the exporters: gRPC and HTTP.
* Extends the OTel resource definition with additional properties, and
makes them shared between tracing and metrics instrumentations.
* Improves `telemetry` config schema definition
* Applies validation to the tracing and metrics configuration so that
config-schema-defined defaults can be used.
* Conditionally register the metrics provider in the plugin
`monitoring_collection` to stop it from replacing the newly registered
global metrics provider if it has no exporters to set up.
* Registers cherry-picked metric-relevant EDOT-provided
autoinstrumentations only when metric collection is enabled.

---

### Small demo of the collected metrics

I've created a small dashboard to demo the metrics that are
automatically collected by the registered instrumentation:

<img width="2310" height="884" alt="image"
src="https://github.com/user-attachments/assets/9b3ebea4-b45c-4f33-a05f-c9f2ac7c3175"
/>
<img width="2305" height="839" alt="image"
src="https://github.com/user-attachments/assets/fcb77d72-38e9-494f-a164-736bbc5fef05"
/>

If you want to see it live, download [this
file](https://github.com/user-attachments/files/21534687/export.ndjson.zip),
unzip it, and import the resulting `export.ndjson` file in a Serverless
Observability project.

Then, click on "Add data" > "Application" > "OpenTelemetry" > "Managed
OTLP Endpoint", and copy the URL and API Key

<img width="1727" height="991" alt="image"
src="https://github.com/user-attachments/assets/60d94e92-ca6d-4002-a6cd-b4c6e3b85c3f"
/>

Then configure your local Kibana with the following:

```yaml
telemetry.metrics:
  enabled: true
  interval: 10s
  exporters:
    - grpc:
        url: <the URL you copied>
        headers:
          authorization: "ApiKey <the API key that was generated>"
```

---

### `@opentelemetry/exporter-metrics-otlp-http`

- [x] **Purpose:** What is this dependency used for? Briefly explain its
role in your changes.

This is used to set up the OTel Trace OTLP exporter using the HTTP
protocol. At the moment, we are only capable of shipping the OTel
metrics using the gRPC protocol, but we'd like to be able to enable
support for the HTTP exporter as well.

- [x] **Justification:** Why is adding this dependency the best
approach?

It's the official and recommended exporter. When using the OTel/EDOT
SDKs, if the process is run with the env var
`OTEL_EXPORTER_OTLP_ENDPOINT`, it automatically instruments this
exporter. We need to programmatically import it because we want to use
the settings coming in `kibana.yml` instead, and we're not allowed to
use dependencies that are not listed in our `package.json` (but this
library was already installed by the SDK, as can be seen in the
[yarn.lock
file](https://github.com/elastic/kibana/pull/229696/files#diff-51e4f558fae534656963876761c95b83b6ef5da5103c4adef6768219ed76c2deR9483)).

- [x] **Alternatives explored:** Were other options considered (e.g.,
using existing internal libraries/utilities, implementing the
functionality directly)? If so, why was this dependency chosen over
them?

We didn't consider any other alternatives, since this is the official
(and already installed by the SDK) OTel HTTP metrics exporter.

- [x] **Existing dependencies:** Does Kibana have a dependency providing
similar functionality? If so, why is the new one preferred?

Yes, this library is already installed by the SDK for auto-configuration
in case the env var `OTEL_EXPORTER_OTLP_ENDPOINT` is attached to the
process.

Similarly, we already use OTel's **Traces** (as opposed to Metrics) OTLP
exporter using HTTP
([link](https://github.com/elastic/kibana/blob/6d55439bd795b6c8f01084dc4ce8b0e2cb7eb0a1/package.json#L1135)).
This is the Metrics homonimus.

---

### Checklist

Check the PR satisfies following conditions. 

Reviewers should verify this PR satisfies this list as well.

- [x]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [x] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [x] This was checked for breaking HTTP API changes, and any breaking
changes have been approved by the breaking-change committee. The
`release_note:breaking` label should be applied in these situations.
- [x] The PR description includes the appropriate Release Notes section,
and the correct `release_note:*` label is applied per the
[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
- [x] Review the [backport
guidelines](https://docs.google.com/document/d/1VyN5k91e5OVumlc0Gb9RPa3h1ewuPE705nRtioPiTvY/edit?usp=sharing)
and apply applicable `backport:*` labels.

### Identify risks

Does this PR introduce any risks? For example, consider risks like hard
to test bugs, performance regression, potential of data loss.

Describe the risk, its severity, and mitigation for each identified
risk. Invite stakeholders and evaluate how to proceed before merging.

- [ ] [See some risk
examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx)
- [ ] ...

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
@wildemat wildemat mentioned this pull request Aug 7, 2025
10 tasks
NicholasPeretti pushed a commit to NicholasPeretti/kibana that referenced this pull request Aug 18, 2025
## Summary

Resolves elastic#229933
Partially addresses elastic#224860
Partially addresses elastic#230002

Notable changes in this PR:

* Adds instrumentation for the OTel Metrics client, accepting 2
configurations for the exporters: gRPC and HTTP.
* Extends the OTel resource definition with additional properties, and
makes them shared between tracing and metrics instrumentations.
* Improves `telemetry` config schema definition
* Applies validation to the tracing and metrics configuration so that
config-schema-defined defaults can be used.
* Conditionally register the metrics provider in the plugin
`monitoring_collection` to stop it from replacing the newly registered
global metrics provider if it has no exporters to set up.
* Registers cherry-picked metric-relevant EDOT-provided
autoinstrumentations only when metric collection is enabled.

---

### Small demo of the collected metrics

I've created a small dashboard to demo the metrics that are
automatically collected by the registered instrumentation:

<img width="2310" height="884" alt="image"
src="https://github.com/user-attachments/assets/9b3ebea4-b45c-4f33-a05f-c9f2ac7c3175"
/>
<img width="2305" height="839" alt="image"
src="https://github.com/user-attachments/assets/fcb77d72-38e9-494f-a164-736bbc5fef05"
/>

If you want to see it live, download [this
file](https://github.com/user-attachments/files/21534687/export.ndjson.zip),
unzip it, and import the resulting `export.ndjson` file in a Serverless
Observability project.

Then, click on "Add data" > "Application" > "OpenTelemetry" > "Managed
OTLP Endpoint", and copy the URL and API Key

<img width="1727" height="991" alt="image"
src="https://github.com/user-attachments/assets/60d94e92-ca6d-4002-a6cd-b4c6e3b85c3f"
/>

Then configure your local Kibana with the following:

```yaml
telemetry.metrics:
  enabled: true
  interval: 10s
  exporters:
    - grpc:
        url: <the URL you copied>
        headers:
          authorization: "ApiKey <the API key that was generated>"
```

---

### `@opentelemetry/exporter-metrics-otlp-http`

- [x] **Purpose:** What is this dependency used for? Briefly explain its
role in your changes.

This is used to set up the OTel Trace OTLP exporter using the HTTP
protocol. At the moment, we are only capable of shipping the OTel
metrics using the gRPC protocol, but we'd like to be able to enable
support for the HTTP exporter as well.

- [x] **Justification:** Why is adding this dependency the best
approach?

It's the official and recommended exporter. When using the OTel/EDOT
SDKs, if the process is run with the env var
`OTEL_EXPORTER_OTLP_ENDPOINT`, it automatically instruments this
exporter. We need to programmatically import it because we want to use
the settings coming in `kibana.yml` instead, and we're not allowed to
use dependencies that are not listed in our `package.json` (but this
library was already installed by the SDK, as can be seen in the
[yarn.lock
file](https://github.com/elastic/kibana/pull/229696/files#diff-51e4f558fae534656963876761c95b83b6ef5da5103c4adef6768219ed76c2deR9483)).

- [x] **Alternatives explored:** Were other options considered (e.g.,
using existing internal libraries/utilities, implementing the
functionality directly)? If so, why was this dependency chosen over
them?

We didn't consider any other alternatives, since this is the official
(and already installed by the SDK) OTel HTTP metrics exporter.

- [x] **Existing dependencies:** Does Kibana have a dependency providing
similar functionality? If so, why is the new one preferred?

Yes, this library is already installed by the SDK for auto-configuration
in case the env var `OTEL_EXPORTER_OTLP_ENDPOINT` is attached to the
process.

Similarly, we already use OTel's **Traces** (as opposed to Metrics) OTLP
exporter using HTTP
([link](https://github.com/elastic/kibana/blob/6d55439bd795b6c8f01084dc4ce8b0e2cb7eb0a1/package.json#L1135)).
This is the Metrics homonimus.

---

### Checklist

Check the PR satisfies following conditions. 

Reviewers should verify this PR satisfies this list as well.

- [x]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [x] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [x] This was checked for breaking HTTP API changes, and any breaking
changes have been approved by the breaking-change committee. The
`release_note:breaking` label should be applied in these situations.
- [x] The PR description includes the appropriate Release Notes section,
and the correct `release_note:*` label is applied per the
[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
- [x] Review the [backport
guidelines](https://docs.google.com/document/d/1VyN5k91e5OVumlc0Gb9RPa3h1ewuPE705nRtioPiTvY/edit?usp=sharing)
and apply applicable `backport:*` labels.

### Identify risks

Does this PR introduce any risks? For example, consider risks like hard
to test bugs, performance regression, potential of data loss.

Describe the risk, its severity, and mitigation for each identified
risk. Invite stakeholders and evaluate how to proceed before merging.

- [ ] [See some risk
examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx)
- [ ] ...

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:AI Infra Platform AppEx AI Infrastructure Team t// Team:Core Platform Core services: plugins, logging, config, saved objects, http, ES client, i18n, etc t// Team:Monitoring Stack Monitoring team Team:Obs AI Assistant Observability AI Assistant Team:Security Platform Security: Auth, Users, Roles, Spaces, Audit Logging, etc t// v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Server-side OTel] [Metrics] Client setup

8 participants