Switch to OTel metrics#348
Conversation
97880a5 to
2cccf3e
Compare
f7d1475 to
e8bf800
Compare
| metricTypes[md.ID] = md.Type | ||
| switch typ := md.Type; typ { | ||
| case MetricTypeCounter: | ||
| counter, err := meter.Int64Counter(md.Name, |
There was a problem hiding this comment.
I'm using the name field from metrics.json here, another option would be to use field which is "namespaced" and maybe easier to parse. Both options abide by the OTel instrument naming requirements.
|
I pushed #350 which should fix the ARM64 test failures. |
e8bf800 to
0e6e85e
Compare
| values := make([]int64, nMetrics) | ||
|
|
||
| ctx := context.Background() | ||
| for i := 0; i < nMetrics; i++ { |
There was a problem hiding this comment.
I measured elapsed time for this loop and it's in the tens-of-microseconds range, which means that performance is not an issue.
| metricTypes map[MetricID]MetricType | ||
|
|
||
| // OTel metric instrumentation | ||
| meter = otel.Meter("go.opentelemetry.io/ebpf-profiler") |
There was a problem hiding this comment.
Should there be a way so that this can be disabled by configuration?
There was a problem hiding this comment.
Configuration normally takes place at the meter provider level which is either part of OTel collector (if agent runs inside OTel collector) or if we introduce a meter provider and OTLP metrics exporter in the otlp_reporter this is where we'd add it.
If a meter provider is not configured, every metering operation is a NOP.
| ids := make([]uint32, nMetrics) | ||
| values := make([]int64, nMetrics) | ||
|
|
||
| ctx := context.Background() |
There was a problem hiding this comment.
To be able to cancel the operation and not get blocked, it might be best to propagate the context to the function and us it.
| ctx := context.Background() | |
| func report(ctx context.Context) |
There was a problem hiding this comment.
If we add context it will propagate out of report and into every Add/AddSlice operation we do in the agent, which is what I was trying to avoid as it could get ugly (e.g. ProcessManager is caching AddSlice). Based on the initial research that I did, it didn't seem possible that a metering operation would block (this is also incompatible with metering being perfomant enough to be inserted in hot loops) so that would make propagating context for this reason - avoid blocking - unnecessary. I'll dig some more.
|
|
||
| if reporterImpl != nil { | ||
| reporterImpl.ReportMetrics(uint32(prevTimestamp), ids, values) | ||
| metric := metricsBuffer[i] |
There was a problem hiding this comment.
Could we simplify this loop and just do a for _, metric := range metricsBuffer { instead?
There was a problem hiding this comment.
There was a problem hiding this comment.
I roughly agree with Florian here.
for _, metric := range metricsBuffer[0:nMetrics] { would do it. Avoiding i here increases readability - otherwise when reading the code the first time, one tries to find if i is used in the loop body, which can be avoided.
Treat missing metric definitions as a hard error
Will switch to OTel metrics, MetricsReporter is no longer needed.
No longer used
No longer used
900e7e9 to
1917970
Compare
3eb37e9 to
6164846
Compare
6164846 to
0adf32a
Compare
Sync from upstream (2025-03-12) Florian Lehner <florianl@users.noreply.github.com> symblib: expose API for single point lookups (open-telemetry#380) Co-authored-by: GitHub <noreply@github.com> Tolya Korniltsev <korniltsev.anatoly@gmail.com> chore: remove unused controller.Config fields (open-telemetry#387) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> libpf: drop unused code (open-telemetry#386) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> tracehandler: drop metadataWarnInhib (open-telemetry#385) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> Go: update to go.opentelemetry.io/otel@v1.35.0 (open-telemetry#383) Co-authored-by: GitHub <noreply@github.com> Christos Kalkanis <christos.kalkanis@elastic.co> processmanager: Don't synchronize a process that's waiting cleanup (open-telemetry#379) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> CI: use latest LTS kernel in tests (open-telemetry#382) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> Makefile: add cargo clean to target clean (open-telemetry#381) Co-authored-by: GitHub <noreply@github.com> Christos Kalkanis <christos.kalkanis@elastic.co> Switch semantics for process.executable.name (open-telemetry#306) Co-authored-by: GitHub <noreply@github.com> Tim Rühsen <tim.ruhsen@elastic.co> Stabilize CI / integration tests (open-telemetry#378) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> Docker fixup (open-telemetry#375) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> Docker: fix rust set up (open-telemetry#371) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> tracer: attach to all kprobes with prefix for off CPU profiling (open-telemetry#370) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> Go: update to Go 1.23 (open-telemetry#372) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> support: generate *ProcInfo types with cgo (open-telemetry#367) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> process: reuse and preallocate memory (open-telemetry#355) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> rust: preparations to integrate Rust (open-telemetry#360) Co-authored-by: GitHub <noreply@github.com> Christos Kalkanis <christos.kalkanis@elastic.co> Switch to OTel metrics (open-telemetry#348) Co-authored-by: GitHub <noreply@github.com> Tolya Korniltsev <korniltsev.anatoly@gmail.com> cargo: remove unused workspace dependency declarations (open-telemetry#364) Co-authored-by: GitHub <noreply@github.com> Tolya Korniltsev <korniltsev.anatoly@gmail.com> reporter: add custom gRPC dial options (open-telemetry#363) Co-authored-by: GitHub <noreply@github.com> umanwizard <brennan@umanwizard.com> Various fixes to node/V8 (open-telemetry#333) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> doc: fix path of tooling (open-telemetry#361) Co-authored-by: GitHub <noreply@github.com> OpenTelemetry Bot <107717825+opentelemetrybot@users.noreply.github.com> Add FOSSA scanning workflow (open-telemetry#357) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> rust: use macro for debug output (open-telemetry#356) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> symblib/gosym: add single point lookup (open-telemetry#346) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> README: provide devfiler v0.14.0 (open-telemetry#354) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> CI: skip environment setup (open-telemetry#353) Co-authored-by: GitHub <noreply@github.com> Richard Chukwu <79311274+RichardChukwu@users.noreply.github.com> Improve contributor guide (open-telemetry#349) Co-authored-by: GitHub <noreply@github.com> Christos Kalkanis <christos.kalkanis@elastic.co> Fix build (open-telemetry#350) Co-authored-by: GitHub <noreply@github.com> Christos Kalkanis <christos.kalkanis@elastic.co> processinfo: refactor process metadata (open-telemetry#344) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> reporter/pdata: do no generate profiles if there are no events (open-telemetry#347) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> README: provide devfiler v0.13.0 (open-telemetry#343) Co-authored-by: GitHub <noreply@github.com> Christos Kalkanis <christos.kalkanis@elastic.co> processmanager: Fix process exit regression (open-telemetry#337) (open-telemetry#338) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> libpf: drop Hash64 (open-telemetry#340) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> cargo: set license field (open-telemetry#336) Co-authored-by: GitHub <noreply@github.com> Damien Mathieu <42@dmathieu.com> Use dummy support for any non-arm64 and non-amd64 archs (open-telemetry#335) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> rust: drop anyhow dependency (open-telemetry#334) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> support: use cgo to generate Go constants from eBPF (open-telemetry#332) Co-authored-by: GitHub <noreply@github.com> Christos Kalkanis <christos.kalkanis@elastic.co> processmanager: Don't log inside critical areas (open-telemetry#328) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> CI: add test for Rust components (open-telemetry#326) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> processmanager: simplify API and return early (open-telemetry#325) Co-authored-by: GitHub <noreply@github.com> Christos Kalkanis <christos.kalkanis@elastic.co> Add Rust native symbolization library and C API wrapper (open-telemetry#267) Co-authored-by: GitHub <noreply@github.com> Christos Kalkanis <christos.kalkanis@elastic.co> Metrics for trace event perf event monitor (open-telemetry#322) Co-authored-by: GitHub <noreply@github.com> Christos Kalkanis <christos.kalkanis@elastic.co> Delayed processing for ProcessManager.pidToProcessInfo (open-telemetry#321) Co-authored-by: GitHub <noreply@github.com> Christos Kalkanis <christos.kalkanis@elastic.co> Rework SymbolizationComplete (open-telemetry#307) Co-authored-by: GitHub <noreply@github.com> Tim Rühsen <tim.ruhsen@elastic.co> Amend -off-cpu-threshold value (open-telemetry#316) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> reporter/collector: fix reporting issue (open-telemetry#319) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> reporter: move pkg samples from internal to public (open-telemetry#314) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> README: provide devfiler v0.11.0 (open-telemetry#313) Co-authored-by: GitHub <noreply@github.com>
Summary
This PR adds OTel metrics instrumentation and cleans up the relevant
reporterinterfaces. My initial implementation kept theMetricsReporterabstraction but I then decided to remove all the previous metric reporting logic (that was not OTel-compliant) in order to simplify the code and keep OTel metrics instrumentation in a single place.This is an initial attempt at introducing OTel instrumentation, deeper architectural restructuring could be made but I think this is a good first step. Note that I'm not introducing a meter provider for OTLP metrics: if the agent is running as an OTel collector receiver, the expectation is that the global meter provider that the OTel collector configures will be used. If the agent is running standalone and reporting via OTLP, we could introduce a meter provider and an OTLP metrics exporter in a follow-up PR (assuming we think this is needed).
Review commit-by-commit might be easier. The last commit will be removed before merging and contains a meter provider, stdout exporter for testing. We'd follow a similar route if we wanted to add an OTLP metrics exporter to the OTLP reporter.
TODO:
Exporter example for testing (stdout)Mark unused metrics as obsolete