Skip to content

Switch to OTel metrics#348

Merged
christos68k merged 9 commits intomainfrom
ck/otel-metrics
Feb 24, 2025
Merged

Switch to OTel metrics#348
christos68k merged 9 commits intomainfrom
ck/otel-metrics

Conversation

@christos68k
Copy link
Copy Markdown
Member

@christos68k christos68k commented Feb 11, 2025

Summary

This PR adds OTel metrics instrumentation and cleans up the relevant reporter interfaces. My initial implementation kept the MetricsReporter abstraction but I then decided to remove all the previous metric reporting logic (that was not OTel-compliant) in order to simplify the code and keep OTel metrics instrumentation in a single place.

This is an initial attempt at introducing OTel instrumentation, deeper architectural restructuring could be made but I think this is a good first step. Note that I'm not introducing a meter provider for OTLP metrics: if the agent is running as an OTel collector receiver, the expectation is that the global meter provider that the OTel collector configures will be used. If the agent is running standalone and reporting via OTLP, we could introduce a meter provider and an OTLP metrics exporter in a follow-up PR (assuming we think this is needed).

Review commit-by-commit might be easier. The last commit will be removed before merging and contains a meter provider, stdout exporter for testing. We'd follow a similar route if we wanted to add an OTLP metrics exporter to the OTLP reporter.

TODO:

  • Exporter example for testing (stdout)
  • Mark unused metrics as obsolete

@christos68k christos68k requested review from a team as code owners February 11, 2025 15:58
@christos68k christos68k marked this pull request as draft February 11, 2025 16:12
@christos68k christos68k changed the title WIP: Switch to OTel metrics Switch to OTel metrics Feb 11, 2025
@christos68k christos68k marked this pull request as ready for review February 11, 2025 22:40
@christos68k christos68k force-pushed the ck/otel-metrics branch 2 times, most recently from f7d1475 to e8bf800 Compare February 11, 2025 22:50
Comment thread metrics/metrics.go
metricTypes[md.ID] = md.Type
switch typ := md.Type; typ {
case MetricTypeCounter:
counter, err := meter.Int64Counter(md.Name,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using the name field from metrics.json here, another option would be to use field which is "namespaced" and maybe easier to parse. Both options abide by the OTel instrument naming requirements.

@christos68k
Copy link
Copy Markdown
Member Author

christos68k commented Feb 11, 2025

I pushed #350 which should fix the ARM64 test failures.

Comment thread metrics/metrics.go
values := make([]int64, nMetrics)

ctx := context.Background()
for i := 0; i < nMetrics; i++ {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I measured elapsed time for this loop and it's in the tens-of-microseconds range, which means that performance is not an issue.

Comment thread metrics/metrics.go Outdated
Comment thread metrics/metrics.go Outdated
metricTypes map[MetricID]MetricType

// OTel metric instrumentation
meter = otel.Meter("go.opentelemetry.io/ebpf-profiler")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a way so that this can be disabled by configuration?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configuration normally takes place at the meter provider level which is either part of OTel collector (if agent runs inside OTel collector) or if we introduce a meter provider and OTLP metrics exporter in the otlp_reporter this is where we'd add it.

If a meter provider is not configured, every metering operation is a NOP.

Comment thread metrics/metrics.go
ids := make([]uint32, nMetrics)
values := make([]int64, nMetrics)

ctx := context.Background()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be able to cancel the operation and not get blocked, it might be best to propagate the context to the function and us it.

Suggested change
ctx := context.Background()
func report(ctx context.Context)

Copy link
Copy Markdown
Member Author

@christos68k christos68k Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add context it will propagate out of report and into every Add/AddSlice operation we do in the agent, which is what I was trying to avoid as it could get ugly (e.g. ProcessManager is caching AddSlice). Based on the initial research that I did, it didn't seem possible that a metering operation would block (this is also incompatible with metering being perfomant enough to be inserted in hot loops) so that would make propagating context for this reason - avoid blocking - unnecessary. I'll dig some more.

Comment thread metrics/metrics.go

if reporterImpl != nil {
reporterImpl.ReportMetrics(uint32(prevTimestamp), ids, values)
metric := metricsBuffer[i]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we simplify this loop and just do a for _, metric := range metricsBuffer { instead?

Copy link
Copy Markdown
Member Author

@christos68k christos68k Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metricsBuffer is statically sized by worst-case (IDMax) and the number of metrics it actually contains is given by nMetrics, so we can't iterate over the entire slice (we could check against Metric{} or Metric.ID == 0 during the iteration but I think the current approach is cleaner).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I roughly agree with Florian here.
for _, metric := range metricsBuffer[0:nMetrics] { would do it. Avoiding i here increases readability - otherwise when reading the code the first time, one tries to find if i is used in the loop body, which can be avoided.

Comment thread reporter/otlp_reporter.go Outdated
@christos68k christos68k merged commit c6cb9d4 into main Feb 24, 2025
@christos68k christos68k deleted the ck/otel-metrics branch February 24, 2025 13:52
bhavnajindal added a commit to instana/opentelemetry-ebpf-profiler that referenced this pull request Mar 12, 2025
Sync from upstream (2025-03-12)

Florian Lehner <florianl@users.noreply.github.com> symblib: expose API for single point lookups (open-telemetry#380)
Co-authored-by: GitHub <noreply@github.com>
Tolya Korniltsev <korniltsev.anatoly@gmail.com> chore: remove unused controller.Config fields (open-telemetry#387)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> libpf: drop unused code (open-telemetry#386)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> tracehandler: drop metadataWarnInhib (open-telemetry#385)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> Go: update to go.opentelemetry.io/otel@v1.35.0 (open-telemetry#383)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> processmanager: Don't synchronize a process that's waiting cleanup (open-telemetry#379)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> CI: use latest LTS kernel in tests (open-telemetry#382)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> Makefile: add cargo clean to target clean (open-telemetry#381)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> Switch semantics for process.executable.name (open-telemetry#306)
Co-authored-by: GitHub <noreply@github.com>
Tim Rühsen <tim.ruhsen@elastic.co> Stabilize CI / integration tests (open-telemetry#378)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> Docker fixup (open-telemetry#375)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> Docker: fix rust set up (open-telemetry#371)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> tracer: attach to all kprobes with prefix for off CPU profiling (open-telemetry#370)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> Go: update to Go 1.23 (open-telemetry#372)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> support: generate *ProcInfo types with cgo (open-telemetry#367)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> process: reuse and preallocate memory (open-telemetry#355)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> rust: preparations to integrate Rust (open-telemetry#360)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> Switch to OTel metrics (open-telemetry#348)
Co-authored-by: GitHub <noreply@github.com>
Tolya Korniltsev <korniltsev.anatoly@gmail.com> cargo: remove unused workspace dependency declarations (open-telemetry#364)
Co-authored-by: GitHub <noreply@github.com>
Tolya Korniltsev <korniltsev.anatoly@gmail.com> reporter: add custom gRPC dial options (open-telemetry#363)
Co-authored-by: GitHub <noreply@github.com>
umanwizard <brennan@umanwizard.com> Various fixes to node/V8 (open-telemetry#333)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> doc: fix path of tooling (open-telemetry#361)
Co-authored-by: GitHub <noreply@github.com>
OpenTelemetry Bot <107717825+opentelemetrybot@users.noreply.github.com> Add FOSSA scanning workflow (open-telemetry#357)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> rust: use macro for debug output (open-telemetry#356)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> symblib/gosym: add single point lookup (open-telemetry#346)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> README: provide devfiler v0.14.0 (open-telemetry#354)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> CI: skip environment setup (open-telemetry#353)
Co-authored-by: GitHub <noreply@github.com>
Richard Chukwu <79311274+RichardChukwu@users.noreply.github.com> Improve contributor guide (open-telemetry#349)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> Fix build (open-telemetry#350)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> processinfo: refactor process metadata (open-telemetry#344)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> reporter/pdata: do no generate profiles if there are no events (open-telemetry#347)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> README: provide devfiler v0.13.0 (open-telemetry#343)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> processmanager: Fix process exit regression (open-telemetry#337) (open-telemetry#338)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> libpf: drop Hash64 (open-telemetry#340)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> cargo: set license field (open-telemetry#336)
Co-authored-by: GitHub <noreply@github.com>
Damien Mathieu <42@dmathieu.com> Use dummy support for any non-arm64 and non-amd64 archs (open-telemetry#335)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> rust: drop anyhow dependency (open-telemetry#334)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> support: use cgo to generate Go constants from eBPF (open-telemetry#332)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> processmanager: Don't log inside critical areas (open-telemetry#328)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> CI: add test for Rust components (open-telemetry#326)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> processmanager: simplify API and return early (open-telemetry#325)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> Add Rust native symbolization library and C API wrapper (open-telemetry#267)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> Metrics for trace event perf event monitor (open-telemetry#322)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> Delayed processing for ProcessManager.pidToProcessInfo (open-telemetry#321)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> Rework SymbolizationComplete (open-telemetry#307)
Co-authored-by: GitHub <noreply@github.com>
Tim Rühsen <tim.ruhsen@elastic.co> Amend -off-cpu-threshold value (open-telemetry#316)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> reporter/collector: fix reporting issue (open-telemetry#319)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> reporter: move pkg samples from internal to public (open-telemetry#314)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> README: provide devfiler v0.11.0 (open-telemetry#313)
Co-authored-by: GitHub <noreply@github.com>
nsavoire added a commit to DataDog/dd-otel-host-profiler that referenced this pull request Mar 31, 2025
nsavoire added a commit to DataDog/dd-otel-host-profiler that referenced this pull request Apr 1, 2025
nsavoire added a commit to DataDog/dd-otel-host-profiler that referenced this pull request Apr 1, 2025
nsavoire added a commit to DataDog/dd-otel-host-profiler that referenced this pull request Apr 1, 2025
nsavoire added a commit to DataDog/dd-otel-host-profiler that referenced this pull request Apr 16, 2025
nsavoire added a commit to DataDog/dd-otel-host-profiler that referenced this pull request Apr 18, 2025
nsavoire added a commit to DataDog/dd-otel-host-profiler that referenced this pull request Apr 30, 2025
nsavoire added a commit to DataDog/dd-otel-host-profiler that referenced this pull request May 5, 2025
nsavoire added a commit to DataDog/dd-otel-host-profiler that referenced this pull request May 5, 2025
@florianl florianl mentioned this pull request Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants