Skip to content

Amend -off-cpu-threshold value#316

Merged
rockdaboot merged 9 commits intoopen-telemetry:mainfrom
rockdaboot:off-cpu-threshold
Jan 22, 2025
Merged

Amend -off-cpu-threshold value#316
rockdaboot merged 9 commits intoopen-telemetry:mainfrom
rockdaboot:off-cpu-threshold

Conversation

@rockdaboot
Copy link
Copy Markdown
Contributor

This amends the -off-cpu-threshold values to

  • 0 means off-cpu profiling is disabled (default)
  • [1..1000] is the %% probability for an off-cpu event to be reported: 1 -> one out of thousand; 1000 -> all.

Before, a value of 1000 disabled off-cpu profiling, thus enabling reporting of all events wasn't possible, which isn't what users might expect. The value of 0 was "valid" but caused a division-by-zero panic.

Additionally, when introducing new configuration variables, a default of non-zero (or empty string) may cause issues in other code bases. For example, our experimental OTEL collector/receiver stopped working, because the newly introduced off-cpu threshold value wasn't initialized with support.OffCPUThresholdMax (not easy to know when not closely following the PRs in this repo). A default of 0 for the off-cpu threshold naturally mitigates this.

@rockdaboot rockdaboot self-assigned this Jan 21, 2025
@rockdaboot rockdaboot requested review from a team as code owners January 21, 2025 08:43
Comment thread tracer/tracer.go
// Try to fit this PID space scaled down with cfg.OffCPUThreshold into
// this map.
adaption["sched_times"] = (4194304 / support.OffCPUThresholdMax) * cfg.OffCPUThreshold
adaption["sched_times"] = (4194304 * cfg.OffCPUThreshold) / support.OffCPUThresholdMax
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiply first to prevent rounding inaccuracies.

@dmathieu
Copy link
Copy Markdown
Member

We should have better unit testing to avoid this kind of regressions

Comment thread cli_flags.go Outdated
Comment thread cli_flags.go Outdated
@rockdaboot
Copy link
Copy Markdown
Contributor Author

We should have better unit testing to avoid this kind of regressions

We should definitely have better unit testing. The current test code coverage for this project is not impressive.

But in the case mentioned in the PR description, the error came from a Cilium ebpf function, hard to trigger/catch in a unit tests. Btw, this PR adds a check for the off-cpu threshold value in controller/config/Validate() - which simplifies creating a unit test for corner cases of configuration values.

Comment thread cli_flags.go Outdated
Comment thread cli_flags.go Outdated
Comment thread cli_flags.go Outdated
Comment thread cli_flags.go
Comment thread cli_flags.go
Comment thread tracer/tracer.go
rockdaboot and others added 2 commits January 22, 2025 08:36
Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co>
@rockdaboot rockdaboot merged commit a7b433f into open-telemetry:main Jan 22, 2025
@rockdaboot rockdaboot deleted the off-cpu-threshold branch January 22, 2025 13:57
@felixge
Copy link
Copy Markdown
Member

felixge commented Feb 6, 2025

Sorry for the late comment on this. Wouldn't it be nicer to define this chance as a rate X where the probability is 1/X?

See https://pkg.go.dev/runtime#SetMutexProfileFraction for prior art that uses this approach.

@rockdaboot
Copy link
Copy Markdown
Contributor Author

Sorry for the late comment on this. Wouldn't it be nicer to define this chance as a rate X where the probability is 1/X?

See https://pkg.go.dev/runtime#SetMutexProfileFraction for prior art that uses this approach.

I agree that the current accepted values limit the user.

Why not directly using a probability 1/X? (For example 0.004)
That avoids explaining the user that X refers to a probability of 1/X.

bhavnajindal added a commit to instana/opentelemetry-ebpf-profiler that referenced this pull request Mar 12, 2025
Sync from upstream (2025-03-12)

Florian Lehner <florianl@users.noreply.github.com> symblib: expose API for single point lookups (open-telemetry#380)
Co-authored-by: GitHub <noreply@github.com>
Tolya Korniltsev <korniltsev.anatoly@gmail.com> chore: remove unused controller.Config fields (open-telemetry#387)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> libpf: drop unused code (open-telemetry#386)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> tracehandler: drop metadataWarnInhib (open-telemetry#385)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> Go: update to go.opentelemetry.io/otel@v1.35.0 (open-telemetry#383)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> processmanager: Don't synchronize a process that's waiting cleanup (open-telemetry#379)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> CI: use latest LTS kernel in tests (open-telemetry#382)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> Makefile: add cargo clean to target clean (open-telemetry#381)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> Switch semantics for process.executable.name (open-telemetry#306)
Co-authored-by: GitHub <noreply@github.com>
Tim Rühsen <tim.ruhsen@elastic.co> Stabilize CI / integration tests (open-telemetry#378)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> Docker fixup (open-telemetry#375)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> Docker: fix rust set up (open-telemetry#371)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> tracer: attach to all kprobes with prefix for off CPU profiling (open-telemetry#370)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> Go: update to Go 1.23 (open-telemetry#372)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> support: generate *ProcInfo types with cgo (open-telemetry#367)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> process: reuse and preallocate memory (open-telemetry#355)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> rust: preparations to integrate Rust (open-telemetry#360)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> Switch to OTel metrics (open-telemetry#348)
Co-authored-by: GitHub <noreply@github.com>
Tolya Korniltsev <korniltsev.anatoly@gmail.com> cargo: remove unused workspace dependency declarations (open-telemetry#364)
Co-authored-by: GitHub <noreply@github.com>
Tolya Korniltsev <korniltsev.anatoly@gmail.com> reporter: add custom gRPC dial options (open-telemetry#363)
Co-authored-by: GitHub <noreply@github.com>
umanwizard <brennan@umanwizard.com> Various fixes to node/V8 (open-telemetry#333)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> doc: fix path of tooling (open-telemetry#361)
Co-authored-by: GitHub <noreply@github.com>
OpenTelemetry Bot <107717825+opentelemetrybot@users.noreply.github.com> Add FOSSA scanning workflow (open-telemetry#357)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> rust: use macro for debug output (open-telemetry#356)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> symblib/gosym: add single point lookup (open-telemetry#346)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> README: provide devfiler v0.14.0 (open-telemetry#354)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> CI: skip environment setup (open-telemetry#353)
Co-authored-by: GitHub <noreply@github.com>
Richard Chukwu <79311274+RichardChukwu@users.noreply.github.com> Improve contributor guide (open-telemetry#349)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> Fix build (open-telemetry#350)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> processinfo: refactor process metadata (open-telemetry#344)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> reporter/pdata: do no generate profiles if there are no events (open-telemetry#347)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> README: provide devfiler v0.13.0 (open-telemetry#343)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> processmanager: Fix process exit regression (open-telemetry#337) (open-telemetry#338)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> libpf: drop Hash64 (open-telemetry#340)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> cargo: set license field (open-telemetry#336)
Co-authored-by: GitHub <noreply@github.com>
Damien Mathieu <42@dmathieu.com> Use dummy support for any non-arm64 and non-amd64 archs (open-telemetry#335)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> rust: drop anyhow dependency (open-telemetry#334)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> support: use cgo to generate Go constants from eBPF (open-telemetry#332)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> processmanager: Don't log inside critical areas (open-telemetry#328)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> CI: add test for Rust components (open-telemetry#326)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> processmanager: simplify API and return early (open-telemetry#325)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> Add Rust native symbolization library and C API wrapper (open-telemetry#267)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> Metrics for trace event perf event monitor (open-telemetry#322)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> Delayed processing for ProcessManager.pidToProcessInfo (open-telemetry#321)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> Rework SymbolizationComplete (open-telemetry#307)
Co-authored-by: GitHub <noreply@github.com>
Tim Rühsen <tim.ruhsen@elastic.co> Amend -off-cpu-threshold value (open-telemetry#316)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> reporter/collector: fix reporting issue (open-telemetry#319)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> reporter: move pkg samples from internal to public (open-telemetry#314)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> README: provide devfiler v0.11.0 (open-telemetry#313)
Co-authored-by: GitHub <noreply@github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants