Delayed processing for ProcessManager.pidToProcessInfo#321
Conversation
The old "symbolize now" mechanism is no longer needed.
| return | ||
| } | ||
|
|
||
| // Delete all entries we have for this particular PID from pid_page_to_mapping_info. |
There was a problem hiding this comment.
I kept this cleanup here as there's no immediate need to postpone cleaning up the eBPF map until traceCaptureKTime >= pidExitKtime (unlike pidToProcessInfo). This also speeds up execution of ProcessedUntil compared to having the map cleanup take place there.
| // NOTE: Exported only for tracer. | ||
| func (pm *ProcessManager) ProcessPIDExit(pid libpf.PID) bool { | ||
| func (pm *ProcessManager) ProcessPIDExit(pid libpf.PID) { | ||
| exitKTime := times.GetKTime() |
There was a problem hiding this comment.
Moved this outside the lock for improved accuracy (there's a debug log in ProcessedUntil that prints exit latency).
Uses ProcessedUntil mechanism to guarantee that process metadata is not discarded before all relevant trace events have been processed.
5629d4b to
a8c3852
Compare
| return symbolize | ||
| return | ||
| } | ||
| if pidExited { |
There was a problem hiding this comment.
We don't want to attempt a repeat cleanup for the same PID, if we've previously performed it.
| return serviceName | ||
| } | ||
|
|
||
| func (pm *ProcessManager) SymbolizationComplete(traceCaptureKTime times.KTime) { |
There was a problem hiding this comment.
Moved to processinfo.go for consistency (all pidToProcessInfo accessors in one place), renamed to ProcessedUntil and updated to also cleanup pidToProcessInfo.
| address, pid, err) | ||
| } | ||
| } | ||
| delete(pm.pidToProcessInfo, pid) |
There was a problem hiding this comment.
This is now taking place in ProcessedUntil, delayed until traceCaptureKTime >= exitKTime.
| return newTrace | ||
| } | ||
|
|
||
| // findMappingForTrace locates the mapping for a given host trace. |
There was a problem hiding this comment.
Moved without changes to processinfo.go for consistency.
| if len(pm.interpreters[pid]) > 0 { | ||
| pidExited := false | ||
| info, pidExists := pm.pidToProcessInfo[pid] | ||
| if pidExists || (pm.interpreterTracerEnabled && |
There was a problem hiding this comment.
Essentially same logic as before with these additions:
- Don't add
exitKTimetopm.exitEventsif it already exists. - Also add
exitKTimetopm.exitEventsifpm.pidToProcessInfo[pid]exists, as we want to cleanup the latter in delayed fashion.
| continue | ||
| } | ||
|
|
||
| delete(pm.pidToProcessInfo, pid) |
There was a problem hiding this comment.
Same logic as before with this single-line addition.
florianl
left a comment
There was a problem hiding this comment.
first look with some comments
|
|
||
| info, ok := pm.pidToProcessInfo[pid] | ||
| if !ok { | ||
| if !pidExists { |
There was a problem hiding this comment.
To keep the global read & write lock as short as possible, the if !pidExists {..} part should be moved before if pidExists || (pm.interpreterTracerEnabled && len(pm.interpreters[pid]) > 0) {..}.
There was a problem hiding this comment.
That would prevent executing if _, pidExited = ... in case (pm.interpreterTracerEnabled && len(pm.interpreters[pid]) > 0 is true.
There was a problem hiding this comment.
As Tim wrote, this would alter the logic. I tried to keep as much of the original semantics the same to avoid introducing new races. Maybe here it's possible to safely say that if !pidExists then it's OK not to write exitKTime in pm.exitEvents but we'd need to carefully examine all subsystem interactions, check for race conditions etc.
| return symbolize | ||
| return | ||
| } | ||
| if pidExited { |
There was a problem hiding this comment.
I think, pidExited should be renamed to pidExitProcessed so something similar, this would it make obvious, that we want to avoid duplicate work.
| pm.mu.Lock() | ||
| defer pm.mu.Unlock() | ||
|
|
||
| nowKTime := times.GetKTime() | ||
| log.Debugf("ProcessedUntil captureKT: %v latency: %v ms", | ||
| traceCaptureKTime, (nowKTime-traceCaptureKTime)/1e6) | ||
|
|
There was a problem hiding this comment.
keep the lock holding as short as possible:
| pm.mu.Lock() | |
| defer pm.mu.Unlock() | |
| nowKTime := times.GetKTime() | |
| log.Debugf("ProcessedUntil captureKT: %v latency: %v ms", | |
| traceCaptureKTime, (nowKTime-traceCaptureKTime)/1e6) | |
| nowKTime := times.GetKTime() | |
| log.Debugf("ProcessedUntil captureKT: %v latency: %v ms", | |
| traceCaptureKTime, (nowKTime-traceCaptureKTime)/1e6) | |
| pm.mu.Lock() | |
| defer pm.mu.Unlock() |
There was a problem hiding this comment.
This can affect the latency measurement, since we're timing before the lock.
Sync from upstream (2025-03-12) Florian Lehner <florianl@users.noreply.github.com> symblib: expose API for single point lookups (open-telemetry#380) Co-authored-by: GitHub <noreply@github.com> Tolya Korniltsev <korniltsev.anatoly@gmail.com> chore: remove unused controller.Config fields (open-telemetry#387) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> libpf: drop unused code (open-telemetry#386) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> tracehandler: drop metadataWarnInhib (open-telemetry#385) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> Go: update to go.opentelemetry.io/otel@v1.35.0 (open-telemetry#383) Co-authored-by: GitHub <noreply@github.com> Christos Kalkanis <christos.kalkanis@elastic.co> processmanager: Don't synchronize a process that's waiting cleanup (open-telemetry#379) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> CI: use latest LTS kernel in tests (open-telemetry#382) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> Makefile: add cargo clean to target clean (open-telemetry#381) Co-authored-by: GitHub <noreply@github.com> Christos Kalkanis <christos.kalkanis@elastic.co> Switch semantics for process.executable.name (open-telemetry#306) Co-authored-by: GitHub <noreply@github.com> Tim Rühsen <tim.ruhsen@elastic.co> Stabilize CI / integration tests (open-telemetry#378) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> Docker fixup (open-telemetry#375) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> Docker: fix rust set up (open-telemetry#371) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> tracer: attach to all kprobes with prefix for off CPU profiling (open-telemetry#370) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> Go: update to Go 1.23 (open-telemetry#372) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> support: generate *ProcInfo types with cgo (open-telemetry#367) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> process: reuse and preallocate memory (open-telemetry#355) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> rust: preparations to integrate Rust (open-telemetry#360) Co-authored-by: GitHub <noreply@github.com> Christos Kalkanis <christos.kalkanis@elastic.co> Switch to OTel metrics (open-telemetry#348) Co-authored-by: GitHub <noreply@github.com> Tolya Korniltsev <korniltsev.anatoly@gmail.com> cargo: remove unused workspace dependency declarations (open-telemetry#364) Co-authored-by: GitHub <noreply@github.com> Tolya Korniltsev <korniltsev.anatoly@gmail.com> reporter: add custom gRPC dial options (open-telemetry#363) Co-authored-by: GitHub <noreply@github.com> umanwizard <brennan@umanwizard.com> Various fixes to node/V8 (open-telemetry#333) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> doc: fix path of tooling (open-telemetry#361) Co-authored-by: GitHub <noreply@github.com> OpenTelemetry Bot <107717825+opentelemetrybot@users.noreply.github.com> Add FOSSA scanning workflow (open-telemetry#357) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> rust: use macro for debug output (open-telemetry#356) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> symblib/gosym: add single point lookup (open-telemetry#346) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> README: provide devfiler v0.14.0 (open-telemetry#354) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> CI: skip environment setup (open-telemetry#353) Co-authored-by: GitHub <noreply@github.com> Richard Chukwu <79311274+RichardChukwu@users.noreply.github.com> Improve contributor guide (open-telemetry#349) Co-authored-by: GitHub <noreply@github.com> Christos Kalkanis <christos.kalkanis@elastic.co> Fix build (open-telemetry#350) Co-authored-by: GitHub <noreply@github.com> Christos Kalkanis <christos.kalkanis@elastic.co> processinfo: refactor process metadata (open-telemetry#344) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> reporter/pdata: do no generate profiles if there are no events (open-telemetry#347) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> README: provide devfiler v0.13.0 (open-telemetry#343) Co-authored-by: GitHub <noreply@github.com> Christos Kalkanis <christos.kalkanis@elastic.co> processmanager: Fix process exit regression (open-telemetry#337) (open-telemetry#338) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> libpf: drop Hash64 (open-telemetry#340) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> cargo: set license field (open-telemetry#336) Co-authored-by: GitHub <noreply@github.com> Damien Mathieu <42@dmathieu.com> Use dummy support for any non-arm64 and non-amd64 archs (open-telemetry#335) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> rust: drop anyhow dependency (open-telemetry#334) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> support: use cgo to generate Go constants from eBPF (open-telemetry#332) Co-authored-by: GitHub <noreply@github.com> Christos Kalkanis <christos.kalkanis@elastic.co> processmanager: Don't log inside critical areas (open-telemetry#328) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> CI: add test for Rust components (open-telemetry#326) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> processmanager: simplify API and return early (open-telemetry#325) Co-authored-by: GitHub <noreply@github.com> Christos Kalkanis <christos.kalkanis@elastic.co> Add Rust native symbolization library and C API wrapper (open-telemetry#267) Co-authored-by: GitHub <noreply@github.com> Christos Kalkanis <christos.kalkanis@elastic.co> Metrics for trace event perf event monitor (open-telemetry#322) Co-authored-by: GitHub <noreply@github.com> Christos Kalkanis <christos.kalkanis@elastic.co> Delayed processing for ProcessManager.pidToProcessInfo (open-telemetry#321) Co-authored-by: GitHub <noreply@github.com> Christos Kalkanis <christos.kalkanis@elastic.co> Rework SymbolizationComplete (open-telemetry#307) Co-authored-by: GitHub <noreply@github.com> Tim Rühsen <tim.ruhsen@elastic.co> Amend -off-cpu-threshold value (open-telemetry#316) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> reporter/collector: fix reporting issue (open-telemetry#319) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> reporter: move pkg samples from internal to public (open-telemetry#314) Co-authored-by: GitHub <noreply@github.com> Florian Lehner <florianl@users.noreply.github.com> README: provide devfiler v0.11.0 (open-telemetry#313) Co-authored-by: GitHub <noreply@github.com>
Summary
SymbolizationCompletetoProcessedUntiland moved toprocessinfo.goProcessManager.pidToProcessInfocleanupLeverages #307 to ensure that process metadata is not discarded before all relevant trace events have been processed.
Fixes #278.
You may find reviewing commit-by-commit to be simpler.