Skip to content

Delayed processing for ProcessManager.pidToProcessInfo#321

Merged
christos68k merged 6 commits intomainfrom
ck/processed-until
Jan 24, 2025
Merged

Delayed processing for ProcessManager.pidToProcessInfo#321
christos68k merged 6 commits intomainfrom
ck/processed-until

Conversation

@christos68k
Copy link
Copy Markdown
Member

@christos68k christos68k commented Jan 22, 2025

Summary

  • Renamed SymbolizationComplete to ProcessedUntil and moved to processinfo.go
  • Dropped no longer needed "symbolize now" remnants
  • Implemented delayed ProcessManager.pidToProcessInfo cleanup

Leverages #307 to ensure that process metadata is not discarded before all relevant trace events have been processed.

Fixes #278.

You may find reviewing commit-by-commit to be simpler.

@christos68k christos68k self-assigned this Jan 22, 2025
@christos68k christos68k requested review from a team as code owners January 22, 2025 23:45
return
}

// Delete all entries we have for this particular PID from pid_page_to_mapping_info.
Copy link
Copy Markdown
Member Author

@christos68k christos68k Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept this cleanup here as there's no immediate need to postpone cleaning up the eBPF map until traceCaptureKTime >= pidExitKtime (unlike pidToProcessInfo). This also speeds up execution of ProcessedUntil compared to having the map cleanup take place there.

// NOTE: Exported only for tracer.
func (pm *ProcessManager) ProcessPIDExit(pid libpf.PID) bool {
func (pm *ProcessManager) ProcessPIDExit(pid libpf.PID) {
exitKTime := times.GetKTime()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this outside the lock for improved accuracy (there's a debug log in ProcessedUntil that prints exit latency).

Uses ProcessedUntil mechanism to guarantee that process metadata
is not discarded before all relevant trace events have been
processed.
Comment thread processmanager/processinfo.go Outdated
return symbolize
return
}
if pidExited {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to attempt a repeat cleanup for the same PID, if we've previously performed it.

Comment thread processmanager/manager.go
return serviceName
}

func (pm *ProcessManager) SymbolizationComplete(traceCaptureKTime times.KTime) {
Copy link
Copy Markdown
Member Author

@christos68k christos68k Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to processinfo.go for consistency (all pidToProcessInfo accessors in one place), renamed to ProcessedUntil and updated to also cleanup pidToProcessInfo.

address, pid, err)
}
}
delete(pm.pidToProcessInfo, pid)
Copy link
Copy Markdown
Member Author

@christos68k christos68k Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now taking place in ProcessedUntil, delayed until traceCaptureKTime >= exitKTime.

Comment thread processmanager/manager.go
return newTrace
}

// findMappingForTrace locates the mapping for a given host trace.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved without changes to processinfo.go for consistency.

if len(pm.interpreters[pid]) > 0 {
pidExited := false
info, pidExists := pm.pidToProcessInfo[pid]
if pidExists || (pm.interpreterTracerEnabled &&
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially same logic as before with these additions:

  1. Don't add exitKTime to pm.exitEvents if it already exists.
  2. Also add exitKTime to pm.exitEvents if pm.pidToProcessInfo[pid] exists, as we want to cleanup the latter in delayed fashion.

continue
}

delete(pm.pidToProcessInfo, pid)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same logic as before with this single-line addition.

Copy link
Copy Markdown
Member

@florianl florianl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first look with some comments


info, ok := pm.pidToProcessInfo[pid]
if !ok {
if !pidExists {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep the global read & write lock as short as possible, the if !pidExists {..} part should be moved before if pidExists || (pm.interpreterTracerEnabled && len(pm.interpreters[pid]) > 0) {..}.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would prevent executing if _, pidExited = ... in case (pm.interpreterTracerEnabled && len(pm.interpreters[pid]) > 0 is true.

Copy link
Copy Markdown
Member Author

@christos68k christos68k Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Tim wrote, this would alter the logic. I tried to keep as much of the original semantics the same to avoid introducing new races. Maybe here it's possible to safely say that if !pidExists then it's OK not to write exitKTime in pm.exitEvents but we'd need to carefully examine all subsystem interactions, check for race conditions etc.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow up is done in #325

Comment thread processmanager/processinfo.go Outdated
return symbolize
return
}
if pidExited {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, pidExited should be renamed to pidExitProcessed so something similar, this would it make obvious, that we want to avoid duplicate work.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

Comment on lines +704 to +710
pm.mu.Lock()
defer pm.mu.Unlock()

nowKTime := times.GetKTime()
log.Debugf("ProcessedUntil captureKT: %v latency: %v ms",
traceCaptureKTime, (nowKTime-traceCaptureKTime)/1e6)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the lock holding as short as possible:

Suggested change
pm.mu.Lock()
defer pm.mu.Unlock()
nowKTime := times.GetKTime()
log.Debugf("ProcessedUntil captureKT: %v latency: %v ms",
traceCaptureKTime, (nowKTime-traceCaptureKTime)/1e6)
nowKTime := times.GetKTime()
log.Debugf("ProcessedUntil captureKT: %v latency: %v ms",
traceCaptureKTime, (nowKTime-traceCaptureKTime)/1e6)
pm.mu.Lock()
defer pm.mu.Unlock()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can affect the latency measurement, since we're timing before the lock.

Comment thread processmanager/processinfo.go
Comment thread processmanager/processinfo.go
Comment thread processmanager/processinfo.go
@christos68k christos68k merged commit f094776 into main Jan 24, 2025
@christos68k christos68k deleted the ck/processed-until branch January 24, 2025 13:28
bhavnajindal added a commit to instana/opentelemetry-ebpf-profiler that referenced this pull request Mar 12, 2025
Sync from upstream (2025-03-12)

Florian Lehner <florianl@users.noreply.github.com> symblib: expose API for single point lookups (open-telemetry#380)
Co-authored-by: GitHub <noreply@github.com>
Tolya Korniltsev <korniltsev.anatoly@gmail.com> chore: remove unused controller.Config fields (open-telemetry#387)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> libpf: drop unused code (open-telemetry#386)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> tracehandler: drop metadataWarnInhib (open-telemetry#385)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> Go: update to go.opentelemetry.io/otel@v1.35.0 (open-telemetry#383)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> processmanager: Don't synchronize a process that's waiting cleanup (open-telemetry#379)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> CI: use latest LTS kernel in tests (open-telemetry#382)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> Makefile: add cargo clean to target clean (open-telemetry#381)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> Switch semantics for process.executable.name (open-telemetry#306)
Co-authored-by: GitHub <noreply@github.com>
Tim Rühsen <tim.ruhsen@elastic.co> Stabilize CI / integration tests (open-telemetry#378)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> Docker fixup (open-telemetry#375)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> Docker: fix rust set up (open-telemetry#371)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> tracer: attach to all kprobes with prefix for off CPU profiling (open-telemetry#370)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> Go: update to Go 1.23 (open-telemetry#372)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> support: generate *ProcInfo types with cgo (open-telemetry#367)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> process: reuse and preallocate memory (open-telemetry#355)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> rust: preparations to integrate Rust (open-telemetry#360)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> Switch to OTel metrics (open-telemetry#348)
Co-authored-by: GitHub <noreply@github.com>
Tolya Korniltsev <korniltsev.anatoly@gmail.com> cargo: remove unused workspace dependency declarations (open-telemetry#364)
Co-authored-by: GitHub <noreply@github.com>
Tolya Korniltsev <korniltsev.anatoly@gmail.com> reporter: add custom gRPC dial options (open-telemetry#363)
Co-authored-by: GitHub <noreply@github.com>
umanwizard <brennan@umanwizard.com> Various fixes to node/V8 (open-telemetry#333)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> doc: fix path of tooling (open-telemetry#361)
Co-authored-by: GitHub <noreply@github.com>
OpenTelemetry Bot <107717825+opentelemetrybot@users.noreply.github.com> Add FOSSA scanning workflow (open-telemetry#357)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> rust: use macro for debug output (open-telemetry#356)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> symblib/gosym: add single point lookup (open-telemetry#346)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> README: provide devfiler v0.14.0 (open-telemetry#354)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> CI: skip environment setup (open-telemetry#353)
Co-authored-by: GitHub <noreply@github.com>
Richard Chukwu <79311274+RichardChukwu@users.noreply.github.com> Improve contributor guide (open-telemetry#349)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> Fix build (open-telemetry#350)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> processinfo: refactor process metadata (open-telemetry#344)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> reporter/pdata: do no generate profiles if there are no events (open-telemetry#347)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> README: provide devfiler v0.13.0 (open-telemetry#343)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> processmanager: Fix process exit regression (open-telemetry#337) (open-telemetry#338)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> libpf: drop Hash64 (open-telemetry#340)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> cargo: set license field (open-telemetry#336)
Co-authored-by: GitHub <noreply@github.com>
Damien Mathieu <42@dmathieu.com> Use dummy support for any non-arm64 and non-amd64 archs (open-telemetry#335)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> rust: drop anyhow dependency (open-telemetry#334)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> support: use cgo to generate Go constants from eBPF (open-telemetry#332)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> processmanager: Don't log inside critical areas (open-telemetry#328)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> CI: add test for Rust components (open-telemetry#326)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> processmanager: simplify API and return early (open-telemetry#325)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> Add Rust native symbolization library and C API wrapper (open-telemetry#267)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> Metrics for trace event perf event monitor (open-telemetry#322)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> Delayed processing for ProcessManager.pidToProcessInfo (open-telemetry#321)
Co-authored-by: GitHub <noreply@github.com>
Christos Kalkanis <christos.kalkanis@elastic.co> Rework SymbolizationComplete (open-telemetry#307)
Co-authored-by: GitHub <noreply@github.com>
Tim Rühsen <tim.ruhsen@elastic.co> Amend -off-cpu-threshold value (open-telemetry#316)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> reporter/collector: fix reporting issue (open-telemetry#319)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> reporter: move pkg samples from internal to public (open-telemetry#314)
Co-authored-by: GitHub <noreply@github.com>
Florian Lehner <florianl@users.noreply.github.com> README: provide devfiler v0.11.0 (open-telemetry#313)
Co-authored-by: GitHub <noreply@github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sending executable path for processes that have exited

3 participants