Recover python frames when BPF fails to read PyCodeObject by gnurizen · Pull Request #1278 · open-telemetry/opentelemetry-ebpf-profiler

gnurizen · 2026-03-23T13:41:50Z

When bpf_probe_read_user fails to read a PyCodeObject (e.g. page
swapped out), push the frame with codeobject_id=0 instead of aborting
the unwind. This preserves the rest of the stack trace.

On the agent side, handle ebpfChecksum=0 in getCodeObject by skipping
the LRU cache (no checksum to validate against) and the staleness
check (no BPF reference to compare). The agent reads the code object
via process_vm_readv which supports page faults, so it can succeed
where BPF could not. Store the calculated checksum in the cache so
subsequent frames with a real BPF checksum can match.

christos68k

LGTM

christos68k · 2026-03-31T23:13:57Z

-		if m.ebpfChecksum == ebpfChecksum {
-			return m, nil
+	if ebpfChecksum != 0 {
+		if value, ok := p.addrToCodeObject.Get(addr); ok {


Suggested change

if value, ok := p.addrToCodeObject.Get(addr); ok {

// A zero checksum indicates code object read failed during the kernel read attempt (e.g. paged out).

if value, ok := p.addrToCodeObject.Get(addr); ok {

Looks like the previous suggestion got wiped in the rebase?

Suggested change

if value, ok := p.addrToCodeObject.Get(addr); ok {

// A zero checksum indicates code object read failed in the kernel (e.g. paged out).

if value, ok := p.addrToCodeObject.Get(addr); ok {

christos68k · 2026-03-31T23:15:41Z

    increment_metric(metricID_UnwindPythonErrBadCodeObjectArgCountAddr);
-    return ERR_PYTHON_BAD_CODE_OBJECT_ADDR;
+    // Push the frame with the code object address so the agent can try to
+    // read it via /proc/pid/mem (which supports page faults unlike BPF).


Nit:

Suggested change

// read it via /proc/pid/mem (which supports page faults unlike BPF).

// read it in userspace (which can take page faults unlike BPF).

Same here:

Suggested change

// read it via /proc/pid/mem (which supports page faults unlike BPF).

// read it in userspace (which can take page faults unlike BPF).

christos68k · 2026-03-31T23:21:43Z

Doesn't this introduce the possibility of a TOCTOU race (leading to a stacktrace with wrong frames) if the codeobject we read in userspace isn't the one we failed to read in the kernel? Is this purely theoretical / highly infrequent?

If it can happen, the tradeoff is traces with wrong frames vs failed unwinds.

florianl

As @christos68k pointed out in #1278 (comment), I also think this opens the possibility of a TOCTOU race condition.
My personal preference is to report failed unwindings over traces with wrong frames.

gnurizen · 2026-04-01T11:15:06Z

To be fair the old code had a TOCTOU race as well, the argcount/lineno/flags bits aren't a guarantee. The problem is that in 200 CPU servers with ridiculous amounts of RAM this problem (soft page fault) can be surprisingly common. We saw high occurrences of this unwinding error. The problem is if we terminate the unwind the page is never faulted in and the read of the same address fails over and over (this is a long running training process where frames in the middle of the stack can be sitting there for hours). This fix causes them to get faulted in so future unwindings of the same stack succeed. I don't know how likely the TOCTOU hazard of python GC'ing the activation object and a different function taking its place is but I suspect its a harmless anomaly that would happen exceedingly rarely. The problem I'm fixing was causing unwinding to fail upwards of %30 of the time on this workload.

Maybe we should just report the faulting address to the UA and have it clear the cache but preserve the current aborted unwind. That would work too but I'm not convinced its materially better.

christos68k · 2026-04-01T14:17:39Z

Maybe we should just report the faulting address to the UA and have it clear the cache but preserve the current aborted unwind. That would work too but I'm not convinced its materially better.

I don't have a strong preference here, you've answered my main question re: how often does this happen so I'm fine with the current approach which doesn't deviate too much from the profiler's "eventual consistency" model.

fabled

I think the core changes look good. But the test suite extension could probably be done simpler by adding a new ebpf helper function. See comments.

gnurizen · 2026-04-08T18:52:38Z

Rebased to main with new simpler testing approach, RFAL.

fabled

Thanks! MUCH cleaner and thorough test set now. 💯
Seems the ebpf blobs need merge from master/rebuild, but approving.

When bpf_probe_read_user fails to read a PyCodeObject (e.g. page swapped out), push the frame with codeobject_id=0 instead of aborting the unwind. This preserves the rest of the stack trace. On the agent side, handle ebpfChecksum=0 in getCodeObject by skipping the LRU cache (no checksum to validate against) and the staleness check (no BPF reference to compare). The agent reads the code object via process_vm_readv which supports page faults, so it can succeed where BPF could not. Store the calculated checksum in the cache so subsequent frames with a real BPF checksum can match.

A new fault addresses element can be added to the test json which must be hit during the test via a call to bpf_probe_read_user_with_test_fault.

gnurizen · 2026-04-09T15:08:59Z

Oops, I think I fixed everything, still not used to GH suggestions, sorry!

gnurizen marked this pull request as ready for review March 23, 2026 14:05

gnurizen requested review from a team as code owners March 23, 2026 14:05

christos68k approved these changes Mar 31, 2026

View reviewed changes

florianl reviewed Apr 1, 2026

View reviewed changes

fabled reviewed Apr 8, 2026

View reviewed changes

Comment thread tools/coredump/coredump_test.go Outdated

gnurizen force-pushed the python-read-fail-continue branch 3 times, most recently from d87d530 to 20cc7e2 Compare April 8, 2026 18:51

fabled approved these changes Apr 9, 2026

View reviewed changes

gnurizen added 2 commits April 9, 2026 11:04

Add testing for pycodeobject read failure to coredump harness

271922c

A new fault addresses element can be added to the test json which must be hit during the test via a call to bpf_probe_read_user_with_test_fault.

gnurizen force-pushed the python-read-fail-continue branch from 20cc7e2 to 72a1b9f Compare April 9, 2026 15:07

Rebase, rebuild blobs and add back review suggestions

72a1b9f

christos68k merged commit 0320a2a into open-telemetry:main Apr 9, 2026
32 checks passed

This was referenced May 15, 2026

print fmt rodata parca-dev/opentelemetry-ebpf-profiler#271

Closed

bump ppc parca-dev/opentelemetry-ebpf-profiler#272

Closed

	if value, ok := p.addrToCodeObject.Get(addr); ok {
	// A zero checksum indicates code object read failed during the kernel read attempt (e.g. paged out).
	if value, ok := p.addrToCodeObject.Get(addr); ok {

	// read it via /proc/pid/mem (which supports page faults unlike BPF).
	// read it in userspace (which can take page faults unlike BPF).

Conversation

gnurizen commented Mar 23, 2026

Uh oh!

christos68k left a comment

Choose a reason for hiding this comment

Uh oh!

christos68k Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

christos68k Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

christos68k Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

christos68k Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

christos68k commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

florianl left a comment

Choose a reason for hiding this comment

Uh oh!

gnurizen commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

christos68k commented Apr 1, 2026

Uh oh!

fabled left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gnurizen commented Apr 8, 2026

Uh oh!

fabled left a comment

Choose a reason for hiding this comment

Uh oh!

gnurizen commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

christos68k commented Mar 31, 2026 •

edited

Loading

gnurizen commented Apr 1, 2026 •

edited

Loading