Skip to content

Recover python frames when BPF fails to read PyCodeObject#1278

Merged
christos68k merged 3 commits into
open-telemetry:mainfrom
parca-dev:python-read-fail-continue
Apr 9, 2026
Merged

Recover python frames when BPF fails to read PyCodeObject#1278
christos68k merged 3 commits into
open-telemetry:mainfrom
parca-dev:python-read-fail-continue

Conversation

@gnurizen
Copy link
Copy Markdown
Contributor

When bpf_probe_read_user fails to read a PyCodeObject (e.g. page
swapped out), push the frame with codeobject_id=0 instead of aborting
the unwind. This preserves the rest of the stack trace.

On the agent side, handle ebpfChecksum=0 in getCodeObject by skipping
the LRU cache (no checksum to validate against) and the staleness
check (no BPF reference to compare). The agent reads the code object
via process_vm_readv which supports page faults, so it can succeed
where BPF could not. Store the calculated checksum in the cache so
subsequent frames with a real BPF checksum can match.

@gnurizen gnurizen marked this pull request as ready for review March 23, 2026 14:05
@gnurizen gnurizen requested review from a team as code owners March 23, 2026 14:05
Copy link
Copy Markdown
Member

@christos68k christos68k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

if m.ebpfChecksum == ebpfChecksum {
return m, nil
if ebpfChecksum != 0 {
if value, ok := p.addrToCodeObject.Get(addr); ok {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if value, ok := p.addrToCodeObject.Get(addr); ok {
// A zero checksum indicates code object read failed during the kernel read attempt (e.g. paged out).
if value, ok := p.addrToCodeObject.Get(addr); ok {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the previous suggestion got wiped in the rebase?

Suggested change
if value, ok := p.addrToCodeObject.Get(addr); ok {
// A zero checksum indicates code object read failed in the kernel (e.g. paged out).
if value, ok := p.addrToCodeObject.Get(addr); ok {

Comment thread support/ebpf/python_tracer.ebpf.c Outdated
increment_metric(metricID_UnwindPythonErrBadCodeObjectArgCountAddr);
return ERR_PYTHON_BAD_CODE_OBJECT_ADDR;
// Push the frame with the code object address so the agent can try to
// read it via /proc/pid/mem (which supports page faults unlike BPF).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
// read it via /proc/pid/mem (which supports page faults unlike BPF).
// read it in userspace (which can take page faults unlike BPF).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here:

Suggested change
// read it via /proc/pid/mem (which supports page faults unlike BPF).
// read it in userspace (which can take page faults unlike BPF).

@christos68k
Copy link
Copy Markdown
Member

christos68k commented Mar 31, 2026

Doesn't this introduce the possibility of a TOCTOU race (leading to a stacktrace with wrong frames) if the codeobject we read in userspace isn't the one we failed to read in the kernel? Is this purely theoretical / highly infrequent?

If it can happen, the tradeoff is traces with wrong frames vs failed unwinds.

Copy link
Copy Markdown
Member

@florianl florianl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @christos68k pointed out in #1278 (comment), I also think this opens the possibility of a TOCTOU race condition.
My personal preference is to report failed unwindings over traces with wrong frames.

@gnurizen
Copy link
Copy Markdown
Contributor Author

gnurizen commented Apr 1, 2026

To be fair the old code had a TOCTOU race as well, the argcount/lineno/flags bits aren't a guarantee. The problem is that in 200 CPU servers with ridiculous amounts of RAM this problem (soft page fault) can be surprisingly common. We saw high occurrences of this unwinding error. The problem is if we terminate the unwind the page is never faulted in and the read of the same address fails over and over (this is a long running training process where frames in the middle of the stack can be sitting there for hours). This fix causes them to get faulted in so future unwindings of the same stack succeed. I don't know how likely the TOCTOU hazard of python GC'ing the activation object and a different function taking its place is but I suspect its a harmless anomaly that would happen exceedingly rarely. The problem I'm fixing was causing unwinding to fail upwards of %30 of the time on this workload.

Maybe we should just report the faulting address to the UA and have it clear the cache but preserve the current aborted unwind. That would work too but I'm not convinced its materially better.

@christos68k
Copy link
Copy Markdown
Member

Maybe we should just report the faulting address to the UA and have it clear the cache but preserve the current aborted unwind. That would work too but I'm not convinced its materially better.

I don't have a strong preference here, you've answered my main question re: how often does this happen so I'm fine with the current approach which doesn't deviate too much from the profiler's "eventual consistency" model.

Copy link
Copy Markdown
Contributor

@fabled fabled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the core changes look good. But the test suite extension could probably be done simpler by adding a new ebpf helper function. See comments.

Comment thread tools/coredump/coredump_test.go Outdated
@gnurizen gnurizen force-pushed the python-read-fail-continue branch 3 times, most recently from d87d530 to 20cc7e2 Compare April 8, 2026 18:51
@gnurizen
Copy link
Copy Markdown
Contributor Author

gnurizen commented Apr 8, 2026

Rebased to main with new simpler testing approach, RFAL.

Copy link
Copy Markdown
Contributor

@fabled fabled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! MUCH cleaner and thorough test set now. 💯
Seems the ebpf blobs need merge from master/rebuild, but approving.

gnurizen added 2 commits April 9, 2026 11:04
When bpf_probe_read_user fails to read a PyCodeObject (e.g. page
swapped out), push the frame with codeobject_id=0 instead of aborting
the unwind. This preserves the rest of the stack trace.

On the agent side, handle ebpfChecksum=0 in getCodeObject by skipping
the LRU cache (no checksum to validate against) and the staleness
check (no BPF reference to compare). The agent reads the code object
via process_vm_readv which supports page faults, so it can succeed
where BPF could not. Store the calculated checksum in the cache so
subsequent frames with a real BPF checksum can match.
A new fault addresses element can be added to the test json which
must be hit during the test via a call to
bpf_probe_read_user_with_test_fault.
@gnurizen gnurizen force-pushed the python-read-fail-continue branch from 20cc7e2 to 72a1b9f Compare April 9, 2026 15:07
@gnurizen
Copy link
Copy Markdown
Contributor Author

gnurizen commented Apr 9, 2026

Oops, I think I fixed everything, still not used to GH suggestions, sorry!

@christos68k christos68k merged commit 0320a2a into open-telemetry:main Apr 9, 2026
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants