Recover python frames when BPF fails to read PyCodeObject#1278
Conversation
| if m.ebpfChecksum == ebpfChecksum { | ||
| return m, nil | ||
| if ebpfChecksum != 0 { | ||
| if value, ok := p.addrToCodeObject.Get(addr); ok { |
There was a problem hiding this comment.
| if value, ok := p.addrToCodeObject.Get(addr); ok { | |
| // A zero checksum indicates code object read failed during the kernel read attempt (e.g. paged out). | |
| if value, ok := p.addrToCodeObject.Get(addr); ok { |
There was a problem hiding this comment.
Looks like the previous suggestion got wiped in the rebase?
| if value, ok := p.addrToCodeObject.Get(addr); ok { | |
| // A zero checksum indicates code object read failed in the kernel (e.g. paged out). | |
| if value, ok := p.addrToCodeObject.Get(addr); ok { |
| increment_metric(metricID_UnwindPythonErrBadCodeObjectArgCountAddr); | ||
| return ERR_PYTHON_BAD_CODE_OBJECT_ADDR; | ||
| // Push the frame with the code object address so the agent can try to | ||
| // read it via /proc/pid/mem (which supports page faults unlike BPF). |
There was a problem hiding this comment.
Nit:
| // read it via /proc/pid/mem (which supports page faults unlike BPF). | |
| // read it in userspace (which can take page faults unlike BPF). |
There was a problem hiding this comment.
Same here:
| // read it via /proc/pid/mem (which supports page faults unlike BPF). | |
| // read it in userspace (which can take page faults unlike BPF). |
|
Doesn't this introduce the possibility of a TOCTOU race (leading to a stacktrace with wrong frames) if the codeobject we read in userspace isn't the one we failed to read in the kernel? Is this purely theoretical / highly infrequent? If it can happen, the tradeoff is traces with wrong frames vs failed unwinds. |
florianl
left a comment
There was a problem hiding this comment.
As @christos68k pointed out in #1278 (comment), I also think this opens the possibility of a TOCTOU race condition.
My personal preference is to report failed unwindings over traces with wrong frames.
|
To be fair the old code had a TOCTOU race as well, the argcount/lineno/flags bits aren't a guarantee. The problem is that in 200 CPU servers with ridiculous amounts of RAM this problem (soft page fault) can be surprisingly common. We saw high occurrences of this unwinding error. The problem is if we terminate the unwind the page is never faulted in and the read of the same address fails over and over (this is a long running training process where frames in the middle of the stack can be sitting there for hours). This fix causes them to get faulted in so future unwindings of the same stack succeed. I don't know how likely the TOCTOU hazard of python GC'ing the activation object and a different function taking its place is but I suspect its a harmless anomaly that would happen exceedingly rarely. The problem I'm fixing was causing unwinding to fail upwards of %30 of the time on this workload. Maybe we should just report the faulting address to the UA and have it clear the cache but preserve the current aborted unwind. That would work too but I'm not convinced its materially better. |
I don't have a strong preference here, you've answered my main question re: how often does this happen so I'm fine with the current approach which doesn't deviate too much from the profiler's "eventual consistency" model. |
fabled
left a comment
There was a problem hiding this comment.
I think the core changes look good. But the test suite extension could probably be done simpler by adding a new ebpf helper function. See comments.
d87d530 to
20cc7e2
Compare
|
Rebased to main with new simpler testing approach, RFAL. |
fabled
left a comment
There was a problem hiding this comment.
Thanks! MUCH cleaner and thorough test set now. 💯
Seems the ebpf blobs need merge from master/rebuild, but approving.
When bpf_probe_read_user fails to read a PyCodeObject (e.g. page swapped out), push the frame with codeobject_id=0 instead of aborting the unwind. This preserves the rest of the stack trace. On the agent side, handle ebpfChecksum=0 in getCodeObject by skipping the LRU cache (no checksum to validate against) and the staleness check (no BPF reference to compare). The agent reads the code object via process_vm_readv which supports page faults, so it can succeed where BPF could not. Store the calculated checksum in the cache so subsequent frames with a real BPF checksum can match.
A new fault addresses element can be added to the test json which must be hit during the test via a call to bpf_probe_read_user_with_test_fault.
20cc7e2 to
72a1b9f
Compare
|
Oops, I think I fixed everything, still not used to GH suggestions, sorry! |
When bpf_probe_read_user fails to read a PyCodeObject (e.g. page
swapped out), push the frame with codeobject_id=0 instead of aborting
the unwind. This preserves the rest of the stack trace.
On the agent side, handle ebpfChecksum=0 in getCodeObject by skipping
the LRU cache (no checksum to validate against) and the staleness
check (no BPF reference to compare). The agent reads the code object
via process_vm_readv which supports page faults, so it can succeed
where BPF could not. Store the calculated checksum in the cache so
subsequent frames with a real BPF checksum can match.