python: combine python and native unwinder to avoid tail call limits by gnurizen · Pull Request #1288 · open-telemetry/opentelemetry-ebpf-profiler

gnurizen · 2026-03-26T16:17:18Z

python: combine python and native unwinder into single loop

Python, especially pytorch programs can exhaust the tail call limit
by switching from python to native unwinders more than 29 times.
This happens because of eval/delegation patterns where one python
frame will be decorated with a couple native frames.

In order to unwind these stack successfully fold the native unwinder
into the python unwinder so at each frame a python or native frame
can be unwound.

Replace the separate walk_python_stack inner loop and outer
transition loop with a single switch-in-loop structure using
step_python and step_native helper functions. This reduces
tail call usage from one per batch to one per loop budget
exhaustion.

Move native unwinder map externs (exe_id_to_*_stack_deltas,
stack_delta_page_to_info, unwind_info_array) out of the
TESTING_COREDUMP guard in extmaps.h so python_tracer.ebpf.c
can include native_stack_trace.h.

Python loop iters is now a ro_vars entry so it can be set low by
default and jacked up with debug_prints are disabled which allows for
much bigger stacks. 29 * 12 (384) remains the deepest python stack we
support but its 29 * 4 w/ debug prints enabled.

gnurizen · 2026-03-30T18:48:42Z

New coredump test: https://drive.google.com/file/d/1c6fCXz7UoMuPu-Gg-bHUluYucILv2gIb

gnurizen · 2026-03-31T15:13:04Z

@fabled @florianl tagging you guys for review consideration, no hurry just want to make sure this gets on the appropriate radars. Thanks!

fabled · 2026-04-08T18:38:16Z

Seems ruby has same issue. See #1335 .

I wonder if something more elaborate could be done. Or is it better to bundle native unwinder with the interpreters that need it due to mixing native/HLL frames every few frames.

gnurizen · 2026-04-08T19:15:22Z

Rebased to PR #1286. Yeah I'd love to get @dalehamel 's thought on the applicability of this approach to the Ruby situation.

dalehamel · 2026-04-08T19:21:27Z

Seems ruby has same issue. See #1335 .

I wonder if something more elaborate could be done. Or is it better to bundle native unwinder with the interpreters that need it due to mixing native/HLL frames every few frames.

Yes especially in production we see this problem. With yjit, the problem is masked by the fact that the jit is only the leaf frame and we don't run with jit frame pointers in production for performance reasons.

However the ruby unwinder is already quite instruction heavy as we need to do the complex CME resolution for each ruby frame, so it might be hard to get all frames if we add the native unwinder into it too. In reality we mostly really care about the actual 'ruby' stack state (which shouldn't really matter if the frame is jit or interpreter backed, as it might be either with zjit) + jit leaf state the majority of the time and that's been fine for our purposes.

If we could manage to actually continue the native unwinding without exhausting tail calls, that would certainly be the best of both worlds, but i wouldn't say it's the highest priority.

linux-foundation-easycla · 2026-04-28T14:15:35Z

The committers listed above are authorized under a signed CLA.

✅ login: gnurizen / name: Tommy Reilly (173671d, 23507cb, 501e196, f020348)

gnurizen · 2026-04-28T15:10:49Z

This has been rebased to main, is passing all the kernel tests and should be green when the new coredump test is uploaded. @florianl @fabled if you could give this another pass when you can that would be much appreciated!

gnurizen · 2026-05-05T14:19:35Z

@florianl @fabled what are your thoughts on splitting off the first commit here and trying to land that first?

gnurizen · 2026-05-14T14:31:53Z

@fabled @florianl friendly ping on this. What if this change was a runtime opt-in (or could be build time I guess) so we could land it w/o changing anything functionally and Parca could just switch it on for our agent? ebpf programs are small, no harm in having some extra ones in the binary. I haven't worked out exactly how that would work but seems doable.

fabled · 2026-05-15T06:03:31Z

I think this is probably the simple thing to go forward to solve a real problem at the hand. I'd prefer to do something else, but it probably is a bigger job and/or not feasible at this time. So I'm ok to do this at this time. Let's keep do this just for everyone (no opt-in switch imho). The less there is configuration / build/runtime switch the more maintainable it is.

@christos68k @florianl Thoughts?

christos68k · 2026-05-15T13:15:51Z

I think this is probably the simple thing to go forward to solve a real problem at the hand. I'd prefer to do something else, but it probably is a bigger job and/or not feasible at this time. So I'm ok to do this at this time. Let's keep do this just for everyone (no opt-in switch imho). The less there is configuration / build/runtime switch the more maintainable it is.

@christos68k @florianl Thoughts?

I'm fine for now with this PR as-is, the change is not that extensive and it's conceptually clean and fairly simple. The alternative @gnurizen proposed here is also fine, though I prefer the current PR.

gnurizen · 2026-05-16T18:25:20Z

I redid the math on the number of frames we do, #1422 really opens up the doors on the upper bound on the *_PER_PROGRAM frame loops! But alas we still need this patch to get around the tail call limit with pytorch's checkered zebra stacks.

@florianl can we make the next step uploading the coredump test so we can get a green CI here? See this comment: #1288 (comment)

fabled · 2026-05-18T07:55:47Z

can we make the next step uploading the coredump test so we can get a green CI here? See this comment: #1288 (comment)

Uploaded and reran tests now. Can you merge with main and update ebpf blobs? Thanks!

This is a prep the patient PR to make room for a hybrid python/native unwinder that we found necessary to unwind large pytorch stacks that go back and forth from python to native more times than the tail call limit will allow. This change is pure code motion and changes nothing functionally.

Python, especially pytorch programs can exhaust the tail call limit by switching from python to native unwinders more than 29 times. This happens because of eval/delegation patterns where one python frame will be decorated with a couple native frames. In order to unwind these stack successfully fold the native unwinder into the python unwinder so at each frame a python or native frame can be unwound. Replace the separate walk_python_stack inner loop and outer transition loop with a single switch-in-loop structure using step_python and step_native helper functions. This reduces tail call usage from one per batch to one per loop budget exhaustion (PYTHON_NATIVE_LOOP_ITERS=9 iterations). Move native unwinder map externs (exe_id_to_*_stack_deltas, stack_delta_page_to_info, unwind_info_array) out of the TESTING_COREDUMP guard in extmaps.h so python_tracer.ebpf.c can include native_stack_trace.h. Python loop iters is now a ro_vars entry so it can be set low by default and jacked up with debug_prints are disabled which allows for much bigger stacks.

Both the host agent (production, no verifier debug branches) and the coredump tool (no verifier at all) need the full 12 iterations to unwind deep Python+native stacks. Only VerboseMode=true in CI hits the 1M verifier instruction limit on kernel 6.18+ because DEBUG_PRINT roughly triples per-iter complexity. Previously the eBPF rodata default was 4 (the verifier-limited value) and systemconfig.go overrode UP to 12 for production. The coredump tool bypasses systemconfig.go and was stuck at 4, breaking the deep-python test. Flip the polarity: default to 12 in eBPF, override DOWN to 4 in systemconfig.go when VerboseMode is set. Coredump picks up 12 for free.

gnurizen changed the title ~~python native hybrid~~ Combine python and native unwinder into single loop Mar 26, 2026

gnurizen force-pushed the python-native-hybrid branch 3 times, most recently from a83b6d6 to 365d706 Compare March 26, 2026 23:40

gnurizen marked this pull request as ready for review March 27, 2026 00:59

gnurizen requested review from a team as code owners March 27, 2026 00:59

gnurizen mentioned this pull request Mar 27, 2026

Move native unwinder impl to a .h #1280

Closed

gnurizen mentioned this pull request Apr 8, 2026

Move rt_regs from stack to record scratch #1286

Merged

gnurizen force-pushed the python-native-hybrid branch from 2489c65 to a4809c1 Compare April 8, 2026 19:07

florianl mentioned this pull request Apr 14, 2026

Reduce lock contention on inhibit_events map #1349

Merged

1 task

gnurizen force-pushed the python-native-hybrid branch from a4809c1 to f5a2fac Compare April 20, 2026 15:33

gnurizen changed the title ~~Combine python and native unwinder into single loop~~ python: combine python and native unwinder into single loop to support highly mixed stacks hitting tail call limits Apr 20, 2026

gnurizen changed the title ~~python: combine python and native unwinder into single loop to support highly mixed stacks hitting tail call limits~~ python: combine python and native unwinder to support deep mixed stacks hitting tail call limits Apr 20, 2026

gnurizen marked this pull request as draft April 20, 2026 15:40

gnurizen force-pushed the python-native-hybrid branch 3 times, most recently from 08dc71e to ef71a9a Compare April 28, 2026 14:15

gnurizen force-pushed the python-native-hybrid branch from ef71a9a to c0eed20 Compare April 28, 2026 14:22

gnurizen marked this pull request as ready for review April 28, 2026 15:09

dalehamel mentioned this pull request Apr 28, 2026

ruby: add skip_native_resume to avoid tail call exhaustion #1335

Open

gnurizen force-pushed the python-native-hybrid branch from c0eed20 to 583f6a0 Compare May 5, 2026 12:31

gnurizen changed the title ~~python: combine python and native unwinder to support deep mixed stacks hitting tail call limits~~ python: combine python and native unwinder to avoid tail call limits May 5, 2026

gnurizen force-pushed the python-native-hybrid branch from 583f6a0 to 23507cb Compare May 5, 2026 13:26

gnurizen added 4 commits May 18, 2026 12:43

Add coredump test for deep python stacks

b39f022

gnurizen force-pushed the python-native-hybrid branch from 23507cb to 7475c10 Compare May 18, 2026 19:08

Conversation

gnurizen commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gnurizen commented Mar 30, 2026

Uh oh!

gnurizen commented Mar 31, 2026

Uh oh!

fabled commented Apr 8, 2026

Uh oh!

gnurizen commented Apr 8, 2026

Uh oh!

dalehamel commented Apr 8, 2026

Uh oh!

linux-foundation-easycla Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gnurizen commented Apr 28, 2026

Uh oh!

gnurizen commented May 5, 2026

Uh oh!

gnurizen commented May 14, 2026

Uh oh!

fabled commented May 15, 2026

Uh oh!

christos68k commented May 15, 2026

Uh oh!

gnurizen commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fabled commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gnurizen commented Mar 26, 2026 •

edited

Loading

linux-foundation-easycla Bot commented Apr 28, 2026 •

edited

Loading

gnurizen commented May 16, 2026 •

edited

Loading