Skip to content

Fix off-by-one error in stack delta lookup#1027

Closed
umanwizard wants to merge 1 commit intoopen-telemetry:mainfrom
parca-dev:fix-offbyone-upstream
Closed

Fix off-by-one error in stack delta lookup#1027
umanwizard wants to merge 1 commit intoopen-telemetry:mainfrom
parca-dev:fix-offbyone-upstream

Conversation

@umanwizard
Copy link
Copy Markdown
Contributor

When the current address is a return address, we need to subtract one to get the real call instruction.

Anecdotally, fixing this bug seems to substantially improve unwinding for Rust aarch64 binaries compiled in debug mode. The reason for this is that our heuristic for getting the frame pointer doesn't work in function epilogues, and in debug builds, you often have a call instruction immediately followed by the function epilogue. So due to this off-by-one error, we'll think we're in the epilogue and not correctly unwind the fp.

It seems much less common in release builds, probably because it is much more unusual to have a call immediately followed by the function epilogue (because the optimizer will replace that sequence by a tail call).

When the current address is a return address, we need to subtract one
to get the real call instruction.

Anecdotally, fixing this bug seems to substantially improve unwinding
for Rust aarch64 binaries compiled in debug mode. The reason for this
is that our heuristic for getting the frame pointer doesn't work in
function epilogues, and in debug builds, you often have a call
instruction immediately followed by the function epilogue. So due to
this off-by-one error, we'll think we're in the epilogue and not
correctly unwind the fp.

It seems much less common in release builds, probably because it is
much more unusual to have a call immediately followed by the function
epilogue (because the optimizer will replace that sequence by a tail call).
@umanwizard umanwizard requested review from a team as code owners December 18, 2025 22:24
@umanwizard
Copy link
Copy Markdown
Contributor Author

@christos68k could you help me upload the test case to the s3 bucket for the coredump test? Tommy told me you helped him do it in the past.

@umanwizard
Copy link
Copy Markdown
Contributor Author

Hmmm... It seems to break some unrelated coredump tests. Looking into it.

@fabled
Copy link
Copy Markdown
Contributor

fabled commented Dec 19, 2025

Several things for you to note:

  • the -1 generally applies to extracting the line number from the other dwarf places where the calling instruction is wanted.
  • technically, the -1 is not wanted here. the call instruction does not change state. the first location where state changes is after the instruction after call. refer to dward .debug_frame / .eh_frame on the meaning of the address there.
  • corner case: on the first instruction (where -1 will break things), but typically should not happen if return_address flag is not set
  • corner case: CALL is the last instruction, and not doing -1 will not match correct location. However, this should not happen as compiler should always generate things after CALL. I've seen some exceptions: hand crafted assembly where called function is no-exit.
  • corner case: tail call. typically converts CALL to JMP and the whole frame is invisible in the trace
  • corner case: hot/cold parts of function split, there might be things without prologue/epilogue, but the above rules on not placing CALL last are true.

You may want to read .eh_frame, the way we extract and generate the tables, and the way the binary search is done. But IIRC the code was crafted so that -1 is unneeded and would break things. But it's several years since I wrote it, so a post-mortem review and fixes are welcome.

Ideally, though the conditional -1 is not done, as it increases code size. Perhaps the issue could be fixed up on stack delta generation side instead?

Depending on things this could be rust .eh_frame generation issue, or the way we parse or match the data. Could you show some disassembly and the .eh_frame of the code you see?

@christos68k
Copy link
Copy Markdown
Member

@christos68k could you help me upload the test case to the s3 bucket for the coredump test? Tommy told me you helped him do it in the past.

Hi @umanwizard, I'm currently traveling and will be on PTO until January 3rd. When you're ready with the test case, ping us here again and if I'm still out either @fabled or @florianl can help with the upload.

@umanwizard
Copy link
Copy Markdown
Contributor Author

@fabled On further thought, you are right. I misunderstood the meaning of the address here.

It does superficially appear to fix the issue, but that must be for a different reason. I suppose I need to investigate further ...

@umanwizard
Copy link
Copy Markdown
Contributor Author

Actually, sorry for the confusion. I think I was originally correct that the current code is wrong and a -1 is necessary.

I have this Rust code:

fn g(n: usize) -> usize {
    unsafe {
        std::arch::asm!("brk #0");
    }
    n
}

fn f(n: usize) -> usize {
    g(n)
}

fn main() {
    let x = f(40);
    println!("{x}");
}

which, when compiled in debug mode, results in this disassembly (irrelevant stuff snipped):

target/debug/test_rust:     file format elf64-littleaarch64


0000000000012f50 <_ZN9test_rust1g17hb8e54390f643f8abE>:
   12f50:	d10083ff 	sub	sp, sp, #0x20
   12f54:	a9017bfd 	stp	x29, x30, [sp, #16]
   12f58:	910043fd 	add	x29, sp, #0x10
   12f5c:	f90007e0 	str	x0, [sp, #8]
   12f60:	d4200000 	brk	#0x0
   12f64:	a9417bfd 	ldp	x29, x30, [sp, #16]
   12f68:	910083ff 	add	sp, sp, #0x20
   12f6c:	d65f03c0 	ret

0000000000012f70 <_ZN9test_rust1f17h8160b349f1e03872E>:
   12f70:	d10083ff 	sub	sp, sp, #0x20
   12f74:	a9017bfd 	stp	x29, x30, [sp, #16]
   12f78:	910043fd 	add	x29, sp, #0x10
   12f7c:	f90007e0 	str	x0, [sp, #8]
   12f80:	97fffff4 	bl	12f50 <_ZN9test_rust1g17hb8e54390f643f8abE>
   12f84:	a9417bfd 	ldp	x29, x30, [sp, #16]
   12f88:	910083ff 	add	sp, sp, #0x20
   12f8c:	d65f03c0 	ret

0000000000012f90 <_ZN9test_rust4main17h04d996d1d90f88eaE>:
   12f90:	d101c3ff 	sub	sp, sp, #0x70
   12f94:	a9067bfd 	stp	x29, x30, [sp, #96]
   12f98:	910183fd 	add	x29, sp, #0x60
   12f9c:	52800508 	mov	w8, #0x28                  	// #40
   12fa0:	2a0803e0 	mov	w0, w8
   12fa4:	97fffff3 	bl	12f70 <_ZN9test_rust1f17h8160b349f1e03872E>
   12fa8:	aa0003e8 	mov	x8, x0
   12fac:	910023e0 	add	x0, sp, #0x8
   12fb0:	f90007e8 	str	x8, [sp, #8]
   12fb4:	d10043a8 	sub	x8, x29, #0x10
   12fb8:	97ffffdb 	bl	12f24 <_ZN4core3fmt2rt8Argument11new_display17h673c257ee5637ca9E>
   12fbc:	3cdf03a0 	ldur	q0, [x29, #-16]
   12fc0:	d10083a1 	sub	x1, x29, #0x20
   12fc4:	3c9e03a0 	stur	q0, [x29, #-32]
   12fc8:	910043e8 	add	x8, sp, #0x10
   12fcc:	f90003e8 	str	x8, [sp]
   12fd0:	f0000340 	adrp	x0, 7d000 <__abi_tag+0x1c420>
   12fd4:	9101e000 	add	x0, x0, #0x78
   12fd8:	94000021 	bl	1305c <_ZN4core3fmt2rt38_$LT$impl$u20$core..fmt..Arguments$GT$6new_v117h00250a48a45390a8E>
   12fdc:	f94003e0 	ldr	x0, [sp]
   12fe0:	94000f71 	bl	16da4 <_ZN3std2io5stdio6_print17h9918374270e24381E>
   12fe4:	a9467bfd 	ldp	x29, x30, [sp, #96]
   12fe8:	9101c3ff 	add	sp, sp, #0x70
   12fec:	d65f03c0 	ret

So, when we hit the breakpoint instruction in g, the stack will look like:

  • 0x12f60 (g)
  • 0x12f84 (f) -- this is a return address
  • 0x12fa8 (main) -- this is a return address

The eh_frame instructions corresponding to f look like this (dumped with readelf -wF):

   LOC           CFA      x29   ra    
0000000000012f70 sp+0     u     u     
0000000000012f74 sp+32    u     u     
0000000000012f7c x29+16   c-16  c-8   
0000000000012f84 sp+32    c-16  c-8   
0000000000012f8c sp+0     u     u     

Note that a new section begins at 0x12f84. -- this section assumes that the instruction at 0x12f84, which clobbers x29, has already executed -- that's why it no longer tries to unwind based on x29.

So the semantics of a section beginning at address "foo" in eh_frame seems to be that the instruction at foo has already executed, not that it's about to execute (which is what I assumed in the past).

Just to confirm that this is right, I checked what gcc does. Gcc also subtracts one from the pc before attempting to look up DWARF/eh_frame data and unwind. See here for the relevant line of the GCC source code.

The reason this current PR is wrong is that we should not do the adjustment for signal return frames; we should only do it for normal frames (again see the code in GDB above). But we don't detect that we are currently in such a frame (and appropriately set return_address = false) until later.

@fabled
Copy link
Copy Markdown
Contributor

fabled commented Dec 19, 2025

So the semantics of a section beginning at address "foo" in eh_frame seems to be that the instruction at foo has already executed, not that it's about to execute (which is what I assumed in the past).

I think this is not true. Can you show non-Rust code that does this, or the DWARF specification saying so? The DWARF version 4 figure 63 & 64 reads otherwise to me (at this late hour).

Also if it was "has executed" it would make it impossible to unwind the first instruction of a function from signal/async context.

Also, the entry eh_frame you show is:

LOC CFA x29 ra
0000000000012f70 sp+0 u u
0000000000012f74 sp+32 u u

So this prologue portion seems conflicting to the epilogue portion. If what you say is true, the sp+32 should be at 0000000000012f70. Because the sub to modify sp+32? is at the very beginning. Right?

I think Rust has .eh_frame generation bug. Or what do you think?

I am not sure for which the GDB function you refer are used, but there are scenarios when the -1 is valid. As mentioned this especially important for line number lookups. So just throwing a random function is not helpful without the actual call site. Also note that in GDB code you refer, it seems to already have done some GDB mangling, as it distinguishes inlined frames vs real frames. We are dealing with real frames only in the ebpf code.

The reason this current PR is wrong is that we should not do the adjustment for signal return frames; we should only do it for normal frames (again see the code in GDB above). But we don't detect that we are currently in such a frame (and appropriately set return_address = false) until later.

We have also different interruption points than GDB. The kernel perf entry can happen signal like. So what we do have additional state. The return_address should correctly reflect this. It's determined on the unwind entry (based on whether we interrupted user code, or kernel code), and reset on signal handler frames.

But just alone on the assembly snippet and the corresponding .eh_frame; it looks conflicting. And I think there is code/.eh_frame generation issue somewhere. Thoughts?

@fabled
Copy link
Copy Markdown
Contributor

fabled commented Dec 19, 2025

The eh_frame dump interleaved with assembly:

LOC CFA x29 ra
0000000000012f70 sp+0 u u
12f70: d10083ff sub sp, sp, #0x20
0000000000012f74 sp+32 u u
12f74: a9017bfd stp x29, x30, [sp, #16]
12f78: 910043fd add x29, sp, #0x10
0000000000012f7c x29+16 c-16 c-8
12f7c: f90007e0 str x0, [sp, #8]
12f80: 97fffff4 bl 12f50 <_ZN9test_rust1g17hb8e54390f643f8abE>
0000000000012f84 sp+32 c-16 c-8
12f84: a9417bfd ldp x29, x30, [sp, #16]
12f88: 910083ff add sp, sp, #0x20
0000000000012f8c sp+0 u u
12f8c: d65f03c0 ret

Suggests that the the address corresponds to the state before the instruction execution. Except for 0000000000012f84 where its off by one.

File issue to Rust?

If we need to workaround this in our code, we should have heuristic to determine that its Rust. And then have code to detect if the bug is manifest (always or per instruction basis), and fixup the address on stack delta generation stage.

@umanwizard
Copy link
Copy Markdown
Contributor Author

umanwizard commented Dec 20, 2025

The issue isn't Rust-specific; I just used Rust arbitrarily.

This C program, when compiled with clang, gives very similar results:

#include <stdio.h>

size_t g(size_t n) {
    __asm__("brk #0");
    return n;
}

size_t f(size_t n) {
    return g(n);
}

int main() {
    size_t x = f(40);
    printf("%zu\n", x);
    return 0;
}

So I think it's an issue for all LLVM-based compilers, not just rustc.

Now, there are a couple things to note here:

  1. Technically, the .eh_frame unwinding info produced by LLVM here isn't wrong. It's unusual, and perhaps unexpected, for it to change one instruction before the boundary between the function body and the epilogue, rather than at the boundary itself, but note that computing the CFA from sp is still valid, even during the function body, in our examples, so technically we should still be able to unwind that way.
  2. The reason that we can't properly unwind with that entry seems to be that on aarch64, we are using a heuristic for unwinding to guess where the frame pointer is:
  // Try to resolve frame pointer
  // simple heuristic for FP based frames
  // the GCC compiler usually generates stack frame records in such a way,
  // so that FP/RA pair is at the bottom of a stack frame (stack frame
  // record at lower addresses is followed by stack vars at higher ones)
  // this implies that if no other changes are applied to the stack such
  // as alloca(), following the prolog SP/FP points to the frame record
  // itself, in such a case FP offset will be equal to 8
  if (info->fpParam == 8) {
    // we can assume the presence of frame pointers
    if (info->fpOpcode != UNWIND_OPCODE_BASE_LR) {
      // FP precedes the RA on the stack (Aarch64 ABI requirement)
      bpf_probe_read_user(&state->fp, sizeof(state->fp), (void *)(ra - 8));
    }
  }

This if block is not entered in the scenarios we're discussing, so we do not update fp.

  1. The line I linked in gdb is definitely used for unwinding, not just for symbolization (maybe it's also used for symbolization). I found it while stepping through gdb (using another instance of gdb) and running the backtrace command.

In gdb 16.3, Dwarf unwinding code is executed with this stack trace:

#0  dwarf2_frame_sniffer (self=0xf1e580 <dwarf2_frame_unwind>, this_frame=..., 
    this_cache=0x172b078) at dwarf2/frame.c:1320
#1  0x00000000007b8e78 in frame_unwind_try_unwinder (this_frame=..., this_cache=0x172b078, 
    unwinder=0xf1e580 <dwarf2_frame_unwind>) at frame-unwind.c:138
#2  0x00000000007b914c in frame_unwind_find_by_frame (this_frame=..., this_cache=0x172b078)
    at frame-unwind.c:209
#3  0x00000000007bb3f4 in compute_frame_id (fi=...) at frame.c:606
#4  0x00000000007bfb14 in get_prev_frame_maybe_check_cycle (this_frame=...) at frame.c:2215
#5  0x00000000007c056c in get_prev_frame_always_1 (this_frame=...) at frame.c:2476
#6  0x00000000007c071c in get_prev_frame_always (this_frame=...) at frame.c:2492
#7  0x00000000007c0f6c in get_prev_frame (this_frame=...) at frame.c:2756
#8  0x0000000000aa4790 in backtrace_command_1 (fp_opts=..., bt_opts=..., count_exp=0x0, 
    from_tty=1) at stack.c:2056
#9  0x0000000000aa4bd4 in backtrace_command (arg=0x0, from_tty=1) at stack.c:2171
(snip)

This makes it clear that dwarf2_frame_sniffer is called as part of backtracing. That function then finds the unwinding FDE with code like this:

  CORE_ADDR block_addr = get_frame_address_in_block (this_frame);
  struct dwarf2_fde *fde = dwarf2_frame_find_fde (&block_addr, NULL);

and get_frame_address_in_block contains the pc - 1 logic I linked above.

So, I'm really not sure what's going on -- is GCC wrong, but it doesn't matter because this logic is only executed for leaf frames?

Perhaps the correct fix is to improve our "heuristic" for finding the frame pointer on aarch64.

@fabled
Copy link
Copy Markdown
Contributor

fabled commented Dec 21, 2025

Now, there are a couple things to note here:

1. Technically, the .eh_frame unwinding info produced by LLVM here isn't _wrong_. It's unusual, and perhaps unexpected, for it to change one instruction _before_ the boundary between the function body and the epilogue, rather than at the boundary itself, but note that computing the CFA from `sp` _is_ still valid, even during the function body, in our examples, so technically we should still be able to unwind that way.

2. The reason that we can't properly unwind with that entry seems to be that on aarch64, we are using a heuristic for unwinding to guess where the frame pointer is:

I think the above two are correct. And completely different from what you've trying to proof so far (we lookup wrong unwind opcode).

I think the failure is indeed arm64 specific issue that the .eh_frame entry:

LOC           CFA      x29   ra    
0000000000012f84 sp+32    c-16  c-8   

is not unwound correctly.

I think the heuristic is just wrong. We should have new opcode to unwind both FP+RA and have the stack delta generator recognize if the FP+RA are defined and consecutive in memory. This would get rid of the heuristic and allow this to work.

Can you close this PR, and create a new issue or PR with the above approach?

@fabled
Copy link
Copy Markdown
Contributor

fabled commented Dec 21, 2025

and get_frame_address_in_block contains the pc - 1 logic I linked above.

Note that this also depends on to what kind of lookup function it is passed. Did you compare that the compare function is same as in our binary search?

@umanwizard umanwizard closed this Dec 22, 2025
@umanwizard
Copy link
Copy Markdown
Contributor Author

I think the above two are correct. And completely different from what you've trying to proof so far (we lookup wrong unwind opcode).

Yes. Just to be clear, I'm not arguing for that anymore. I'm now convinced that my initial analysis was wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants