Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for "bt" command printing "bogus exception frame" warning #14

Merged
merged 1 commit into from
Feb 14, 2023

Conversation

fengjixuchui
Copy link
Owner

Currently, the "bt" command may print a bogus exception frame and the remaining frame will be truncated on x86_64 when using the "virsh send-key KEY_LEFTALT KEY_SYSRQ KEY_C" command to trigger a panic from the KVM host. For example:

crash> bt
PID: 0 TASK: ffff9e7a47e32f00 CPU: 3 COMMAND: "swapper/3"
#0 [ffffba7900118bb8] machine_kexec at ffffffff87e5c2c7
#1 [ffffba7900118c08] __crash_kexec at ffffffff87f9500d
#2 [ffffba7900118cd0] panic at ffffffff87edfff9
#3 [ffffba7900118d50] sysrq_handle_crash at ffffffff883ce2c1
...
#16 [ffffba7900118fd8] handle_edge_irq at ffffffff87f559f2
#17 [ffffba7900118ff0] asm_call_on_stack at ffffffff88800fa2
--- ---
#18 [ffffba790008bda0] asm_call_on_stack at ffffffff88800fa2
RIP: ffffffffffffffff RSP: 0000000000000124 RFLAGS: 00000003
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
RDX: ffffffff88800c1e RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000001 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: ffffffff88760555 R12: ffffba790008be08
R13: ffffffff87f18002 R14: ffff9e7a47e32f00 R15: ffff9e7bb6198e00
ORIG_RAX: 0000000000000000 CS: 0003 SS: 0000
bt: WARNING: possibly bogus exception frame
crash>

The following related kernel commits cause the current issue, crash needs to adjust the value of irq_eframe_link.

Related kernel commits:
[1] v5.8: 931b94145981 ("x86/entry: Provide helpers for executing on the irqstack") [2] v5.8: fa5e5c409213 ("x86/entry: Use idtentry for interrupts") [3] v5.12: 52d743f3b712 ("x86/softirq: Remove indirection in do_softirq_own_stack()")

Signed-off-by: Lianbo Jiang [email protected]
Signed-off-by: Kazuhito Hagio [email protected]

Currently, the "bt" command may print a bogus exception frame
and the remaining frame will be truncated on x86_64 when using the
"virsh send-key <kvm guest> KEY_LEFTALT KEY_SYSRQ KEY_C" command
to trigger a panic from the KVM host. For example:

  crash> bt
  PID: 0        TASK: ffff9e7a47e32f00  CPU: 3    COMMAND: "swapper/3"
   #0 [ffffba7900118bb8] machine_kexec at ffffffff87e5c2c7
   #1 [ffffba7900118c08] __crash_kexec at ffffffff87f9500d
   #2 [ffffba7900118cd0] panic at ffffffff87edfff9
   #3 [ffffba7900118d50] sysrq_handle_crash at ffffffff883ce2c1
   ...
   #16 [ffffba7900118fd8] handle_edge_irq at ffffffff87f559f2
   #17 [ffffba7900118ff0] asm_call_on_stack at ffffffff88800fa2
   --- <IRQ stack> ---
   #18 [ffffba790008bda0] asm_call_on_stack at ffffffff88800fa2
       RIP: ffffffffffffffff  RSP: 0000000000000124  RFLAGS: 00000003
       RAX: 0000000000000000  RBX: 0000000000000001  RCX: 0000000000000000
       RDX: ffffffff88800c1e  RSI: 0000000000000000  RDI: 0000000000000000
       RBP: 0000000000000001   R8: 0000000000000000   R9: 0000000000000000
       R10: 0000000000000000  R11: ffffffff88760555  R12: ffffba790008be08
       R13: ffffffff87f18002  R14: ffff9e7a47e32f00  R15: ffff9e7bb6198e00
       ORIG_RAX: 0000000000000000  CS: 0003  SS: 0000
  bt: WARNING: possibly bogus exception frame
  crash>

The following related kernel commits cause the current issue, crash
needs to adjust the value of irq_eframe_link.

Related kernel commits:
[1] v5.8: 931b94145981 ("x86/entry: Provide helpers for executing on the irqstack")
[2] v5.8: fa5e5c409213 ("x86/entry: Use idtentry for interrupts")
[3] v5.12: 52d743f3b712 ("x86/softirq: Remove indirection in do_softirq_own_stack()")

Signed-off-by: Lianbo Jiang <[email protected]>
Signed-off-by: Kazuhito Hagio <[email protected]>
@fengjixuchui fengjixuchui merged commit 99ee376 into fengjixuchui:master Feb 14, 2023
fengjixuchui pushed a commit that referenced this pull request Feb 27, 2023
Kernel commit 7d65f4a65532 ("irq: Consolidate do_softirq() arch overriden
implementations") renamed the call_softirq to do_softirq_own_stack, and
there is no exception frame also when coming from do_softirq_own_stack.
Without the patch, crash may unnecessarily output an exception frame with
a warning as below:

  crash> foreach bt
  ...
  PID: 0        TASK: ffff914f820a8000  CPU: 25   COMMAND: "swapper/25"
   #0 [fffffe0000504e48] crash_nmi_callback at ffffffffa665d763
   #1 [fffffe0000504e50] nmi_handle at ffffffffa662a423
   #2 [fffffe0000504ea8] default_do_nmi at ffffffffa6fe7dc9
   #3 [fffffe0000504ec8] do_nmi at ffffffffa662a97f
   #4 [fffffe0000504ef0] end_repeat_nmi at ffffffffa70015e8
      [exception RIP: clone_endio+172]
      RIP: ffffffffc005c1ec  RSP: ffffa1d403d08e98  RFLAGS: 00000246
      RAX: 0000000000000000  RBX: ffff915326fba230  RCX: 0000000000000018
      RDX: ffffffffc0075400  RSI: 0000000000000000  RDI: ffff915326fba230
      RBP: ffff915326fba1c0   R8: 0000000000001000   R9: ffff915308d6d2a0
      R10: 000000a97dfe5e10  R11: ffffa1d40038fe98  R12: ffff915302babc40
      R13: ffff914f94360000  R14: 0000000000000000  R15: 0000000000000000
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  --- <NMI exception stack> ---
   #5 [ffffa1d403d08e98] clone_endio at ffffffffc005c1ec [dm_mod]
   #6 [ffffa1d403d08ed0] blk_update_request at ffffffffa6a96954
   #7 [ffffa1d403d08f10] scsi_end_request at ffffffffa6c9b968
   #8 [ffffa1d403d08f48] scsi_io_completion at ffffffffa6c9bb3e
   #9 [ffffa1d403d08f90] blk_complete_reqs at ffffffffa6aa0e95
   #10 [ffffa1d403d08fa0] __softirqentry_text_start at ffffffffa72000dc
   #11 [ffffa1d403d08ff0] do_softirq_own_stack at ffffffffa7000f9a
  --- <IRQ stack> ---
   #12 [ffffa1d40038fe70] do_softirq_own_stack at ffffffffa7000f9a
      [exception RIP: unknown or invalid address]
      RIP: 0000000000000000  RSP: 0000000000000000  RFLAGS: 00000000
      RAX: ffffffffa672eae5  RBX: ffffffffa83b34e0  RCX: ffffffffa672eb12
      RDX: 0000000000000010  RSI: 8b7d6c8869010c00  RDI: 0000000000000085
      RBP: 0000000000000286   R8: ffff914f820a8000   R9: ffffffffa67a94e0
      R10: 0000000000000286  R11: ffffffffa66fb4c5  R12: ffffffffa67a898b
      R13: 0000000000000000  R14: fffffffffffffff8  R15: ffffffffa67a1e68
      ORIG_RAX: 0000000000000000  CS: 0000  SS: ffffffffa672edff
   bt: WARNING: possibly bogus exception frame
   #13 [ffffa1d40038ff30] start_secondary at ffffffffa665fa2c
   #14 [ffffa1d40038ff50] secondary_startup_64_no_verify at ffffffffa6600116
   ...

Reported-by: Marco Patalano <[email protected]>
Signed-off-by: Lianbo Jiang <[email protected]>
fengjixuchui pushed a commit that referenced this pull request Mar 5, 2024
…usly

There is an issue that, for kernel modules, "dis -rl" fails to display
modules code line number data after execute "bt" command in crash.

Without the patch:
  crsah> mod -S
  crash> bt
  PID: 1500     TASK: ff2bd8b093524000  CPU: 16   COMMAND: "lpfc_worker_0"
   #0 [ff2c9f725c39f9e0] machine_kexec at ffffffff8e0686d3
   ...snip...
   #8 [ff2c9f725c39fcc0] __lpfc_sli_release_iocbq_s4 at ffffffffc0f2f425 [lpfc]
   ...snip...
  crash> dis -rl ffffffffc0f60f82
  0xffffffffc0f60eb0 <lpfc_nlp_get>:      nopl   0x0(%rax,%rax,1) [FTRACE NOP]
  0xffffffffc0f60eb5 <lpfc_nlp_get+5>:    push   %rbp
  0xffffffffc0f60eb6 <lpfc_nlp_get+6>:    push   %rbx
  0xffffffffc0f60eb7 <lpfc_nlp_get+7>:    test   %rdi,%rdi

With the patch:
  crash> mod -S
  crash> bt
  PID: 1500     TASK: ff2bd8b093524000  CPU: 16   COMMAND: "lpfc_worker_0"
   #0 [ff2c9f725c39f9e0] machine_kexec at ffffffff8e0686d3
   ...snip...
   #8 [ff2c9f725c39fcc0] __lpfc_sli_release_iocbq_s4 at ffffffffc0f2f425 [lpfc]
   ...snip...
  crash> dis -rl ffffffffc0f60f82
  /usr/src/debug/kernel-4.18.0-425.13.1.el8_7/linux-4.18.0-425.13.1.el8_7.x86_64/drivers/scsi/lpfc/lpfc_hbadisc.c: 6756
  0xffffffffc0f60eb0 <lpfc_nlp_get>:      nopl   0x0(%rax,%rax,1) [FTRACE NOP]
  /usr/src/debug/kernel-4.18.0-425.13.1.el8_7/linux-4.18.0-425.13.1.el8_7.x86_64/drivers/scsi/lpfc/lpfc_hbadisc.c: 6759
  0xffffffffc0f60eb5 <lpfc_nlp_get+5>:    push   %rbp

The root cause is, after kernel module been loaded by mod command, the symtable
is not expanded in gdb side. crash bt or dis command will trigger such an
expansion. However the symtable expansion is different for the 2 commands:

The stack trace of "dis -rl" for symtable expanding:

  #0  0x00000000008d8d9f in add_compunit_symtab_to_objfile ...
  #1  0x00000000006d3293 in buildsym_compunit::end_symtab_with_blockvector ...
  #2  0x00000000006d336a in buildsym_compunit::end_symtab_from_static_block ...
  #3  0x000000000077e8e9 in process_full_comp_unit ...
  #4  process_queue ...
  #5  dw2_do_instantiate_symtab ...
  #6  0x000000000077ed67 in dw2_instantiate_symtab ...
  #7  0x000000000077f75e in dw2_expand_all_symtabs ...
  #8  0x00000000008f254d in gdb_get_line_number ...
  #9  0x00000000008f22af in gdb_command_funnel_1 ...
  #10 0x00000000008f2003 in gdb_command_funnel ...
  #11 0x00000000005b7f02 in gdb_interface ...
  #12 0x00000000005f5bd8 in get_line_number ...
  #13 0x000000000059e574 in cmd_dis ...

The stack trace of "bt" for symtable expanding:

  #0  0x00000000008d8d9f in add_compunit_symtab_to_objfile ...
  #1  0x00000000006d3293 in buildsym_compunit::end_symtab_with_blockvector ...
  #2  0x00000000006d336a in buildsym_compunit::end_symtab_from_static_block ...
  #3  0x000000000077e8e9 in process_full_comp_unit ...
  #4  process_queue ...
  #5  dw2_do_instantiate_symtab ...
  #6  0x000000000077ed67 in dw2_instantiate_symtab ...
  #7  0x000000000077f8ed in dw2_lookup_symbol ...
  #8  0x00000000008e6d03 in lookup_symbol_via_quick_fns ...
  #9  0x00000000008e7153 in lookup_symbol_in_objfile ...
  #10 0x00000000008e73c6 in lookup_symbol_global_or_static_iterator_cb ...
  #11 0x00000000008b99c4 in svr4_iterate_over_objfiles_in_search_order ...
  #12 0x00000000008e754e in lookup_global_or_static_symbol ...
  #13 0x00000000008e75da in lookup_static_symbol ...
  #14 0x00000000008e632c in lookup_symbol_aux ...
  #15 0x00000000008e5a7a in lookup_symbol_in_language ...
  #16 0x00000000008e5b30 in lookup_symbol ...
  #17 0x00000000008f2a4a in gdb_get_datatype ...
  #18 0x00000000008f22c0 in gdb_command_funnel_1 ...
  crash-utility#19 0x00000000008f2003 in gdb_command_funnel ...
  crash-utility#20 0x00000000005b7f02 in gdb_interface ...
  crash-utility#21 0x00000000005f8a9f in datatype_info ...
  crash-utility#22 0x0000000000599947 in cpu_map_size ...
  crash-utility#23 0x00000000005a975d in get_cpus_online ...
  crash-utility#24 0x0000000000637a8b in diskdump_get_prstatus_percpu ...
  crash-utility#25 0x000000000062f0e4 in get_netdump_regs_x86_64 ...
  crash-utility#26 0x000000000059fe68 in back_trace ...
  crash-utility#27 0x00000000005ab1cb in cmd_bt ...

For the stacktrace of "dis -rl", it calls dw2_expand_all_symtabs() to expand
all symtable of the objfile, or "*.ko.debug" in our case. However for
the stacktrace of "bt", it doesn't expand all, but only a subset of symtable
which is enough to find a symbol by dw2_lookup_symbol(). As a result, the
objfile->compunit_symtabs, which is the head of a single linked list of
struct compunit_symtab, is not NULL but didn't contain all symtables. It
will not be reinitialized in gdb_get_line_number() by "dis -rl" because
!objfile_has_full_symbols(objfile) check will fail, so it cannot display
the proper code line number data.

Since objfile_has_full_symbols(objfile) check cannot ensure all symbols
been expanded, this patch add a new member as a flag for struct objfile
to record if all symbols have been expanded. The flag will be set only ofter
expand_all_symtabs been called.

Signed-off-by: Tao Liu <[email protected]>
fengjixuchui pushed a commit that referenced this pull request Mar 5, 2024
This patch introduces per-cpu IRQ stacks for RISCV64 to let
"bt" do backtrace on it and 'bt -E' search eframes on it,
and the 'help -m' command displays the addresses of each
per-cpu IRQ stack.

TEST: a vmcore dumped via hacking the handle_irq_event_percpu()
( Why not using lkdtm INT_HW_IRQ_EN EXCEPTION ?
  There is a deadlock[1] in crash_kexec path if use that)

  crash> bt
  PID: 0        TASK: ffffffff8140db00  CPU: 0    COMMAND: "swapper/0"
   #0 [ff20000000003e60] __handle_irq_event_percpu at ffffffff8006462e
   #1 [ff20000000003ed0] handle_irq_event_percpu at ffffffff80064702
   #2 [ff20000000003ef0] handle_irq_event at ffffffff8006477c
   #3 [ff20000000003f20] handle_fasteoi_irq at ffffffff80068664
   #4 [ff20000000003f50] generic_handle_domain_irq at ffffffff80063988
   #5 [ff20000000003f60] plic_handle_irq at ffffffff8046633e
   #6 [ff20000000003fb0] generic_handle_domain_irq at ffffffff80063988
   #7 [ff20000000003fc0] riscv_intc_irq at ffffffff80465f8e
   #8 [ff20000000003fd0] handle_riscv_irq at ffffffff808361e8
       PC: ffffffff80837314  [default_idle_call+50]
       RA: ffffffff80837310  [default_idle_call+46]
       SP: ffffffff81403da0  CAUSE: 8000000000000009
  epc : ffffffff80837314 ra : ffffffff80837310 sp : ffffffff81403da0
   gp : ffffffff814ef848 tp : ffffffff8140db00 t0 : ff2000000004bb18
   t1 : 0000000000032c73 t2 : ffffffff81200a48 s0 : ffffffff81403db0
   s1 : 0000000000000000 a0 : 0000000000000004 a1 : 0000000000000000
   a2 : ff6000009f1e7000 a3 : 0000000000002304 a4 : ffffffff80c1c2d8
   a5 : 0000000000000000 a6 : ff6000001fe01958 a7 : 00002496ea89dbf1
   s2 : ffffffff814f0220 s3 : 0000000000000001 s4 : 000000000000003f
   s5 : ffffffff814f03d8 s6 : 0000000000000000 s7 : ffffffff814f00d0
   s8 : ffffffff81526f10 s9 : ffffffff80c1d880 s10: 0000000000000000
   s11: 0000000000000001 t3 : 0000000000003392 t4 : 0000000000000000
   t5 : 0000000000000000 t6 : 0000000000000040
   status: 0000000200000120 badaddr: 0000000000000000
    cause: 8000000000000009 orig_a0: ffffffff80837310
  --- <IRQ stack> ---
   #9 [ffffffff81403da0] default_idle_call at ffffffff80837314
   #10 [ffffffff81403db0] do_idle at ffffffff8004d0a0
   #11 [ffffffff81403e40] cpu_startup_entry at ffffffff8004d21e
   #12 [ffffffff81403e60] kernel_init at ffffffff8083746a
   #13 [ffffffff81403e70] arch_post_acpi_subsys_init at ffffffff80a006d8
   #14 [ffffffff81403e80] console_on_rootfs at ffffffff80a00c92
  crash>

  crash> bt -E
  CPU 0 IRQ STACK:
  KERNEL-MODE EXCEPTION FRAME AT: ff20000000003a48
       PC: ffffffff8006462e  [__handle_irq_event_percpu+30]
       RA: ffffffff80064702  [handle_irq_event_percpu+18]
       SP: ff20000000003e60  CAUSE: 000000000000000d
  epc : ffffffff8006462e ra : ffffffff80064702 sp : ff20000000003e60
   gp : ffffffff814ef848 tp : ffffffff8140db00 t0 : 0000000000046600
   t1 : ffffffff80836464 t2 : ffffffff81200a48 s0 : ff20000000003ed0
   s1 : 0000000000000000 a0 : 0000000000000000 a1 : 0000000000000118
   a2 : 0000000000000052 a3 : 0000000000000000 a4 : 0000000000000000
   a5 : 0000000000010001 a6 : ff6000001fe01958 a7 : 00002496ea89dbf1
   s2 : ff60000000941ab0 s3 : ffffffff814a0658 s4 : ff60000000089230
   s5 : ffffffff814a0518 s6 : ffffffff814a0620 s7 : ffffffff80e5f0f8
   s8 : ffffffff80fc50b0 s9 : ffffffff80c1d880 s10: 0000000000000000
   s11: 0000000000000001 t3 : 0000000000003392 t4 : 0000000000000000
   t5 : 0000000000000000 t6 : 0000000000000040
   status: 0000000200000100 badaddr: 0000000000000078
    cause: 000000000000000d orig_a0: ff20000000003ea0

  CPU 1 IRQ STACK: (none found)

  crash>

  crash> help -m
  <snip>
             machspec: ced1e0
          irq_stack_size: 16384
           irq_stacks[0]: ff20000000000000
           irq_stacks[1]: ff20000000008000
  crash>

[1]: https://lore.kernel.org/linux-riscv/[email protected]/

Signed-off-by: Song Shuai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants