Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing (un)xlate_dev_mem_ptr exports on s390x #19

Open
hramrach opened this issue Aug 16, 2017 · 3 comments
Open

missing (un)xlate_dev_mem_ptr exports on s390x #19

hramrach opened this issue Aug 16, 2017 · 3 comments

Comments

@hramrach
Copy link

Loading crash module on s390x these messages are seen:

[ 129.794298] crash: loading out-of-tree module taints kernel.
[ 129.795196] crash: Unknown symbol unxlate_dev_mem_ptr (err 0)
[ 129.795379] crash: Unknown symbol xlate_dev_mem_ptr (err 0)

On most architectures xlate_dev_mem_ptr is #defined to __va but on s390x it is implemented as actual function.

@crash-utility
Copy link
Collaborator

crash-utility commented Aug 16, 2017 via email

@hramrach
Copy link
Author

I can export the functions with no problem.

The point here is that either the driver should be fixed to work with non-RH kernels or the export submitted upstream.

@crash-utility
Copy link
Collaborator

crash-utility commented Aug 16, 2017 via email

k-hagio pushed a commit that referenced this issue Nov 29, 2023
…usly

There is an issue that, for kernel modules, "dis -rl" fails to display
modules code line number data after execute "bt" command in crash.

Without the patch:
  crsah> mod -S
  crash> bt
  PID: 1500     TASK: ff2bd8b093524000  CPU: 16   COMMAND: "lpfc_worker_0"
   #0 [ff2c9f725c39f9e0] machine_kexec at ffffffff8e0686d3
   ...snip...
   #8 [ff2c9f725c39fcc0] __lpfc_sli_release_iocbq_s4 at ffffffffc0f2f425 [lpfc]
   ...snip...
  crash> dis -rl ffffffffc0f60f82
  0xffffffffc0f60eb0 <lpfc_nlp_get>:      nopl   0x0(%rax,%rax,1) [FTRACE NOP]
  0xffffffffc0f60eb5 <lpfc_nlp_get+5>:    push   %rbp
  0xffffffffc0f60eb6 <lpfc_nlp_get+6>:    push   %rbx
  0xffffffffc0f60eb7 <lpfc_nlp_get+7>:    test   %rdi,%rdi

With the patch:
  crash> mod -S
  crash> bt
  PID: 1500     TASK: ff2bd8b093524000  CPU: 16   COMMAND: "lpfc_worker_0"
   #0 [ff2c9f725c39f9e0] machine_kexec at ffffffff8e0686d3
   ...snip...
   #8 [ff2c9f725c39fcc0] __lpfc_sli_release_iocbq_s4 at ffffffffc0f2f425 [lpfc]
   ...snip...
  crash> dis -rl ffffffffc0f60f82
  /usr/src/debug/kernel-4.18.0-425.13.1.el8_7/linux-4.18.0-425.13.1.el8_7.x86_64/drivers/scsi/lpfc/lpfc_hbadisc.c: 6756
  0xffffffffc0f60eb0 <lpfc_nlp_get>:      nopl   0x0(%rax,%rax,1) [FTRACE NOP]
  /usr/src/debug/kernel-4.18.0-425.13.1.el8_7/linux-4.18.0-425.13.1.el8_7.x86_64/drivers/scsi/lpfc/lpfc_hbadisc.c: 6759
  0xffffffffc0f60eb5 <lpfc_nlp_get+5>:    push   %rbp

The root cause is, after kernel module been loaded by mod command, the symtable
is not expanded in gdb side. crash bt or dis command will trigger such an
expansion. However the symtable expansion is different for the 2 commands:

The stack trace of "dis -rl" for symtable expanding:

  #0  0x00000000008d8d9f in add_compunit_symtab_to_objfile ...
  #1  0x00000000006d3293 in buildsym_compunit::end_symtab_with_blockvector ...
  #2  0x00000000006d336a in buildsym_compunit::end_symtab_from_static_block ...
  #3  0x000000000077e8e9 in process_full_comp_unit ...
  #4  process_queue ...
  #5  dw2_do_instantiate_symtab ...
  #6  0x000000000077ed67 in dw2_instantiate_symtab ...
  #7  0x000000000077f75e in dw2_expand_all_symtabs ...
  #8  0x00000000008f254d in gdb_get_line_number ...
  #9  0x00000000008f22af in gdb_command_funnel_1 ...
  #10 0x00000000008f2003 in gdb_command_funnel ...
  #11 0x00000000005b7f02 in gdb_interface ...
  #12 0x00000000005f5bd8 in get_line_number ...
  #13 0x000000000059e574 in cmd_dis ...

The stack trace of "bt" for symtable expanding:

  #0  0x00000000008d8d9f in add_compunit_symtab_to_objfile ...
  #1  0x00000000006d3293 in buildsym_compunit::end_symtab_with_blockvector ...
  #2  0x00000000006d336a in buildsym_compunit::end_symtab_from_static_block ...
  #3  0x000000000077e8e9 in process_full_comp_unit ...
  #4  process_queue ...
  #5  dw2_do_instantiate_symtab ...
  #6  0x000000000077ed67 in dw2_instantiate_symtab ...
  #7  0x000000000077f8ed in dw2_lookup_symbol ...
  #8  0x00000000008e6d03 in lookup_symbol_via_quick_fns ...
  #9  0x00000000008e7153 in lookup_symbol_in_objfile ...
  #10 0x00000000008e73c6 in lookup_symbol_global_or_static_iterator_cb ...
  #11 0x00000000008b99c4 in svr4_iterate_over_objfiles_in_search_order ...
  #12 0x00000000008e754e in lookup_global_or_static_symbol ...
  #13 0x00000000008e75da in lookup_static_symbol ...
  #14 0x00000000008e632c in lookup_symbol_aux ...
  #15 0x00000000008e5a7a in lookup_symbol_in_language ...
  #16 0x00000000008e5b30 in lookup_symbol ...
  #17 0x00000000008f2a4a in gdb_get_datatype ...
  #18 0x00000000008f22c0 in gdb_command_funnel_1 ...
  #19 0x00000000008f2003 in gdb_command_funnel ...
  #20 0x00000000005b7f02 in gdb_interface ...
  #21 0x00000000005f8a9f in datatype_info ...
  #22 0x0000000000599947 in cpu_map_size ...
  #23 0x00000000005a975d in get_cpus_online ...
  #24 0x0000000000637a8b in diskdump_get_prstatus_percpu ...
  #25 0x000000000062f0e4 in get_netdump_regs_x86_64 ...
  #26 0x000000000059fe68 in back_trace ...
  #27 0x00000000005ab1cb in cmd_bt ...

For the stacktrace of "dis -rl", it calls dw2_expand_all_symtabs() to expand
all symtable of the objfile, or "*.ko.debug" in our case. However for
the stacktrace of "bt", it doesn't expand all, but only a subset of symtable
which is enough to find a symbol by dw2_lookup_symbol(). As a result, the
objfile->compunit_symtabs, which is the head of a single linked list of
struct compunit_symtab, is not NULL but didn't contain all symtables. It
will not be reinitialized in gdb_get_line_number() by "dis -rl" because
!objfile_has_full_symbols(objfile) check will fail, so it cannot display
the proper code line number data.

Since objfile_has_full_symbols(objfile) check cannot ensure all symbols
been expanded, this patch add a new member as a flag for struct objfile
to record if all symbols have been expanded. The flag will be set only ofter
expand_all_symtabs been called.

Signed-off-by: Tao Liu <[email protected]>
liutgnu added a commit to liutgnu/crash-preview that referenced this issue Nov 30, 2023
…usly

There is an issue that, for kernel modules, "dis -rl" fails to display
modules code line number data after execute "bt" command in crash.

Without the patch:
  crsah> mod -S
  crash> bt
  PID: 1500     TASK: ff2bd8b093524000  CPU: 16   COMMAND: "lpfc_worker_0"
   #0 [ff2c9f725c39f9e0] machine_kexec at ffffffff8e0686d3
   ...snip...
   crash-utility#8 [ff2c9f725c39fcc0] __lpfc_sli_release_iocbq_s4 at ffffffffc0f2f425 [lpfc]
   ...snip...
  crash> dis -rl ffffffffc0f60f82
  0xffffffffc0f60eb0 <lpfc_nlp_get>:      nopl   0x0(%rax,%rax,1) [FTRACE NOP]
  0xffffffffc0f60eb5 <lpfc_nlp_get+5>:    push   %rbp
  0xffffffffc0f60eb6 <lpfc_nlp_get+6>:    push   %rbx
  0xffffffffc0f60eb7 <lpfc_nlp_get+7>:    test   %rdi,%rdi

With the patch:
  crash> mod -S
  crash> bt
  PID: 1500     TASK: ff2bd8b093524000  CPU: 16   COMMAND: "lpfc_worker_0"
   #0 [ff2c9f725c39f9e0] machine_kexec at ffffffff8e0686d3
   ...snip...
   crash-utility#8 [ff2c9f725c39fcc0] __lpfc_sli_release_iocbq_s4 at ffffffffc0f2f425 [lpfc]
   ...snip...
  crash> dis -rl ffffffffc0f60f82
  /usr/src/debug/kernel-4.18.0-425.13.1.el8_7/linux-4.18.0-425.13.1.el8_7.x86_64/drivers/scsi/lpfc/lpfc_hbadisc.c: 6756
  0xffffffffc0f60eb0 <lpfc_nlp_get>:      nopl   0x0(%rax,%rax,1) [FTRACE NOP]
  /usr/src/debug/kernel-4.18.0-425.13.1.el8_7/linux-4.18.0-425.13.1.el8_7.x86_64/drivers/scsi/lpfc/lpfc_hbadisc.c: 6759
  0xffffffffc0f60eb5 <lpfc_nlp_get+5>:    push   %rbp

The root cause is, after kernel module been loaded by mod command, the symtable
is not expanded in gdb side. crash bt or dis command will trigger such an
expansion. However the symtable expansion is different for the 2 commands:

The stack trace of "dis -rl" for symtable expanding:

  #0  0x00000000008d8d9f in add_compunit_symtab_to_objfile ...
  crash-utility#1  0x00000000006d3293 in buildsym_compunit::end_symtab_with_blockvector ...
  crash-utility#2  0x00000000006d336a in buildsym_compunit::end_symtab_from_static_block ...
  crash-utility#3  0x000000000077e8e9 in process_full_comp_unit ...
  crash-utility#4  process_queue ...
  crash-utility#5  dw2_do_instantiate_symtab ...
  crash-utility#6  0x000000000077ed67 in dw2_instantiate_symtab ...
  crash-utility#7  0x000000000077f75e in dw2_expand_all_symtabs ...
  crash-utility#8  0x00000000008f254d in gdb_get_line_number ...
  crash-utility#9  0x00000000008f22af in gdb_command_funnel_1 ...
  crash-utility#10 0x00000000008f2003 in gdb_command_funnel ...
  crash-utility#11 0x00000000005b7f02 in gdb_interface ...
  crash-utility#12 0x00000000005f5bd8 in get_line_number ...
  crash-utility#13 0x000000000059e574 in cmd_dis ...

The stack trace of "bt" for symtable expanding:

  #0  0x00000000008d8d9f in add_compunit_symtab_to_objfile ...
  crash-utility#1  0x00000000006d3293 in buildsym_compunit::end_symtab_with_blockvector ...
  crash-utility#2  0x00000000006d336a in buildsym_compunit::end_symtab_from_static_block ...
  crash-utility#3  0x000000000077e8e9 in process_full_comp_unit ...
  crash-utility#4  process_queue ...
  crash-utility#5  dw2_do_instantiate_symtab ...
  crash-utility#6  0x000000000077ed67 in dw2_instantiate_symtab ...
  crash-utility#7  0x000000000077f8ed in dw2_lookup_symbol ...
  crash-utility#8  0x00000000008e6d03 in lookup_symbol_via_quick_fns ...
  crash-utility#9  0x00000000008e7153 in lookup_symbol_in_objfile ...
  crash-utility#10 0x00000000008e73c6 in lookup_symbol_global_or_static_iterator_cb ...
  crash-utility#11 0x00000000008b99c4 in svr4_iterate_over_objfiles_in_search_order ...
  crash-utility#12 0x00000000008e754e in lookup_global_or_static_symbol ...
  crash-utility#13 0x00000000008e75da in lookup_static_symbol ...
  crash-utility#14 0x00000000008e632c in lookup_symbol_aux ...
  crash-utility#15 0x00000000008e5a7a in lookup_symbol_in_language ...
  crash-utility#16 0x00000000008e5b30 in lookup_symbol ...
  crash-utility#17 0x00000000008f2a4a in gdb_get_datatype ...
  crash-utility#18 0x00000000008f22c0 in gdb_command_funnel_1 ...
  crash-utility#19 0x00000000008f2003 in gdb_command_funnel ...
  crash-utility#20 0x00000000005b7f02 in gdb_interface ...
  crash-utility#21 0x00000000005f8a9f in datatype_info ...
  crash-utility#22 0x0000000000599947 in cpu_map_size ...
  crash-utility#23 0x00000000005a975d in get_cpus_online ...
  crash-utility#24 0x0000000000637a8b in diskdump_get_prstatus_percpu ...
  crash-utility#25 0x000000000062f0e4 in get_netdump_regs_x86_64 ...
  crash-utility#26 0x000000000059fe68 in back_trace ...
  crash-utility#27 0x00000000005ab1cb in cmd_bt ...

For the stacktrace of "dis -rl", it calls dw2_expand_all_symtabs() to expand
all symtable of the objfile, or "*.ko.debug" in our case. However for
the stacktrace of "bt", it doesn't expand all, but only a subset of symtable
which is enough to find a symbol by dw2_lookup_symbol(). As a result, the
objfile->compunit_symtabs, which is the head of a single linked list of
struct compunit_symtab, is not NULL but didn't contain all symtables. It
will not be reinitialized in gdb_get_line_number() by "dis -rl" because
!objfile_has_full_symbols(objfile) check will fail, so it cannot display
the proper code line number data.

Since objfile_has_full_symbols(objfile) check cannot ensure all symbols
been expanded, this patch add a new member as a flag for struct objfile
to record if all symbols have been expanded. The flag will be set only ofter
expand_all_symtabs been called.

Signed-off-by: Tao Liu <[email protected]>
liutgnu added a commit to liutgnu/crash-preview that referenced this issue Dec 1, 2023
…eviously

There is an issue that, for kernel modules, "dis -rl" fails to display
modules code line number data after execute "bt" command in crash.

Without the patch:
  crsah> mod -S
  crash> bt
  PID: 1500     TASK: ff2bd8b093524000  CPU: 16   COMMAND: "lpfc_worker_0"
   #0 [ff2c9f725c39f9e0] machine_kexec at ffffffff8e0686d3
   ...snip...
   crash-utility#8 [ff2c9f725c39fcc0] __lpfc_sli_release_iocbq_s4 at ffffffffc0f2f425 [lpfc]
   ...snip...
  crash> dis -rl ffffffffc0f60f82
  0xffffffffc0f60eb0 <lpfc_nlp_get>:      nopl   0x0(%rax,%rax,1) [FTRACE NOP]
  0xffffffffc0f60eb5 <lpfc_nlp_get+5>:    push   %rbp
  0xffffffffc0f60eb6 <lpfc_nlp_get+6>:    push   %rbx
  0xffffffffc0f60eb7 <lpfc_nlp_get+7>:    test   %rdi,%rdi

With the patch:
  crash> mod -S
  crash> bt
  PID: 1500     TASK: ff2bd8b093524000  CPU: 16   COMMAND: "lpfc_worker_0"
   #0 [ff2c9f725c39f9e0] machine_kexec at ffffffff8e0686d3
   ...snip...
   crash-utility#8 [ff2c9f725c39fcc0] __lpfc_sli_release_iocbq_s4 at ffffffffc0f2f425 [lpfc]
   ...snip...
  crash> dis -rl ffffffffc0f60f82
  /usr/src/debug/kernel-4.18.0-425.13.1.el8_7/linux-4.18.0-425.13.1.el8_7.x86_64/drivers/scsi/lpfc/lpfc_hbadisc.c: 6756
  0xffffffffc0f60eb0 <lpfc_nlp_get>:      nopl   0x0(%rax,%rax,1) [FTRACE NOP]
  /usr/src/debug/kernel-4.18.0-425.13.1.el8_7/linux-4.18.0-425.13.1.el8_7.x86_64/drivers/scsi/lpfc/lpfc_hbadisc.c: 6759
  0xffffffffc0f60eb5 <lpfc_nlp_get+5>:    push   %rbp

The root cause is, after kernel module been loaded by mod command, the symtable
is not expanded in gdb side. crash bt or dis command will trigger such an
expansion. However the symtable expansion is different for the 2 commands:

The stack trace of "dis -rl" for symtable expanding:

  #0  0x00000000008d8d9f in add_compunit_symtab_to_objfile ...
  crash-utility#1  0x00000000006d3293 in buildsym_compunit::end_symtab_with_blockvector ...
  crash-utility#2  0x00000000006d336a in buildsym_compunit::end_symtab_from_static_block ...
  crash-utility#3  0x000000000077e8e9 in process_full_comp_unit ...
  crash-utility#4  process_queue ...
  crash-utility#5  dw2_do_instantiate_symtab ...
  crash-utility#6  0x000000000077ed67 in dw2_instantiate_symtab ...
  crash-utility#7  0x000000000077f75e in dw2_expand_all_symtabs ...
  crash-utility#8  0x00000000008f254d in gdb_get_line_number ...
  crash-utility#9  0x00000000008f22af in gdb_command_funnel_1 ...
  crash-utility#10 0x00000000008f2003 in gdb_command_funnel ...
  crash-utility#11 0x00000000005b7f02 in gdb_interface ...
  crash-utility#12 0x00000000005f5bd8 in get_line_number ...
  crash-utility#13 0x000000000059e574 in cmd_dis ...

The stack trace of "bt" for symtable expanding:

  #0  0x00000000008d8d9f in add_compunit_symtab_to_objfile ...
  crash-utility#1  0x00000000006d3293 in buildsym_compunit::end_symtab_with_blockvector ...
  crash-utility#2  0x00000000006d336a in buildsym_compunit::end_symtab_from_static_block ...
  crash-utility#3  0x000000000077e8e9 in process_full_comp_unit ...
  crash-utility#4  process_queue ...
  crash-utility#5  dw2_do_instantiate_symtab ...
  crash-utility#6  0x000000000077ed67 in dw2_instantiate_symtab ...
  crash-utility#7  0x000000000077f8ed in dw2_lookup_symbol ...
  crash-utility#8  0x00000000008e6d03 in lookup_symbol_via_quick_fns ...
  crash-utility#9  0x00000000008e7153 in lookup_symbol_in_objfile ...
  crash-utility#10 0x00000000008e73c6 in lookup_symbol_global_or_static_iterator_cb ...
  crash-utility#11 0x00000000008b99c4 in svr4_iterate_over_objfiles_in_search_order ...
  crash-utility#12 0x00000000008e754e in lookup_global_or_static_symbol ...
  crash-utility#13 0x00000000008e75da in lookup_static_symbol ...
  crash-utility#14 0x00000000008e632c in lookup_symbol_aux ...
  crash-utility#15 0x00000000008e5a7a in lookup_symbol_in_language ...
  crash-utility#16 0x00000000008e5b30 in lookup_symbol ...
  crash-utility#17 0x00000000008f2a4a in gdb_get_datatype ...
  crash-utility#18 0x00000000008f22c0 in gdb_command_funnel_1 ...
  crash-utility#19 0x00000000008f2003 in gdb_command_funnel ...
  crash-utility#20 0x00000000005b7f02 in gdb_interface ...
  crash-utility#21 0x00000000005f8a9f in datatype_info ...
  crash-utility#22 0x0000000000599947 in cpu_map_size ...
  crash-utility#23 0x00000000005a975d in get_cpus_online ...
  crash-utility#24 0x0000000000637a8b in diskdump_get_prstatus_percpu ...
  crash-utility#25 0x000000000062f0e4 in get_netdump_regs_x86_64 ...
  crash-utility#26 0x000000000059fe68 in back_trace ...
  crash-utility#27 0x00000000005ab1cb in cmd_bt ...

For the stacktrace of "dis -rl", it calls dw2_expand_all_symtabs() to expand
all symtable of the objfile, or "*.ko.debug" in our case. However for
the stacktrace of "bt", it doesn't expand all, but only a subset of symtable
which is enough to find a symbol by dw2_lookup_symbol(). As a result, the
objfile->compunit_symtabs, which is the head of a single linked list of
struct compunit_symtab, is not NULL but didn't contain all symtables. It
will not be reinitialized in gdb_get_line_number() by "dis -rl" because
!objfile_has_full_symbols(objfile) check will fail, so it cannot display
the proper code line number data.

Since objfile_has_full_symbols(objfile) check cannot ensure all symbols
been expanded, this patch add a new member as a flag for struct objfile
to record if all symbols have been expanded. The flag will be set only ofter
expand_all_symtabs been called.

Signed-off-by: Tao Liu <[email protected]>
sugarfillet pushed a commit to sugarfillet/crash that referenced this issue Dec 13, 2023
The patch introduces per-cpu overflow stacks for RISCV64 to let
"bt" do backtrace on it and the 'help -m' command dispalys the
addresss of each per-cpu overflow stack.

TEST: a lkdtm DIRECT EXHAUST_STACK vmcore

```
crash> bt
PID: 1        TASK: ff600000000d8000  CPU: 1    COMMAND: "sh"
 #0 [ff6000001fc501c0] riscv_crash_save_regs at ffffffff8000a1dc
 crash-utility#1 [ff6000001fc50320] panic at ffffffff808773ec
 crash-utility#2 [ff6000001fc50380] walk_stackframe at ffffffff800056da
     PC: ffffffff80876a34  [memset+96]
     RA: ffffffff80563dc0  [recursive_loop+68]
     SP: ff2000000000fd50  CAUSE: 000000000000000f
epc : ffffffff80876a34 ra : ffffffff80563dc0 sp : ff2000000000fd50
 gp : ffffffff81515d38 tp : 0000000000000000 t0 : ff2000000000fd58
 t1 : ff600000000d88c8 t2 : 6143203a6d74646b s0 : ff20000000010190
 s1 : 0000000000000012 a0 : ff2000000000fd58 a1 : 1212121212121212
 a2 : 0000000000000400 a3 : ff20000000010158 a4 : 0000000000000000
 a5 : 725bedba92260900 a6 : 000000000130e0f0 a7 : 0000000000000000
 s2 : ff2000000000fd58 s3 : ffffffff815170d8 s4 : ff20000000013e60
 s5 : 000000000000000e s6 : ff20000000013e60 s7 : 0000000000000000
 s8 : ff60000000861000 s9 : 00007fffc3641694 s10: 00007fffc3641690
 s11: 00005555796ed240 t3 : 0000000000010297 t4 : ffffffff80c17810
 t5 : ffffffff8195e7b8 t6 : ff20000000013b18
 status: 0000000200000120 badaddr: ff2000000000fd58
  cause: 000000000000000f orig_a0: 0000000000000000
--- <OVERFLOW stack> ---
 crash-utility#3 [ff2000000000fd50] memset at ffffffff80876a34
 crash-utility#4 [ff20000000010190] recursive_loop at ffffffff80563e16
 crash-utility#5 [ff200000000105d0] recursive_loop at ffffffff80563e16
 < recursive_loop ...>
 crash-utility#16 [ff20000000013490] recursive_loop at ffffffff80563e16
 crash-utility#17 [ff200000000138d0] recursive_loop at ffffffff80563e16
 crash-utility#18 [ff20000000013d10] lkdtm_EXHAUST_STACK at ffffffff8088005e
 crash-utility#19 [ff20000000013d30] lkdtm_do_action at ffffffff80563292
 crash-utility#20 [ff20000000013d40] direct_entry at ffffffff80563474
 crash-utility#21 [ff20000000013d70] full_proxy_write at ffffffff8032fb3a
 crash-utility#22 [ff20000000013db0] vfs_write at ffffffff801d6414
 crash-utility#23 [ff20000000013e60] ksys_write at ffffffff801d67b8
 crash-utility#24 [ff20000000013eb0] __riscv_sys_write at ffffffff801d6832
 crash-utility#25 [ff20000000013ec0] do_trap_ecall_u at ffffffff80884a20
crash>

crash> help -m
<snip>
        irq_stack_size: 16384
         irq_stacks[0]: ff20000000000000
         irq_stacks[1]: ff20000000008000
        overflow_stack_size: 4096
         overflow_stacks[0]: ff6000001fa7a510
         overflow_stacks[1]: ff6000001fc4f510
crash>
```
Signed-off-by: Song Shuai <[email protected]>
k-hagio pushed a commit that referenced this issue Jan 11, 2024
The patch introduces per-cpu overflow stacks for RISCV64 to let
"bt" do backtrace on it and the 'help -m' command dispalys the
addresss of each per-cpu overflow stack.

TEST: a lkdtm DIRECT EXHAUST_STACK vmcore

  crash> bt
  PID: 1        TASK: ff600000000d8000  CPU: 1    COMMAND: "sh"
   #0 [ff6000001fc501c0] riscv_crash_save_regs at ffffffff8000a1dc
   #1 [ff6000001fc50320] panic at ffffffff808773ec
   #2 [ff6000001fc50380] walk_stackframe at ffffffff800056da
       PC: ffffffff80876a34  [memset+96]
       RA: ffffffff80563dc0  [recursive_loop+68]
       SP: ff2000000000fd50  CAUSE: 000000000000000f
  epc : ffffffff80876a34 ra : ffffffff80563dc0 sp : ff2000000000fd50
   gp : ffffffff81515d38 tp : 0000000000000000 t0 : ff2000000000fd58
   t1 : ff600000000d88c8 t2 : 6143203a6d74646b s0 : ff20000000010190
   s1 : 0000000000000012 a0 : ff2000000000fd58 a1 : 1212121212121212
   a2 : 0000000000000400 a3 : ff20000000010158 a4 : 0000000000000000
   a5 : 725bedba92260900 a6 : 000000000130e0f0 a7 : 0000000000000000
   s2 : ff2000000000fd58 s3 : ffffffff815170d8 s4 : ff20000000013e60
   s5 : 000000000000000e s6 : ff20000000013e60 s7 : 0000000000000000
   s8 : ff60000000861000 s9 : 00007fffc3641694 s10: 00007fffc3641690
   s11: 00005555796ed240 t3 : 0000000000010297 t4 : ffffffff80c17810
   t5 : ffffffff8195e7b8 t6 : ff20000000013b18
   status: 0000000200000120 badaddr: ff2000000000fd58
    cause: 000000000000000f orig_a0: 0000000000000000
  --- <OVERFLOW stack> ---
   #3 [ff2000000000fd50] memset at ffffffff80876a34
   #4 [ff20000000010190] recursive_loop at ffffffff80563e16
   #5 [ff200000000105d0] recursive_loop at ffffffff80563e16
   < recursive_loop ...>
   #16 [ff20000000013490] recursive_loop at ffffffff80563e16
   #17 [ff200000000138d0] recursive_loop at ffffffff80563e16
   #18 [ff20000000013d10] lkdtm_EXHAUST_STACK at ffffffff8088005e
   #19 [ff20000000013d30] lkdtm_do_action at ffffffff80563292
   #20 [ff20000000013d40] direct_entry at ffffffff80563474
   #21 [ff20000000013d70] full_proxy_write at ffffffff8032fb3a
   #22 [ff20000000013db0] vfs_write at ffffffff801d6414
   #23 [ff20000000013e60] ksys_write at ffffffff801d67b8
   #24 [ff20000000013eb0] __riscv_sys_write at ffffffff801d6832
   #25 [ff20000000013ec0] do_trap_ecall_u at ffffffff80884a20
  crash>

  crash> help -m
  <snip>
          irq_stack_size: 16384
           irq_stacks[0]: ff20000000000000
           irq_stacks[1]: ff20000000008000
          overflow_stack_size: 4096
           overflow_stacks[0]: ff6000001fa7a510
           overflow_stacks[1]: ff6000001fc4f510
  crash>

Signed-off-by: Song Shuai <[email protected]>
liutgnu pushed a commit to liutgnu/crash-preview that referenced this issue Feb 21, 2024
The patch introduces per-cpu overflow stacks for RISCV64 to let
"bt" do backtrace on it and the 'help -m' command dispalys the
addresss of each per-cpu overflow stack.

TEST: a lkdtm DIRECT EXHAUST_STACK vmcore

  crash> bt
  PID: 1        TASK: ff600000000d8000  CPU: 1    COMMAND: "sh"
   #0 [ff6000001fc501c0] riscv_crash_save_regs at ffffffff8000a1dc
   crash-utility#1 [ff6000001fc50320] panic at ffffffff808773ec
   crash-utility#2 [ff6000001fc50380] walk_stackframe at ffffffff800056da
       PC: ffffffff80876a34  [memset+96]
       RA: ffffffff80563dc0  [recursive_loop+68]
       SP: ff2000000000fd50  CAUSE: 000000000000000f
  epc : ffffffff80876a34 ra : ffffffff80563dc0 sp : ff2000000000fd50
   gp : ffffffff81515d38 tp : 0000000000000000 t0 : ff2000000000fd58
   t1 : ff600000000d88c8 t2 : 6143203a6d74646b s0 : ff20000000010190
   s1 : 0000000000000012 a0 : ff2000000000fd58 a1 : 1212121212121212
   a2 : 0000000000000400 a3 : ff20000000010158 a4 : 0000000000000000
   a5 : 725bedba92260900 a6 : 000000000130e0f0 a7 : 0000000000000000
   s2 : ff2000000000fd58 s3 : ffffffff815170d8 s4 : ff20000000013e60
   s5 : 000000000000000e s6 : ff20000000013e60 s7 : 0000000000000000
   s8 : ff60000000861000 s9 : 00007fffc3641694 s10: 00007fffc3641690
   s11: 00005555796ed240 t3 : 0000000000010297 t4 : ffffffff80c17810
   t5 : ffffffff8195e7b8 t6 : ff20000000013b18
   status: 0000000200000120 badaddr: ff2000000000fd58
    cause: 000000000000000f orig_a0: 0000000000000000
  --- <OVERFLOW stack> ---
   crash-utility#3 [ff2000000000fd50] memset at ffffffff80876a34
   crash-utility#4 [ff20000000010190] recursive_loop at ffffffff80563e16
   crash-utility#5 [ff200000000105d0] recursive_loop at ffffffff80563e16
   < recursive_loop ...>
   crash-utility#16 [ff20000000013490] recursive_loop at ffffffff80563e16
   crash-utility#17 [ff200000000138d0] recursive_loop at ffffffff80563e16
   crash-utility#18 [ff20000000013d10] lkdtm_EXHAUST_STACK at ffffffff8088005e
   crash-utility#19 [ff20000000013d30] lkdtm_do_action at ffffffff80563292
   crash-utility#20 [ff20000000013d40] direct_entry at ffffffff80563474
   crash-utility#21 [ff20000000013d70] full_proxy_write at ffffffff8032fb3a
   crash-utility#22 [ff20000000013db0] vfs_write at ffffffff801d6414
   crash-utility#23 [ff20000000013e60] ksys_write at ffffffff801d67b8
   crash-utility#24 [ff20000000013eb0] __riscv_sys_write at ffffffff801d6832
   crash-utility#25 [ff20000000013ec0] do_trap_ecall_u at ffffffff80884a20
  crash>

  crash> help -m
  <snip>
          irq_stack_size: 16384
           irq_stacks[0]: ff20000000000000
           irq_stacks[1]: ff20000000008000
          overflow_stack_size: 4096
           overflow_stacks[0]: ff6000001fa7a510
           overflow_stacks[1]: ff6000001fc4f510
  crash>

Signed-off-by: Song Shuai <[email protected]>
k-hagio pushed a commit that referenced this issue Mar 28, 2024
…ame" warning

The "bogus exception frame" warning was observed again on a specific
vmcore, and the remaining frame was truncated on x86_64 machine, when
executing the "bt" command as below:

  crash> bt 0 -c 8
  PID: 0        TASK: ffff9948c08f5640  CPU: 8    COMMAND: "swapper/8"
   #0 [fffffe1788788e58] crash_nmi_callback at ffffffff972672bb
   #1 [fffffe1788788e68] nmi_handle at ffffffff9722eb8e
   #2 [fffffe1788788eb0] default_do_nmi at ffffffff97e51cd0
   #3 [fffffe1788788ed0] exc_nmi at ffffffff97e51ee1
   #4 [fffffe1788788ef0] end_repeat_nmi at ffffffff980015f9
      [exception RIP: __update_load_avg_se+13]
      RIP: ffffffff9736b16d  RSP: ffffbec3c08acc78  RFLAGS: 00000046
      RAX: 0000000000000000  RBX: ffff994c2f2b1a40  RCX: ffffbec3c08acdc0
      RDX: ffff9948e4fe1d80  RSI: ffff994c2f2b1a40  RDI: 0000001d7ad7d55d
      RBP: ffffbec3c08acc88   R8: 0000001d921fca6f   R9: ffff994c2f2b1328
      R10: 00000000fffd0010  R11: ffffffff98e060c0  R12: 0000001d7ad7d55d
      R13: 0000000000000005  R14: ffff994c2f2b19c0  R15: 0000000000000001
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  --- <NMI exception stack> ---
   #5 [ffffbec3c08acc78] __update_load_avg_se at ffffffff9736b16d
   #6 [ffffbec3c08acce0] enqueue_entity at ffffffff9735c9ab
   #7 [ffffbec3c08acd28] enqueue_task_fair at ffffffff9735cef8
  ...
  #18 [ffffbec3c08acf90] blk_complete_reqs at ffffffff977978d0
  #19 [ffffbec3c08acfa0] __do_softirq at ffffffff97e66f7a
  #20 [ffffbec3c08acff0] do_softirq at ffffffff9730f6ef
  --- <IRQ stack> ---
  #21 [ffffbec3c022ff18] do_idle at ffffffff97368288
      [exception RIP: unknown or invalid address]
      RIP: 0000000000000000  RSP: 0000000000000000  RFLAGS: 00000000
      RAX: 0000000000000000  RBX: 000000089726a2d0  RCX: 0000000000000000
      RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000000
      RBP: ffffffff9726a3dd   R8: 0000000000000000   R9: 0000000000000000
      R10: ffffffff9720015a  R11: e48885e126bc1600  R12: 0000000000000000
      R13: ffffffff973684a9  R14: 0000000000000094  R15: 0000000040000000
      ORIG_RAX: 0000000000000000  CS: 0000  SS: 0000
  bt: WARNING: possibly bogus exception frame
  crash>

Actually there is no exception frame, when called from do_softirq().
With the patch:

  crash> bt 0 -c 8
  ...
  #18 [ffffbec3c08acf90] blk_complete_reqs at ffffffff977978d0
  #19 [ffffbec3c08acfa0] __do_softirq at ffffffff97e66f7a
  #20 [ffffbec3c08acff0] do_softirq at ffffffff9730f6ef
  --- <IRQ stack> ---
  #21 [ffffbec3c022ff28] cpu_startup_entry at ffffffff973684a9
  #22 [ffffbec3c022ff38] start_secondary at ffffffff9726a3dd
  #23 [ffffbec3c022ff50] secondary_startup_64_no_verify at ffffffff9720015a
  crash>

Reported-by: Jie Li <[email protected]>
Signed-off-by: Lianbo Jiang <[email protected]>
k-hagio pushed a commit that referenced this issue Apr 17, 2024
On x86_64, there are cases where crash cannot handle IRQ exception
frames properly.  For example, with RHEL9.3 kernel, "bt" command fails
with with "WARNING possibly bogus exception frame":

  crash> bt -c 30
  PID: 2898241  TASK: ff4cb0ce0da0c680  CPU: 30   COMMAND: "star-ccm+"
   #0 [fffffe4658d88e58] crash_nmi_callback at ffffffffa00675e8
   #1 [fffffe4658d88e68] nmi_handle at ffffffffa002ebab
  ...
  --- <NMI exception stack> ---
  ...
  #13 [ff5eba269937cf90] __do_softirq at ffffffffa0c6c007
  #14 [ff5eba269937cfe0] __irq_exit_rcu at ffffffffa010ef61
  #15 [ff5eba269937cff0] sysvec_apic_timer_interrupt at ffffffffa0c58ca2
  --- <IRQ stack> ---
      RIP: 0000000000000010  RSP: 0000000000000018  RFLAGS: ff5eba26ddc9f7e8
      RAX: 0000000000000a20  RBX: ff5eba26ddc9f940  RCX: 0000000000001000
      RDX: ffffffb559980000  RSI: ff4cb12d67207400  RDI: ffffffffffffffff
      RBP: 0000000000001000   R8: ff5eba26ddc9f940   R9: ff5eba26ddc9f8af
      R10: 0000000000000003  R11: 0000000000000a20  R12: ff5eba26ddc9f8b0
      R13: 000000283c07f000  R14: ff4cb0f5a29a1c00  R15: 0000000000000001
      ORIG_RAX: ffffffffa07c4e60  CS: 0206  SS: 7000001cf0380001
  bt: WARNING: possibly bogus exception frame

Running "crash" with "--machdep irq_eframe_link=0xffffffffffffffe8"
option (i.e. thus irq_eframe_link = -24) works properly:

  PID: 2898241  TASK: ff4cb0ce0da0c680  CPU: 30   COMMAND: "star-ccm+"
   #0 [fffffe4658d88e58] crash_nmi_callback at ffffffffa00675e8
   #1 [fffffe4658d88e68] nmi_handle at ffffffffa002ebab
  ...
  --- <NMI exception stack> ---
  ...
  #13 [ff5eba269937cf90] __do_softirq at ffffffffa0c6c007
  #14 [ff5eba269937cfe0] __irq_exit_rcu at ffffffffa010ef61
  #15 [ff5eba269937cff0] sysvec_apic_timer_interrupt at ffffffffa0c58ca2
  --- <IRQ stack> ---
  #16 [ff5eba26ddc9f738] asm_sysvec_apic_timer_interrupt at ffffffffa0e00e06
      [exception RIP: alloc_pte.constprop.0+32]
      RIP: ffffffffa07c4e60  RSP: ff5eba26ddc9f7e8  RFLAGS: 00000206
      RAX: ff5eba26ddc9f940  RBX: 0000000000001000  RCX: 0000000000000a20
      RDX: 0000000000001000  RSI: ffffffb559980000  RDI: ff4cb12d67207400
      RBP: ff5eba26ddc9f8b0   R8: ff5eba26ddc9f8af   R9: 0000000000000003
      R10: 0000000000000a20  R11: ff5eba26ddc9f940  R12: 000000283c07f000
      R13: ff4cb0f5a29a1c00  R14: 0000000000000001  R15: ff4cb0f5a29a1bf8
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  #17 [ff5eba26ddc9f830] iommu_v1_map_pages at ffffffffa07c5648
  #18 [ff5eba26ddc9f8f8] __iommu_map at ffffffffa07d7803
  #19 [ff5eba26ddc9f990] iommu_map_sg at ffffffffa07d7b71
  #20 [ff5eba26ddc9f9f0] iommu_dma_map_sg at ffffffffa07ddcc9
  #21 [ff5eba26ddc9fa90] __dma_map_sg_attrs at ffffffffa01b5205
  ...

Some background:

asm_common_interrupt:
      callq  error_entry
      movq   %rax,%rsp
      movq   %rsp,%rdi
      movq   0x78(%rsp),%rsi
      movq   $-0x1,0x78(%rsp)
      call common_interrupt    # rsp pointing to regs

common_interrupt:
      pushq  %r12
      pushq  %rbp
      pushq  %rbx
      [...]
      movq   hardirq_stack_ptr,%r11
      movq   %rsp,(%r11)
      movq   %r11,%rsp
      [...]
      call __common_interrupt  # rip:common_interrupt

So frame_size(rip:common_interrupt) = 32 (3 push + ret).

Hence "machdep->machspec->irq_eframe_link = -32;" (see x86_64_irq_eframe_link_init()).

Now:

asm_sysvec_apic_timer_interrupt:
      pushq  $-0x1
      callq  error_entry
      movq   %rax,%rsp
      movq   %rsp,%rdi
      callq  sysvec_apic_timer_interrupt

sysvec_apic_timer_interrupt:
      pushq  %r12
      pushq  %rbp
      [...]
      movq   hardirq_stack_ptr,%r11
      movq   %rsp,(%r11)
      movq   %r11,%rsp
      [...]
      call __sysvec_apic_timer_interrupt  # rip:sysvec_apic_timer_interrupt

Here frame_size(rip:sysvec_apic_timer_interrupt) = 24 (2 push + ret)

We should also notice that:

rip  = *(hardirq_stack_ptr - 8)
rsp  = *(hardirq_stack_ptr)
regs = rsp - frame_size(rip)

But x86_64_get_framesize() does not work with IRQ handlers (returns 0).
So not many options other than hardcoding the most likely value and
looking around it.  Actually x86_64_irq_eframe_link() was trying -32
(default), and then -40, but not -24.

Signed-off-by: Georges Aureau <[email protected]>
liutgnu added a commit to liutgnu/crash-preview that referenced this issue Dec 5, 2024
…usly

There is an issue that, for kernel modules, "dis -rl" fails to display
modules code line number data after execute "bt" command in crash.

Without the patch:
  crsah> mod -S
  crash> bt
  PID: 1500     TASK: ff2bd8b093524000  CPU: 16   COMMAND: "lpfc_worker_0"
   #0 [ff2c9f725c39f9e0] machine_kexec at ffffffff8e0686d3
   ...snip...
   crash-utility#8 [ff2c9f725c39fcc0] __lpfc_sli_release_iocbq_s4 at ffffffffc0f2f425 [lpfc]
   ...snip...
  crash> dis -rl ffffffffc0f60f82
  0xffffffffc0f60eb0 <lpfc_nlp_get>:      nopl   0x0(%rax,%rax,1) [FTRACE NOP]
  0xffffffffc0f60eb5 <lpfc_nlp_get+5>:    push   %rbp
  0xffffffffc0f60eb6 <lpfc_nlp_get+6>:    push   %rbx
  0xffffffffc0f60eb7 <lpfc_nlp_get+7>:    test   %rdi,%rdi

With the patch:
  crash> mod -S
  crash> bt
  PID: 1500     TASK: ff2bd8b093524000  CPU: 16   COMMAND: "lpfc_worker_0"
   #0 [ff2c9f725c39f9e0] machine_kexec at ffffffff8e0686d3
   ...snip...
   crash-utility#8 [ff2c9f725c39fcc0] __lpfc_sli_release_iocbq_s4 at ffffffffc0f2f425 [lpfc]
   ...snip...
  crash> dis -rl ffffffffc0f60f82
  /usr/src/debug/kernel-4.18.0-425.13.1.el8_7/linux-4.18.0-425.13.1.el8_7.x86_64/drivers/scsi/lpfc/lpfc_hbadisc.c: 6756
  0xffffffffc0f60eb0 <lpfc_nlp_get>:      nopl   0x0(%rax,%rax,1) [FTRACE NOP]
  /usr/src/debug/kernel-4.18.0-425.13.1.el8_7/linux-4.18.0-425.13.1.el8_7.x86_64/drivers/scsi/lpfc/lpfc_hbadisc.c: 6759
  0xffffffffc0f60eb5 <lpfc_nlp_get+5>:    push   %rbp

The root cause is, after kernel module been loaded by mod command, the symtable
is not expanded in gdb side. crash bt or dis command will trigger such an
expansion. However the symtable expansion is different for the 2 commands:

The stack trace of "dis -rl" for symtable expanding:

  #0  0x00000000008d8d9f in add_compunit_symtab_to_objfile ...
  crash-utility#1  0x00000000006d3293 in buildsym_compunit::end_symtab_with_blockvector ...
  crash-utility#2  0x00000000006d336a in buildsym_compunit::end_symtab_from_static_block ...
  crash-utility#3  0x000000000077e8e9 in process_full_comp_unit ...
  crash-utility#4  process_queue ...
  crash-utility#5  dw2_do_instantiate_symtab ...
  crash-utility#6  0x000000000077ed67 in dw2_instantiate_symtab ...
  crash-utility#7  0x000000000077f75e in dw2_expand_all_symtabs ...
  crash-utility#8  0x00000000008f254d in gdb_get_line_number ...
  crash-utility#9  0x00000000008f22af in gdb_command_funnel_1 ...
  crash-utility#10 0x00000000008f2003 in gdb_command_funnel ...
  crash-utility#11 0x00000000005b7f02 in gdb_interface ...
  crash-utility#12 0x00000000005f5bd8 in get_line_number ...
  crash-utility#13 0x000000000059e574 in cmd_dis ...

The stack trace of "bt" for symtable expanding:

  #0  0x00000000008d8d9f in add_compunit_symtab_to_objfile ...
  crash-utility#1  0x00000000006d3293 in buildsym_compunit::end_symtab_with_blockvector ...
  crash-utility#2  0x00000000006d336a in buildsym_compunit::end_symtab_from_static_block ...
  crash-utility#3  0x000000000077e8e9 in process_full_comp_unit ...
  crash-utility#4  process_queue ...
  crash-utility#5  dw2_do_instantiate_symtab ...
  crash-utility#6  0x000000000077ed67 in dw2_instantiate_symtab ...
  crash-utility#7  0x000000000077f8ed in dw2_lookup_symbol ...
  crash-utility#8  0x00000000008e6d03 in lookup_symbol_via_quick_fns ...
  crash-utility#9  0x00000000008e7153 in lookup_symbol_in_objfile ...
  crash-utility#10 0x00000000008e73c6 in lookup_symbol_global_or_static_iterator_cb ...
  crash-utility#11 0x00000000008b99c4 in svr4_iterate_over_objfiles_in_search_order ...
  crash-utility#12 0x00000000008e754e in lookup_global_or_static_symbol ...
  crash-utility#13 0x00000000008e75da in lookup_static_symbol ...
  crash-utility#14 0x00000000008e632c in lookup_symbol_aux ...
  crash-utility#15 0x00000000008e5a7a in lookup_symbol_in_language ...
  crash-utility#16 0x00000000008e5b30 in lookup_symbol ...
  crash-utility#17 0x00000000008f2a4a in gdb_get_datatype ...
  crash-utility#18 0x00000000008f22c0 in gdb_command_funnel_1 ...
  crash-utility#19 0x00000000008f2003 in gdb_command_funnel ...
  crash-utility#20 0x00000000005b7f02 in gdb_interface ...
  crash-utility#21 0x00000000005f8a9f in datatype_info ...
  crash-utility#22 0x0000000000599947 in cpu_map_size ...
  crash-utility#23 0x00000000005a975d in get_cpus_online ...
  crash-utility#24 0x0000000000637a8b in diskdump_get_prstatus_percpu ...
  crash-utility#25 0x000000000062f0e4 in get_netdump_regs_x86_64 ...
  crash-utility#26 0x000000000059fe68 in back_trace ...
  crash-utility#27 0x00000000005ab1cb in cmd_bt ...

For the stacktrace of "dis -rl", it calls dw2_expand_all_symtabs() to expand
all symtable of the objfile, or "*.ko.debug" in our case. However for
the stacktrace of "bt", it doesn't expand all, but only a subset of symtable
which is enough to find a symbol by dw2_lookup_symbol(). As a result, the
objfile->compunit_symtabs, which is the head of a single linked list of
struct compunit_symtab, is not NULL but didn't contain all symtables. It
will not be reinitialized in gdb_get_line_number() by "dis -rl" because
!objfile_has_full_symbols(objfile) check will fail, so it cannot display
the proper code line number data.

Since objfile_has_full_symbols(objfile) check cannot ensure all symbols
been expanded, this patch add a new member as a flag for struct objfile
to record if all symbols have been expanded. The flag will be set only ofter
expand_all_symtabs been called.

Signed-off-by: Tao Liu <[email protected]>
liutgnu pushed a commit to liutgnu/crash-preview that referenced this issue Dec 5, 2024
The patch introduces per-cpu overflow stacks for RISCV64 to let
"bt" do backtrace on it and the 'help -m' command dispalys the
addresss of each per-cpu overflow stack.

TEST: a lkdtm DIRECT EXHAUST_STACK vmcore

  crash> bt
  PID: 1        TASK: ff600000000d8000  CPU: 1    COMMAND: "sh"
   #0 [ff6000001fc501c0] riscv_crash_save_regs at ffffffff8000a1dc
   crash-utility#1 [ff6000001fc50320] panic at ffffffff808773ec
   crash-utility#2 [ff6000001fc50380] walk_stackframe at ffffffff800056da
       PC: ffffffff80876a34  [memset+96]
       RA: ffffffff80563dc0  [recursive_loop+68]
       SP: ff2000000000fd50  CAUSE: 000000000000000f
  epc : ffffffff80876a34 ra : ffffffff80563dc0 sp : ff2000000000fd50
   gp : ffffffff81515d38 tp : 0000000000000000 t0 : ff2000000000fd58
   t1 : ff600000000d88c8 t2 : 6143203a6d74646b s0 : ff20000000010190
   s1 : 0000000000000012 a0 : ff2000000000fd58 a1 : 1212121212121212
   a2 : 0000000000000400 a3 : ff20000000010158 a4 : 0000000000000000
   a5 : 725bedba92260900 a6 : 000000000130e0f0 a7 : 0000000000000000
   s2 : ff2000000000fd58 s3 : ffffffff815170d8 s4 : ff20000000013e60
   s5 : 000000000000000e s6 : ff20000000013e60 s7 : 0000000000000000
   s8 : ff60000000861000 s9 : 00007fffc3641694 s10: 00007fffc3641690
   s11: 00005555796ed240 t3 : 0000000000010297 t4 : ffffffff80c17810
   t5 : ffffffff8195e7b8 t6 : ff20000000013b18
   status: 0000000200000120 badaddr: ff2000000000fd58
    cause: 000000000000000f orig_a0: 0000000000000000
  --- <OVERFLOW stack> ---
   crash-utility#3 [ff2000000000fd50] memset at ffffffff80876a34
   crash-utility#4 [ff20000000010190] recursive_loop at ffffffff80563e16
   crash-utility#5 [ff200000000105d0] recursive_loop at ffffffff80563e16
   < recursive_loop ...>
   crash-utility#16 [ff20000000013490] recursive_loop at ffffffff80563e16
   crash-utility#17 [ff200000000138d0] recursive_loop at ffffffff80563e16
   crash-utility#18 [ff20000000013d10] lkdtm_EXHAUST_STACK at ffffffff8088005e
   crash-utility#19 [ff20000000013d30] lkdtm_do_action at ffffffff80563292
   crash-utility#20 [ff20000000013d40] direct_entry at ffffffff80563474
   crash-utility#21 [ff20000000013d70] full_proxy_write at ffffffff8032fb3a
   crash-utility#22 [ff20000000013db0] vfs_write at ffffffff801d6414
   crash-utility#23 [ff20000000013e60] ksys_write at ffffffff801d67b8
   crash-utility#24 [ff20000000013eb0] __riscv_sys_write at ffffffff801d6832
   crash-utility#25 [ff20000000013ec0] do_trap_ecall_u at ffffffff80884a20
  crash>

  crash> help -m
  <snip>
          irq_stack_size: 16384
           irq_stacks[0]: ff20000000000000
           irq_stacks[1]: ff20000000008000
          overflow_stack_size: 4096
           overflow_stacks[0]: ff6000001fa7a510
           overflow_stacks[1]: ff6000001fc4f510
  crash>

Signed-off-by: Song Shuai <[email protected]>
liutgnu pushed a commit to liutgnu/crash-preview that referenced this issue Dec 5, 2024
…ame" warning

The "bogus exception frame" warning was observed again on a specific
vmcore, and the remaining frame was truncated on x86_64 machine, when
executing the "bt" command as below:

  crash> bt 0 -c 8
  PID: 0        TASK: ffff9948c08f5640  CPU: 8    COMMAND: "swapper/8"
   #0 [fffffe1788788e58] crash_nmi_callback at ffffffff972672bb
   crash-utility#1 [fffffe1788788e68] nmi_handle at ffffffff9722eb8e
   crash-utility#2 [fffffe1788788eb0] default_do_nmi at ffffffff97e51cd0
   crash-utility#3 [fffffe1788788ed0] exc_nmi at ffffffff97e51ee1
   crash-utility#4 [fffffe1788788ef0] end_repeat_nmi at ffffffff980015f9
      [exception RIP: __update_load_avg_se+13]
      RIP: ffffffff9736b16d  RSP: ffffbec3c08acc78  RFLAGS: 00000046
      RAX: 0000000000000000  RBX: ffff994c2f2b1a40  RCX: ffffbec3c08acdc0
      RDX: ffff9948e4fe1d80  RSI: ffff994c2f2b1a40  RDI: 0000001d7ad7d55d
      RBP: ffffbec3c08acc88   R8: 0000001d921fca6f   R9: ffff994c2f2b1328
      R10: 00000000fffd0010  R11: ffffffff98e060c0  R12: 0000001d7ad7d55d
      R13: 0000000000000005  R14: ffff994c2f2b19c0  R15: 0000000000000001
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  --- <NMI exception stack> ---
   crash-utility#5 [ffffbec3c08acc78] __update_load_avg_se at ffffffff9736b16d
   crash-utility#6 [ffffbec3c08acce0] enqueue_entity at ffffffff9735c9ab
   crash-utility#7 [ffffbec3c08acd28] enqueue_task_fair at ffffffff9735cef8
  ...
  crash-utility#18 [ffffbec3c08acf90] blk_complete_reqs at ffffffff977978d0
  crash-utility#19 [ffffbec3c08acfa0] __do_softirq at ffffffff97e66f7a
  crash-utility#20 [ffffbec3c08acff0] do_softirq at ffffffff9730f6ef
  --- <IRQ stack> ---
  crash-utility#21 [ffffbec3c022ff18] do_idle at ffffffff97368288
      [exception RIP: unknown or invalid address]
      RIP: 0000000000000000  RSP: 0000000000000000  RFLAGS: 00000000
      RAX: 0000000000000000  RBX: 000000089726a2d0  RCX: 0000000000000000
      RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000000
      RBP: ffffffff9726a3dd   R8: 0000000000000000   R9: 0000000000000000
      R10: ffffffff9720015a  R11: e48885e126bc1600  R12: 0000000000000000
      R13: ffffffff973684a9  R14: 0000000000000094  R15: 0000000040000000
      ORIG_RAX: 0000000000000000  CS: 0000  SS: 0000
  bt: WARNING: possibly bogus exception frame
  crash>

Actually there is no exception frame, when called from do_softirq().
With the patch:

  crash> bt 0 -c 8
  ...
  crash-utility#18 [ffffbec3c08acf90] blk_complete_reqs at ffffffff977978d0
  crash-utility#19 [ffffbec3c08acfa0] __do_softirq at ffffffff97e66f7a
  crash-utility#20 [ffffbec3c08acff0] do_softirq at ffffffff9730f6ef
  --- <IRQ stack> ---
  crash-utility#21 [ffffbec3c022ff28] cpu_startup_entry at ffffffff973684a9
  crash-utility#22 [ffffbec3c022ff38] start_secondary at ffffffff9726a3dd
  crash-utility#23 [ffffbec3c022ff50] secondary_startup_64_no_verify at ffffffff9720015a
  crash>

Reported-by: Jie Li <[email protected]>
Signed-off-by: Lianbo Jiang <[email protected]>
liutgnu pushed a commit to liutgnu/crash-preview that referenced this issue Dec 5, 2024
On x86_64, there are cases where crash cannot handle IRQ exception
frames properly.  For example, with RHEL9.3 kernel, "bt" command fails
with with "WARNING possibly bogus exception frame":

  crash> bt -c 30
  PID: 2898241  TASK: ff4cb0ce0da0c680  CPU: 30   COMMAND: "star-ccm+"
   #0 [fffffe4658d88e58] crash_nmi_callback at ffffffffa00675e8
   crash-utility#1 [fffffe4658d88e68] nmi_handle at ffffffffa002ebab
  ...
  --- <NMI exception stack> ---
  ...
  crash-utility#13 [ff5eba269937cf90] __do_softirq at ffffffffa0c6c007
  crash-utility#14 [ff5eba269937cfe0] __irq_exit_rcu at ffffffffa010ef61
  crash-utility#15 [ff5eba269937cff0] sysvec_apic_timer_interrupt at ffffffffa0c58ca2
  --- <IRQ stack> ---
      RIP: 0000000000000010  RSP: 0000000000000018  RFLAGS: ff5eba26ddc9f7e8
      RAX: 0000000000000a20  RBX: ff5eba26ddc9f940  RCX: 0000000000001000
      RDX: ffffffb559980000  RSI: ff4cb12d67207400  RDI: ffffffffffffffff
      RBP: 0000000000001000   R8: ff5eba26ddc9f940   R9: ff5eba26ddc9f8af
      R10: 0000000000000003  R11: 0000000000000a20  R12: ff5eba26ddc9f8b0
      R13: 000000283c07f000  R14: ff4cb0f5a29a1c00  R15: 0000000000000001
      ORIG_RAX: ffffffffa07c4e60  CS: 0206  SS: 7000001cf0380001
  bt: WARNING: possibly bogus exception frame

Running "crash" with "--machdep irq_eframe_link=0xffffffffffffffe8"
option (i.e. thus irq_eframe_link = -24) works properly:

  PID: 2898241  TASK: ff4cb0ce0da0c680  CPU: 30   COMMAND: "star-ccm+"
   #0 [fffffe4658d88e58] crash_nmi_callback at ffffffffa00675e8
   crash-utility#1 [fffffe4658d88e68] nmi_handle at ffffffffa002ebab
  ...
  --- <NMI exception stack> ---
  ...
  crash-utility#13 [ff5eba269937cf90] __do_softirq at ffffffffa0c6c007
  crash-utility#14 [ff5eba269937cfe0] __irq_exit_rcu at ffffffffa010ef61
  crash-utility#15 [ff5eba269937cff0] sysvec_apic_timer_interrupt at ffffffffa0c58ca2
  --- <IRQ stack> ---
  crash-utility#16 [ff5eba26ddc9f738] asm_sysvec_apic_timer_interrupt at ffffffffa0e00e06
      [exception RIP: alloc_pte.constprop.0+32]
      RIP: ffffffffa07c4e60  RSP: ff5eba26ddc9f7e8  RFLAGS: 00000206
      RAX: ff5eba26ddc9f940  RBX: 0000000000001000  RCX: 0000000000000a20
      RDX: 0000000000001000  RSI: ffffffb559980000  RDI: ff4cb12d67207400
      RBP: ff5eba26ddc9f8b0   R8: ff5eba26ddc9f8af   R9: 0000000000000003
      R10: 0000000000000a20  R11: ff5eba26ddc9f940  R12: 000000283c07f000
      R13: ff4cb0f5a29a1c00  R14: 0000000000000001  R15: ff4cb0f5a29a1bf8
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  crash-utility#17 [ff5eba26ddc9f830] iommu_v1_map_pages at ffffffffa07c5648
  crash-utility#18 [ff5eba26ddc9f8f8] __iommu_map at ffffffffa07d7803
  crash-utility#19 [ff5eba26ddc9f990] iommu_map_sg at ffffffffa07d7b71
  crash-utility#20 [ff5eba26ddc9f9f0] iommu_dma_map_sg at ffffffffa07ddcc9
  crash-utility#21 [ff5eba26ddc9fa90] __dma_map_sg_attrs at ffffffffa01b5205
  ...

Some background:

asm_common_interrupt:
      callq  error_entry
      movq   %rax,%rsp
      movq   %rsp,%rdi
      movq   0x78(%rsp),%rsi
      movq   $-0x1,0x78(%rsp)
      call common_interrupt    # rsp pointing to regs

common_interrupt:
      pushq  %r12
      pushq  %rbp
      pushq  %rbx
      [...]
      movq   hardirq_stack_ptr,%r11
      movq   %rsp,(%r11)
      movq   %r11,%rsp
      [...]
      call __common_interrupt  # rip:common_interrupt

So frame_size(rip:common_interrupt) = 32 (3 push + ret).

Hence "machdep->machspec->irq_eframe_link = -32;" (see x86_64_irq_eframe_link_init()).

Now:

asm_sysvec_apic_timer_interrupt:
      pushq  $-0x1
      callq  error_entry
      movq   %rax,%rsp
      movq   %rsp,%rdi
      callq  sysvec_apic_timer_interrupt

sysvec_apic_timer_interrupt:
      pushq  %r12
      pushq  %rbp
      [...]
      movq   hardirq_stack_ptr,%r11
      movq   %rsp,(%r11)
      movq   %r11,%rsp
      [...]
      call __sysvec_apic_timer_interrupt  # rip:sysvec_apic_timer_interrupt

Here frame_size(rip:sysvec_apic_timer_interrupt) = 24 (2 push + ret)

We should also notice that:

rip  = *(hardirq_stack_ptr - 8)
rsp  = *(hardirq_stack_ptr)
regs = rsp - frame_size(rip)

But x86_64_get_framesize() does not work with IRQ handlers (returns 0).
So not many options other than hardcoding the most likely value and
looking around it.  Actually x86_64_irq_eframe_link() was trying -32
(default), and then -40, but not -24.

Signed-off-by: Georges Aureau <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant