memory corruption #20

leilchen · 2017-08-30T09:36:37Z

I'm building and running crash-7.1.9. But the tool itself crashes because of memory overflow. The core dump point is at free time, which makes it a little tricky to debug. The error message looks like this:

crash 7.1.9
Copyright (C) 2002-2016 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.

*** Error in `/tmp/crash': free(): invalid pointer: 0x0000000001022270 ***
Aborted (core dumped)

The following patch can fix the problem:

`--- /tmp/crash/crash-7.1.9/filesys.c 2017-04-21 02:54:33.000000000 +0800
+++ crash-7.1.9/filesys.c 2017-08-31 06:18:32.000040000 +0800
@@ -308,6 +308,7 @@
#define CREATE 1
#define DESTROY 0
#define DEFAULT_SEARCHDIRS 5
+#define EXTRA_SEARCHDIRS 5

static char **
build_searchdirs(int create, int *preferred)
@@ -340,11 +341,11 @@
*preferred = 0;

/*

   *  Allow, at a minimum, the defaults plus an extra three directories

   *  Allow, at a minimum, the defaults plus EXTRA_SEARCHDIRS directories
   *  for the two possible /usr/src/redhat/BUILD/kernel-xxx locations
   *  plus the Red Hat debug directory.
   */

```
  cnt = DEFAULT_SEARCHDIRS + 3;
```

  cnt = DEFAULT_SEARCHDIRS + EXTRA_SEARCHDIRS;

   if ((dirp = opendir("/usr/src"))) {
           for (dp = readdir(dirp); dp != NULL; dp = readdir(dirp))

`

The text was updated successfully, but these errors were encountered:

crash-utility · 2017-08-30T13:26:19Z

----- Original Message -----

I'm building and running crash-7.1.9. But the tool itself crashes because of memory overflow. The core dump point is at free time, which makes it a little tricky to debug. The error message looks like this:

You should be able to run gdb on the core file and get a backtrace. But anyway, with your patch applied, can you show me what "crash -d1" shows? Thanks, Dave

…

crash 7.1.9 Copyright (C) 2002-2016 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. *** Error in `/tmp/crash': free(): invalid pointer: 0x0000000001022270 *** Aborted (core dumped) The following patch can fix the problem: `--- /tmp/crash/crash-7.1.9/filesys.c 2017-04-21 02:54:33.000000000 +0800 +++ crash-7.1.9/filesys.c 2017-08-31 06:18:32.000040000 +0800 @@ -308,6 +308,7 @@ #define CREATE 1 #define DESTROY 0 #define DEFAULT_SEARCHDIRS 5 +#define EXTRA_SEARCHDIRS 5 static char ** build_searchdirs(int create, int *preferred) @@ -340,11 +341,11 @@ *preferred = 0; /* - * Allow, at a minimum, the defaults plus an extra three directories + * Allow, at a minimum, the defaults plus EXTRA_SEARCHDIRS directories * for the two possible /usr/src/redhat/BUILD/kernel-xxx locations * plus the Red Hat debug directory. */ - cnt = DEFAULT_SEARCHDIRS + 3; + cnt = DEFAULT_SEARCHDIRS + EXTRA_SEARCHDIRS; if ((dirp = opendir("/usr/src"))) { for (dp = readdir(dirp); dp != NULL; dp = readdir(dirp)) ` -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #20

crash-utility · 2017-08-30T19:57:18Z

----- Original Message -----

----- Original Message ----- > I'm building and running crash-7.1.9. But the tool itself crashes because of > memory overflow. The core dump point is at free time, which makes it a > little tricky to debug. The error message looks like this: You should be able to run gdb on the core file and get a backtrace. But anyway, with your patch applied, can you show me what "crash -d1" shows? Thanks, Dave

Also, can you show me the output of "ls /usr/src" on your host machine where the crash occurs? I'm trying to figure out how the malloc'd searchdirs[] memory gets corrupted, and I'm thinking that just bumping the extras from 3 to 5 is papering over the problem. Thanks, Dave

…

> > crash 7.1.9 > Copyright (C) 2002-2016 Red Hat, Inc. > Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation > Copyright (C) 1999-2006 Hewlett-Packard Co > Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited > Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. > Copyright (C) 2005, 2011 NEC Corporation > Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. > Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. > This program is free software, covered by the GNU General Public License, > and you are welcome to change it and/or distribute copies of it under > certain conditions. Enter "help copying" to see the conditions. > This program has absolutely no warranty. Enter "help warranty" for > details. > > *** Error in `/tmp/crash': free(): invalid pointer: 0x0000000001022270 *** > Aborted (core dumped) > > The following patch can fix the problem: > > `--- /tmp/crash/crash-7.1.9/filesys.c 2017-04-21 02:54:33.000000000 > +0800 > +++ crash-7.1.9/filesys.c 2017-08-31 06:18:32.000040000 +0800 > @@ -308,6 +308,7 @@ > #define CREATE 1 > #define DESTROY 0 > #define DEFAULT_SEARCHDIRS 5 > +#define EXTRA_SEARCHDIRS 5 > > static char ** > build_searchdirs(int create, int *preferred) > @@ -340,11 +341,11 @@ > *preferred = 0; > > /* > - * Allow, at a minimum, the defaults plus an extra three > directories > + * Allow, at a minimum, the defaults plus EXTRA_SEARCHDIRS > directories > * for the two possible /usr/src/redhat/BUILD/kernel-xxx locations > * plus the Red Hat debug directory. > */ > - cnt = DEFAULT_SEARCHDIRS + 3; > + cnt = DEFAULT_SEARCHDIRS + EXTRA_SEARCHDIRS; > > if ((dirp = opendir("/usr/src"))) { > for (dp = readdir(dirp); dp != NULL; dp = readdir(dirp)) > ` > > -- > You are receiving this because you are subscribed to this thread. > Reply to this email directly or view it on GitHub: > #20

crash-utility · 2017-08-30T20:19:21Z

----- Original Message -----

----- Original Message ----- > > > ----- Original Message ----- > > I'm building and running crash-7.1.9. But the tool itself crashes because > > of > > memory overflow. The core dump point is at free time, which makes it a > > little tricky to debug. The error message looks like this: > > You should be able to run gdb on the core file and get a backtrace. > But anyway, with your patch applied, can you show me what "crash -d1" > shows? > > Thanks, > Dave Also, can you show me the output of "ls /usr/src" on your host machine where the crash occurs? I'm trying to figure out how the malloc'd searchdirs[] memory gets corrupted, and I'm thinking that just bumping the extras from 3 to 5 is papering over the problem. Thanks, Dave

And I'm also wondering whether somehow the 1500 bytes in dirbuf[BUFSIZE] in build_searchdirs() is getting overrun and spilling into the searchdirs allocated memory? Instead of bumping the count from 3 to 5, what happens if you patch the dirbuf[] size, something like this: - char dirbuf[BUFSIZE]; + char dirbuf[BUFSIZE*2]; I just don't understand how you came to the conclusion to add the 3-to-5 change? That code hasn't changed in years, and I can't figure out what's different about your host machine setup. Dave

…

> > > > > > crash 7.1.9 > > Copyright (C) 2002-2016 Red Hat, Inc. > > Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation > > Copyright (C) 1999-2006 Hewlett-Packard Co > > Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited > > Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. > > Copyright (C) 2005, 2011 NEC Corporation > > Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. > > Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. > > This program is free software, covered by the GNU General Public License, > > and you are welcome to change it and/or distribute copies of it under > > certain conditions. Enter "help copying" to see the conditions. > > This program has absolutely no warranty. Enter "help warranty" for > > details. > > > > *** Error in `/tmp/crash': free(): invalid pointer: 0x0000000001022270 > > *** > > Aborted (core dumped) > > > > The following patch can fix the problem: > > > > `--- /tmp/crash/crash-7.1.9/filesys.c 2017-04-21 02:54:33.000000000 > > +0800 > > +++ crash-7.1.9/filesys.c 2017-08-31 06:18:32.000040000 +0800 > > @@ -308,6 +308,7 @@ > > #define CREATE 1 > > #define DESTROY 0 > > #define DEFAULT_SEARCHDIRS 5 > > +#define EXTRA_SEARCHDIRS 5 > > > > static char ** > > build_searchdirs(int create, int *preferred) > > @@ -340,11 +341,11 @@ > > *preferred = 0; > > > > /* > > - * Allow, at a minimum, the defaults plus an extra three > > directories > > + * Allow, at a minimum, the defaults plus EXTRA_SEARCHDIRS > > directories > > * for the two possible /usr/src/redhat/BUILD/kernel-xxx > > locations > > * plus the Red Hat debug directory. > > */ > > - cnt = DEFAULT_SEARCHDIRS + 3; > > + cnt = DEFAULT_SEARCHDIRS + EXTRA_SEARCHDIRS; > > > > if ((dirp = opendir("/usr/src"))) { > > for (dp = readdir(dirp); dp != NULL; dp = readdir(dirp)) > > ` > > > > -- > > You are receiving this because you are subscribed to this thread. > > Reply to this email directly or view it on GitHub: > > #20 >

leilchen · 2017-08-31T00:44:03Z

Hi Dave,
I'm running a customized distro which doesn't have /usr/src. The rational behind EXTRA=5 is:

line 401, the cnt is set to DEFAULT_SEARCHDIRS
from line 404, kernel/redhat kernel v1/redhat kernel v2/redhat debug directories are added, i.e. 4 new items get added.
Besides, line 454 actually accesses DEFAULT_SEARCHDIRS+5 item.

But why there's no bug report on other distro? I have no idea.

The backtrace is as below:
(gdb) bt
#0 0x00007ffff6cd5da5 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007ffff6cd7558 in __GI_abort () at abort.c:90
#2 0x00007ffff6d16d7b in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7ffff6e21de8 "*** Error in `%s': %s: 0x%s ***\n")
at ../sysdeps/unix/sysv/linux/libc_fatal.c:196
#3 0x00007ffff6d1e192 in malloc_printerr (ptr=0xfca270, str=0x7ffff6e1f4de "free(): invalid pointer", action=3) at malloc.c:4964
#4 _int_free (av=0x7ffff7044720 <main_arena>, p=0xfca260, have_lock=0) at malloc.c:3796
#5 0x00000000004aa159 in ?? ()
#6 0x00000000004aa97a in fd_init ()
#7 0x0000000000464350 in main ()

Thanks,
Lei Chen

crash-utility · 2017-08-31T19:37:24Z

Ok thanks -- committed!

BTW, in the future, please post patches on the crash utility mailing list:
[email protected]
It's a moderated list, but if you choose not to enroll, I will see any legitimate patch postings among the spam, and I will accept it.

crash-utility · 2017-08-31T19:38:19Z

If you want to join the list, it's here:
https://www.redhat.com/mailman/listinfo/crash-utility

…usly There is an issue that, for kernel modules, "dis -rl" fails to display modules code line number data after execute "bt" command in crash. Without the patch: crsah> mod -S crash> bt PID: 1500 TASK: ff2bd8b093524000 CPU: 16 COMMAND: "lpfc_worker_0" #0 [ff2c9f725c39f9e0] machine_kexec at ffffffff8e0686d3 ...snip... #8 [ff2c9f725c39fcc0] __lpfc_sli_release_iocbq_s4 at ffffffffc0f2f425 [lpfc] ...snip... crash> dis -rl ffffffffc0f60f82 0xffffffffc0f60eb0 <lpfc_nlp_get>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] 0xffffffffc0f60eb5 <lpfc_nlp_get+5>: push %rbp 0xffffffffc0f60eb6 <lpfc_nlp_get+6>: push %rbx 0xffffffffc0f60eb7 <lpfc_nlp_get+7>: test %rdi,%rdi With the patch: crash> mod -S crash> bt PID: 1500 TASK: ff2bd8b093524000 CPU: 16 COMMAND: "lpfc_worker_0" #0 [ff2c9f725c39f9e0] machine_kexec at ffffffff8e0686d3 ...snip... #8 [ff2c9f725c39fcc0] __lpfc_sli_release_iocbq_s4 at ffffffffc0f2f425 [lpfc] ...snip... crash> dis -rl ffffffffc0f60f82 /usr/src/debug/kernel-4.18.0-425.13.1.el8_7/linux-4.18.0-425.13.1.el8_7.x86_64/drivers/scsi/lpfc/lpfc_hbadisc.c: 6756 0xffffffffc0f60eb0 <lpfc_nlp_get>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] /usr/src/debug/kernel-4.18.0-425.13.1.el8_7/linux-4.18.0-425.13.1.el8_7.x86_64/drivers/scsi/lpfc/lpfc_hbadisc.c: 6759 0xffffffffc0f60eb5 <lpfc_nlp_get+5>: push %rbp The root cause is, after kernel module been loaded by mod command, the symtable is not expanded in gdb side. crash bt or dis command will trigger such an expansion. However the symtable expansion is different for the 2 commands: The stack trace of "dis -rl" for symtable expanding: #0 0x00000000008d8d9f in add_compunit_symtab_to_objfile ... #1 0x00000000006d3293 in buildsym_compunit::end_symtab_with_blockvector ... #2 0x00000000006d336a in buildsym_compunit::end_symtab_from_static_block ... #3 0x000000000077e8e9 in process_full_comp_unit ... #4 process_queue ... #5 dw2_do_instantiate_symtab ... #6 0x000000000077ed67 in dw2_instantiate_symtab ... #7 0x000000000077f75e in dw2_expand_all_symtabs ... #8 0x00000000008f254d in gdb_get_line_number ... #9 0x00000000008f22af in gdb_command_funnel_1 ... #10 0x00000000008f2003 in gdb_command_funnel ... #11 0x00000000005b7f02 in gdb_interface ... #12 0x00000000005f5bd8 in get_line_number ... #13 0x000000000059e574 in cmd_dis ... The stack trace of "bt" for symtable expanding: #0 0x00000000008d8d9f in add_compunit_symtab_to_objfile ... #1 0x00000000006d3293 in buildsym_compunit::end_symtab_with_blockvector ... #2 0x00000000006d336a in buildsym_compunit::end_symtab_from_static_block ... #3 0x000000000077e8e9 in process_full_comp_unit ... #4 process_queue ... #5 dw2_do_instantiate_symtab ... #6 0x000000000077ed67 in dw2_instantiate_symtab ... #7 0x000000000077f8ed in dw2_lookup_symbol ... #8 0x00000000008e6d03 in lookup_symbol_via_quick_fns ... #9 0x00000000008e7153 in lookup_symbol_in_objfile ... #10 0x00000000008e73c6 in lookup_symbol_global_or_static_iterator_cb ... #11 0x00000000008b99c4 in svr4_iterate_over_objfiles_in_search_order ... #12 0x00000000008e754e in lookup_global_or_static_symbol ... #13 0x00000000008e75da in lookup_static_symbol ... #14 0x00000000008e632c in lookup_symbol_aux ... #15 0x00000000008e5a7a in lookup_symbol_in_language ... #16 0x00000000008e5b30 in lookup_symbol ... #17 0x00000000008f2a4a in gdb_get_datatype ... #18 0x00000000008f22c0 in gdb_command_funnel_1 ... #19 0x00000000008f2003 in gdb_command_funnel ... #20 0x00000000005b7f02 in gdb_interface ... #21 0x00000000005f8a9f in datatype_info ... #22 0x0000000000599947 in cpu_map_size ... #23 0x00000000005a975d in get_cpus_online ... #24 0x0000000000637a8b in diskdump_get_prstatus_percpu ... #25 0x000000000062f0e4 in get_netdump_regs_x86_64 ... #26 0x000000000059fe68 in back_trace ... #27 0x00000000005ab1cb in cmd_bt ... For the stacktrace of "dis -rl", it calls dw2_expand_all_symtabs() to expand all symtable of the objfile, or "*.ko.debug" in our case. However for the stacktrace of "bt", it doesn't expand all, but only a subset of symtable which is enough to find a symbol by dw2_lookup_symbol(). As a result, the objfile->compunit_symtabs, which is the head of a single linked list of struct compunit_symtab, is not NULL but didn't contain all symtables. It will not be reinitialized in gdb_get_line_number() by "dis -rl" because !objfile_has_full_symbols(objfile) check will fail, so it cannot display the proper code line number data. Since objfile_has_full_symbols(objfile) check cannot ensure all symbols been expanded, this patch add a new member as a flag for struct objfile to record if all symbols have been expanded. The flag will be set only ofter expand_all_symtabs been called. Signed-off-by: Tao Liu <[email protected]>

…usly There is an issue that, for kernel modules, "dis -rl" fails to display modules code line number data after execute "bt" command in crash. Without the patch: crsah> mod -S crash> bt PID: 1500 TASK: ff2bd8b093524000 CPU: 16 COMMAND: "lpfc_worker_0" #0 [ff2c9f725c39f9e0] machine_kexec at ffffffff8e0686d3 ...snip... crash-utility#8 [ff2c9f725c39fcc0] __lpfc_sli_release_iocbq_s4 at ffffffffc0f2f425 [lpfc] ...snip... crash> dis -rl ffffffffc0f60f82 0xffffffffc0f60eb0 <lpfc_nlp_get>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] 0xffffffffc0f60eb5 <lpfc_nlp_get+5>: push %rbp 0xffffffffc0f60eb6 <lpfc_nlp_get+6>: push %rbx 0xffffffffc0f60eb7 <lpfc_nlp_get+7>: test %rdi,%rdi With the patch: crash> mod -S crash> bt PID: 1500 TASK: ff2bd8b093524000 CPU: 16 COMMAND: "lpfc_worker_0" #0 [ff2c9f725c39f9e0] machine_kexec at ffffffff8e0686d3 ...snip... crash-utility#8 [ff2c9f725c39fcc0] __lpfc_sli_release_iocbq_s4 at ffffffffc0f2f425 [lpfc] ...snip... crash> dis -rl ffffffffc0f60f82 /usr/src/debug/kernel-4.18.0-425.13.1.el8_7/linux-4.18.0-425.13.1.el8_7.x86_64/drivers/scsi/lpfc/lpfc_hbadisc.c: 6756 0xffffffffc0f60eb0 <lpfc_nlp_get>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] /usr/src/debug/kernel-4.18.0-425.13.1.el8_7/linux-4.18.0-425.13.1.el8_7.x86_64/drivers/scsi/lpfc/lpfc_hbadisc.c: 6759 0xffffffffc0f60eb5 <lpfc_nlp_get+5>: push %rbp The root cause is, after kernel module been loaded by mod command, the symtable is not expanded in gdb side. crash bt or dis command will trigger such an expansion. However the symtable expansion is different for the 2 commands: The stack trace of "dis -rl" for symtable expanding: #0 0x00000000008d8d9f in add_compunit_symtab_to_objfile ... crash-utility#1 0x00000000006d3293 in buildsym_compunit::end_symtab_with_blockvector ... crash-utility#2 0x00000000006d336a in buildsym_compunit::end_symtab_from_static_block ... crash-utility#3 0x000000000077e8e9 in process_full_comp_unit ... crash-utility#4 process_queue ... crash-utility#5 dw2_do_instantiate_symtab ... crash-utility#6 0x000000000077ed67 in dw2_instantiate_symtab ... crash-utility#7 0x000000000077f75e in dw2_expand_all_symtabs ... crash-utility#8 0x00000000008f254d in gdb_get_line_number ... crash-utility#9 0x00000000008f22af in gdb_command_funnel_1 ... crash-utility#10 0x00000000008f2003 in gdb_command_funnel ... crash-utility#11 0x00000000005b7f02 in gdb_interface ... crash-utility#12 0x00000000005f5bd8 in get_line_number ... crash-utility#13 0x000000000059e574 in cmd_dis ... The stack trace of "bt" for symtable expanding: #0 0x00000000008d8d9f in add_compunit_symtab_to_objfile ... crash-utility#1 0x00000000006d3293 in buildsym_compunit::end_symtab_with_blockvector ... crash-utility#2 0x00000000006d336a in buildsym_compunit::end_symtab_from_static_block ... crash-utility#3 0x000000000077e8e9 in process_full_comp_unit ... crash-utility#4 process_queue ... crash-utility#5 dw2_do_instantiate_symtab ... crash-utility#6 0x000000000077ed67 in dw2_instantiate_symtab ... crash-utility#7 0x000000000077f8ed in dw2_lookup_symbol ... crash-utility#8 0x00000000008e6d03 in lookup_symbol_via_quick_fns ... crash-utility#9 0x00000000008e7153 in lookup_symbol_in_objfile ... crash-utility#10 0x00000000008e73c6 in lookup_symbol_global_or_static_iterator_cb ... crash-utility#11 0x00000000008b99c4 in svr4_iterate_over_objfiles_in_search_order ... crash-utility#12 0x00000000008e754e in lookup_global_or_static_symbol ... crash-utility#13 0x00000000008e75da in lookup_static_symbol ... crash-utility#14 0x00000000008e632c in lookup_symbol_aux ... crash-utility#15 0x00000000008e5a7a in lookup_symbol_in_language ... crash-utility#16 0x00000000008e5b30 in lookup_symbol ... crash-utility#17 0x00000000008f2a4a in gdb_get_datatype ... crash-utility#18 0x00000000008f22c0 in gdb_command_funnel_1 ... crash-utility#19 0x00000000008f2003 in gdb_command_funnel ... crash-utility#20 0x00000000005b7f02 in gdb_interface ... crash-utility#21 0x00000000005f8a9f in datatype_info ... crash-utility#22 0x0000000000599947 in cpu_map_size ... crash-utility#23 0x00000000005a975d in get_cpus_online ... crash-utility#24 0x0000000000637a8b in diskdump_get_prstatus_percpu ... crash-utility#25 0x000000000062f0e4 in get_netdump_regs_x86_64 ... crash-utility#26 0x000000000059fe68 in back_trace ... crash-utility#27 0x00000000005ab1cb in cmd_bt ... For the stacktrace of "dis -rl", it calls dw2_expand_all_symtabs() to expand all symtable of the objfile, or "*.ko.debug" in our case. However for the stacktrace of "bt", it doesn't expand all, but only a subset of symtable which is enough to find a symbol by dw2_lookup_symbol(). As a result, the objfile->compunit_symtabs, which is the head of a single linked list of struct compunit_symtab, is not NULL but didn't contain all symtables. It will not be reinitialized in gdb_get_line_number() by "dis -rl" because !objfile_has_full_symbols(objfile) check will fail, so it cannot display the proper code line number data. Since objfile_has_full_symbols(objfile) check cannot ensure all symbols been expanded, this patch add a new member as a flag for struct objfile to record if all symbols have been expanded. The flag will be set only ofter expand_all_symtabs been called. Signed-off-by: Tao Liu <[email protected]>

…eviously There is an issue that, for kernel modules, "dis -rl" fails to display modules code line number data after execute "bt" command in crash. Without the patch: crsah> mod -S crash> bt PID: 1500 TASK: ff2bd8b093524000 CPU: 16 COMMAND: "lpfc_worker_0" #0 [ff2c9f725c39f9e0] machine_kexec at ffffffff8e0686d3 ...snip... crash-utility#8 [ff2c9f725c39fcc0] __lpfc_sli_release_iocbq_s4 at ffffffffc0f2f425 [lpfc] ...snip... crash> dis -rl ffffffffc0f60f82 0xffffffffc0f60eb0 <lpfc_nlp_get>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] 0xffffffffc0f60eb5 <lpfc_nlp_get+5>: push %rbp 0xffffffffc0f60eb6 <lpfc_nlp_get+6>: push %rbx 0xffffffffc0f60eb7 <lpfc_nlp_get+7>: test %rdi,%rdi With the patch: crash> mod -S crash> bt PID: 1500 TASK: ff2bd8b093524000 CPU: 16 COMMAND: "lpfc_worker_0" #0 [ff2c9f725c39f9e0] machine_kexec at ffffffff8e0686d3 ...snip... crash-utility#8 [ff2c9f725c39fcc0] __lpfc_sli_release_iocbq_s4 at ffffffffc0f2f425 [lpfc] ...snip... crash> dis -rl ffffffffc0f60f82 /usr/src/debug/kernel-4.18.0-425.13.1.el8_7/linux-4.18.0-425.13.1.el8_7.x86_64/drivers/scsi/lpfc/lpfc_hbadisc.c: 6756 0xffffffffc0f60eb0 <lpfc_nlp_get>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] /usr/src/debug/kernel-4.18.0-425.13.1.el8_7/linux-4.18.0-425.13.1.el8_7.x86_64/drivers/scsi/lpfc/lpfc_hbadisc.c: 6759 0xffffffffc0f60eb5 <lpfc_nlp_get+5>: push %rbp The root cause is, after kernel module been loaded by mod command, the symtable is not expanded in gdb side. crash bt or dis command will trigger such an expansion. However the symtable expansion is different for the 2 commands: The stack trace of "dis -rl" for symtable expanding: #0 0x00000000008d8d9f in add_compunit_symtab_to_objfile ... crash-utility#1 0x00000000006d3293 in buildsym_compunit::end_symtab_with_blockvector ... crash-utility#2 0x00000000006d336a in buildsym_compunit::end_symtab_from_static_block ... crash-utility#3 0x000000000077e8e9 in process_full_comp_unit ... crash-utility#4 process_queue ... crash-utility#5 dw2_do_instantiate_symtab ... crash-utility#6 0x000000000077ed67 in dw2_instantiate_symtab ... crash-utility#7 0x000000000077f75e in dw2_expand_all_symtabs ... crash-utility#8 0x00000000008f254d in gdb_get_line_number ... crash-utility#9 0x00000000008f22af in gdb_command_funnel_1 ... crash-utility#10 0x00000000008f2003 in gdb_command_funnel ... crash-utility#11 0x00000000005b7f02 in gdb_interface ... crash-utility#12 0x00000000005f5bd8 in get_line_number ... crash-utility#13 0x000000000059e574 in cmd_dis ... The stack trace of "bt" for symtable expanding: #0 0x00000000008d8d9f in add_compunit_symtab_to_objfile ... crash-utility#1 0x00000000006d3293 in buildsym_compunit::end_symtab_with_blockvector ... crash-utility#2 0x00000000006d336a in buildsym_compunit::end_symtab_from_static_block ... crash-utility#3 0x000000000077e8e9 in process_full_comp_unit ... crash-utility#4 process_queue ... crash-utility#5 dw2_do_instantiate_symtab ... crash-utility#6 0x000000000077ed67 in dw2_instantiate_symtab ... crash-utility#7 0x000000000077f8ed in dw2_lookup_symbol ... crash-utility#8 0x00000000008e6d03 in lookup_symbol_via_quick_fns ... crash-utility#9 0x00000000008e7153 in lookup_symbol_in_objfile ... crash-utility#10 0x00000000008e73c6 in lookup_symbol_global_or_static_iterator_cb ... crash-utility#11 0x00000000008b99c4 in svr4_iterate_over_objfiles_in_search_order ... crash-utility#12 0x00000000008e754e in lookup_global_or_static_symbol ... crash-utility#13 0x00000000008e75da in lookup_static_symbol ... crash-utility#14 0x00000000008e632c in lookup_symbol_aux ... crash-utility#15 0x00000000008e5a7a in lookup_symbol_in_language ... crash-utility#16 0x00000000008e5b30 in lookup_symbol ... crash-utility#17 0x00000000008f2a4a in gdb_get_datatype ... crash-utility#18 0x00000000008f22c0 in gdb_command_funnel_1 ... crash-utility#19 0x00000000008f2003 in gdb_command_funnel ... crash-utility#20 0x00000000005b7f02 in gdb_interface ... crash-utility#21 0x00000000005f8a9f in datatype_info ... crash-utility#22 0x0000000000599947 in cpu_map_size ... crash-utility#23 0x00000000005a975d in get_cpus_online ... crash-utility#24 0x0000000000637a8b in diskdump_get_prstatus_percpu ... crash-utility#25 0x000000000062f0e4 in get_netdump_regs_x86_64 ... crash-utility#26 0x000000000059fe68 in back_trace ... crash-utility#27 0x00000000005ab1cb in cmd_bt ... For the stacktrace of "dis -rl", it calls dw2_expand_all_symtabs() to expand all symtable of the objfile, or "*.ko.debug" in our case. However for the stacktrace of "bt", it doesn't expand all, but only a subset of symtable which is enough to find a symbol by dw2_lookup_symbol(). As a result, the objfile->compunit_symtabs, which is the head of a single linked list of struct compunit_symtab, is not NULL but didn't contain all symtables. It will not be reinitialized in gdb_get_line_number() by "dis -rl" because !objfile_has_full_symbols(objfile) check will fail, so it cannot display the proper code line number data. Since objfile_has_full_symbols(objfile) check cannot ensure all symbols been expanded, this patch add a new member as a flag for struct objfile to record if all symbols have been expanded. The flag will be set only ofter expand_all_symtabs been called. Signed-off-by: Tao Liu <[email protected]>

The patch introduces per-cpu overflow stacks for RISCV64 to let "bt" do backtrace on it and the 'help -m' command dispalys the addresss of each per-cpu overflow stack. TEST: a lkdtm DIRECT EXHAUST_STACK vmcore ``` crash> bt PID: 1 TASK: ff600000000d8000 CPU: 1 COMMAND: "sh" #0 [ff6000001fc501c0] riscv_crash_save_regs at ffffffff8000a1dc crash-utility#1 [ff6000001fc50320] panic at ffffffff808773ec crash-utility#2 [ff6000001fc50380] walk_stackframe at ffffffff800056da PC: ffffffff80876a34 [memset+96] RA: ffffffff80563dc0 [recursive_loop+68] SP: ff2000000000fd50 CAUSE: 000000000000000f epc : ffffffff80876a34 ra : ffffffff80563dc0 sp : ff2000000000fd50 gp : ffffffff81515d38 tp : 0000000000000000 t0 : ff2000000000fd58 t1 : ff600000000d88c8 t2 : 6143203a6d74646b s0 : ff20000000010190 s1 : 0000000000000012 a0 : ff2000000000fd58 a1 : 1212121212121212 a2 : 0000000000000400 a3 : ff20000000010158 a4 : 0000000000000000 a5 : 725bedba92260900 a6 : 000000000130e0f0 a7 : 0000000000000000 s2 : ff2000000000fd58 s3 : ffffffff815170d8 s4 : ff20000000013e60 s5 : 000000000000000e s6 : ff20000000013e60 s7 : 0000000000000000 s8 : ff60000000861000 s9 : 00007fffc3641694 s10: 00007fffc3641690 s11: 00005555796ed240 t3 : 0000000000010297 t4 : ffffffff80c17810 t5 : ffffffff8195e7b8 t6 : ff20000000013b18 status: 0000000200000120 badaddr: ff2000000000fd58 cause: 000000000000000f orig_a0: 0000000000000000 --- <OVERFLOW stack> --- crash-utility#3 [ff2000000000fd50] memset at ffffffff80876a34 crash-utility#4 [ff20000000010190] recursive_loop at ffffffff80563e16 crash-utility#5 [ff200000000105d0] recursive_loop at ffffffff80563e16 < recursive_loop ...> crash-utility#16 [ff20000000013490] recursive_loop at ffffffff80563e16 crash-utility#17 [ff200000000138d0] recursive_loop at ffffffff80563e16 crash-utility#18 [ff20000000013d10] lkdtm_EXHAUST_STACK at ffffffff8088005e crash-utility#19 [ff20000000013d30] lkdtm_do_action at ffffffff80563292 crash-utility#20 [ff20000000013d40] direct_entry at ffffffff80563474 crash-utility#21 [ff20000000013d70] full_proxy_write at ffffffff8032fb3a crash-utility#22 [ff20000000013db0] vfs_write at ffffffff801d6414 crash-utility#23 [ff20000000013e60] ksys_write at ffffffff801d67b8 crash-utility#24 [ff20000000013eb0] __riscv_sys_write at ffffffff801d6832 crash-utility#25 [ff20000000013ec0] do_trap_ecall_u at ffffffff80884a20 crash> crash> help -m <snip> irq_stack_size: 16384 irq_stacks[0]: ff20000000000000 irq_stacks[1]: ff20000000008000 overflow_stack_size: 4096 overflow_stacks[0]: ff6000001fa7a510 overflow_stacks[1]: ff6000001fc4f510 crash> ``` Signed-off-by: Song Shuai <[email protected]>

The patch introduces per-cpu overflow stacks for RISCV64 to let "bt" do backtrace on it and the 'help -m' command dispalys the addresss of each per-cpu overflow stack. TEST: a lkdtm DIRECT EXHAUST_STACK vmcore crash> bt PID: 1 TASK: ff600000000d8000 CPU: 1 COMMAND: "sh" #0 [ff6000001fc501c0] riscv_crash_save_regs at ffffffff8000a1dc #1 [ff6000001fc50320] panic at ffffffff808773ec #2 [ff6000001fc50380] walk_stackframe at ffffffff800056da PC: ffffffff80876a34 [memset+96] RA: ffffffff80563dc0 [recursive_loop+68] SP: ff2000000000fd50 CAUSE: 000000000000000f epc : ffffffff80876a34 ra : ffffffff80563dc0 sp : ff2000000000fd50 gp : ffffffff81515d38 tp : 0000000000000000 t0 : ff2000000000fd58 t1 : ff600000000d88c8 t2 : 6143203a6d74646b s0 : ff20000000010190 s1 : 0000000000000012 a0 : ff2000000000fd58 a1 : 1212121212121212 a2 : 0000000000000400 a3 : ff20000000010158 a4 : 0000000000000000 a5 : 725bedba92260900 a6 : 000000000130e0f0 a7 : 0000000000000000 s2 : ff2000000000fd58 s3 : ffffffff815170d8 s4 : ff20000000013e60 s5 : 000000000000000e s6 : ff20000000013e60 s7 : 0000000000000000 s8 : ff60000000861000 s9 : 00007fffc3641694 s10: 00007fffc3641690 s11: 00005555796ed240 t3 : 0000000000010297 t4 : ffffffff80c17810 t5 : ffffffff8195e7b8 t6 : ff20000000013b18 status: 0000000200000120 badaddr: ff2000000000fd58 cause: 000000000000000f orig_a0: 0000000000000000 --- <OVERFLOW stack> --- #3 [ff2000000000fd50] memset at ffffffff80876a34 #4 [ff20000000010190] recursive_loop at ffffffff80563e16 #5 [ff200000000105d0] recursive_loop at ffffffff80563e16 < recursive_loop ...> #16 [ff20000000013490] recursive_loop at ffffffff80563e16 #17 [ff200000000138d0] recursive_loop at ffffffff80563e16 #18 [ff20000000013d10] lkdtm_EXHAUST_STACK at ffffffff8088005e #19 [ff20000000013d30] lkdtm_do_action at ffffffff80563292 #20 [ff20000000013d40] direct_entry at ffffffff80563474 #21 [ff20000000013d70] full_proxy_write at ffffffff8032fb3a #22 [ff20000000013db0] vfs_write at ffffffff801d6414 #23 [ff20000000013e60] ksys_write at ffffffff801d67b8 #24 [ff20000000013eb0] __riscv_sys_write at ffffffff801d6832 #25 [ff20000000013ec0] do_trap_ecall_u at ffffffff80884a20 crash> crash> help -m <snip> irq_stack_size: 16384 irq_stacks[0]: ff20000000000000 irq_stacks[1]: ff20000000008000 overflow_stack_size: 4096 overflow_stacks[0]: ff6000001fa7a510 overflow_stacks[1]: ff6000001fc4f510 crash> Signed-off-by: Song Shuai <[email protected]>

…ss range Previously, to find a module symbol and its offset by an arbitrary address, all symbols within the module will be iterated by address ascending order until the last symbol with a smaller address been noticed. However if the address is not within the module address range, e.g. the address is higher than the module's last symbol's address, then the module can be surely skipped, because its symbol iteration is unnecessary. This can speed up the kernel module symbols finding and improve the overall performance. Without the patch: $ time echo "bt 8993" | ~/crash-dev/crash vmcore vmlinux crash> bt 8993 PID: 8993 TASK: ffff927569cc2100 CPU: 2 COMMAND: "WriterPool0" #0 [ffff927569cd76f0] __schedule at ffffffffb3db78d8 #1 [ffff927569cd7758] schedule_preempt_disabled at ffffffffb3db8bf9 #2 [ffff927569cd7768] __mutex_lock_slowpath at ffffffffb3db6ca7 #3 [ffff927569cd77c0] mutex_lock at ffffffffb3db602f #4 [ffff927569cd77d8] ucache_retrieve at ffffffffc0cf4409 [secfs2] ...snip the stacktrace of the same module... #11 [ffff927569cd7ba0] cskal_path_vfs_getattr_nosec at ffffffffc05cae76 [falcon_kal] ...snip... #13 [ffff927569cd7c40] _ZdlPv at ffffffffc086e751 [falcon_lsm_serviceable] ...snip... #20 [ffff927569cd7ef8] unload_network_ops_symbols at ffffffffc06f11c0 [falcon_lsm_pinned_14713] #21 [ffff927569cd7f50] system_call_fastpath at ffffffffb3dc539a RIP: 00007f2b28ed4023 RSP: 00007f2a45fe7f80 RFLAGS: 00000206 RAX: 0000000000000012 RBX: 00007f2a68302e00 RCX: 00007f2a682546d8 RDX: 0000000000000826 RSI: 00007eb57ea6a000 RDI: 00000000000000e3 RBP: 00007eb57ea6a000 R8: 0000000000000826 R9: 00000002670bdfd2 R10: 00000002670bdfd2 R11: 0000000000000293 R12: 00000002670bdfd2 R13: 00007f29d501a480 R14: 0000000000000826 R15: 00000002670bdfd2 ORIG_RAX: 0000000000000012 CS: 0033 SS: 002b crash> real 7m14.826s user 7m12.502s sys 0m1.091s With the patch: $ time echo "bt 8993" | ~/crash-dev/crash vmcore vmlinux crash> bt 8993 PID: 8993 TASK: ffff927569cc2100 CPU: 2 COMMAND: "WriterPool0" #0 [ffff927569cd76f0] __schedule at ffffffffb3db78d8 #1 [ffff927569cd7758] schedule_preempt_disabled at ffffffffb3db8bf9 ...snip the same output... crash> real 0m8.827s user 0m7.896s sys 0m0.938s Signed-off-by: Tao Liu <[email protected]>

…ss range Previously, to find a module symbol and its offset by an arbitrary address, all symbols within the module will be iterated by address ascending order until the last symbol with a smaller address been noticed. However if the address is not within the module address range, e.g. the address is higher than the module's last symbol's address, then the module can be surely skipped, because its symbol iteration is unnecessary. This can speed up the kernel module symbols finding and improve the overall performance. Without the patch: $ time echo "bt 8993" | ~/crash-dev/crash vmcore vmlinux crash> bt 8993 PID: 8993 TASK: ffff927569cc2100 CPU: 2 COMMAND: "WriterPool0" #0 [ffff927569cd76f0] __schedule at ffffffffb3db78d8 crash-utility#1 [ffff927569cd7758] schedule_preempt_disabled at ffffffffb3db8bf9 crash-utility#2 [ffff927569cd7768] __mutex_lock_slowpath at ffffffffb3db6ca7 crash-utility#3 [ffff927569cd77c0] mutex_lock at ffffffffb3db602f crash-utility#4 [ffff927569cd77d8] ucache_retrieve at ffffffffc0cf4409 [secfs2] ...snip the stacktrace of the same module... crash-utility#11 [ffff927569cd7ba0] cskal_path_vfs_getattr_nosec at ffffffffc05cae76 [falcon_kal] ...snip... crash-utility#13 [ffff927569cd7c40] _ZdlPv at ffffffffc086e751 [falcon_lsm_serviceable] ...snip... crash-utility#20 [ffff927569cd7ef8] unload_network_ops_symbols at ffffffffc06f11c0 [falcon_lsm_pinned_14713] crash-utility#21 [ffff927569cd7f50] system_call_fastpath at ffffffffb3dc539a RIP: 00007f2b28ed4023 RSP: 00007f2a45fe7f80 RFLAGS: 00000206 RAX: 0000000000000012 RBX: 00007f2a68302e00 RCX: 00007f2a682546d8 RDX: 0000000000000826 RSI: 00007eb57ea6a000 RDI: 00000000000000e3 RBP: 00007eb57ea6a000 R8: 0000000000000826 R9: 00000002670bdfd2 R10: 00000002670bdfd2 R11: 0000000000000293 R12: 00000002670bdfd2 R13: 00007f29d501a480 R14: 0000000000000826 R15: 00000002670bdfd2 ORIG_RAX: 0000000000000012 CS: 0033 SS: 002b crash> real 7m14.826s user 7m12.502s sys 0m1.091s With the patch: $ time echo "bt 8993" | ~/crash-dev/crash vmcore vmlinux crash> bt 8993 PID: 8993 TASK: ffff927569cc2100 CPU: 2 COMMAND: "WriterPool0" #0 [ffff927569cd76f0] __schedule at ffffffffb3db78d8 crash-utility#1 [ffff927569cd7758] schedule_preempt_disabled at ffffffffb3db8bf9 ...snip the same output... crash> real 0m8.827s user 0m7.896s sys 0m0.938s Signed-off-by: Tao Liu <[email protected]>

The patch introduces per-cpu overflow stacks for RISCV64 to let "bt" do backtrace on it and the 'help -m' command dispalys the addresss of each per-cpu overflow stack. TEST: a lkdtm DIRECT EXHAUST_STACK vmcore crash> bt PID: 1 TASK: ff600000000d8000 CPU: 1 COMMAND: "sh" #0 [ff6000001fc501c0] riscv_crash_save_regs at ffffffff8000a1dc crash-utility#1 [ff6000001fc50320] panic at ffffffff808773ec crash-utility#2 [ff6000001fc50380] walk_stackframe at ffffffff800056da PC: ffffffff80876a34 [memset+96] RA: ffffffff80563dc0 [recursive_loop+68] SP: ff2000000000fd50 CAUSE: 000000000000000f epc : ffffffff80876a34 ra : ffffffff80563dc0 sp : ff2000000000fd50 gp : ffffffff81515d38 tp : 0000000000000000 t0 : ff2000000000fd58 t1 : ff600000000d88c8 t2 : 6143203a6d74646b s0 : ff20000000010190 s1 : 0000000000000012 a0 : ff2000000000fd58 a1 : 1212121212121212 a2 : 0000000000000400 a3 : ff20000000010158 a4 : 0000000000000000 a5 : 725bedba92260900 a6 : 000000000130e0f0 a7 : 0000000000000000 s2 : ff2000000000fd58 s3 : ffffffff815170d8 s4 : ff20000000013e60 s5 : 000000000000000e s6 : ff20000000013e60 s7 : 0000000000000000 s8 : ff60000000861000 s9 : 00007fffc3641694 s10: 00007fffc3641690 s11: 00005555796ed240 t3 : 0000000000010297 t4 : ffffffff80c17810 t5 : ffffffff8195e7b8 t6 : ff20000000013b18 status: 0000000200000120 badaddr: ff2000000000fd58 cause: 000000000000000f orig_a0: 0000000000000000 --- <OVERFLOW stack> --- crash-utility#3 [ff2000000000fd50] memset at ffffffff80876a34 crash-utility#4 [ff20000000010190] recursive_loop at ffffffff80563e16 crash-utility#5 [ff200000000105d0] recursive_loop at ffffffff80563e16 < recursive_loop ...> crash-utility#16 [ff20000000013490] recursive_loop at ffffffff80563e16 crash-utility#17 [ff200000000138d0] recursive_loop at ffffffff80563e16 crash-utility#18 [ff20000000013d10] lkdtm_EXHAUST_STACK at ffffffff8088005e crash-utility#19 [ff20000000013d30] lkdtm_do_action at ffffffff80563292 crash-utility#20 [ff20000000013d40] direct_entry at ffffffff80563474 crash-utility#21 [ff20000000013d70] full_proxy_write at ffffffff8032fb3a crash-utility#22 [ff20000000013db0] vfs_write at ffffffff801d6414 crash-utility#23 [ff20000000013e60] ksys_write at ffffffff801d67b8 crash-utility#24 [ff20000000013eb0] __riscv_sys_write at ffffffff801d6832 crash-utility#25 [ff20000000013ec0] do_trap_ecall_u at ffffffff80884a20 crash> crash> help -m <snip> irq_stack_size: 16384 irq_stacks[0]: ff20000000000000 irq_stacks[1]: ff20000000008000 overflow_stack_size: 4096 overflow_stacks[0]: ff6000001fa7a510 overflow_stacks[1]: ff6000001fc4f510 crash> Signed-off-by: Song Shuai <[email protected]>

…ss range Previously, to find a module symbol and its offset by an arbitrary address, all symbols within the module will be iterated by address ascending order until the last symbol with a smaller address been noticed. However if the address is not within the module address range, e.g. the address is higher than the module's last symbol's address, then the module can be surely skipped, because its symbol iteration is unnecessary. This can speed up the kernel module symbols finding and improve the overall performance. Without the patch: $ time echo "bt 8993" | ~/crash-dev/crash vmcore vmlinux crash> bt 8993 PID: 8993 TASK: ffff927569cc2100 CPU: 2 COMMAND: "WriterPool0" #0 [ffff927569cd76f0] __schedule at ffffffffb3db78d8 crash-utility#1 [ffff927569cd7758] schedule_preempt_disabled at ffffffffb3db8bf9 crash-utility#2 [ffff927569cd7768] __mutex_lock_slowpath at ffffffffb3db6ca7 crash-utility#3 [ffff927569cd77c0] mutex_lock at ffffffffb3db602f crash-utility#4 [ffff927569cd77d8] ucache_retrieve at ffffffffc0cf4409 [secfs2] ...snip the stacktrace of the same module... crash-utility#11 [ffff927569cd7ba0] cskal_path_vfs_getattr_nosec at ffffffffc05cae76 [falcon_kal] ...snip... crash-utility#13 [ffff927569cd7c40] _ZdlPv at ffffffffc086e751 [falcon_lsm_serviceable] ...snip... crash-utility#20 [ffff927569cd7ef8] unload_network_ops_symbols at ffffffffc06f11c0 [falcon_lsm_pinned_14713] crash-utility#21 [ffff927569cd7f50] system_call_fastpath at ffffffffb3dc539a RIP: 00007f2b28ed4023 RSP: 00007f2a45fe7f80 RFLAGS: 00000206 RAX: 0000000000000012 RBX: 00007f2a68302e00 RCX: 00007f2a682546d8 RDX: 0000000000000826 RSI: 00007eb57ea6a000 RDI: 00000000000000e3 RBP: 00007eb57ea6a000 R8: 0000000000000826 R9: 00000002670bdfd2 R10: 00000002670bdfd2 R11: 0000000000000293 R12: 00000002670bdfd2 R13: 00007f29d501a480 R14: 0000000000000826 R15: 00000002670bdfd2 ORIG_RAX: 0000000000000012 CS: 0033 SS: 002b crash> real 7m14.826s user 7m12.502s sys 0m1.091s With the patch: $ time echo "bt 8993" | ~/crash-dev/crash vmcore vmlinux crash> bt 8993 PID: 8993 TASK: ffff927569cc2100 CPU: 2 COMMAND: "WriterPool0" #0 [ffff927569cd76f0] __schedule at ffffffffb3db78d8 crash-utility#1 [ffff927569cd7758] schedule_preempt_disabled at ffffffffb3db8bf9 ...snip the same output... crash> real 0m8.827s user 0m7.896s sys 0m0.938s Signed-off-by: Tao Liu <[email protected]>

…ame" warning The "bogus exception frame" warning was observed again on a specific vmcore, and the remaining frame was truncated on x86_64 machine, when executing the "bt" command as below: crash> bt 0 -c 8 PID: 0 TASK: ffff9948c08f5640 CPU: 8 COMMAND: "swapper/8" #0 [fffffe1788788e58] crash_nmi_callback at ffffffff972672bb #1 [fffffe1788788e68] nmi_handle at ffffffff9722eb8e #2 [fffffe1788788eb0] default_do_nmi at ffffffff97e51cd0 #3 [fffffe1788788ed0] exc_nmi at ffffffff97e51ee1 #4 [fffffe1788788ef0] end_repeat_nmi at ffffffff980015f9 [exception RIP: __update_load_avg_se+13] RIP: ffffffff9736b16d RSP: ffffbec3c08acc78 RFLAGS: 00000046 RAX: 0000000000000000 RBX: ffff994c2f2b1a40 RCX: ffffbec3c08acdc0 RDX: ffff9948e4fe1d80 RSI: ffff994c2f2b1a40 RDI: 0000001d7ad7d55d RBP: ffffbec3c08acc88 R8: 0000001d921fca6f R9: ffff994c2f2b1328 R10: 00000000fffd0010 R11: ffffffff98e060c0 R12: 0000001d7ad7d55d R13: 0000000000000005 R14: ffff994c2f2b19c0 R15: 0000000000000001 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #5 [ffffbec3c08acc78] __update_load_avg_se at ffffffff9736b16d #6 [ffffbec3c08acce0] enqueue_entity at ffffffff9735c9ab #7 [ffffbec3c08acd28] enqueue_task_fair at ffffffff9735cef8 ... #18 [ffffbec3c08acf90] blk_complete_reqs at ffffffff977978d0 #19 [ffffbec3c08acfa0] __do_softirq at ffffffff97e66f7a #20 [ffffbec3c08acff0] do_softirq at ffffffff9730f6ef --- <IRQ stack> --- #21 [ffffbec3c022ff18] do_idle at ffffffff97368288 [exception RIP: unknown or invalid address] RIP: 0000000000000000 RSP: 0000000000000000 RFLAGS: 00000000 RAX: 0000000000000000 RBX: 000000089726a2d0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffffff9726a3dd R8: 0000000000000000 R9: 0000000000000000 R10: ffffffff9720015a R11: e48885e126bc1600 R12: 0000000000000000 R13: ffffffff973684a9 R14: 0000000000000094 R15: 0000000040000000 ORIG_RAX: 0000000000000000 CS: 0000 SS: 0000 bt: WARNING: possibly bogus exception frame crash> Actually there is no exception frame, when called from do_softirq(). With the patch: crash> bt 0 -c 8 ... #18 [ffffbec3c08acf90] blk_complete_reqs at ffffffff977978d0 #19 [ffffbec3c08acfa0] __do_softirq at ffffffff97e66f7a #20 [ffffbec3c08acff0] do_softirq at ffffffff9730f6ef --- <IRQ stack> --- #21 [ffffbec3c022ff28] cpu_startup_entry at ffffffff973684a9 #22 [ffffbec3c022ff38] start_secondary at ffffffff9726a3dd #23 [ffffbec3c022ff50] secondary_startup_64_no_verify at ffffffff9720015a crash> Reported-by: Jie Li <[email protected]> Signed-off-by: Lianbo Jiang <[email protected]>

On x86_64, there are cases where crash cannot handle IRQ exception frames properly. For example, with RHEL9.3 kernel, "bt" command fails with with "WARNING possibly bogus exception frame": crash> bt -c 30 PID: 2898241 TASK: ff4cb0ce0da0c680 CPU: 30 COMMAND: "star-ccm+" #0 [fffffe4658d88e58] crash_nmi_callback at ffffffffa00675e8 #1 [fffffe4658d88e68] nmi_handle at ffffffffa002ebab ... --- <NMI exception stack> --- ... #13 [ff5eba269937cf90] __do_softirq at ffffffffa0c6c007 #14 [ff5eba269937cfe0] __irq_exit_rcu at ffffffffa010ef61 #15 [ff5eba269937cff0] sysvec_apic_timer_interrupt at ffffffffa0c58ca2 --- <IRQ stack> --- RIP: 0000000000000010 RSP: 0000000000000018 RFLAGS: ff5eba26ddc9f7e8 RAX: 0000000000000a20 RBX: ff5eba26ddc9f940 RCX: 0000000000001000 RDX: ffffffb559980000 RSI: ff4cb12d67207400 RDI: ffffffffffffffff RBP: 0000000000001000 R8: ff5eba26ddc9f940 R9: ff5eba26ddc9f8af R10: 0000000000000003 R11: 0000000000000a20 R12: ff5eba26ddc9f8b0 R13: 000000283c07f000 R14: ff4cb0f5a29a1c00 R15: 0000000000000001 ORIG_RAX: ffffffffa07c4e60 CS: 0206 SS: 7000001cf0380001 bt: WARNING: possibly bogus exception frame Running "crash" with "--machdep irq_eframe_link=0xffffffffffffffe8" option (i.e. thus irq_eframe_link = -24) works properly: PID: 2898241 TASK: ff4cb0ce0da0c680 CPU: 30 COMMAND: "star-ccm+" #0 [fffffe4658d88e58] crash_nmi_callback at ffffffffa00675e8 #1 [fffffe4658d88e68] nmi_handle at ffffffffa002ebab ... --- <NMI exception stack> --- ... #13 [ff5eba269937cf90] __do_softirq at ffffffffa0c6c007 #14 [ff5eba269937cfe0] __irq_exit_rcu at ffffffffa010ef61 #15 [ff5eba269937cff0] sysvec_apic_timer_interrupt at ffffffffa0c58ca2 --- <IRQ stack> --- #16 [ff5eba26ddc9f738] asm_sysvec_apic_timer_interrupt at ffffffffa0e00e06 [exception RIP: alloc_pte.constprop.0+32] RIP: ffffffffa07c4e60 RSP: ff5eba26ddc9f7e8 RFLAGS: 00000206 RAX: ff5eba26ddc9f940 RBX: 0000000000001000 RCX: 0000000000000a20 RDX: 0000000000001000 RSI: ffffffb559980000 RDI: ff4cb12d67207400 RBP: ff5eba26ddc9f8b0 R8: ff5eba26ddc9f8af R9: 0000000000000003 R10: 0000000000000a20 R11: ff5eba26ddc9f940 R12: 000000283c07f000 R13: ff4cb0f5a29a1c00 R14: 0000000000000001 R15: ff4cb0f5a29a1bf8 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #17 [ff5eba26ddc9f830] iommu_v1_map_pages at ffffffffa07c5648 #18 [ff5eba26ddc9f8f8] __iommu_map at ffffffffa07d7803 #19 [ff5eba26ddc9f990] iommu_map_sg at ffffffffa07d7b71 #20 [ff5eba26ddc9f9f0] iommu_dma_map_sg at ffffffffa07ddcc9 #21 [ff5eba26ddc9fa90] __dma_map_sg_attrs at ffffffffa01b5205 ... Some background: asm_common_interrupt: callq error_entry movq %rax,%rsp movq %rsp,%rdi movq 0x78(%rsp),%rsi movq $-0x1,0x78(%rsp) call common_interrupt # rsp pointing to regs common_interrupt: pushq %r12 pushq %rbp pushq %rbx [...] movq hardirq_stack_ptr,%r11 movq %rsp,(%r11) movq %r11,%rsp [...] call __common_interrupt # rip:common_interrupt So frame_size(rip:common_interrupt) = 32 (3 push + ret). Hence "machdep->machspec->irq_eframe_link = -32;" (see x86_64_irq_eframe_link_init()). Now: asm_sysvec_apic_timer_interrupt: pushq $-0x1 callq error_entry movq %rax,%rsp movq %rsp,%rdi callq sysvec_apic_timer_interrupt sysvec_apic_timer_interrupt: pushq %r12 pushq %rbp [...] movq hardirq_stack_ptr,%r11 movq %rsp,(%r11) movq %r11,%rsp [...] call __sysvec_apic_timer_interrupt # rip:sysvec_apic_timer_interrupt Here frame_size(rip:sysvec_apic_timer_interrupt) = 24 (2 push + ret) We should also notice that: rip = *(hardirq_stack_ptr - 8) rsp = *(hardirq_stack_ptr) regs = rsp - frame_size(rip) But x86_64_get_framesize() does not work with IRQ handlers (returns 0). So not many options other than hardcoding the most likely value and looking around it. Actually x86_64_irq_eframe_link() was trying -32 (default), and then -40, but not -24. Signed-off-by: Georges Aureau <[email protected]>

…usly There is an issue that, for kernel modules, "dis -rl" fails to display modules code line number data after execute "bt" command in crash. Without the patch: crsah> mod -S crash> bt PID: 1500 TASK: ff2bd8b093524000 CPU: 16 COMMAND: "lpfc_worker_0" #0 [ff2c9f725c39f9e0] machine_kexec at ffffffff8e0686d3 ...snip... crash-utility#8 [ff2c9f725c39fcc0] __lpfc_sli_release_iocbq_s4 at ffffffffc0f2f425 [lpfc] ...snip... crash> dis -rl ffffffffc0f60f82 0xffffffffc0f60eb0 <lpfc_nlp_get>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] 0xffffffffc0f60eb5 <lpfc_nlp_get+5>: push %rbp 0xffffffffc0f60eb6 <lpfc_nlp_get+6>: push %rbx 0xffffffffc0f60eb7 <lpfc_nlp_get+7>: test %rdi,%rdi With the patch: crash> mod -S crash> bt PID: 1500 TASK: ff2bd8b093524000 CPU: 16 COMMAND: "lpfc_worker_0" #0 [ff2c9f725c39f9e0] machine_kexec at ffffffff8e0686d3 ...snip... crash-utility#8 [ff2c9f725c39fcc0] __lpfc_sli_release_iocbq_s4 at ffffffffc0f2f425 [lpfc] ...snip... crash> dis -rl ffffffffc0f60f82 /usr/src/debug/kernel-4.18.0-425.13.1.el8_7/linux-4.18.0-425.13.1.el8_7.x86_64/drivers/scsi/lpfc/lpfc_hbadisc.c: 6756 0xffffffffc0f60eb0 <lpfc_nlp_get>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] /usr/src/debug/kernel-4.18.0-425.13.1.el8_7/linux-4.18.0-425.13.1.el8_7.x86_64/drivers/scsi/lpfc/lpfc_hbadisc.c: 6759 0xffffffffc0f60eb5 <lpfc_nlp_get+5>: push %rbp The root cause is, after kernel module been loaded by mod command, the symtable is not expanded in gdb side. crash bt or dis command will trigger such an expansion. However the symtable expansion is different for the 2 commands: The stack trace of "dis -rl" for symtable expanding: #0 0x00000000008d8d9f in add_compunit_symtab_to_objfile ... crash-utility#1 0x00000000006d3293 in buildsym_compunit::end_symtab_with_blockvector ... crash-utility#2 0x00000000006d336a in buildsym_compunit::end_symtab_from_static_block ... crash-utility#3 0x000000000077e8e9 in process_full_comp_unit ... crash-utility#4 process_queue ... crash-utility#5 dw2_do_instantiate_symtab ... crash-utility#6 0x000000000077ed67 in dw2_instantiate_symtab ... crash-utility#7 0x000000000077f75e in dw2_expand_all_symtabs ... crash-utility#8 0x00000000008f254d in gdb_get_line_number ... crash-utility#9 0x00000000008f22af in gdb_command_funnel_1 ... crash-utility#10 0x00000000008f2003 in gdb_command_funnel ... crash-utility#11 0x00000000005b7f02 in gdb_interface ... crash-utility#12 0x00000000005f5bd8 in get_line_number ... crash-utility#13 0x000000000059e574 in cmd_dis ... The stack trace of "bt" for symtable expanding: #0 0x00000000008d8d9f in add_compunit_symtab_to_objfile ... crash-utility#1 0x00000000006d3293 in buildsym_compunit::end_symtab_with_blockvector ... crash-utility#2 0x00000000006d336a in buildsym_compunit::end_symtab_from_static_block ... crash-utility#3 0x000000000077e8e9 in process_full_comp_unit ... crash-utility#4 process_queue ... crash-utility#5 dw2_do_instantiate_symtab ... crash-utility#6 0x000000000077ed67 in dw2_instantiate_symtab ... crash-utility#7 0x000000000077f8ed in dw2_lookup_symbol ... crash-utility#8 0x00000000008e6d03 in lookup_symbol_via_quick_fns ... crash-utility#9 0x00000000008e7153 in lookup_symbol_in_objfile ... crash-utility#10 0x00000000008e73c6 in lookup_symbol_global_or_static_iterator_cb ... crash-utility#11 0x00000000008b99c4 in svr4_iterate_over_objfiles_in_search_order ... crash-utility#12 0x00000000008e754e in lookup_global_or_static_symbol ... crash-utility#13 0x00000000008e75da in lookup_static_symbol ... crash-utility#14 0x00000000008e632c in lookup_symbol_aux ... crash-utility#15 0x00000000008e5a7a in lookup_symbol_in_language ... crash-utility#16 0x00000000008e5b30 in lookup_symbol ... crash-utility#17 0x00000000008f2a4a in gdb_get_datatype ... crash-utility#18 0x00000000008f22c0 in gdb_command_funnel_1 ... crash-utility#19 0x00000000008f2003 in gdb_command_funnel ... crash-utility#20 0x00000000005b7f02 in gdb_interface ... crash-utility#21 0x00000000005f8a9f in datatype_info ... crash-utility#22 0x0000000000599947 in cpu_map_size ... crash-utility#23 0x00000000005a975d in get_cpus_online ... crash-utility#24 0x0000000000637a8b in diskdump_get_prstatus_percpu ... crash-utility#25 0x000000000062f0e4 in get_netdump_regs_x86_64 ... crash-utility#26 0x000000000059fe68 in back_trace ... crash-utility#27 0x00000000005ab1cb in cmd_bt ... For the stacktrace of "dis -rl", it calls dw2_expand_all_symtabs() to expand all symtable of the objfile, or "*.ko.debug" in our case. However for the stacktrace of "bt", it doesn't expand all, but only a subset of symtable which is enough to find a symbol by dw2_lookup_symbol(). As a result, the objfile->compunit_symtabs, which is the head of a single linked list of struct compunit_symtab, is not NULL but didn't contain all symtables. It will not be reinitialized in gdb_get_line_number() by "dis -rl" because !objfile_has_full_symbols(objfile) check will fail, so it cannot display the proper code line number data. Since objfile_has_full_symbols(objfile) check cannot ensure all symbols been expanded, this patch add a new member as a flag for struct objfile to record if all symbols have been expanded. The flag will be set only ofter expand_all_symtabs been called. Signed-off-by: Tao Liu <[email protected]>

The patch introduces per-cpu overflow stacks for RISCV64 to let "bt" do backtrace on it and the 'help -m' command dispalys the addresss of each per-cpu overflow stack. TEST: a lkdtm DIRECT EXHAUST_STACK vmcore crash> bt PID: 1 TASK: ff600000000d8000 CPU: 1 COMMAND: "sh" #0 [ff6000001fc501c0] riscv_crash_save_regs at ffffffff8000a1dc crash-utility#1 [ff6000001fc50320] panic at ffffffff808773ec crash-utility#2 [ff6000001fc50380] walk_stackframe at ffffffff800056da PC: ffffffff80876a34 [memset+96] RA: ffffffff80563dc0 [recursive_loop+68] SP: ff2000000000fd50 CAUSE: 000000000000000f epc : ffffffff80876a34 ra : ffffffff80563dc0 sp : ff2000000000fd50 gp : ffffffff81515d38 tp : 0000000000000000 t0 : ff2000000000fd58 t1 : ff600000000d88c8 t2 : 6143203a6d74646b s0 : ff20000000010190 s1 : 0000000000000012 a0 : ff2000000000fd58 a1 : 1212121212121212 a2 : 0000000000000400 a3 : ff20000000010158 a4 : 0000000000000000 a5 : 725bedba92260900 a6 : 000000000130e0f0 a7 : 0000000000000000 s2 : ff2000000000fd58 s3 : ffffffff815170d8 s4 : ff20000000013e60 s5 : 000000000000000e s6 : ff20000000013e60 s7 : 0000000000000000 s8 : ff60000000861000 s9 : 00007fffc3641694 s10: 00007fffc3641690 s11: 00005555796ed240 t3 : 0000000000010297 t4 : ffffffff80c17810 t5 : ffffffff8195e7b8 t6 : ff20000000013b18 status: 0000000200000120 badaddr: ff2000000000fd58 cause: 000000000000000f orig_a0: 0000000000000000 --- <OVERFLOW stack> --- crash-utility#3 [ff2000000000fd50] memset at ffffffff80876a34 crash-utility#4 [ff20000000010190] recursive_loop at ffffffff80563e16 crash-utility#5 [ff200000000105d0] recursive_loop at ffffffff80563e16 < recursive_loop ...> crash-utility#16 [ff20000000013490] recursive_loop at ffffffff80563e16 crash-utility#17 [ff200000000138d0] recursive_loop at ffffffff80563e16 crash-utility#18 [ff20000000013d10] lkdtm_EXHAUST_STACK at ffffffff8088005e crash-utility#19 [ff20000000013d30] lkdtm_do_action at ffffffff80563292 crash-utility#20 [ff20000000013d40] direct_entry at ffffffff80563474 crash-utility#21 [ff20000000013d70] full_proxy_write at ffffffff8032fb3a crash-utility#22 [ff20000000013db0] vfs_write at ffffffff801d6414 crash-utility#23 [ff20000000013e60] ksys_write at ffffffff801d67b8 crash-utility#24 [ff20000000013eb0] __riscv_sys_write at ffffffff801d6832 crash-utility#25 [ff20000000013ec0] do_trap_ecall_u at ffffffff80884a20 crash> crash> help -m <snip> irq_stack_size: 16384 irq_stacks[0]: ff20000000000000 irq_stacks[1]: ff20000000008000 overflow_stack_size: 4096 overflow_stacks[0]: ff6000001fa7a510 overflow_stacks[1]: ff6000001fc4f510 crash> Signed-off-by: Song Shuai <[email protected]>

…ss range Previously, to find a module symbol and its offset by an arbitrary address, all symbols within the module will be iterated by address ascending order until the last symbol with a smaller address been noticed. However if the address is not within the module address range, e.g. the address is higher than the module's last symbol's address, then the module can be surely skipped, because its symbol iteration is unnecessary. This can speed up the kernel module symbols finding and improve the overall performance. Without the patch: $ time echo "bt 8993" | ~/crash-dev/crash vmcore vmlinux crash> bt 8993 PID: 8993 TASK: ffff927569cc2100 CPU: 2 COMMAND: "WriterPool0" #0 [ffff927569cd76f0] __schedule at ffffffffb3db78d8 crash-utility#1 [ffff927569cd7758] schedule_preempt_disabled at ffffffffb3db8bf9 crash-utility#2 [ffff927569cd7768] __mutex_lock_slowpath at ffffffffb3db6ca7 crash-utility#3 [ffff927569cd77c0] mutex_lock at ffffffffb3db602f crash-utility#4 [ffff927569cd77d8] ucache_retrieve at ffffffffc0cf4409 [secfs2] ...snip the stacktrace of the same module... crash-utility#11 [ffff927569cd7ba0] cskal_path_vfs_getattr_nosec at ffffffffc05cae76 [falcon_kal] ...snip... crash-utility#13 [ffff927569cd7c40] _ZdlPv at ffffffffc086e751 [falcon_lsm_serviceable] ...snip... crash-utility#20 [ffff927569cd7ef8] unload_network_ops_symbols at ffffffffc06f11c0 [falcon_lsm_pinned_14713] crash-utility#21 [ffff927569cd7f50] system_call_fastpath at ffffffffb3dc539a RIP: 00007f2b28ed4023 RSP: 00007f2a45fe7f80 RFLAGS: 00000206 RAX: 0000000000000012 RBX: 00007f2a68302e00 RCX: 00007f2a682546d8 RDX: 0000000000000826 RSI: 00007eb57ea6a000 RDI: 00000000000000e3 RBP: 00007eb57ea6a000 R8: 0000000000000826 R9: 00000002670bdfd2 R10: 00000002670bdfd2 R11: 0000000000000293 R12: 00000002670bdfd2 R13: 00007f29d501a480 R14: 0000000000000826 R15: 00000002670bdfd2 ORIG_RAX: 0000000000000012 CS: 0033 SS: 002b crash> real 7m14.826s user 7m12.502s sys 0m1.091s With the patch: $ time echo "bt 8993" | ~/crash-dev/crash vmcore vmlinux crash> bt 8993 PID: 8993 TASK: ffff927569cc2100 CPU: 2 COMMAND: "WriterPool0" #0 [ffff927569cd76f0] __schedule at ffffffffb3db78d8 crash-utility#1 [ffff927569cd7758] schedule_preempt_disabled at ffffffffb3db8bf9 ...snip the same output... crash> real 0m8.827s user 0m7.896s sys 0m0.938s Signed-off-by: Tao Liu <[email protected]>

…ame" warning The "bogus exception frame" warning was observed again on a specific vmcore, and the remaining frame was truncated on x86_64 machine, when executing the "bt" command as below: crash> bt 0 -c 8 PID: 0 TASK: ffff9948c08f5640 CPU: 8 COMMAND: "swapper/8" #0 [fffffe1788788e58] crash_nmi_callback at ffffffff972672bb crash-utility#1 [fffffe1788788e68] nmi_handle at ffffffff9722eb8e crash-utility#2 [fffffe1788788eb0] default_do_nmi at ffffffff97e51cd0 crash-utility#3 [fffffe1788788ed0] exc_nmi at ffffffff97e51ee1 crash-utility#4 [fffffe1788788ef0] end_repeat_nmi at ffffffff980015f9 [exception RIP: __update_load_avg_se+13] RIP: ffffffff9736b16d RSP: ffffbec3c08acc78 RFLAGS: 00000046 RAX: 0000000000000000 RBX: ffff994c2f2b1a40 RCX: ffffbec3c08acdc0 RDX: ffff9948e4fe1d80 RSI: ffff994c2f2b1a40 RDI: 0000001d7ad7d55d RBP: ffffbec3c08acc88 R8: 0000001d921fca6f R9: ffff994c2f2b1328 R10: 00000000fffd0010 R11: ffffffff98e060c0 R12: 0000001d7ad7d55d R13: 0000000000000005 R14: ffff994c2f2b19c0 R15: 0000000000000001 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- crash-utility#5 [ffffbec3c08acc78] __update_load_avg_se at ffffffff9736b16d crash-utility#6 [ffffbec3c08acce0] enqueue_entity at ffffffff9735c9ab crash-utility#7 [ffffbec3c08acd28] enqueue_task_fair at ffffffff9735cef8 ... crash-utility#18 [ffffbec3c08acf90] blk_complete_reqs at ffffffff977978d0 crash-utility#19 [ffffbec3c08acfa0] __do_softirq at ffffffff97e66f7a crash-utility#20 [ffffbec3c08acff0] do_softirq at ffffffff9730f6ef --- <IRQ stack> --- crash-utility#21 [ffffbec3c022ff18] do_idle at ffffffff97368288 [exception RIP: unknown or invalid address] RIP: 0000000000000000 RSP: 0000000000000000 RFLAGS: 00000000 RAX: 0000000000000000 RBX: 000000089726a2d0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffffff9726a3dd R8: 0000000000000000 R9: 0000000000000000 R10: ffffffff9720015a R11: e48885e126bc1600 R12: 0000000000000000 R13: ffffffff973684a9 R14: 0000000000000094 R15: 0000000040000000 ORIG_RAX: 0000000000000000 CS: 0000 SS: 0000 bt: WARNING: possibly bogus exception frame crash> Actually there is no exception frame, when called from do_softirq(). With the patch: crash> bt 0 -c 8 ... crash-utility#18 [ffffbec3c08acf90] blk_complete_reqs at ffffffff977978d0 crash-utility#19 [ffffbec3c08acfa0] __do_softirq at ffffffff97e66f7a crash-utility#20 [ffffbec3c08acff0] do_softirq at ffffffff9730f6ef --- <IRQ stack> --- crash-utility#21 [ffffbec3c022ff28] cpu_startup_entry at ffffffff973684a9 crash-utility#22 [ffffbec3c022ff38] start_secondary at ffffffff9726a3dd crash-utility#23 [ffffbec3c022ff50] secondary_startup_64_no_verify at ffffffff9720015a crash> Reported-by: Jie Li <[email protected]> Signed-off-by: Lianbo Jiang <[email protected]>

On x86_64, there are cases where crash cannot handle IRQ exception frames properly. For example, with RHEL9.3 kernel, "bt" command fails with with "WARNING possibly bogus exception frame": crash> bt -c 30 PID: 2898241 TASK: ff4cb0ce0da0c680 CPU: 30 COMMAND: "star-ccm+" #0 [fffffe4658d88e58] crash_nmi_callback at ffffffffa00675e8 crash-utility#1 [fffffe4658d88e68] nmi_handle at ffffffffa002ebab ... --- <NMI exception stack> --- ... crash-utility#13 [ff5eba269937cf90] __do_softirq at ffffffffa0c6c007 crash-utility#14 [ff5eba269937cfe0] __irq_exit_rcu at ffffffffa010ef61 crash-utility#15 [ff5eba269937cff0] sysvec_apic_timer_interrupt at ffffffffa0c58ca2 --- <IRQ stack> --- RIP: 0000000000000010 RSP: 0000000000000018 RFLAGS: ff5eba26ddc9f7e8 RAX: 0000000000000a20 RBX: ff5eba26ddc9f940 RCX: 0000000000001000 RDX: ffffffb559980000 RSI: ff4cb12d67207400 RDI: ffffffffffffffff RBP: 0000000000001000 R8: ff5eba26ddc9f940 R9: ff5eba26ddc9f8af R10: 0000000000000003 R11: 0000000000000a20 R12: ff5eba26ddc9f8b0 R13: 000000283c07f000 R14: ff4cb0f5a29a1c00 R15: 0000000000000001 ORIG_RAX: ffffffffa07c4e60 CS: 0206 SS: 7000001cf0380001 bt: WARNING: possibly bogus exception frame Running "crash" with "--machdep irq_eframe_link=0xffffffffffffffe8" option (i.e. thus irq_eframe_link = -24) works properly: PID: 2898241 TASK: ff4cb0ce0da0c680 CPU: 30 COMMAND: "star-ccm+" #0 [fffffe4658d88e58] crash_nmi_callback at ffffffffa00675e8 crash-utility#1 [fffffe4658d88e68] nmi_handle at ffffffffa002ebab ... --- <NMI exception stack> --- ... crash-utility#13 [ff5eba269937cf90] __do_softirq at ffffffffa0c6c007 crash-utility#14 [ff5eba269937cfe0] __irq_exit_rcu at ffffffffa010ef61 crash-utility#15 [ff5eba269937cff0] sysvec_apic_timer_interrupt at ffffffffa0c58ca2 --- <IRQ stack> --- crash-utility#16 [ff5eba26ddc9f738] asm_sysvec_apic_timer_interrupt at ffffffffa0e00e06 [exception RIP: alloc_pte.constprop.0+32] RIP: ffffffffa07c4e60 RSP: ff5eba26ddc9f7e8 RFLAGS: 00000206 RAX: ff5eba26ddc9f940 RBX: 0000000000001000 RCX: 0000000000000a20 RDX: 0000000000001000 RSI: ffffffb559980000 RDI: ff4cb12d67207400 RBP: ff5eba26ddc9f8b0 R8: ff5eba26ddc9f8af R9: 0000000000000003 R10: 0000000000000a20 R11: ff5eba26ddc9f940 R12: 000000283c07f000 R13: ff4cb0f5a29a1c00 R14: 0000000000000001 R15: ff4cb0f5a29a1bf8 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 crash-utility#17 [ff5eba26ddc9f830] iommu_v1_map_pages at ffffffffa07c5648 crash-utility#18 [ff5eba26ddc9f8f8] __iommu_map at ffffffffa07d7803 crash-utility#19 [ff5eba26ddc9f990] iommu_map_sg at ffffffffa07d7b71 crash-utility#20 [ff5eba26ddc9f9f0] iommu_dma_map_sg at ffffffffa07ddcc9 crash-utility#21 [ff5eba26ddc9fa90] __dma_map_sg_attrs at ffffffffa01b5205 ... Some background: asm_common_interrupt: callq error_entry movq %rax,%rsp movq %rsp,%rdi movq 0x78(%rsp),%rsi movq $-0x1,0x78(%rsp) call common_interrupt # rsp pointing to regs common_interrupt: pushq %r12 pushq %rbp pushq %rbx [...] movq hardirq_stack_ptr,%r11 movq %rsp,(%r11) movq %r11,%rsp [...] call __common_interrupt # rip:common_interrupt So frame_size(rip:common_interrupt) = 32 (3 push + ret). Hence "machdep->machspec->irq_eframe_link = -32;" (see x86_64_irq_eframe_link_init()). Now: asm_sysvec_apic_timer_interrupt: pushq $-0x1 callq error_entry movq %rax,%rsp movq %rsp,%rdi callq sysvec_apic_timer_interrupt sysvec_apic_timer_interrupt: pushq %r12 pushq %rbp [...] movq hardirq_stack_ptr,%r11 movq %rsp,(%r11) movq %r11,%rsp [...] call __sysvec_apic_timer_interrupt # rip:sysvec_apic_timer_interrupt Here frame_size(rip:sysvec_apic_timer_interrupt) = 24 (2 push + ret) We should also notice that: rip = *(hardirq_stack_ptr - 8) rsp = *(hardirq_stack_ptr) regs = rsp - frame_size(rip) But x86_64_get_framesize() does not work with IRQ handlers (returns 0). So not many options other than hardcoding the most likely value and looking around it. Actually x86_64_irq_eframe_link() was trying -32 (default), and then -40, but not -24. Signed-off-by: Georges Aureau <[email protected]>

crash-utility closed this as completed Aug 31, 2017

JoeyJiao mentioned this issue Nov 29, 2019

Fix SIGSEGV when cur_req is null pointer #45

Closed

cjl20062529 mentioned this issue Jun 3, 2020

Error reading global variable value in module with p command #50

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory corruption #20

memory corruption #20

leilchen commented Aug 30, 2017

crash-utility commented Aug 30, 2017 via email

crash-utility commented Aug 30, 2017 via email

crash-utility commented Aug 30, 2017 via email

leilchen commented Aug 31, 2017

crash-utility commented Aug 31, 2017

crash-utility commented Aug 31, 2017

memory corruption #20

memory corruption #20

Comments

leilchen commented Aug 30, 2017

crash-utility commented Aug 30, 2017 via email

crash-utility commented Aug 30, 2017 via email

crash-utility commented Aug 30, 2017 via email

leilchen commented Aug 31, 2017

crash-utility commented Aug 31, 2017

crash-utility commented Aug 31, 2017