-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
problem with crash utility in ubuntu 16.10 #9
Comments
----- Original Message -----
Hi,
i was unable to build with `make` but i install with `apt install crash` on
ubuntu 16.10
```
uname -a
Linux raminfp 4.10.8-041008-generic #201703310531 SMP Fri Mar 31 09:33:56 UTC
2017 x86_64 x86_64 x86_64 GNU/Linux
```
kernel version `4.10.8-041008-generic`
```
***@***.*** /h/r/D/crash# crash
crash 7.1.5
Copyright (C) 2002-2016 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
crash: cannot find booted kernel -- please enter namelist argument
Usage:
crash [OPTION]... NAMELIST ***@***.*** (dumpfile form)
crash [OPTION]... [NAMELIST] (live system form)
Enter "crash -h" for details.
```
i get error `crash: cannot find booted kernel -- please enter namelist
argument`
you say on docs :
>If the kernel file is stored in /boot, /, /boot/efi, or in any /usr/src
or /usr/lib/debug/lib/modules subdirectory, then no command line arguments
are required -- the first kernel found that matches /proc/version will be
used as the namelist.
```
***@***.*** /h/r/D/crash# cat /proc/version
Linux version 4.10.8-041008-generic ***@***.***) (gcc version 6.2.0
20161005 (Ubuntu 6.2.0-5ubuntu12) ) #201703310531 SMP Fri Mar 31 09:33:56
UTC 2017
***@***.*** /h/r/D/crash# ls /boot/
vmlinuz-4.10.8-041008-generic
```
crash not work on ubuntu 16.10, if i am mistake please help for resolve it,
Cheers,
Ramin
I am not familiar Ubuntu how packages their kernel debuginfo package, but
what is required is the vmlinux file containing full debuginfo data
(not the stripped /boot/vmlinuz) file. Typically it would be installed
in the /usr/lib/debug/lib/modules subdirectory, but again I am not sure
about Ubuntu.
Also, when you are able to locate/install the kernel debuginfo package,
you will probably find that crash-7.1.5 will not work with a 4.10 kernel,
and therefore you would need to build/install crash-7.1.8.
Dave
|
root@raminfp /h/r/D/crash# cd /usr/lib/debug/lib/modules
cd: The directory “/usr/lib/debug/lib/modules” does not exist |
@crash-utility it did not work, root@raminfp /h/r/D/c/crash-7.1.7# ./crash --version
crash 7.1.7
Copyright (C) 2002-2016 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu". please tell me,Which operating system do you do? Thanks, |
----- Original Message -----
@crash-utility
```bash
***@***.*** /h/r/D/crash# cd /usr/lib/debug/lib/modules
cd: The directory “/usr/lib/debug/lib/modules” does not exist
OK, like I said, I have *no* idea about how Ubuntu packages are handled.
If you google "ubuntu kernel debuginfo package", you should be
able to determine what you need to do to get the relevant package.
Dave
|
----- Original Message -----
@crash-utility not work,
```bash
***@***.*** /h/r/D/c/crash-7.1.7# ./crash --version
crash 7.1.7
Copyright (C) 2002-2016 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
```
please tell me,Which operating system do you do?
Thanks,
Ramin
Red Hat's RHEL kernels, and Fedora kernels. But it works with any
operating system as long as you can get the proper vmlinux debuginfo
file.
Dave
|
@crash-utility do you means this?
|
----- Original Message -----
@crash-utility
```
***@***.*** /h/r/D/c/crash-7.1.7# ./crash /boot/vmlinuz-4.8.0-46-generic
crash 7.1.7
Copyright (C) 2002-2016 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
crash: /boot/vmlinuz-4.8.0-46-generic: not a supported file format
Usage:
crash [OPTION]... NAMELIST ***@***.*** (dumpfile form)
crash [OPTION]... [NAMELIST] (live system form)
Enter "crash -h" for details.
```
The crash utility (similar to gdb) requires an ELF format file
that contains full debuginfo data. For example:
$ file /usr/lib/debug/lib/modules/3.10.0-514.el7.x86_64/vmlinux
/usr/lib/debug/lib/modules/3.10.0-514.el7.x86_64/vmlinux: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=d72f51bee55ee4a6ea6bdc37f3faeaa7393d006c, not stripped
The vmlinuz file (with a "z" at the end) is a stripped/compressed bzImage,
which is not appropriate for the crash utility:
$ file /boot/vmlinuz-3.10.0-514.el7.x86_64
/boot/vmlinuz-3.10.0-514.el7.x86_64: Linux kernel x86 boot executable bzImage, version 3.10.0-514.el7.x86_64 ([email protected], RO-rootFS, swap_dev 0x5, Normal VGA
$
Dave
|
Thanks, you'r right, i should decompress compressed kernel image, this is bad because kernel isn't running for use crash, |
Note that you cannot "decompress" a vmlinuz file and restore it into a vmlinux file. You have to usel the original vmlinux file from which the stripped/compressed image was generated. That's why you have to install the kernel debuginfo package that contains the original vmlinux file. |
@crash-utility Thanks a lot Dave, How to work on ubuntu 16.10: PG key import:
root@raminfp:# apt-key adv --keyserver keyserver.ubuntu.com --recv-keys C8CAB6595FDFF622
Add repository config:
root@raminfp:# codename=$(lsb_release -c | awk '{print $2}')
root@raminfp:# tee /etc/apt/sources.list.d/ddebs.list
root@raminfp:# deb http://ddebs.ubuntu.com/ ${codename} main restricted universe multiverse
root@raminfp:# deb http://ddebs.ubuntu.com/ ${codename}-security main restricted universe multiverse
root@raminfp:# deb http://ddebs.ubuntu.com/ ${codename}-updates main restricted universe multiverse
root@raminfp:# deb http://ddebs.ubuntu.com/ ${codename}-proposed main restricted universe multiverse
root@raminfp:# apt-get update
root@raminfp:# uname -r
4.8.0-46-generic
root@raminfp:# apt-get install linux-image-4.8.0-46-generic-dbgsym Finally :root@raminfp:/home/raminfp/Desktop/crash/crash-7.1.8# ./crash /usr/lib/debug/boot/vmlinux-4.8.0-46-generic
crash 7.1.8
Copyright (C) 2002-2016 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
WARNING: kernel relocated [184MB]: patching 98263 gdb minimal_symbol values
crash: read error: kernel virtual address: ffffffff8d63b400 type: "page_offset_base"
crash: this kernel may be configured with CONFIG_STRICT_DEVMEM, which
renders /dev/mem unusable as a live memory source.
crash: trying /proc/kcore as an alternative to /dev/mem
KERNEL: /usr/lib/debug/boot/vmlinux-4.8.0-46-generic
DUMPFILE: /proc/kcore
CPUS: 4
DATE: Fri Apr 14 06:05:18 2017
UPTIME: 00:11:44
LOAD AVERAGE: 0.72, 0.58, 0.67
TASKS: 648
NODENAME: raminfp
RELEASE: 4.8.0-46-generic
VERSION: #49-Ubuntu SMP Fri Mar 31 13:57:14 UTC 2017
MACHINE: x86_64 (1795 Mhz)
MEMORY: 4 GB
PID: 6288
COMMAND: "crash"
TASK: ffff99e1c4cdb600 [THREAD_INFO: ffff99e1c2f84000]
CPU: 3
STATE: TASK_RUNNING (ACTIVE)
crash>
Thanks Again Dave, i have question of RedHat,i can send for you in email address ([email protected]) please? |
----- Original Message -----
@crash-utility Thanks a lot Dave,
How to work on ubuntu 16.10:
```bash
PG key import:
***@***.***:# apt-key adv --keyserver keyserver.ubuntu.com --recv-keys
C8CAB6595FDFF622
Add repository config:
***@***.***:# codename=$(lsb_release -c | awk '{print $2}')
***@***.***:# tee /etc/apt/sources.list.d/ddebs.list
***@***.***:# deb http://ddebs.ubuntu.com/ ${codename} main restricted
universe multiverse
***@***.***:# deb http://ddebs.ubuntu.com/ ${codename}-security main
restricted universe multiverse
***@***.***:# deb http://ddebs.ubuntu.com/ ${codename}-updates main
restricted universe multiverse
***@***.***:# deb http://ddebs.ubuntu.com/ ${codename}-proposed main
restricted universe multiverse
***@***.***:# apt-get update
***@***.***:# uname -r
4.8.0-46-generic
***@***.***:# apt-get install linux-image-4.8.0-46-generic-dbgsym
```
## Finally :
```bash
***@***.***:/home/raminfp/Desktop/crash/crash-7.1.8# ./crash
/usr/lib/debug/boot/vmlinux-4.8.0-46-generic
crash 7.1.8
Copyright (C) 2002-2016 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
WARNING: kernel relocated [184MB]: patching 98263 gdb minimal_symbol values
crash: read error: kernel virtual address: ffffffff8d63b400 type:
"page_offset_base"
crash: this kernel may be configured with CONFIG_STRICT_DEVMEM, which
renders /dev/mem unusable as a live memory source.
crash: trying /proc/kcore as an alternative to /dev/mem
KERNEL: /usr/lib/debug/boot/vmlinux-4.8.0-46-generic
DUMPFILE: /proc/kcore
CPUS: 4
DATE: Fri Apr 14 06:05:18 2017
UPTIME: 00:11:44
LOAD AVERAGE: 0.72, 0.58, 0.67
TASKS: 648
NODENAME: raminfp
RELEASE: 4.8.0-46-generic
VERSION: #49-Ubuntu SMP Fri Mar 31 13:57:14 UTC 2017
MACHINE: x86_64 (1795 Mhz)
MEMORY: 4 GB
PID: 6288
COMMAND: "crash"
TASK: ffff99e1c4cdb600 [THREAD_INFO: ffff99e1c2f84000]
CPU: 3
STATE: TASK_RUNNING (ACTIVE)
crash>
```
Thanks Again Dave,
i have question of RedHat,i can send for you in email address
***@***.***) please?
Glad you finally got it working! Yes, you can use [email protected].
Dave
…
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#9 (comment)
|
Thank You! |
Fix for 'bt' command and options on Linux 5.8-rc1 or later kernels that contain merge commit 076f14be7fc942e112c94c841baec44124275cd0. The merged patches changed the name of exception functions that have been used by the crash utility to check the exception frame. Without the patch, the command and options cannot display it. Before: crash> bt PID: 8752 TASK: ffff8f80cb244380 CPU: 2 COMMAND: "insmod" #0 [ffffa3e40187f9f8] machine_kexec at ffffffffab25d267 crash-utility#1 [ffffa3e40187fa48] __crash_kexec at ffffffffab38e2ed crash-utility#2 [ffffa3e40187fb10] crash_kexec at ffffffffab38f1dd crash-utility#3 [ffffa3e40187fb28] oops_end at ffffffffab222cbd crash-utility#4 [ffffa3e40187fb48] do_trap at ffffffffab21fea1 crash-utility#5 [ffffa3e40187fb90] do_error_trap at ffffffffab21ff75 crash-utility#6 [ffffa3e40187fbd0] exc_invalid_op at ffffffffabb76a2c crash-utility#7 [ffffa3e40187fbf0] asm_exc_invalid_op at ffffffffabc00a72 crash-utility#8 [ffffa3e40187fc78] init_module at ffffffffc042b018 [invalid] crash-utility#9 [ffffa3e40187fca0] init_module at ffffffffc042b018 [invalid] crash-utility#10 [ffffa3e40187fca8] do_one_initcall at ffffffffab202806 crash-utility#11 [ffffa3e40187fd18] do_init_module at ffffffffab3888ba crash-utility#12 [ffffa3e40187fd38] load_module at ffffffffab38afde After: crash> bt PID: 8752 TASK: ffff8f80cb244380 CPU: 2 COMMAND: "insmod" #0 [ffffa3e40187f9f8] machine_kexec at ffffffffab25d267 crash-utility#1 [ffffa3e40187fa48] __crash_kexec at ffffffffab38e2ed crash-utility#2 [ffffa3e40187fb10] crash_kexec at ffffffffab38f1dd crash-utility#3 [ffffa3e40187fb28] oops_end at ffffffffab222cbd crash-utility#4 [ffffa3e40187fb48] do_trap at ffffffffab21fea1 crash-utility#5 [ffffa3e40187fb90] do_error_trap at ffffffffab21ff75 crash-utility#6 [ffffa3e40187fbd0] exc_invalid_op at ffffffffabb76a2c crash-utility#7 [ffffa3e40187fbf0] asm_exc_invalid_op at ffffffffabc00a72 [exception RIP: init_module+24] RIP: ffffffffc042b018 RSP: ffffa3e40187fca8 RFLAGS: 00010246 RAX: 000000000000001c RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff8f80fbd18000 RDI: ffff8f80fbd18000 RBP: ffffffffc042b000 R8: 000000000000029d R9: 000000000000002c R10: 0000000000000000 R11: ffffa3e40187fb58 R12: ffffffffc042d018 R13: ffffa3e40187fdf0 R14: ffffffffc042d000 R15: ffffa3e40187fe90 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 crash-utility#8 [ffffa3e40187fca0] init_module at ffffffffc042b018 [invalid] crash-utility#9 [ffffa3e40187fca8] do_one_initcall at ffffffffab202806 crash-utility#10 [ffffa3e40187fd18] do_init_module at ffffffffab3888ba crash-utility#11 [ffffa3e40187fd38] load_module at ffffffffab38afde Signed-off-by: Kazuhito Hagio <[email protected]>
Fix for 'bt' command and options on Linux 5.8-rc1 and later kernels that contain merge commit 076f14be7fc942e112c94c841baec44124275cd0. The merged patches changed the name of exception functions that have been used by the crash utility to check the exception frame. Without the patch, the command and options cannot display it. Before: crash> bt PID: 8752 TASK: ffff8f80cb244380 CPU: 2 COMMAND: "insmod" #0 [ffffa3e40187f9f8] machine_kexec at ffffffffab25d267 #1 [ffffa3e40187fa48] __crash_kexec at ffffffffab38e2ed #2 [ffffa3e40187fb10] crash_kexec at ffffffffab38f1dd #3 [ffffa3e40187fb28] oops_end at ffffffffab222cbd #4 [ffffa3e40187fb48] do_trap at ffffffffab21fea1 #5 [ffffa3e40187fb90] do_error_trap at ffffffffab21ff75 #6 [ffffa3e40187fbd0] exc_invalid_op at ffffffffabb76a2c #7 [ffffa3e40187fbf0] asm_exc_invalid_op at ffffffffabc00a72 #8 [ffffa3e40187fc78] init_module at ffffffffc042b018 [invalid] #9 [ffffa3e40187fca0] init_module at ffffffffc042b018 [invalid] #10 [ffffa3e40187fca8] do_one_initcall at ffffffffab202806 #11 [ffffa3e40187fd18] do_init_module at ffffffffab3888ba #12 [ffffa3e40187fd38] load_module at ffffffffab38afde After: crash> bt PID: 8752 TASK: ffff8f80cb244380 CPU: 2 COMMAND: "insmod" #0 [ffffa3e40187f9f8] machine_kexec at ffffffffab25d267 #1 [ffffa3e40187fa48] __crash_kexec at ffffffffab38e2ed #2 [ffffa3e40187fb10] crash_kexec at ffffffffab38f1dd #3 [ffffa3e40187fb28] oops_end at ffffffffab222cbd #4 [ffffa3e40187fb48] do_trap at ffffffffab21fea1 #5 [ffffa3e40187fb90] do_error_trap at ffffffffab21ff75 #6 [ffffa3e40187fbd0] exc_invalid_op at ffffffffabb76a2c #7 [ffffa3e40187fbf0] asm_exc_invalid_op at ffffffffabc00a72 [exception RIP: init_module+24] RIP: ffffffffc042b018 RSP: ffffa3e40187fca8 RFLAGS: 00010246 RAX: 000000000000001c RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff8f80fbd18000 RDI: ffff8f80fbd18000 RBP: ffffffffc042b000 R8: 000000000000029d R9: 000000000000002c R10: 0000000000000000 R11: ffffa3e40187fb58 R12: ffffffffc042d018 R13: ffffa3e40187fdf0 R14: ffffffffc042d000 R15: ffffa3e40187fe90 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffffa3e40187fca0] init_module at ffffffffc042b018 [invalid] #9 [ffffa3e40187fca8] do_one_initcall at ffffffffab202806 #10 [ffffa3e40187fd18] do_init_module at ffffffffab3888ba #11 [ffffa3e40187fd38] load_module at ffffffffab38afde Signed-off-by: Kazuhito Hagio <[email protected]>
…'bt' command dumpfiles: (1) If the kernel's crash_notes are not available, read them from ELF notes. (2) If an online CPUs did not save its ELF notes, then adjust the mapping of each ELF note to its CPU accordingly. E.g. With this patch: crash> bt PID: 4768 TASK: 9800000243bcf200 CPU: 3 COMMAND: "bash" #0 [980000024291f930] __crash_kexec at ffffffff802fff84 crash-utility#1 [980000024291faa0] panic at ffffffff80248cac crash-utility#2 [980000024291fb40] die at ffffffff8021b338 crash-utility#3 [980000024291fb70] do_page_fault at ffffffff802315e0 crash-utility#4 [980000024291fbd0] tlb_do_page_fault_1 at ffffffff80239388 crash-utility#5 [980000024291fd00] sysrq_handle_crash at ffffffff8085d308 crash-utility#6 [980000024291fd10] __handle_sysrq at ffffffff8085d9e0 crash-utility#7 [980000024291fd60] write_sysrq_trigger at ffffffff8085e020 crash-utility#8 [980000024291fd80] proc_reg_write at ffffffff804762f0 crash-utility#9 [980000024291fda0] __vfs_write at ffffffff803f3138 Signed-off-by: Huacai Chen <[email protected]> Signed-off-by: Youling Tang <[email protected]>
…'bt' command dumpfiles: (1) If the kernel's crash_notes are not available, read them from ELF notes. (2) If an online CPUs did not save its ELF notes, then adjust the mapping of each ELF note to its CPU accordingly. E.g. With this patch: crash> bt PID: 4768 TASK: 9800000243bcf200 CPU: 3 COMMAND: "bash" #0 [980000024291f930] __crash_kexec at ffffffff802fff84 #1 [980000024291faa0] panic at ffffffff80248cac #2 [980000024291fb40] die at ffffffff8021b338 #3 [980000024291fb70] do_page_fault at ffffffff802315e0 #4 [980000024291fbd0] tlb_do_page_fault_1 at ffffffff80239388 #5 [980000024291fd00] sysrq_handle_crash at ffffffff8085d308 #6 [980000024291fd10] __handle_sysrq at ffffffff8085d9e0 #7 [980000024291fd60] write_sysrq_trigger at ffffffff8085e020 #8 [980000024291fd80] proc_reg_write at ffffffff804762f0 #9 [980000024291fda0] __vfs_write at ffffffff803f3138 Signed-off-by: Huacai Chen <[email protected]> Signed-off-by: Youling Tang <[email protected]>
Overflow stack supported since kernel 4.14 in commit 872d8327ce8, without this patch, bt command trigger a SIGSEGV fault due the SP pointed to the overflow stack which not yet loaded by crash. Before: KERNEL: ../vmlinux DUMPFILE: la_guestdump.gcore CPUS: 8 DATE: Tue Jul 13 19:59:44 CST 2021 UPTIME: 00:00:42 LOAD AVERAGE: 3.99, 1.13, 0.39 TASKS: 1925 NODENAME: localhost RELEASE: 4.14.156+ VERSION: crash-utility#1 SMP PREEMPT Tue Jul 13 10:37:23 UTC 2021 MACHINE: aarch64 (unknown Mhz) MEMORY: 8.7 GB PANIC: "Kernel panic - not syncing: kernel stack overflow" PID: 1969 COMMAND: "irq/139-0-0024" TASK: ffffffcc1a230000 [THREAD_INFO: ffffffcc1a230000] CPU: 0 STATE: TASK_RUNNING (PANIC) crash-7.3.0> bt PID: 1969 TASK: ffffffcc1a230000 CPU: 0 COMMAND: "irq/139-0-0024" Segmentation fault (core dumped) After: crash> bt PID: 1969 TASK: ffffffcc1a230000 CPU: 0 COMMAND: "irq/139-0-0024" #0 [ffffffcc7fd5cf50] __delay at ffffff8008c80774 crash-utility#1 [ffffffcc7fd5cf60] __const_udelay at ffffff8008c80864 crash-utility#2 [ffffffcc7fd5cf80] msm_trigger_wdog_bite at ffffff80084e9430 crash-utility#3 [ffffffcc7fd5cfa0] do_vm_restart at ffffff80087bc974 crash-utility#4 [ffffffcc7fd5cfc0] machine_restart at ffffff80080856fc crash-utility#5 [ffffffcc7fd5cfd0] emergency_restart at ffffff80080d49bc crash-utility#6 [ffffffcc7fd5d140] panic at ffffff80080af4c0 crash-utility#7 [ffffffcc7fd5d150] nmi_panic at ffffff80080af150 crash-utility#8 [ffffffcc7fd5d190] handle_bad_stack at ffffff800808b0b8 crash-utility#9 [ffffffcc7fd5d2d0] __bad_stack at ffffff800808285c --- <IRQ stack> --- crash-utility#10 [ffffff801187bc60] el1_error_invalid at ffffff8008082e7c crash-utility#11 [ffffff801187bcc0] cyttsp6_mt_attention at ffffff8000e8498c [cyttsp6] crash-utility#12 [ffffff801187bd20] call_atten_cb at ffffff8000e82030 [cyttsp6] crash-utility#13 [ffffff801187bdc0] cyttsp6_irq at ffffff8000e81e34 [cyttsp6] crash-utility#14 [ffffff801187bdf0] irq_thread_fn at ffffff8008128dd8 crash-utility#15 [ffffff801187be50] irq_thread at ffffff8008128ca4 crash-utility#16 [ffffff801187beb0] kthread at ffffff80080d2fc4 crash> Signed-off-by: Hong YANG <[email protected]>
Kernel commit <872d8327ce89> ("arm64: add VMAP_STACK overflow detection") has supported the overflow stack exception handling. Without the patch, the "bt" command will make crash generate a core dump because of segmentation fault. With the patch, the "bt" command can display the overflow stack. Before: crash> bt PID: 3607 TASK: ffffffcbf9a4da00 CPU: 2 COMMAND: "sh" Segmentation fault (core dumped) After: crash> bt PID: 3607 TASK: ffffffcbf9a4da00 CPU: 2 COMMAND: "sh" #0 [ffffffccbfd85f50] __delay at ffffff8008ceded8 ... #5 [ffffffccbfd85fd0] emergency_restart at ffffff80080d49fc #6 [ffffffccbfd86140] panic at ffffff80080af4c0 #7 [ffffffccbfd86150] nmi_panic at ffffff80080af150 #8 [ffffffccbfd86190] handle_bad_stack at ffffff800808b0b8 #9 [ffffffccbfd862d0] __bad_stack at ffffff800808285c PC: ffffff8008082e80 [el1_sync] LR: ffffff8000d6c214 [stack_overflow_demo+84] SP: ffffff1a79930070 PSTATE: 204003c5 X29: ffffff8011b03d00 X28: ffffffcbf9a4da00 X27: ffffff8008e02000 X26: 0000000000000040 X25: 0000000000000124 X24: ffffffcbf9a4da00 X23: 0000007daec2e288 X22: ffffffcbfe03b800 X21: 0000007daec2e288 X20: 0000000000000002 X19: 0000000000000002 X18: 0000000000000002 X17: 00000000000003e7 X16: 0000000000000000 X15: 0000000000000000 X14: ffffffcc17facb00 X13: ffffffccb4c25c00 X12: 0000000000000000 X11: ffffffcc17fad660 X10: 0000000000000af0 X9: 0000000000000000 X8: ffffff1a799334f0 X7: 0000000000000000 X6: 000000000000003f X5: 0000000000000040 X4: 0000000000000010 X3: 00000065981d07f0 X2: 00000065981d07f0 X1: 0000000000000000 X0: ffffff1a799334f0 Signed-off-by: Hong YANG <[email protected]>
Kernel commit <872d8327ce89> ("arm64: add VMAP_STACK overflow detection") has supported the overflow stack exception handling. Without the patch, the "bt" command will make crash generate a core dump because of segmentation fault. With the patch, the "bt" command can display the overflow stack. Before: crash> bt PID: 3607 TASK: ffffffcbf9a4da00 CPU: 2 COMMAND: "sh" Segmentation fault (core dumped) After: crash> bt PID: 3607 TASK: ffffffcbf9a4da00 CPU: 2 COMMAND: "sh" #0 [ffffffccbfd85f50] __delay at ffffff8008ceded8 ... #5 [ffffffccbfd85fd0] emergency_restart at ffffff80080d49fc #6 [ffffffccbfd86140] panic at ffffff80080af4c0 #7 [ffffffccbfd86150] nmi_panic at ffffff80080af150 #8 [ffffffccbfd86190] handle_bad_stack at ffffff800808b0b8 #9 [ffffffccbfd862d0] __bad_stack at ffffff800808285c PC: ffffff8008082e80 [el1_sync] LR: ffffff8000d6c214 [stack_overflow_demo+84] SP: ffffff1a79930070 PSTATE: 204003c5 X29: ffffff8011b03d00 X28: ffffffcbf9a4da00 X27: ffffff8008e02000 X26: 0000000000000040 X25: 0000000000000124 X24: ffffffcbf9a4da00 X23: 0000007daec2e288 X22: ffffffcbfe03b800 X21: 0000007daec2e288 X20: 0000000000000002 X19: 0000000000000002 X18: 0000000000000002 X17: 00000000000003e7 X16: 0000000000000000 X15: 0000000000000000 X14: ffffffcc17facb00 X13: ffffffccb4c25c00 X12: 0000000000000000 X11: ffffffcc17fad660 X10: 0000000000000af0 X9: 0000000000000000 X8: ffffff1a799334f0 X7: 0000000000000000 X6: 000000000000003f X5: 0000000000000040 X4: 0000000000000010 X3: 00000065981d07f0 X2: 00000065981d07f0 X1: 0000000000000000 X0: ffffff1a799334f0 Signed-off-by: Hong YANG <[email protected]>
When we use crash to troubleshoot softlockup and other problems, we often use the 'bt -a' command to print the stacks of running processes on all CPUs. But now some servers have hundreds of CPUs (such as AMD machines), which causes the 'bt -a' command to output a lot of process stacks. And many of these stacks are the stacks of the idle process, which are not needed by us. Therefore, in order to reduce this part of the interference information, this patch adds the -n option to the bt command. When we specify '-n idle' (meaning no idle), the stack of the idle process will be filtered out, thus speeding up our troubleshooting. And the option works only for crash dumps captured by kdump. The command output is as follows: crash> bt -a -n idle [...] PID: 0 TASK: ffff889ff8c34380 CPU: 8 COMMAND: "swapper/8" PID: 0 TASK: ffff889ff8c32d00 CPU: 9 COMMAND: "swapper/9" PID: 0 TASK: ffff889ff8c31680 CPU: 10 COMMAND: "swapper/10" PID: 0 TASK: ffff889ff8c35a00 CPU: 11 COMMAND: "swapper/11" PID: 0 TASK: ffff889ff8c3c380 CPU: 12 COMMAND: "swapper/12" PID: 150773 TASK: ffff889fe85a1680 CPU: 13 COMMAND: "bash" #0 [ffffc9000d35bcd0] machine_kexec at ffffffff8105a407 #1 [ffffc9000d35bd28] __crash_kexec at ffffffff8113033d #2 [ffffc9000d35bdf0] panic at ffffffff81081930 #3 [ffffc9000d35be70] sysrq_handle_crash at ffffffff814e38d1 #4 [ffffc9000d35be78] __handle_sysrq.cold.12 at ffffffff814e4175 #5 [ffffc9000d35bea8] write_sysrq_trigger at ffffffff814e404b #6 [ffffc9000d35beb8] proc_reg_write at ffffffff81330d86 #7 [ffffc9000d35bed0] vfs_write at ffffffff812a72d5 #8 [ffffc9000d35bf00] ksys_write at ffffffff812a7579 #9 [ffffc9000d35bf38] do_syscall_64 at ffffffff81004259 RIP: 00007fa7abcdc274 RSP: 00007fffa731f678 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fa7abcdc274 RDX: 0000000000000002 RSI: 0000563ca51ee6d0 RDI: 0000000000000001 RBP: 0000563ca51ee6d0 R8: 000000000000000a R9: 00007fa7abd6be80 R10: 000000000000000a R11: 0000000000000246 R12: 00007fa7abdad760 R13: 0000000000000002 R14: 00007fa7abda8760 R15: 0000000000000002 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b [...] Signed-off-by: Qi Zheng <[email protected]> Acked-by: Kazuhito Hagio <[email protected]> Acked-by: Lianbo Jiang <[email protected]>
Kernel commit 7d65f4a65532 ("irq: Consolidate do_softirq() arch overriden implementations") renamed the call_softirq to do_softirq_own_stack, and there is no exception frame also when coming from do_softirq_own_stack. Without the patch, crash may unnecessarily output an exception frame with a warning as below: crash> foreach bt ... PID: 0 TASK: ffff914f820a8000 CPU: 25 COMMAND: "swapper/25" #0 [fffffe0000504e48] crash_nmi_callback at ffffffffa665d763 #1 [fffffe0000504e50] nmi_handle at ffffffffa662a423 #2 [fffffe0000504ea8] default_do_nmi at ffffffffa6fe7dc9 #3 [fffffe0000504ec8] do_nmi at ffffffffa662a97f #4 [fffffe0000504ef0] end_repeat_nmi at ffffffffa70015e8 [exception RIP: clone_endio+172] RIP: ffffffffc005c1ec RSP: ffffa1d403d08e98 RFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff915326fba230 RCX: 0000000000000018 RDX: ffffffffc0075400 RSI: 0000000000000000 RDI: ffff915326fba230 RBP: ffff915326fba1c0 R8: 0000000000001000 R9: ffff915308d6d2a0 R10: 000000a97dfe5e10 R11: ffffa1d40038fe98 R12: ffff915302babc40 R13: ffff914f94360000 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #5 [ffffa1d403d08e98] clone_endio at ffffffffc005c1ec [dm_mod] #6 [ffffa1d403d08ed0] blk_update_request at ffffffffa6a96954 #7 [ffffa1d403d08f10] scsi_end_request at ffffffffa6c9b968 #8 [ffffa1d403d08f48] scsi_io_completion at ffffffffa6c9bb3e #9 [ffffa1d403d08f90] blk_complete_reqs at ffffffffa6aa0e95 #10 [ffffa1d403d08fa0] __softirqentry_text_start at ffffffffa72000dc #11 [ffffa1d403d08ff0] do_softirq_own_stack at ffffffffa7000f9a --- <IRQ stack> --- #12 [ffffa1d40038fe70] do_softirq_own_stack at ffffffffa7000f9a [exception RIP: unknown or invalid address] RIP: 0000000000000000 RSP: 0000000000000000 RFLAGS: 00000000 RAX: ffffffffa672eae5 RBX: ffffffffa83b34e0 RCX: ffffffffa672eb12 RDX: 0000000000000010 RSI: 8b7d6c8869010c00 RDI: 0000000000000085 RBP: 0000000000000286 R8: ffff914f820a8000 R9: ffffffffa67a94e0 R10: 0000000000000286 R11: ffffffffa66fb4c5 R12: ffffffffa67a898b R13: 0000000000000000 R14: fffffffffffffff8 R15: ffffffffa67a1e68 ORIG_RAX: 0000000000000000 CS: 0000 SS: ffffffffa672edff bt: WARNING: possibly bogus exception frame #13 [ffffa1d40038ff30] start_secondary at ffffffffa665fa2c #14 [ffffa1d40038ff50] secondary_startup_64_no_verify at ffffffffa6600116 ... Reported-by: Marco Patalano <[email protected]> Signed-off-by: Lianbo Jiang <[email protected]>
On kernels configured with CONFIG_RANDOMIZE_KSTACK_OFFSET=y and random_kstack_offset=on, a random offset is added to the stack with __kstack_alloca() at the beginning of do_syscall_64() and other syscall entry functions. This eventually does the following instruction. <do_syscall_64+32>: sub %rax,%rsp On the other hand, crash uses only a part of data for ORC unwinder to unwind stacks and if an ip value doesn't have a usable ORC data, it caluculates the frame size with parsing the assembly of the function. However, crash cannot calculate the frame size correctly with the instruction above, and prints stale return addresses like this: crash> bt 1 PID: 1 TASK: ffff9c250023b880 CPU: 0 COMMAND: "systemd" #0 [ffffb7e5c001fc80] __schedule at ffffffff91ae2b16 crash-utility#1 [ffffb7e5c001fd00] schedule at ffffffff91ae2ed3 crash-utility#2 [ffffb7e5c001fd18] schedule_hrtimeout_range_clock at ffffffff91ae7ed8 crash-utility#3 [ffffb7e5c001fda8] ep_poll at ffffffff913ef828 crash-utility#4 [ffffb7e5c001fe48] do_epoll_wait at ffffffff913ef943 crash-utility#5 [ffffb7e5c001fe80] __x64_sys_epoll_wait at ffffffff913f0130 crash-utility#6 [ffffb7e5c001fed0] do_syscall_64 at ffffffff91ad7169 crash-utility#7 [ffffb7e5c001fef0] do_syscall_64 at ffffffff91ad7179 << crash-utility#8 [ffffb7e5c001ff10] syscall_exit_to_user_mode at ffffffff91adaab2 << stale entries crash-utility#9 [ffffb7e5c001ff20] do_syscall_64 at ffffffff91ad7179 << crash-utility#10 [ffffb7e5c001ff50] entry_SYSCALL_64_after_hwframe at ffffffff91c0009b RIP: 00007f258d9427ae RSP: 00007fffda631d60 RFLAGS: 00000293 ... To fix this, enhance the usage of ORC data. The ORC unwinder often uses %rbp value, so keep it from exception frames and inactive task stacks. Signed-off-by: Kazuhito Hagio <[email protected]>
On kernels configured with CONFIG_RANDOMIZE_KSTACK_OFFSET=y and random_kstack_offset=on, a random offset is added to task stacks with __kstack_alloca() at the beginning of do_syscall_64() and other syscall entry functions. This eventually does the following instruction. <do_syscall_64+32>: sub %rax,%rsp On the other hand, crash uses only a part of data for ORC unwinder to unwind stacks and if an ip value doesn't have a usable ORC data, it caluculates the frame size with parsing the assembly of the function. However, crash cannot calculate the frame size correctly with the instruction above, and prints stale return addresses like this: crash> bt 1 PID: 1 TASK: ffff9c250023b880 CPU: 0 COMMAND: "systemd" #0 [ffffb7e5c001fc80] __schedule at ffffffff91ae2b16 #1 [ffffb7e5c001fd00] schedule at ffffffff91ae2ed3 #2 [ffffb7e5c001fd18] schedule_hrtimeout_range_clock at ffffffff91ae7ed8 #3 [ffffb7e5c001fda8] ep_poll at ffffffff913ef828 #4 [ffffb7e5c001fe48] do_epoll_wait at ffffffff913ef943 #5 [ffffb7e5c001fe80] __x64_sys_epoll_wait at ffffffff913f0130 #6 [ffffb7e5c001fed0] do_syscall_64 at ffffffff91ad7169 #7 [ffffb7e5c001fef0] do_syscall_64 at ffffffff91ad7179 << #8 [ffffb7e5c001ff10] syscall_exit_to_user_mode at ffffffff91adaab2 << stale entries #9 [ffffb7e5c001ff20] do_syscall_64 at ffffffff91ad7179 << #10 [ffffb7e5c001ff50] entry_SYSCALL_64_after_hwframe at ffffffff91c0009b RIP: 00007f258d9427ae RSP: 00007fffda631d60 RFLAGS: 00000293 ... To fix this, enhance the use of ORC data. The ORC unwinder often uses %rbp value, so keep it from exception frames and inactive task stacks. Signed-off-by: Kazuhito Hagio <[email protected]>
Previously we can only view the stack unwinding for the tasks which are running on each CPUs. This patch will enable the ability to view arbitrary tasks stack unwinding. After crash get initialized, "info threads" will output like the following: crash> info threads Id Target Id Frame 1 CPU 0 native_safe_halt () at arch/x86/include/asm/irqflags.h:54 ... * 8 CPU 7 blk_mq_rq_timed_out (req=0xffff880fdb246000, reserved=reserved@entry=false) at block/blk-mq.c:640 ... 13 CPU 12 <unavailable> in ?? () 14 CPU 13 native_safe_halt () at arch/x86/include/asm/irqflags.h:54 ... crash> ps PID PPID CPU TASK ST %MEM VSZ RSS COMM > 0 0 0 ffffffff819f9480 RU 0.0 0 0 [swapper/0] > 0 0 1 ffff880169411fa0 RU 0.0 0 0 [swapper/1] ... 0 0 23 ffff8801694e0000 RU 0.0 0 0 [swapper/23] 1 0 13 ffff880169b30000 IN 0.0 193052 4180 systemd "info threads" show the tasks which are currently running on each CPU. If we'd like to view systemd task's stack unwinding, which is inactive status, we do the following: crash> set 1 or crash> set ffff880169b30000 Then the register cache of systemd will be swapped into CPU 13: crash> info threads Id Target Id Frame 1 CPU 0 native_safe_halt () at arch/x86/include/asm/irqflags.h:54 ... 8 CPU 7 blk_mq_rq_timed_out (req=0xffff880fdb246000, reserved=reserved@entry=false) at block/blk-mq.c:640 ... 13 CPU 12 <unavailable> in ?? () * 14 CPU 13 0xffffffff816a8f65 in context_switch (rq=0x0, next=0x0, prev=0xffff880169b30000) at kernel/sched/core.c:2527 ... And we can view the stack unwinding of systemd: crash> bt PID: 1 TASK: ffff880169b30000 CPU: 13 COMMAND: "systemd" #0 [ffff880169b3bd58] __schedule at ffffffff816a8f65 crash-utility#1 [ffff880169b3bdc0] schedule at ffffffff816a94e9 crash-utility#2 [ffff880169b3bdd0] schedule_hrtimeout_range_clock at ffffffff816a86fd crash-utility#3 [ffff880169b3be68] schedule_hrtimeout_range at ffffffff816a8733 crash-utility#4 [ffff880169b3be78] ep_poll at ffffffff8124bb7e crash-utility#5 [ffff880169b3bf30] sys_epoll_wait at ffffffff8124d00d crash-utility#6 [ffff880169b3bf80] system_call_fastpath at ffffffff816b5009 RIP: 00007f0449407923 RSP: 00007ffc35a3c378 RFLAGS: 00010246 RAX: 00000000000000e8 RBX: ffffffff816b5009 RCX: 0000000000000071 RDX: 000000000000001d RSI: 00007ffc35a3d5a0 RDI: 0000000000000004 RBP: 00007ffc35a3d810 R8: 0000000000000000 R9: 0000000000000000 R10: 00000000ffffffff R11: 0000000000000293 R12: 0000563ca2ebe980 R13: 0000000000000003 R14: ffffffffffffffff R15: 0000000000000001 ORIG_RAX: 00000000000000e8 CS: 0033 SS: 002b crash> gdb bt #0 0xffffffff816a8f65 in context_switch (rq=0x0, next=0x0, prev=0xffff880169b30000) at kernel/sched/core.c:2527 crash-utility#1 __schedule () at kernel/sched/core.c:3540 crash-utility#2 0xffffffff816a94e9 in schedule () at kernel/sched/core.c:3577 crash-utility#3 0xffffffff816a86fd in schedule_hrtimeout_range_clock (expires=expires@entry=0x0, delta=delta@entry=0, mode=mode@entry=HRTIMER_MODE_ABS, clock=clock@entry=1) at kernel/hrtimer.c:1724 crash-utility#4 0xffffffff816a8733 in schedule_hrtimeout_range (expires=expires@entry=0x0, delta=delta@entry=0, mode=mode@entry=HRTIMER_MODE_ABS) at kernel/hrtimer.c:1778 crash-utility#5 0xffffffff8124bb7e in ep_poll (ep=0xffff880fd861f8c0, events=events@entry=0x7ffc35a3d5a0, maxevents=maxevents@entry=29, timeout=timeout@entry=-1) at fs/eventpoll.c:1669 crash-utility#6 0xffffffff8124d00d in SYSC_epoll_wait (timeout=<optimized out>, maxevents=29, events=<optimized out>, epfd=<optimized out>) at fs/eventpoll.c:2043 crash-utility#7 SyS_epoll_wait (epfd=<optimized out>, events=140721208415648, maxevents=29, timeout=4294967295) at fs/eventpoll.c:2008 crash-utility#8 <signal handler called> crash-utility#9 0x00007f0449407923 in ?? () Signed-off-by: Tao Liu <[email protected]> Signed-off-by: Aditya Gupta <[email protected]>
The stack unwinding is for kernel addresses only. If non-kernel address encountered, it is usually a user space address, or non-address value like a function call parameter. So stopping stack unwinding at non-kernel address will decrease the invalid unwind results. Before: crash> gdb bt #0 0xffffffff816a8f65 in context_switch ... crash-utility#1 __schedule () ... crash-utility#2 0xffffffff816a94e9 in schedule ... crash-utility#3 0xffffffff816a86fd in schedule_hrtimeout_range_clock ... crash-utility#4 0xffffffff816a8733 in schedule_hrtimeout_range ... crash-utility#5 0xffffffff8124bb7e in ep_poll ... crash-utility#6 0xffffffff8124d00d in SYSC_epoll_wait ... crash-utility#7 SyS_epoll_wait ... crash-utility#8 <signal handler called> crash-utility#9 0x00007f0449407923 in ?? () crash-utility#10 0xffff880100000001 in ?? () crash-utility#11 0xffff880169b3c010 in ?? () crash-utility#12 0x0000000000000040 in irq_stack_union () crash-utility#13 0xffff880169b3c058 in ?? () crash-utility#14 0xffff880169b3c048 in ?? () crash-utility#15 0xffff880169b3c050 in ?? () crash-utility#16 0x0000000000000000 in ?? () After: crash> gdb bt #0 0xffffffff816a8f65 in context_switch ... crash-utility#1 __schedule () ... crash-utility#2 0xffffffff816a94e9 in schedule () ... crash-utility#3 0xffffffff816a86fd in schedule_hrtimeout_range_clock ... crash-utility#4 0xffffffff816a8733 in schedule_hrtimeout_range ... crash-utility#5 0xffffffff8124bb7e in ep_poll ... crash-utility#6 0xffffffff8124d00d in SYSC_epoll_wait ... crash-utility#7 SyS_epoll_wait ... crash-utility#8 <signal handler called> crash-utility#9 0x00007f0449407923 in ?? () Signed-off-by: Tao Liu <[email protected]>
Previously we can only view the stack unwinding for the tasks which are running on each CPUs. This patch will enable the ability to view arbitrary tasks stack unwinding. After crash get initialized, "info threads" will output like the following: crash> info threads Id Target Id Frame 1 CPU 0 native_safe_halt () at arch/x86/include/asm/irqflags.h:54 ... * 8 CPU 7 blk_mq_rq_timed_out (req=0xffff880fdb246000, reserved=reserved@entry=false) at block/blk-mq.c:640 ... 13 CPU 12 <unavailable> in ?? () 14 CPU 13 native_safe_halt () at arch/x86/include/asm/irqflags.h:54 ... crash> ps PID PPID CPU TASK ST %MEM VSZ RSS COMM > 0 0 0 ffffffff819f9480 RU 0.0 0 0 [swapper/0] > 0 0 1 ffff880169411fa0 RU 0.0 0 0 [swapper/1] ... 0 0 23 ffff8801694e0000 RU 0.0 0 0 [swapper/23] 1 0 13 ffff880169b30000 IN 0.0 193052 4180 systemd "info threads" show the tasks which are currently running on each CPU. If we'd like to view systemd task's stack unwinding, which is inactive status, we do the following: crash> set 1 or crash> set ffff880169b30000 Then the register cache of systemd will be swapped into CPU 13: crash> info threads Id Target Id Frame 1 CPU 0 native_safe_halt () at arch/x86/include/asm/irqflags.h:54 ... 8 CPU 7 blk_mq_rq_timed_out (req=0xffff880fdb246000, reserved=reserved@entry=false) at block/blk-mq.c:640 ... 13 CPU 12 <unavailable> in ?? () * 14 CPU 13 0xffffffff816a8f65 in context_switch (rq=0x0, next=0x0, prev=0xffff880169b30000) at kernel/sched/core.c:2527 ... And we can view the stack unwinding of systemd: crash> bt PID: 1 TASK: ffff880169b30000 CPU: 13 COMMAND: "systemd" #0 [ffff880169b3bd58] __schedule at ffffffff816a8f65 crash-utility#1 [ffff880169b3bdc0] schedule at ffffffff816a94e9 crash-utility#2 [ffff880169b3bdd0] schedule_hrtimeout_range_clock at ffffffff816a86fd crash-utility#3 [ffff880169b3be68] schedule_hrtimeout_range at ffffffff816a8733 crash-utility#4 [ffff880169b3be78] ep_poll at ffffffff8124bb7e crash-utility#5 [ffff880169b3bf30] sys_epoll_wait at ffffffff8124d00d crash-utility#6 [ffff880169b3bf80] system_call_fastpath at ffffffff816b5009 RIP: 00007f0449407923 RSP: 00007ffc35a3c378 RFLAGS: 00010246 RAX: 00000000000000e8 RBX: ffffffff816b5009 RCX: 0000000000000071 RDX: 000000000000001d RSI: 00007ffc35a3d5a0 RDI: 0000000000000004 RBP: 00007ffc35a3d810 R8: 0000000000000000 R9: 0000000000000000 R10: 00000000ffffffff R11: 0000000000000293 R12: 0000563ca2ebe980 R13: 0000000000000003 R14: ffffffffffffffff R15: 0000000000000001 ORIG_RAX: 00000000000000e8 CS: 0033 SS: 002b crash> gdb bt #0 0xffffffff816a8f65 in context_switch (rq=0x0, next=0x0, prev=0xffff880169b30000) at kernel/sched/core.c:2527 crash-utility#1 __schedule () at kernel/sched/core.c:3540 crash-utility#2 0xffffffff816a94e9 in schedule () at kernel/sched/core.c:3577 crash-utility#3 0xffffffff816a86fd in schedule_hrtimeout_range_clock (expires=expires@entry=0x0, delta=delta@entry=0, mode=mode@entry=HRTIMER_MODE_ABS, clock=clock@entry=1) at kernel/hrtimer.c:1724 crash-utility#4 0xffffffff816a8733 in schedule_hrtimeout_range (expires=expires@entry=0x0, delta=delta@entry=0, mode=mode@entry=HRTIMER_MODE_ABS) at kernel/hrtimer.c:1778 crash-utility#5 0xffffffff8124bb7e in ep_poll (ep=0xffff880fd861f8c0, events=events@entry=0x7ffc35a3d5a0, maxevents=maxevents@entry=29, timeout=timeout@entry=-1) at fs/eventpoll.c:1669 crash-utility#6 0xffffffff8124d00d in SYSC_epoll_wait (timeout=<optimized out>, maxevents=29, events=<optimized out>, epfd=<optimized out>) at fs/eventpoll.c:2043 crash-utility#7 SyS_epoll_wait (epfd=<optimized out>, events=140721208415648, maxevents=29, timeout=4294967295) at fs/eventpoll.c:2008 crash-utility#8 <signal handler called> crash-utility#9 0x00007f0449407923 in ?? () Signed-off-by: Tao Liu <[email protected]> Signed-off-by: Aditya Gupta <[email protected]>
The stack unwinding is for kernel addresses only. If non-kernel address encountered, it is usually a user space address, or non-address value like a function call parameter. So stopping stack unwinding at non-kernel address will decrease the invalid unwind results. Before: crash> gdb bt #0 0xffffffff816a8f65 in context_switch ... crash-utility#1 __schedule () ... crash-utility#2 0xffffffff816a94e9 in schedule ... crash-utility#3 0xffffffff816a86fd in schedule_hrtimeout_range_clock ... crash-utility#4 0xffffffff816a8733 in schedule_hrtimeout_range ... crash-utility#5 0xffffffff8124bb7e in ep_poll ... crash-utility#6 0xffffffff8124d00d in SYSC_epoll_wait ... crash-utility#7 SyS_epoll_wait ... crash-utility#8 <signal handler called> crash-utility#9 0x00007f0449407923 in ?? () crash-utility#10 0xffff880100000001 in ?? () crash-utility#11 0xffff880169b3c010 in ?? () crash-utility#12 0x0000000000000040 in irq_stack_union () crash-utility#13 0xffff880169b3c058 in ?? () crash-utility#14 0xffff880169b3c048 in ?? () crash-utility#15 0xffff880169b3c050 in ?? () crash-utility#16 0x0000000000000000 in ?? () After: crash> gdb bt #0 0xffffffff816a8f65 in context_switch ... crash-utility#1 __schedule () ... crash-utility#2 0xffffffff816a94e9 in schedule () ... crash-utility#3 0xffffffff816a86fd in schedule_hrtimeout_range_clock ... crash-utility#4 0xffffffff816a8733 in schedule_hrtimeout_range ... crash-utility#5 0xffffffff8124bb7e in ep_poll ... crash-utility#6 0xffffffff8124d00d in SYSC_epoll_wait ... crash-utility#7 SyS_epoll_wait ... crash-utility#8 <signal handler called> crash-utility#9 0x00007f0449407923 in ?? () Signed-off-by: Tao Liu <[email protected]>
The stack unwinding is for kernel addresses only. If non-kernel address encountered, it is usually a user space address, or non-address value like a function call parameter. So stopping stack unwinding at non-kernel address will decrease the invalid unwind results. Before: crash> gdb bt #0 0xffffffff816a8f65 in context_switch ... crash-utility#1 __schedule () ... crash-utility#2 0xffffffff816a94e9 in schedule ... crash-utility#3 0xffffffff816a86fd in schedule_hrtimeout_range_clock ... crash-utility#4 0xffffffff816a8733 in schedule_hrtimeout_range ... crash-utility#5 0xffffffff8124bb7e in ep_poll ... crash-utility#6 0xffffffff8124d00d in SYSC_epoll_wait ... crash-utility#7 SyS_epoll_wait ... crash-utility#8 <signal handler called> crash-utility#9 0x00007f0449407923 in ?? () crash-utility#10 0xffff880100000001 in ?? () crash-utility#11 0xffff880169b3c010 in ?? () crash-utility#12 0x0000000000000040 in irq_stack_union () crash-utility#13 0xffff880169b3c058 in ?? () crash-utility#14 0xffff880169b3c048 in ?? () crash-utility#15 0xffff880169b3c050 in ?? () crash-utility#16 0x0000000000000000 in ?? () After: crash> gdb bt #0 0xffffffff816a8f65 in context_switch ... crash-utility#1 __schedule () ... crash-utility#2 0xffffffff816a94e9 in schedule () ... crash-utility#3 0xffffffff816a86fd in schedule_hrtimeout_range_clock ... crash-utility#4 0xffffffff816a8733 in schedule_hrtimeout_range ... crash-utility#5 0xffffffff8124bb7e in ep_poll ... crash-utility#6 0xffffffff8124d00d in SYSC_epoll_wait ... crash-utility#7 SyS_epoll_wait ... crash-utility#8 <signal handler called> crash-utility#9 0x00007f0449407923 in ?? () Signed-off-by: Tao Liu <[email protected]>
The following segmentation fault occurred during session initialization: $ crash vmlinx vmcore ... please wait... (determining panic task)Segmentation fault Here is the backtrace of the crash-utility: (gdb) bt #0 value_search_module_6_4 (value=18446603338276298752, offset=0x7ffffffface0) at symbols.c:5564 #1 0x0000555555812bd0 in value_to_symstr (value=18446603338276298752, buf=buf@entry=0x7fffffffb9c0 "", radix=10, radix@entry=0) at symbols.c:5872 #2 0x00005555557694a2 in display_memory (addr=<optimized out>, count=2048, flag=208, memtype=memtype@entry=1, opt=opt@entry=0x0) at memory.c:1740 #3 0x0000555555769e1f in raw_stack_dump (stackbase=<optimized out>, size=<optimized out>) at memory.c:2194 #4 0x00005555557923ff in get_active_set_panic_task () at task.c:8639 #5 0x00005555557930d2 in get_dumpfile_panic_task () at task.c:7628 #6 0x00005555557a89d3 in panic_search () at task.c:7380 #7 get_panic_context () at task.c:6267 #8 task_init () at task.c:687 #9 0x00005555557305b3 in main_loop () at main.c:787 ... This is due to lack of existence check on module symbol table. Not all mod_mem_type will be existent for a module, e.g. in the following module case: (gdb) p lm->symtable[0] $1 = (struct syment *) 0x4dcbad0 (gdb) p lm->symtable[1] $2 = (struct syment *) 0x4dcbb70 (gdb) p lm->symtable[2] $3 = (struct syment *) 0x4dcbc10 (gdb) p lm->symtable[3] $4 = (struct syment *) 0x0 (gdb) p lm->symtable[4] $5 = (struct syment *) 0x4dcbcb0 (gdb) p lm->symtable[5] $6 = (struct syment *) 0x4dcbd00 (gdb) p lm->symtable[6] $7 = (struct syment *) 0x0 MOD_RO_AFTER_INIT(3) and MOD_INIT_RODATA(6) do not exist, which should be skipped, otherwise the segmentation fault will happen. Fixes: 7750e61 ("Support module memory layout change on Linux 6.4") Closes: #176 Reported-by: Naveen Chaudhary <[email protected]> Signed-off-by: Tao Liu <[email protected]>
The stack unwinding is for kernel addresses only. If non-kernel address encountered, it is usually a user space address, or non-address value like a function call parameter. So stopping stack unwinding at non-kernel address will decrease the invalid unwind results. Before: crash> gdb bt #0 0xffffffff816a8f65 in context_switch ... crash-utility#1 __schedule () ... crash-utility#2 0xffffffff816a94e9 in schedule ... crash-utility#3 0xffffffff816a86fd in schedule_hrtimeout_range_clock ... crash-utility#4 0xffffffff816a8733 in schedule_hrtimeout_range ... crash-utility#5 0xffffffff8124bb7e in ep_poll ... crash-utility#6 0xffffffff8124d00d in SYSC_epoll_wait ... crash-utility#7 SyS_epoll_wait ... crash-utility#8 <signal handler called> crash-utility#9 0x00007f0449407923 in ?? () crash-utility#10 0xffff880100000001 in ?? () crash-utility#11 0xffff880169b3c010 in ?? () crash-utility#12 0x0000000000000040 in irq_stack_union () crash-utility#13 0xffff880169b3c058 in ?? () crash-utility#14 0xffff880169b3c048 in ?? () crash-utility#15 0xffff880169b3c050 in ?? () crash-utility#16 0x0000000000000000 in ?? () After: crash> gdb bt #0 0xffffffff816a8f65 in context_switch ... crash-utility#1 __schedule () ... crash-utility#2 0xffffffff816a94e9 in schedule () ... crash-utility#3 0xffffffff816a86fd in schedule_hrtimeout_range_clock ... crash-utility#4 0xffffffff816a8733 in schedule_hrtimeout_range ... crash-utility#5 0xffffffff8124bb7e in ep_poll ... crash-utility#6 0xffffffff8124d00d in SYSC_epoll_wait ... crash-utility#7 SyS_epoll_wait ... crash-utility#8 <signal handler called> crash-utility#9 0x00007f0449407923 in ?? () Cc: Sourabh Jain <[email protected]> Cc: Hari Bathini <[email protected]> Cc: Mahesh J Salgaonkar <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Lianbo Jiang <[email protected]> Cc: HAGIO KAZUHITO(萩尾 一仁) <[email protected]> Cc: Tao Liu <[email protected]> Cc: Alexey Makhalov <[email protected]> Signed-off-by: Tao Liu <[email protected]>
With Kernel commit 65c9cc9e2c14 ("x86/fred: Reserve space for the FRED stack frame") in Linux 6.9-rc1 and later, x86_64 will add extra padding ('TOP_OF_KERNEL_STACK_PADDING (2 * 8)', see: arch/x86/include/asm\ /thread_info.h,) for kernel stack when the CONFIG_X86_FRED is enabled. As a result, the pt_regs will be moved downwards due to the offset of padding, and the values of registers read from pt_regs will be incorrect as below. Without the patch: crash> bt PID: 2040 TASK: ffff969136fc4180 CPU: 16 COMMAND: "bash" #0 [ffffa996409aba38] machine_kexec at ffffffff9f881eb7 #1 [ffffa996409aba90] __crash_kexec at ffffffff9fa1e49e #2 [ffffa996409abb48] panic at ffffffff9f91a6cd #3 [ffffa996409abbc8] sysrq_handle_crash at ffffffffa0015076 #4 [ffffa996409abbd0] __handle_sysrq at ffffffffa0015640 #5 [ffffa996409abc00] write_sysrq_trigger at ffffffffa0015ce5 #6 [ffffa996409abc28] proc_reg_write at ffffffff9fd35bf5 #7 [ffffa996409abc40] vfs_write at ffffffff9fc8d462 #8 [ffffa996409abcd0] ksys_write at ffffffff9fc8dadf #9 [ffffa996409abd08] do_syscall_64 at ffffffffa0517429 #10 [ffffa996409abf40] entry_SYSCALL_64_after_hwframe at ffffffffa060012b [exception RIP: unknown or invalid address] RIP: 0000000000000246 RSP: 0000000000000000 RFLAGS: 0000002b RAX: 0000000000000002 RBX: 00007f9b9f5b13e0 RCX: 000055cee7486fb0 RDX: 0000000000000001 RSI: 0000000000000001 RDI: 00007f9b9f4fda57 RBP: 0000000000000246 R8: 00007f9b9f4fda57 R9: ffffffffffffffda R10: 0000000000000000 R11: 00007f9b9f5b14e0 R12: 0000000000000002 R13: 000055cee7486fb0 R14: 0000000000000002 R15: 00007f9b9f5fb780 ORIG_RAX: 0000000000000033 CS: 7ffe65327978 SS: 0000 bt: WARNING: possibly bogus exception frame crash> With the patch: crash> bt PID: 2040 TASK: ffff969136fc4180 CPU: 16 COMMAND: "bash" #0 [ffffa996409aba38] machine_kexec at ffffffff9f881eb7 #1 [ffffa996409aba90] __crash_kexec at ffffffff9fa1e49e #2 [ffffa996409abb48] panic at ffffffff9f91a6cd #3 [ffffa996409abbc8] sysrq_handle_crash at ffffffffa0015076 #4 [ffffa996409abbd0] __handle_sysrq at ffffffffa0015640 #5 [ffffa996409abc00] write_sysrq_trigger at ffffffffa0015ce5 #6 [ffffa996409abc28] proc_reg_write at ffffffff9fd35bf5 #7 [ffffa996409abc40] vfs_write at ffffffff9fc8d462 #8 [ffffa996409abcd0] ksys_write at ffffffff9fc8dadf #9 [ffffa996409abd08] do_syscall_64 at ffffffffa0517429 #10 [ffffa996409abf40] entry_SYSCALL_64_after_hwframe at ffffffffa060012b RIP: 00007f9b9f4fda57 RSP: 00007ffe65327978 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f9b9f4fda57 RDX: 0000000000000002 RSI: 000055cee7486fb0 RDI: 0000000000000001 RBP: 000055cee7486fb0 R8: 0000000000000000 R9: 00007f9b9f5b14e0 R10: 00007f9b9f5b13e0 R11: 0000000000000246 R12: 0000000000000002 R13: 00007f9b9f5fb780 R14: 0000000000000002 R15: 00007f9b9f5f69e0 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b crash> Link: https://www.mail-archive.com/[email protected]/msg00754.html Signed-off-by: Lianbo Jiang <[email protected]> Signed-off-by: Tao Liu <[email protected]>
If we use crash to parse ramdump(Qcom phone device) rathen than vmcore. Start command should be like: crash vmlinux --kaslr=xxx DDRCS0_0.BIN@0x0000000080000000,... --machdep vabits_actual=39 Then We will see bt command show misleading backtrace information below: crash> bt 16930 PID: 16930 TASK: ffffff89b3eada00 CPU: 2 COMMAND: "Firebase Backgr" #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4 crash-utility#1 [ffffffc034c43850] __kvm_nvhe_$d.2314 at 6be732e004cf05a0 crash-utility#2 [ffffffc034c438b0] __kvm_nvhe_$d.2314 at 86c54c6004ceff80 crash-utility#3 [ffffffc034c43950] __kvm_nvhe_$d.2314 at 55d6f96003a7b120 crash-utility#4 [ffffffc034c439f0] __kvm_nvhe_$d.2314 at 9ccec46003a80a64 crash-utility#5 [ffffffc034c43ac0] __kvm_nvhe_$d.2314 at 8cf41e6003a945c4 crash-utility#6 [ffffffc034c43b10] __kvm_nvhe_$d.2314 at a8f181e00372c818 crash-utility#7 [ffffffc034c43b40] __kvm_nvhe_$d.2314 at 6dedde600372c0d0 crash-utility#8 [ffffffc034c43b90] __kvm_nvhe_$d.2314 at 62cc07e00373d0ac crash-utility#9 [ffffffc034c43c00] __kvm_nvhe_$d.2314 at 72fb1de00373bedc ... PC: 00000073f5294840 LR: 00000070d8f39ba4 SP: 00000070d4afd5d0 X29: 00000070d4afd600 X28: b4000071efcda7f0 X27: 00000070d4afe000 X26: 0000000000000000 X25: 00000070d9616000 X24: 0000000000000000 X23: 0000000000000000 X22: 0000000000000000 X21: 0000000000000000 X20: b40000728fd27520 X19: b40000728fd27550 X18: 000000702daba000 X17: 00000073f5294820 X16: 00000070d940f9d8 X15: 00000000000000bf X14: 0000000000000000 X13: 00000070d8ad2fac X12: b40000718fce5040 X11: 0000000000000000 X10: 0000000000000070 X9: 0000000000000001 X8: 0000000000000062 X7: 0000000000000020 X6: 0000000000000000 X5: 0000000000000000 X4: 0000000000000000 X3: 0000000000000000 X2: 0000000000000002 X1: 0000000000000080 X0: b40000728fd27550 ORIG_X0: b40000728fd27550 SYSCALLNO: ffffffff PSTATE: 40001000 By checking the raw data below, will see the lr (fp+8) data show the pointer which already been replaced by PAC prefix. crash> bt -f PID: 16930 TASK: ffffff89b3eada00 CPU: 2 COMMAND: "Firebase Backgr" #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4 ffffffc034c437f0: ffffffc034c43850 6be732e004cf05a4 ffffffc034c43800: ffffffe006186108 a0ed07e004cf09c4 ffffffc034c43810: ffffff8a1a340000 ffffff8a8d343c00 ffffffc034c43820: ffffff89b3eada00 ffffff8b780db540 ffffffc034c43830: ffffff89b3eada00 0000000000000000 ffffffc034c43840: 0000000000000004 712b828118484a00 crash-utility#1 [ffffffc034c43850] __kvm_nvhe_$d.2314 at 6be732e004cf05a0 ffffffc034c43850: ffffffc034c438b0 86c54c6004ceff84 ffffffc034c43860: 000000708070f000 ffffffc034c43938 ffffffc034c43870: ffffff88bd822878 ffffff89b3eada00 ... So we check the CONFIG_ARM64_PTR_AUTH and CONFIG_ARM64_PTR_AUTH_KERNEL to double check if pac mechanism been enabled on this ramdump. Then we use vabits to figure it out. Fix then show the right backtrace below: crash> bt 16930 PID: 16930 TASK: ffffff89b3eada00 CPU: 2 COMMAND: "Firebase Backgr" #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4 crash-utility#1 [ffffffc034c43850] __schedule at ffffffe004cf05a0 crash-utility#2 [ffffffc034c438b0] preempt_schedule_common at ffffffe004ceff80 crash-utility#3 [ffffffc034c43950] unmap_page_range at ffffffe003a7b120 crash-utility#4 [ffffffc034c439f0] unmap_vmas at ffffffe003a80a64 crash-utility#5 [ffffffc034c43ac0] exit_mmap at ffffffe003a945c4 crash-utility#6 [ffffffc034c43b10] __mmput at ffffffe00372c818 crash-utility#7 [ffffffc034c43b40] mmput at ffffffe00372c0d0 crash-utility#8 [ffffffc034c43b90] exit_mm at ffffffe00373d0ac crash-utility#9 [ffffffc034c43c00] do_exit at ffffffe00373bedc PC: 00000073f5294840 LR: 00000070d8f39ba4 SP: 00000070d4afd5d0 X29: 00000070d4afd600 X28: b4000071efcda7f0 X27: 00000070d4afe000 X26: 0000000000000000 X25: 00000070d9616000 X24: 0000000000000000 X23: 0000000000000000 X22: 0000000000000000 X21: 0000000000000000 X20: b40000728fd27520 X19: b40000728fd27550 X18: 000000702daba000 X17: 00000073f5294820 X16: 00000070d940f9d8 X15: 00000000000000bf X14: 0000000000000000 X13: 00000070d8ad2fac X12: b40000718fce5040 X11: 0000000000000000 X10: 0000000000000070 X9: 0000000000000001 X8: 0000000000000062 X7: 0000000000000020 X6: 0000000000000000 X5: 0000000000000000 X4: 0000000000000000 X3: 0000000000000000 X2: 0000000000000002 X1: 0000000000000080 X0: b40000728fd27550 ORIG_X0: b40000728fd27550 SYSCALLNO: ffffffff PSTATE: 40001000 Let's use GENMASK to replace the pac pointer to fix it. gki related commit url here: https://lore.kernel.org/all/[email protected]/
If we use crash to parse ramdump(Qcom phone device) rathen than vmcore. Start command should be like: crash vmlinux --kaslr=xxx DDRCS0_0.BIN@0x0000000080000000,... --machdep vabits_actual=39 Then We will see bt command show misleading backtrace information below: crash> bt 16930 PID: 16930 TASK: ffffff89b3eada00 CPU: 2 COMMAND: "Firebase Backgr" #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4 crash-utility#1 [ffffffc034c43850] __kvm_nvhe_$d.2314 at 6be732e004cf05a0 crash-utility#2 [ffffffc034c438b0] __kvm_nvhe_$d.2314 at 86c54c6004ceff80 crash-utility#3 [ffffffc034c43950] __kvm_nvhe_$d.2314 at 55d6f96003a7b120 crash-utility#4 [ffffffc034c439f0] __kvm_nvhe_$d.2314 at 9ccec46003a80a64 crash-utility#5 [ffffffc034c43ac0] __kvm_nvhe_$d.2314 at 8cf41e6003a945c4 crash-utility#6 [ffffffc034c43b10] __kvm_nvhe_$d.2314 at a8f181e00372c818 crash-utility#7 [ffffffc034c43b40] __kvm_nvhe_$d.2314 at 6dedde600372c0d0 crash-utility#8 [ffffffc034c43b90] __kvm_nvhe_$d.2314 at 62cc07e00373d0ac crash-utility#9 [ffffffc034c43c00] __kvm_nvhe_$d.2314 at 72fb1de00373bedc ... PC: 00000073f5294840 LR: 00000070d8f39ba4 SP: 00000070d4afd5d0 X29: 00000070d4afd600 X28: b4000071efcda7f0 X27: 00000070d4afe000 X26: 0000000000000000 X25: 00000070d9616000 X24: 0000000000000000 X23: 0000000000000000 X22: 0000000000000000 X21: 0000000000000000 X20: b40000728fd27520 X19: b40000728fd27550 X18: 000000702daba000 X17: 00000073f5294820 X16: 00000070d940f9d8 X15: 00000000000000bf X14: 0000000000000000 X13: 00000070d8ad2fac X12: b40000718fce5040 X11: 0000000000000000 X10: 0000000000000070 X9: 0000000000000001 X8: 0000000000000062 X7: 0000000000000020 X6: 0000000000000000 X5: 0000000000000000 X4: 0000000000000000 X3: 0000000000000000 X2: 0000000000000002 X1: 0000000000000080 X0: b40000728fd27550 ORIG_X0: b40000728fd27550 SYSCALLNO: ffffffff PSTATE: 40001000 By checking the raw data below, will see the lr (fp+8) data show the pointer which already been replaced by PAC prefix. crash> bt -f PID: 16930 TASK: ffffff89b3eada00 CPU: 2 COMMAND: "Firebase Backgr" #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4 ffffffc034c437f0: ffffffc034c43850 6be732e004cf05a4 ffffffc034c43800: ffffffe006186108 a0ed07e004cf09c4 ffffffc034c43810: ffffff8a1a340000 ffffff8a8d343c00 ffffffc034c43820: ffffff89b3eada00 ffffff8b780db540 ffffffc034c43830: ffffff89b3eada00 0000000000000000 ffffffc034c43840: 0000000000000004 712b828118484a00 crash-utility#1 [ffffffc034c43850] __kvm_nvhe_$d.2314 at 6be732e004cf05a0 ffffffc034c43850: ffffffc034c438b0 86c54c6004ceff84 ffffffc034c43860: 000000708070f000 ffffffc034c43938 ffffffc034c43870: ffffff88bd822878 ffffff89b3eada00 ... So we check the CONFIG_ARM64_PTR_AUTH and CONFIG_ARM64_PTR_AUTH_KERNEL to double check if pac mechanism been enabled on this ramdump. Then we use vabits to figure it out. Fix then show the right backtrace below: crash> bt 16930 PID: 16930 TASK: ffffff89b3eada00 CPU: 2 COMMAND: "Firebase Backgr" #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4 crash-utility#1 [ffffffc034c43850] __schedule at ffffffe004cf05a0 crash-utility#2 [ffffffc034c438b0] preempt_schedule_common at ffffffe004ceff80 crash-utility#3 [ffffffc034c43950] unmap_page_range at ffffffe003a7b120 crash-utility#4 [ffffffc034c439f0] unmap_vmas at ffffffe003a80a64 crash-utility#5 [ffffffc034c43ac0] exit_mmap at ffffffe003a945c4 crash-utility#6 [ffffffc034c43b10] __mmput at ffffffe00372c818 crash-utility#7 [ffffffc034c43b40] mmput at ffffffe00372c0d0 crash-utility#8 [ffffffc034c43b90] exit_mm at ffffffe00373d0ac crash-utility#9 [ffffffc034c43c00] do_exit at ffffffe00373bedc PC: 00000073f5294840 LR: 00000070d8f39ba4 SP: 00000070d4afd5d0 X29: 00000070d4afd600 X28: b4000071efcda7f0 X27: 00000070d4afe000 X26: 0000000000000000 X25: 00000070d9616000 X24: 0000000000000000 X23: 0000000000000000 X22: 0000000000000000 X21: 0000000000000000 X20: b40000728fd27520 X19: b40000728fd27550 X18: 000000702daba000 X17: 00000073f5294820 X16: 00000070d940f9d8 X15: 00000000000000bf X14: 0000000000000000 X13: 00000070d8ad2fac X12: b40000718fce5040 X11: 0000000000000000 X10: 0000000000000070 X9: 0000000000000001 X8: 0000000000000062 X7: 0000000000000020 X6: 0000000000000000 X5: 0000000000000000 X4: 0000000000000000 X3: 0000000000000000 X2: 0000000000000002 X1: 0000000000000080 X0: b40000728fd27550 ORIG_X0: b40000728fd27550 SYSCALLNO: ffffffff PSTATE: 40001000 Let's use GENMASK to replace the pac pointer to fix it. gki related commit url here: https://lore.kernel.org/all/[email protected]/ Signed-off-by: bevis_chen <[email protected]>
If we use crash to parse ramdump(Qcom phone device) rathen than vmcore. Start command should be like: crash vmlinux --kaslr=xxx DDRCS0_0.BIN@0x0000000080000000,... --machdep vabits_actual=39 Then We will see bt command show misleading backtrace information below: crash> bt 16930 PID: 16930 TASK: ffffff89b3eada00 CPU: 2 COMMAND: "Firebase Backgr" #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4 crash-utility#1 [ffffffc034c43850] __kvm_nvhe_$d.2314 at 6be732e004cf05a0 crash-utility#2 [ffffffc034c438b0] __kvm_nvhe_$d.2314 at 86c54c6004ceff80 crash-utility#3 [ffffffc034c43950] __kvm_nvhe_$d.2314 at 55d6f96003a7b120 crash-utility#4 [ffffffc034c439f0] __kvm_nvhe_$d.2314 at 9ccec46003a80a64 crash-utility#5 [ffffffc034c43ac0] __kvm_nvhe_$d.2314 at 8cf41e6003a945c4 crash-utility#6 [ffffffc034c43b10] __kvm_nvhe_$d.2314 at a8f181e00372c818 crash-utility#7 [ffffffc034c43b40] __kvm_nvhe_$d.2314 at 6dedde600372c0d0 crash-utility#8 [ffffffc034c43b90] __kvm_nvhe_$d.2314 at 62cc07e00373d0ac crash-utility#9 [ffffffc034c43c00] __kvm_nvhe_$d.2314 at 72fb1de00373bedc ... PC: 00000073f5294840 LR: 00000070d8f39ba4 SP: 00000070d4afd5d0 X29: 00000070d4afd600 X28: b4000071efcda7f0 X27: 00000070d4afe000 X26: 0000000000000000 X25: 00000070d9616000 X24: 0000000000000000 X23: 0000000000000000 X22: 0000000000000000 X21: 0000000000000000 X20: b40000728fd27520 X19: b40000728fd27550 X18: 000000702daba000 X17: 00000073f5294820 X16: 00000070d940f9d8 X15: 00000000000000bf X14: 0000000000000000 X13: 00000070d8ad2fac X12: b40000718fce5040 X11: 0000000000000000 X10: 0000000000000070 X9: 0000000000000001 X8: 0000000000000062 X7: 0000000000000020 X6: 0000000000000000 X5: 0000000000000000 X4: 0000000000000000 X3: 0000000000000000 X2: 0000000000000002 X1: 0000000000000080 X0: b40000728fd27550 ORIG_X0: b40000728fd27550 SYSCALLNO: ffffffff PSTATE: 40001000 By checking the raw data below, will see the lr (fp+8) data show the pointer which already been replaced by PAC prefix. crash> bt -f PID: 16930 TASK: ffffff89b3eada00 CPU: 2 COMMAND: "Firebase Backgr" #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4 ffffffc034c437f0: ffffffc034c43850 6be732e004cf05a4 ffffffc034c43800: ffffffe006186108 a0ed07e004cf09c4 ffffffc034c43810: ffffff8a1a340000 ffffff8a8d343c00 ffffffc034c43820: ffffff89b3eada00 ffffff8b780db540 ffffffc034c43830: ffffff89b3eada00 0000000000000000 ffffffc034c43840: 0000000000000004 712b828118484a00 crash-utility#1 [ffffffc034c43850] __kvm_nvhe_$d.2314 at 6be732e004cf05a0 ffffffc034c43850: ffffffc034c438b0 86c54c6004ceff84 ffffffc034c43860: 000000708070f000 ffffffc034c43938 ffffffc034c43870: ffffff88bd822878 ffffff89b3eada00 ... So we check the CONFIG_ARM64_PTR_AUTH and CONFIG_ARM64_PTR_AUTH_KERNEL to double check if pac mechanism been enabled on this ramdump. Then we use vabits to figure it out. Fix then show the right backtrace below: crash> bt 16930 PID: 16930 TASK: ffffff89b3eada00 CPU: 2 COMMAND: "Firebase Backgr" #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4 crash-utility#1 [ffffffc034c43850] __schedule at ffffffe004cf05a0 crash-utility#2 [ffffffc034c438b0] preempt_schedule_common at ffffffe004ceff80 crash-utility#3 [ffffffc034c43950] unmap_page_range at ffffffe003a7b120 crash-utility#4 [ffffffc034c439f0] unmap_vmas at ffffffe003a80a64 crash-utility#5 [ffffffc034c43ac0] exit_mmap at ffffffe003a945c4 crash-utility#6 [ffffffc034c43b10] __mmput at ffffffe00372c818 crash-utility#7 [ffffffc034c43b40] mmput at ffffffe00372c0d0 crash-utility#8 [ffffffc034c43b90] exit_mm at ffffffe00373d0ac crash-utility#9 [ffffffc034c43c00] do_exit at ffffffe00373bedc PC: 00000073f5294840 LR: 00000070d8f39ba4 SP: 00000070d4afd5d0 X29: 00000070d4afd600 X28: b4000071efcda7f0 X27: 00000070d4afe000 X26: 0000000000000000 X25: 00000070d9616000 X24: 0000000000000000 X23: 0000000000000000 X22: 0000000000000000 X21: 0000000000000000 X20: b40000728fd27520 X19: b40000728fd27550 X18: 000000702daba000 X17: 00000073f5294820 X16: 00000070d940f9d8 X15: 00000000000000bf X14: 0000000000000000 X13: 00000070d8ad2fac X12: b40000718fce5040 X11: 0000000000000000 X10: 0000000000000070 X9: 0000000000000001 X8: 0000000000000062 X7: 0000000000000020 X6: 0000000000000000 X5: 0000000000000000 X4: 0000000000000000 X3: 0000000000000000 X2: 0000000000000002 X1: 0000000000000080 X0: b40000728fd27550 ORIG_X0: b40000728fd27550 SYSCALLNO: ffffffff PSTATE: 40001000 Let's use GENMASK to replace the pac pointer to fix it. gki related commit url here: https://lore.kernel.org/all/[email protected]/ Signed-off-by: bevis_chen <[email protected]>
For ramdump(Qcom phone device) case with the kernel option CONFIG_ARM64_PTR_AUTH_KERNEL enabled, the bt command may print incorrect stacktrace as below: crash> bt 16930 PID: 16930 TASK: ffffff89b3eada00 CPU: 2 COMMAND: "Firebase Backgr" #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4 #1 [ffffffc034c43850] __kvm_nvhe_$d.2314 at 6be732e004cf05a0 #2 [ffffffc034c438b0] __kvm_nvhe_$d.2314 at 86c54c6004ceff80 #3 [ffffffc034c43950] __kvm_nvhe_$d.2314 at 55d6f96003a7b120 ... PC: 00000073f5294840 LR: 00000070d8f39ba4 SP: 00000070d4afd5d0 X29: 00000070d4afd600 X28: b4000071efcda7f0 X27: 00000070d4afe000 X26: 0000000000000000 X25: 00000070d9616000 X24: 0000000000000000 X23: 0000000000000000 X22: 0000000000000000 X21: 0000000000000000 X20: b40000728fd27520 X19: b40000728fd27550 X18: 000000702daba000 X17: 00000073f5294820 X16: 00000070d940f9d8 X15: 00000000000000bf X14: 0000000000000000 X13: 00000070d8ad2fac X12: b40000718fce5040 X11: 0000000000000000 X10: 0000000000000070 X9: 0000000000000001 X8: 0000000000000062 X7: 0000000000000020 X6: 0000000000000000 X5: 0000000000000000 X4: 0000000000000000 X3: 0000000000000000 X2: 0000000000000002 X1: 0000000000000080 X0: b40000728fd27550 ORIG_X0: b40000728fd27550 SYSCALLNO: ffffffff PSTATE: 40001000 Crash tool can not get the KERNELPACMASK value from the vmcoreinfo, need to calculate its value based on the vabits. With the patch: crash> bt 16930 PID: 16930 TASK: ffffff89b3eada00 CPU: 2 COMMAND: "Firebase Backgr" #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4 #1 [ffffffc034c43850] __schedule at ffffffe004cf05a0 #2 [ffffffc034c438b0] preempt_schedule_common at ffffffe004ceff80 #3 [ffffffc034c43950] unmap_page_range at ffffffe003a7b120 #4 [ffffffc034c439f0] unmap_vmas at ffffffe003a80a64 #5 [ffffffc034c43ac0] exit_mmap at ffffffe003a945c4 #6 [ffffffc034c43b10] __mmput at ffffffe00372c818 #7 [ffffffc034c43b40] mmput at ffffffe00372c0d0 #8 [ffffffc034c43b90] exit_mm at ffffffe00373d0ac #9 [ffffffc034c43c00] do_exit at ffffffe00373bedc PC: 00000073f5294840 LR: 00000070d8f39ba4 SP: 00000070d4afd5d0 X29: 00000070d4afd600 X28: b4000071efcda7f0 X27: 00000070d4afe000 X26: 0000000000000000 X25: 00000070d9616000 X24: 0000000000000000 X23: 0000000000000000 X22: 0000000000000000 X21: 0000000000000000 X20: b40000728fd27520 X19: b40000728fd27550 X18: 000000702daba000 X17: 00000073f5294820 X16: 00000070d940f9d8 X15: 00000000000000bf X14: 0000000000000000 X13: 00000070d8ad2fac X12: b40000718fce5040 X11: 0000000000000000 X10: 0000000000000070 X9: 0000000000000001 X8: 0000000000000062 X7: 0000000000000020 X6: 0000000000000000 X5: 0000000000000000 X4: 0000000000000000 X3: 0000000000000000 X2: 0000000000000002 X1: 0000000000000080 X0: b40000728fd27550 ORIG_X0: b40000728fd27550 SYSCALLNO: ffffffff PSTATE: 40001000 Related kernel commits: 689eae42afd7 ("arm64: mask PAC bits of __builtin_return_address") de1702f65feb ("arm64: move PAC masks to <asm/pointer_auth.h>") Signed-off-by: bevis_chen <[email protected]>
The stack unwinding is for kernel addresses only. If non-kernel address encountered, it is usually a user space address, or non-address value like a function call parameter. So stopping stack unwinding at non-kernel address will decrease the invalid unwind results. Before: crash> gdb bt #0 0xffffffff816a8f65 in context_switch ... crash-utility#1 __schedule () ... crash-utility#2 0xffffffff816a94e9 in schedule ... crash-utility#3 0xffffffff816a86fd in schedule_hrtimeout_range_clock ... crash-utility#4 0xffffffff816a8733 in schedule_hrtimeout_range ... crash-utility#5 0xffffffff8124bb7e in ep_poll ... crash-utility#6 0xffffffff8124d00d in SYSC_epoll_wait ... crash-utility#7 SyS_epoll_wait ... crash-utility#8 <signal handler called> crash-utility#9 0x00007f0449407923 in ?? () crash-utility#10 0xffff880100000001 in ?? () crash-utility#11 0xffff880169b3c010 in ?? () crash-utility#12 0x0000000000000040 in irq_stack_union () crash-utility#13 0xffff880169b3c058 in ?? () crash-utility#14 0xffff880169b3c048 in ?? () crash-utility#15 0xffff880169b3c050 in ?? () crash-utility#16 0x0000000000000000 in ?? () After: crash> gdb bt #0 0xffffffff816a8f65 in context_switch ... crash-utility#1 __schedule () ... crash-utility#2 0xffffffff816a94e9 in schedule () ... crash-utility#3 0xffffffff816a86fd in schedule_hrtimeout_range_clock ... crash-utility#4 0xffffffff816a8733 in schedule_hrtimeout_range ... crash-utility#5 0xffffffff8124bb7e in ep_poll ... crash-utility#6 0xffffffff8124d00d in SYSC_epoll_wait ... crash-utility#7 SyS_epoll_wait ... crash-utility#8 <signal handler called> Cc: Sourabh Jain <[email protected]> Cc: Hari Bathini <[email protected]> Cc: Mahesh J Salgaonkar <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Lianbo Jiang <[email protected]> Cc: HAGIO KAZUHITO(萩尾 一仁) <[email protected]> Cc: Tao Liu <[email protected]> Cc: Alexey Makhalov <[email protected]> Signed-off-by: Tao Liu <[email protected]>
See the following stack trace: (gdb) bt #0 0x00005635ac2b166b in arm64_unwind_frame (frame=0x7ffdaf35cb70, bt=0x7ffdaf35d430) at arm64.c:2821 #1 arm64_back_trace_cmd (bt=0x7ffdaf35d430) at arm64.c:3306 #2 0x00005635ac27b108 in back_trace (bt=bt@entry=0x7ffdaf35d430) at kernel.c:3239 #3 0x00005635ac2880ae in cmd_bt () at kernel.c:2863 #4 0x00005635ac1f16dc in exec_command () at main.c:893 #5 0x00005635ac1f192a in main_loop () at main.c:840 #6 0x00005635ac50df81 in captured_main (data=<optimized out>) at main.c:1284 #7 gdb_main (args=<optimized out>) at main.c:1313 #8 0x00005635ac50e000 in gdb_main_entry (argc=<optimized out>, argv=<optimized out>) at main.c:1338 #9 0x00005635ac1ea2a5 in main (argc=5, argv=0x7ffdaf35dde8) at main.c:721 The issue may be encountered when thread_union symbol not found in vmlinux due to compiling optimization. This patch will try the following 2 methods to get the irq_stack_size when thread_union symbol unavailable: 1. change the thread_shift when KASAN is enabled and with vmcoreinfo. In arm64/include/asm/memory.h: #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS) ... #define IRQ_STACK_SIZE THREAD_SIZE Since enabling the KASAN will affect the final value, this patch reset IRQ_STACK_SIZE according to the calculation process in kernel code. 2. Try getting the value from kernel code disassembly, to get THREAD_SHIFT directly from tbnz instruction. In arch/arm64/kernel/entry.S: .macro kernel_ventry, el:req, ht:req, regsize:req, label:req ... add sp, sp, x0 sub x0, sp, x0 tbnz x0, #THREAD_SHIFT, 0f $ gdb vmlinux (gdb) disass vectors Dump of assembler code for function vectors: ... 0xffff800080010804 <+4>: add sp, sp, x0 0xffff800080010808 <+8>: sub x0, sp, x0 0xffff80008001080c <+12>: tbnz w0, #16, 0xffff80008001081c <vectors+28> Signed-off-by: yeping.zheng <[email protected]> Improved-by: Tao Liu <[email protected]>
See the following stack trace: (gdb) bt #0 0x00005635ac2b166b in arm64_unwind_frame (frame=0x7ffdaf35cb70, bt=0x7ffdaf35d430) at arm64.c:2821 #1 arm64_back_trace_cmd (bt=0x7ffdaf35d430) at arm64.c:3306 #2 0x00005635ac27b108 in back_trace (bt=bt@entry=0x7ffdaf35d430) at kernel.c:3239 #3 0x00005635ac2880ae in cmd_bt () at kernel.c:2863 #4 0x00005635ac1f16dc in exec_command () at main.c:893 #5 0x00005635ac1f192a in main_loop () at main.c:840 #6 0x00005635ac50df81 in captured_main (data=<optimized out>) at main.c:1284 #7 gdb_main (args=<optimized out>) at main.c:1313 #8 0x00005635ac50e000 in gdb_main_entry (argc=<optimized out>, argv=<optimized out>) at main.c:1338 #9 0x00005635ac1ea2a5 in main (argc=5, argv=0x7ffdaf35dde8) at main.c:721 The issue may be encountered when thread_union symbol not found in vmlinux due to compiling optimization. This patch will try the following 2 methods to get the irq_stack_size when thread_union symbol unavailable: 1. change the thread_shift when KASAN is enabled and with vmcoreinfo. In arm64/include/asm/memory.h: #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS) ... #define IRQ_STACK_SIZE THREAD_SIZE Since enabling the KASAN will affect the final value, this patch reset IRQ_STACK_SIZE according to the calculation process in kernel code. 2. Try getting the value from kernel code disassembly, to get THREAD_SHIFT directly from tbnz instruction. In arch/arm64/kernel/entry.S: .macro kernel_ventry, el:req, ht:req, regsize:req, label:req ... add sp, sp, x0 sub x0, sp, x0 tbnz x0, #THREAD_SHIFT, 0f $ gdb vmlinux (gdb) disass vectors Dump of assembler code for function vectors: ... 0xffff800080010804 <+4>: add sp, sp, x0 0xffff800080010808 <+8>: sub x0, sp, x0 0xffff80008001080c <+12>: tbnz w0, #16, 0xffff80008001081c <vectors+28> Signed-off-by: yeping.zheng <[email protected]> Improved-by: Tao Liu <[email protected]>
The stack unwinding is for kernel addresses only. If non-kernel address encountered, it is usually a user space address, or non-address value like a function call parameter. So stopping stack unwinding at non-kernel address will decrease the invalid unwind results. Before: crash> gdb bt #0 0xffffffff816a8f65 in context_switch ... crash-utility#1 __schedule () ... crash-utility#2 0xffffffff816a94e9 in schedule ... crash-utility#3 0xffffffff816a86fd in schedule_hrtimeout_range_clock ... crash-utility#4 0xffffffff816a8733 in schedule_hrtimeout_range ... crash-utility#5 0xffffffff8124bb7e in ep_poll ... crash-utility#6 0xffffffff8124d00d in SYSC_epoll_wait ... crash-utility#7 SyS_epoll_wait ... crash-utility#8 <signal handler called> crash-utility#9 0x00007f0449407923 in ?? () crash-utility#10 0xffff880100000001 in ?? () crash-utility#11 0xffff880169b3c010 in ?? () crash-utility#12 0x0000000000000040 in irq_stack_union () crash-utility#13 0xffff880169b3c058 in ?? () crash-utility#14 0xffff880169b3c048 in ?? () crash-utility#15 0xffff880169b3c050 in ?? () crash-utility#16 0x0000000000000000 in ?? () After: crash> gdb bt #0 0xffffffff816a8f65 in context_switch ... crash-utility#1 __schedule () ... crash-utility#2 0xffffffff816a94e9 in schedule () ... crash-utility#3 0xffffffff816a86fd in schedule_hrtimeout_range_clock ... crash-utility#4 0xffffffff816a8733 in schedule_hrtimeout_range ... crash-utility#5 0xffffffff8124bb7e in ep_poll ... crash-utility#6 0xffffffff8124d00d in SYSC_epoll_wait ... crash-utility#7 SyS_epoll_wait ... crash-utility#8 <signal handler called> Cc: Sourabh Jain <[email protected]> Cc: Hari Bathini <[email protected]> Cc: Mahesh J Salgaonkar <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Lianbo Jiang <[email protected]> Cc: HAGIO KAZUHITO(萩尾 一仁) <[email protected]> Cc: Tao Liu <[email protected]> Cc: Alexey Makhalov <[email protected]> Signed-off-by: Tao Liu <[email protected]>
Previously, "retq" is used to determine the end of a function, so the end of framesize calculation. However "ret" might be outputted by gdb rather than "retq", as a result, the framesize is returned incorrectly, and bogus stack trace will be outputted. Without the patch: $ crash -d 3 vmcore vmlinux crash> bt 0xffffffff92da7545 <copy_process+5>: push %rbp [framesize: 8] ... 0xffffffff92da7561 <copy_process+33>: sub $0x238,%rsp [framesize: 624] ... 0xffffffff92da776a <copy_process+554>: pop %r15 [framesize: 8] 0xffffffff92da776c <copy_process+556>: pop %rbp [framesize: 0] 0xffffffff92da776d <copy_process+557>: ret crash> bt -D dump framesize_cache_entries: ... [ 3]: ffffffff92dadcbd 0 CF (copy_process+26493) crash> bt ... #9 [ffff888263157bc0] copy_process at ffffffff92dadcbd #10 [ffff888263157d20] __mutex_init at ffffffff92ed8dd5 #11 [ffff888263157d38] __alloc_file at ffffffff93458397 #12 [ffff888263157d60] alloc_empty_file at ffffffff934585d2 #13 [ffff888263157da8] __alloc_fd at ffffffff934b5ead #14 [ffff888263157e38] _do_fork at ffffffff92dae7a1 #15 [ffff888263157f28] do_syscall_64 at ffffffff92c085f4 Stack #10 ~ #13 are bogus and misleading. With the patch: ... 0xffffffff92da776d <copy_process+557>: ret [framesize restored to: 624] crash> bt -D dump ... [ 3]: ffffffff92dadcbd 624 CF (copy_process+26493) crash> bt ... #9 [ffff888263157bc0] copy_process at ffffffff92dadcbd #10 [ffff888263157e38] _do_fork at ffffffff92dae7a1 #11 [ffff888263157f28] do_syscall_64 at ffffffff92c085f4 Signed-off-by: Tao Liu <[email protected]>
…usly There is an issue that, for kernel modules, "dis -rl" fails to display modules code line number data after execute "bt" command in crash. Without the patch: crsah> mod -S crash> bt PID: 1500 TASK: ff2bd8b093524000 CPU: 16 COMMAND: "lpfc_worker_0" #0 [ff2c9f725c39f9e0] machine_kexec at ffffffff8e0686d3 ...snip... crash-utility#8 [ff2c9f725c39fcc0] __lpfc_sli_release_iocbq_s4 at ffffffffc0f2f425 [lpfc] ...snip... crash> dis -rl ffffffffc0f60f82 0xffffffffc0f60eb0 <lpfc_nlp_get>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] 0xffffffffc0f60eb5 <lpfc_nlp_get+5>: push %rbp 0xffffffffc0f60eb6 <lpfc_nlp_get+6>: push %rbx 0xffffffffc0f60eb7 <lpfc_nlp_get+7>: test %rdi,%rdi With the patch: crash> mod -S crash> bt PID: 1500 TASK: ff2bd8b093524000 CPU: 16 COMMAND: "lpfc_worker_0" #0 [ff2c9f725c39f9e0] machine_kexec at ffffffff8e0686d3 ...snip... crash-utility#8 [ff2c9f725c39fcc0] __lpfc_sli_release_iocbq_s4 at ffffffffc0f2f425 [lpfc] ...snip... crash> dis -rl ffffffffc0f60f82 /usr/src/debug/kernel-4.18.0-425.13.1.el8_7/linux-4.18.0-425.13.1.el8_7.x86_64/drivers/scsi/lpfc/lpfc_hbadisc.c: 6756 0xffffffffc0f60eb0 <lpfc_nlp_get>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] /usr/src/debug/kernel-4.18.0-425.13.1.el8_7/linux-4.18.0-425.13.1.el8_7.x86_64/drivers/scsi/lpfc/lpfc_hbadisc.c: 6759 0xffffffffc0f60eb5 <lpfc_nlp_get+5>: push %rbp The root cause is, after kernel module been loaded by mod command, the symtable is not expanded in gdb side. crash bt or dis command will trigger such an expansion. However the symtable expansion is different for the 2 commands: The stack trace of "dis -rl" for symtable expanding: #0 0x00000000008d8d9f in add_compunit_symtab_to_objfile ... crash-utility#1 0x00000000006d3293 in buildsym_compunit::end_symtab_with_blockvector ... crash-utility#2 0x00000000006d336a in buildsym_compunit::end_symtab_from_static_block ... crash-utility#3 0x000000000077e8e9 in process_full_comp_unit ... crash-utility#4 process_queue ... crash-utility#5 dw2_do_instantiate_symtab ... crash-utility#6 0x000000000077ed67 in dw2_instantiate_symtab ... crash-utility#7 0x000000000077f75e in dw2_expand_all_symtabs ... crash-utility#8 0x00000000008f254d in gdb_get_line_number ... crash-utility#9 0x00000000008f22af in gdb_command_funnel_1 ... crash-utility#10 0x00000000008f2003 in gdb_command_funnel ... crash-utility#11 0x00000000005b7f02 in gdb_interface ... crash-utility#12 0x00000000005f5bd8 in get_line_number ... crash-utility#13 0x000000000059e574 in cmd_dis ... The stack trace of "bt" for symtable expanding: #0 0x00000000008d8d9f in add_compunit_symtab_to_objfile ... crash-utility#1 0x00000000006d3293 in buildsym_compunit::end_symtab_with_blockvector ... crash-utility#2 0x00000000006d336a in buildsym_compunit::end_symtab_from_static_block ... crash-utility#3 0x000000000077e8e9 in process_full_comp_unit ... crash-utility#4 process_queue ... crash-utility#5 dw2_do_instantiate_symtab ... crash-utility#6 0x000000000077ed67 in dw2_instantiate_symtab ... crash-utility#7 0x000000000077f8ed in dw2_lookup_symbol ... crash-utility#8 0x00000000008e6d03 in lookup_symbol_via_quick_fns ... crash-utility#9 0x00000000008e7153 in lookup_symbol_in_objfile ... crash-utility#10 0x00000000008e73c6 in lookup_symbol_global_or_static_iterator_cb ... crash-utility#11 0x00000000008b99c4 in svr4_iterate_over_objfiles_in_search_order ... crash-utility#12 0x00000000008e754e in lookup_global_or_static_symbol ... crash-utility#13 0x00000000008e75da in lookup_static_symbol ... crash-utility#14 0x00000000008e632c in lookup_symbol_aux ... crash-utility#15 0x00000000008e5a7a in lookup_symbol_in_language ... crash-utility#16 0x00000000008e5b30 in lookup_symbol ... crash-utility#17 0x00000000008f2a4a in gdb_get_datatype ... crash-utility#18 0x00000000008f22c0 in gdb_command_funnel_1 ... crash-utility#19 0x00000000008f2003 in gdb_command_funnel ... crash-utility#20 0x00000000005b7f02 in gdb_interface ... crash-utility#21 0x00000000005f8a9f in datatype_info ... crash-utility#22 0x0000000000599947 in cpu_map_size ... crash-utility#23 0x00000000005a975d in get_cpus_online ... crash-utility#24 0x0000000000637a8b in diskdump_get_prstatus_percpu ... crash-utility#25 0x000000000062f0e4 in get_netdump_regs_x86_64 ... crash-utility#26 0x000000000059fe68 in back_trace ... crash-utility#27 0x00000000005ab1cb in cmd_bt ... For the stacktrace of "dis -rl", it calls dw2_expand_all_symtabs() to expand all symtable of the objfile, or "*.ko.debug" in our case. However for the stacktrace of "bt", it doesn't expand all, but only a subset of symtable which is enough to find a symbol by dw2_lookup_symbol(). As a result, the objfile->compunit_symtabs, which is the head of a single linked list of struct compunit_symtab, is not NULL but didn't contain all symtables. It will not be reinitialized in gdb_get_line_number() by "dis -rl" because !objfile_has_full_symbols(objfile) check will fail, so it cannot display the proper code line number data. Since objfile_has_full_symbols(objfile) check cannot ensure all symbols been expanded, this patch add a new member as a flag for struct objfile to record if all symbols have been expanded. The flag will be set only ofter expand_all_symtabs been called. Signed-off-by: Tao Liu <[email protected]>
Previously the value of bt->bptr is not checked, which may led to a wrong prev_sp and framesize. As a result, bt->stackbuf[] will be accessed out of range, and segfault. Before: crash> set debug 1 crash> bt ...snip... --- <NMI exception stack> --- crash-utility#8 [ffffffff9a603e10] __switch_to_asm at ffffffff99800214 rsp: ffffffff9a603e10 textaddr: ffffffff99800214 -> spo: 0 bpo: 0 spr: 0 bpr: 0 type: 0 end: 0 crash-utility#9 [ffffffff9a603e40] __schedule at ffffffff9960dfb1 rsp: ffffffff9a603e40 textaddr: ffffffff9960dfb1 -> spo: 16 bpo: -16 spr: 4 bpr: 1 type: 0 end: 0 rsp: ffffffff9a603e40 rbp: ffffb9ca076e7ca8 prev_sp: ffffb9ca076e7cb8 framesize: 1829650024 Segmentation fault (core dumped) (gdb) p/x bt->stackbase $1 = 0xffffffff9a600000 (gdb) p/x bt->stacktop $2 = 0xffffffff9a604000 After: crash> set debug 1 crash> bt ...snip... --- <NMI exception stack> --- crash-utility#8 [ffffffff9a603e10] __switch_to_asm at ffffffff99800214 rsp: ffffffff9a603e10 textaddr: ffffffff99800214 -> spo: 0 bpo: 0 spr: 0 bpr: 0 type: 0 end: 0 crash-utility#9 [ffffffff9a603e40] __schedule at ffffffff9960dfb1 rsp: ffffffff9a603e40 textaddr: ffffffff9960dfb1 -> spo: 16 bpo: -16 spr: 4 bpr: 1 type: 0 end: 0 crash-utility#10 [ffffffff9a603e98] schedule_idle at ffffffff9960e87c rsp: ffffffff9a603e98 textaddr: ffffffff9960e87c -> spo: 8 bpo: 0 spr: 5 bpr: 0 type: 0 end: 0 rsp: ffffffff9a603e98 prev_sp: ffffffff9a603ea8 framesize: 0 ...snip... Check bt->bptr value before calculate framesize. Only bt->bptr within the range of bt->stackbase and bt->stacktop will be regarded as valid. Signed-off-by: Tao Liu <[email protected]>
This patch introduces per-cpu IRQ stacks for RISCV64 to let "bt" do backtrace on it and 'bt -E' search eframes on it, and the 'help -m' command displays the addresses of each per-cpu IRQ stack. TEST: a vmcore dumped via hacking the handle_irq_event_percpu() ( Why not using lkdtm INT_HW_IRQ_EN EXCEPTION ? There is a deadlock[1] in crash_kexec path if use that) crash> bt PID: 0 TASK: ffffffff8140db00 CPU: 0 COMMAND: "swapper/0" #0 [ff20000000003e60] __handle_irq_event_percpu at ffffffff8006462e crash-utility#1 [ff20000000003ed0] handle_irq_event_percpu at ffffffff80064702 crash-utility#2 [ff20000000003ef0] handle_irq_event at ffffffff8006477c crash-utility#3 [ff20000000003f20] handle_fasteoi_irq at ffffffff80068664 crash-utility#4 [ff20000000003f50] generic_handle_domain_irq at ffffffff80063988 crash-utility#5 [ff20000000003f60] plic_handle_irq at ffffffff8046633e crash-utility#6 [ff20000000003fb0] generic_handle_domain_irq at ffffffff80063988 crash-utility#7 [ff20000000003fc0] riscv_intc_irq at ffffffff80465f8e crash-utility#8 [ff20000000003fd0] handle_riscv_irq at ffffffff808361e8 PC: ffffffff80837314 [default_idle_call+50] RA: ffffffff80837310 [default_idle_call+46] SP: ffffffff81403da0 CAUSE: 8000000000000009 epc : ffffffff80837314 ra : ffffffff80837310 sp : ffffffff81403da0 gp : ffffffff814ef848 tp : ffffffff8140db00 t0 : ff2000000004bb18 t1 : 0000000000032c73 t2 : ffffffff81200a48 s0 : ffffffff81403db0 s1 : 0000000000000000 a0 : 0000000000000004 a1 : 0000000000000000 a2 : ff6000009f1e7000 a3 : 0000000000002304 a4 : ffffffff80c1c2d8 a5 : 0000000000000000 a6 : ff6000001fe01958 a7 : 00002496ea89dbf1 s2 : ffffffff814f0220 s3 : 0000000000000001 s4 : 000000000000003f s5 : ffffffff814f03d8 s6 : 0000000000000000 s7 : ffffffff814f00d0 s8 : ffffffff81526f10 s9 : ffffffff80c1d880 s10: 0000000000000000 s11: 0000000000000001 t3 : 0000000000003392 t4 : 0000000000000000 t5 : 0000000000000000 t6 : 0000000000000040 status: 0000000200000120 badaddr: 0000000000000000 cause: 8000000000000009 orig_a0: ffffffff80837310 --- <IRQ stack> --- crash-utility#9 [ffffffff81403da0] default_idle_call at ffffffff80837314 crash-utility#10 [ffffffff81403db0] do_idle at ffffffff8004d0a0 crash-utility#11 [ffffffff81403e40] cpu_startup_entry at ffffffff8004d21e crash-utility#12 [ffffffff81403e60] kernel_init at ffffffff8083746a crash-utility#13 [ffffffff81403e70] arch_post_acpi_subsys_init at ffffffff80a006d8 crash-utility#14 [ffffffff81403e80] console_on_rootfs at ffffffff80a00c92 crash> crash> bt -E CPU 0 IRQ STACK: KERNEL-MODE EXCEPTION FRAME AT: ff20000000003a48 PC: ffffffff8006462e [__handle_irq_event_percpu+30] RA: ffffffff80064702 [handle_irq_event_percpu+18] SP: ff20000000003e60 CAUSE: 000000000000000d epc : ffffffff8006462e ra : ffffffff80064702 sp : ff20000000003e60 gp : ffffffff814ef848 tp : ffffffff8140db00 t0 : 0000000000046600 t1 : ffffffff80836464 t2 : ffffffff81200a48 s0 : ff20000000003ed0 s1 : 0000000000000000 a0 : 0000000000000000 a1 : 0000000000000118 a2 : 0000000000000052 a3 : 0000000000000000 a4 : 0000000000000000 a5 : 0000000000010001 a6 : ff6000001fe01958 a7 : 00002496ea89dbf1 s2 : ff60000000941ab0 s3 : ffffffff814a0658 s4 : ff60000000089230 s5 : ffffffff814a0518 s6 : ffffffff814a0620 s7 : ffffffff80e5f0f8 s8 : ffffffff80fc50b0 s9 : ffffffff80c1d880 s10: 0000000000000000 s11: 0000000000000001 t3 : 0000000000003392 t4 : 0000000000000000 t5 : 0000000000000000 t6 : 0000000000000040 status: 0000000200000100 badaddr: 0000000000000078 cause: 000000000000000d orig_a0: ff20000000003ea0 CPU 1 IRQ STACK: (none found) crash> crash> help -m <snip> machspec: ced1e0 irq_stack_size: 16384 irq_stacks[0]: ff20000000000000 irq_stacks[1]: ff20000000008000 crash> [1]: https://lore.kernel.org/linux-riscv/[email protected]/ Signed-off-by: Song Shuai <[email protected]>
- Add basic support for the 'bt' command. - LooongArch64: Add 'bt -f' command support - LoongArch64: Add 'bt -l' command support E.g. With this patch: crash> bt PID: 1832 TASK: 900000009a552100 CPU: 11 COMMAND: "bash" #0 [900000009beffb60] __cpu_possible_mask at 90000000014168f0 crash-utility#1 [900000009beffb60] __crash_kexec at 90000000002e7660 crash-utility#2 [900000009beffcd0] panic at 9000000000f0ec28 crash-utility#3 [900000009beffd60] sysrq_handle_crash at 9000000000a2c188 crash-utility#4 [900000009beffd70] __handle_sysrq at 9000000000a2c85c crash-utility#5 [900000009beffdc0] write_sysrq_trigger at 9000000000a2ce10 crash-utility#6 [900000009beffde0] proc_reg_write at 90000000004ce454 crash-utility#7 [900000009beffe00] vfs_write at 900000000043e838 crash-utility#8 [900000009beffe40] ksys_write at 900000000043eb58 crash-utility#9 [900000009beffe80] do_syscall at 9000000000f2da54 crash-utility#10 [900000009beffea0] handle_syscall at 9000000000221440 crash> ... Co-developed-by: Youling Tang <[email protected]> Signed-off-by: Youling Tang <[email protected]> Signed-off-by: Ming Wang <[email protected]>
The following segmentation fault occurred during session initialization: $ crash vmlinx vmcore ... please wait... (determining panic task)Segmentation fault Here is the backtrace of the crash-utility: (gdb) bt #0 value_search_module_6_4 (value=18446603338276298752, offset=0x7ffffffface0) at symbols.c:5564 crash-utility#1 0x0000555555812bd0 in value_to_symstr (value=18446603338276298752, buf=buf@entry=0x7fffffffb9c0 "", radix=10, radix@entry=0) at symbols.c:5872 crash-utility#2 0x00005555557694a2 in display_memory (addr=<optimized out>, count=2048, flag=208, memtype=memtype@entry=1, opt=opt@entry=0x0) at memory.c:1740 crash-utility#3 0x0000555555769e1f in raw_stack_dump (stackbase=<optimized out>, size=<optimized out>) at memory.c:2194 crash-utility#4 0x00005555557923ff in get_active_set_panic_task () at task.c:8639 crash-utility#5 0x00005555557930d2 in get_dumpfile_panic_task () at task.c:7628 crash-utility#6 0x00005555557a89d3 in panic_search () at task.c:7380 crash-utility#7 get_panic_context () at task.c:6267 crash-utility#8 task_init () at task.c:687 crash-utility#9 0x00005555557305b3 in main_loop () at main.c:787 ... This is due to lack of existence check on module symbol table. Not all mod_mem_type will be existent for a module, e.g. in the following module case: (gdb) p lm->symtable[0] $1 = (struct syment *) 0x4dcbad0 (gdb) p lm->symtable[1] $2 = (struct syment *) 0x4dcbb70 (gdb) p lm->symtable[2] $3 = (struct syment *) 0x4dcbc10 (gdb) p lm->symtable[3] $4 = (struct syment *) 0x0 (gdb) p lm->symtable[4] $5 = (struct syment *) 0x4dcbcb0 (gdb) p lm->symtable[5] $6 = (struct syment *) 0x4dcbd00 (gdb) p lm->symtable[6] $7 = (struct syment *) 0x0 MOD_RO_AFTER_INIT(3) and MOD_INIT_RODATA(6) do not exist, which should be skipped, otherwise the segmentation fault will happen. Fixes: 7750e61 ("Support module memory layout change on Linux 6.4") Closes: crash-utility#176 Reported-by: Naveen Chaudhary <[email protected]> Signed-off-by: Tao Liu <[email protected]>
With Kernel commit 65c9cc9e2c14 ("x86/fred: Reserve space for the FRED stack frame") in Linux 6.9-rc1 and later, x86_64 will add extra padding ('TOP_OF_KERNEL_STACK_PADDING (2 * 8)', see: arch/x86/include/asm\ /thread_info.h,) for kernel stack when the CONFIG_X86_FRED is enabled. As a result, the pt_regs will be moved downwards due to the offset of padding, and the values of registers read from pt_regs will be incorrect as below. Without the patch: crash> bt PID: 2040 TASK: ffff969136fc4180 CPU: 16 COMMAND: "bash" #0 [ffffa996409aba38] machine_kexec at ffffffff9f881eb7 crash-utility#1 [ffffa996409aba90] __crash_kexec at ffffffff9fa1e49e crash-utility#2 [ffffa996409abb48] panic at ffffffff9f91a6cd crash-utility#3 [ffffa996409abbc8] sysrq_handle_crash at ffffffffa0015076 crash-utility#4 [ffffa996409abbd0] __handle_sysrq at ffffffffa0015640 crash-utility#5 [ffffa996409abc00] write_sysrq_trigger at ffffffffa0015ce5 crash-utility#6 [ffffa996409abc28] proc_reg_write at ffffffff9fd35bf5 crash-utility#7 [ffffa996409abc40] vfs_write at ffffffff9fc8d462 crash-utility#8 [ffffa996409abcd0] ksys_write at ffffffff9fc8dadf crash-utility#9 [ffffa996409abd08] do_syscall_64 at ffffffffa0517429 crash-utility#10 [ffffa996409abf40] entry_SYSCALL_64_after_hwframe at ffffffffa060012b [exception RIP: unknown or invalid address] RIP: 0000000000000246 RSP: 0000000000000000 RFLAGS: 0000002b RAX: 0000000000000002 RBX: 00007f9b9f5b13e0 RCX: 000055cee7486fb0 RDX: 0000000000000001 RSI: 0000000000000001 RDI: 00007f9b9f4fda57 RBP: 0000000000000246 R8: 00007f9b9f4fda57 R9: ffffffffffffffda R10: 0000000000000000 R11: 00007f9b9f5b14e0 R12: 0000000000000002 R13: 000055cee7486fb0 R14: 0000000000000002 R15: 00007f9b9f5fb780 ORIG_RAX: 0000000000000033 CS: 7ffe65327978 SS: 0000 bt: WARNING: possibly bogus exception frame crash> With the patch: crash> bt PID: 2040 TASK: ffff969136fc4180 CPU: 16 COMMAND: "bash" #0 [ffffa996409aba38] machine_kexec at ffffffff9f881eb7 crash-utility#1 [ffffa996409aba90] __crash_kexec at ffffffff9fa1e49e crash-utility#2 [ffffa996409abb48] panic at ffffffff9f91a6cd crash-utility#3 [ffffa996409abbc8] sysrq_handle_crash at ffffffffa0015076 crash-utility#4 [ffffa996409abbd0] __handle_sysrq at ffffffffa0015640 crash-utility#5 [ffffa996409abc00] write_sysrq_trigger at ffffffffa0015ce5 crash-utility#6 [ffffa996409abc28] proc_reg_write at ffffffff9fd35bf5 crash-utility#7 [ffffa996409abc40] vfs_write at ffffffff9fc8d462 crash-utility#8 [ffffa996409abcd0] ksys_write at ffffffff9fc8dadf crash-utility#9 [ffffa996409abd08] do_syscall_64 at ffffffffa0517429 crash-utility#10 [ffffa996409abf40] entry_SYSCALL_64_after_hwframe at ffffffffa060012b RIP: 00007f9b9f4fda57 RSP: 00007ffe65327978 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f9b9f4fda57 RDX: 0000000000000002 RSI: 000055cee7486fb0 RDI: 0000000000000001 RBP: 000055cee7486fb0 R8: 0000000000000000 R9: 00007f9b9f5b14e0 R10: 00007f9b9f5b13e0 R11: 0000000000000246 R12: 0000000000000002 R13: 00007f9b9f5fb780 R14: 0000000000000002 R15: 00007f9b9f5f69e0 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b crash> Link: https://www.mail-archive.com/[email protected]/msg00754.html Signed-off-by: Lianbo Jiang <[email protected]> Signed-off-by: Tao Liu <[email protected]>
For ramdump(Qcom phone device) case with the kernel option CONFIG_ARM64_PTR_AUTH_KERNEL enabled, the bt command may print incorrect stacktrace as below: crash> bt 16930 PID: 16930 TASK: ffffff89b3eada00 CPU: 2 COMMAND: "Firebase Backgr" #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4 crash-utility#1 [ffffffc034c43850] __kvm_nvhe_$d.2314 at 6be732e004cf05a0 crash-utility#2 [ffffffc034c438b0] __kvm_nvhe_$d.2314 at 86c54c6004ceff80 crash-utility#3 [ffffffc034c43950] __kvm_nvhe_$d.2314 at 55d6f96003a7b120 ... PC: 00000073f5294840 LR: 00000070d8f39ba4 SP: 00000070d4afd5d0 X29: 00000070d4afd600 X28: b4000071efcda7f0 X27: 00000070d4afe000 X26: 0000000000000000 X25: 00000070d9616000 X24: 0000000000000000 X23: 0000000000000000 X22: 0000000000000000 X21: 0000000000000000 X20: b40000728fd27520 X19: b40000728fd27550 X18: 000000702daba000 X17: 00000073f5294820 X16: 00000070d940f9d8 X15: 00000000000000bf X14: 0000000000000000 X13: 00000070d8ad2fac X12: b40000718fce5040 X11: 0000000000000000 X10: 0000000000000070 X9: 0000000000000001 X8: 0000000000000062 X7: 0000000000000020 X6: 0000000000000000 X5: 0000000000000000 X4: 0000000000000000 X3: 0000000000000000 X2: 0000000000000002 X1: 0000000000000080 X0: b40000728fd27550 ORIG_X0: b40000728fd27550 SYSCALLNO: ffffffff PSTATE: 40001000 Crash tool can not get the KERNELPACMASK value from the vmcoreinfo, need to calculate its value based on the vabits. With the patch: crash> bt 16930 PID: 16930 TASK: ffffff89b3eada00 CPU: 2 COMMAND: "Firebase Backgr" #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4 crash-utility#1 [ffffffc034c43850] __schedule at ffffffe004cf05a0 crash-utility#2 [ffffffc034c438b0] preempt_schedule_common at ffffffe004ceff80 crash-utility#3 [ffffffc034c43950] unmap_page_range at ffffffe003a7b120 crash-utility#4 [ffffffc034c439f0] unmap_vmas at ffffffe003a80a64 crash-utility#5 [ffffffc034c43ac0] exit_mmap at ffffffe003a945c4 crash-utility#6 [ffffffc034c43b10] __mmput at ffffffe00372c818 crash-utility#7 [ffffffc034c43b40] mmput at ffffffe00372c0d0 crash-utility#8 [ffffffc034c43b90] exit_mm at ffffffe00373d0ac crash-utility#9 [ffffffc034c43c00] do_exit at ffffffe00373bedc PC: 00000073f5294840 LR: 00000070d8f39ba4 SP: 00000070d4afd5d0 X29: 00000070d4afd600 X28: b4000071efcda7f0 X27: 00000070d4afe000 X26: 0000000000000000 X25: 00000070d9616000 X24: 0000000000000000 X23: 0000000000000000 X22: 0000000000000000 X21: 0000000000000000 X20: b40000728fd27520 X19: b40000728fd27550 X18: 000000702daba000 X17: 00000073f5294820 X16: 00000070d940f9d8 X15: 00000000000000bf X14: 0000000000000000 X13: 00000070d8ad2fac X12: b40000718fce5040 X11: 0000000000000000 X10: 0000000000000070 X9: 0000000000000001 X8: 0000000000000062 X7: 0000000000000020 X6: 0000000000000000 X5: 0000000000000000 X4: 0000000000000000 X3: 0000000000000000 X2: 0000000000000002 X1: 0000000000000080 X0: b40000728fd27550 ORIG_X0: b40000728fd27550 SYSCALLNO: ffffffff PSTATE: 40001000 Related kernel commits: 689eae42afd7 ("arm64: mask PAC bits of __builtin_return_address") de1702f65feb ("arm64: move PAC masks to <asm/pointer_auth.h>") Signed-off-by: bevis_chen <[email protected]>
See the following stack trace: (gdb) bt #0 0x00005635ac2b166b in arm64_unwind_frame (frame=0x7ffdaf35cb70, bt=0x7ffdaf35d430) at arm64.c:2821 crash-utility#1 arm64_back_trace_cmd (bt=0x7ffdaf35d430) at arm64.c:3306 crash-utility#2 0x00005635ac27b108 in back_trace (bt=bt@entry=0x7ffdaf35d430) at kernel.c:3239 crash-utility#3 0x00005635ac2880ae in cmd_bt () at kernel.c:2863 crash-utility#4 0x00005635ac1f16dc in exec_command () at main.c:893 crash-utility#5 0x00005635ac1f192a in main_loop () at main.c:840 crash-utility#6 0x00005635ac50df81 in captured_main (data=<optimized out>) at main.c:1284 crash-utility#7 gdb_main (args=<optimized out>) at main.c:1313 crash-utility#8 0x00005635ac50e000 in gdb_main_entry (argc=<optimized out>, argv=<optimized out>) at main.c:1338 crash-utility#9 0x00005635ac1ea2a5 in main (argc=5, argv=0x7ffdaf35dde8) at main.c:721 The issue may be encountered when thread_union symbol not found in vmlinux due to compiling optimization. This patch will try the following 2 methods to get the irq_stack_size when thread_union symbol unavailable: 1. change the thread_shift when KASAN is enabled and with vmcoreinfo. In arm64/include/asm/memory.h: #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS) ... #define IRQ_STACK_SIZE THREAD_SIZE Since enabling the KASAN will affect the final value, this patch reset IRQ_STACK_SIZE according to the calculation process in kernel code. 2. Try getting the value from kernel code disassembly, to get THREAD_SHIFT directly from tbnz instruction. In arch/arm64/kernel/entry.S: .macro kernel_ventry, el:req, ht:req, regsize:req, label:req ... add sp, sp, x0 sub x0, sp, x0 tbnz x0, #THREAD_SHIFT, 0f $ gdb vmlinux (gdb) disass vectors Dump of assembler code for function vectors: ... 0xffff800080010804 <+4>: add sp, sp, x0 0xffff800080010808 <+8>: sub x0, sp, x0 0xffff80008001080c <+12>: tbnz w0, crash-utility#16, 0xffff80008001081c <vectors+28> Signed-off-by: yeping.zheng <[email protected]> Improved-by: Tao Liu <[email protected]>
Previously, "retq" is used to determine the end of a function, so the end of framesize calculation. However "ret" might be outputted by gdb rather than "retq", as a result, the framesize is returned incorrectly, and bogus stack trace will be outputted. Without the patch: $ crash -d 3 vmcore vmlinux crash> bt 0xffffffff92da7545 <copy_process+5>: push %rbp [framesize: 8] ... 0xffffffff92da7561 <copy_process+33>: sub $0x238,%rsp [framesize: 624] ... 0xffffffff92da776a <copy_process+554>: pop %r15 [framesize: 8] 0xffffffff92da776c <copy_process+556>: pop %rbp [framesize: 0] 0xffffffff92da776d <copy_process+557>: ret crash> bt -D dump framesize_cache_entries: ... [ 3]: ffffffff92dadcbd 0 CF (copy_process+26493) crash> bt ... crash-utility#9 [ffff888263157bc0] copy_process at ffffffff92dadcbd crash-utility#10 [ffff888263157d20] __mutex_init at ffffffff92ed8dd5 crash-utility#11 [ffff888263157d38] __alloc_file at ffffffff93458397 crash-utility#12 [ffff888263157d60] alloc_empty_file at ffffffff934585d2 crash-utility#13 [ffff888263157da8] __alloc_fd at ffffffff934b5ead crash-utility#14 [ffff888263157e38] _do_fork at ffffffff92dae7a1 crash-utility#15 [ffff888263157f28] do_syscall_64 at ffffffff92c085f4 Stack crash-utility#10 ~ crash-utility#13 are bogus and misleading. With the patch: ... 0xffffffff92da776d <copy_process+557>: ret [framesize restored to: 624] crash> bt -D dump ... [ 3]: ffffffff92dadcbd 624 CF (copy_process+26493) crash> bt ... crash-utility#9 [ffff888263157bc0] copy_process at ffffffff92dadcbd crash-utility#10 [ffff888263157e38] _do_fork at ffffffff92dae7a1 crash-utility#11 [ffff888263157f28] do_syscall_64 at ffffffff92c085f4 Signed-off-by: Tao Liu <[email protected]>
Hi,
i was unable to build with
make
but i install withapt install crash
on ubuntu 16.10kernel version
4.10.8-041008-generic
i get error
crash: cannot find booted kernel -- please enter namelist argument
you say on docs :
crash not work on ubuntu 16.10, if i am mistake please help for resolve it,
Cheers,
Ramin
The text was updated successfully, but these errors were encountered: