Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup and update to gdb 7.8 #2

Conversation

jeffmahoney
Copy link

Hi Dave -

Our Hack Week at SUSE is next week, and I'm planning on continuing a project I've been working on to extend the Python functionality exported by GDB to be useful through crash. I'm aware of several other projects with similar goals. I've already gotten the needed bits integrated into our gdb, which is at version 7.8. When I updated the patches from an earlier version to 7.8, there were substantial differences. I didn't really want to maintain two sets of patches: one against gdb and the other against the gdb that crash uses. Since there's no real reason not to update crash's gdb version to 7.8, I went ahead and integrated it with crash.

This pull request covers a fair amount of ground.

  • I've removed support for gdb versions prior to 7.6
  • Cleaned up configuration of different gdb versions
  • Renamed read_string within crash to men_read_string to avoid renaming gdb functions
  • Use pointers to minimal_symbols for patching rather than pointers to longs (this one is needed for 7.8)
  • The command hook has been removed in 7.8, so I've implemented an interpreter that just copies the cli interpreter during its own initialization and overrides the command fund.
  • Constify parameters for a few functions
  • Eliminate a few global variables that can be replaced with cleaner options
  • Break out the gdb interface code from gdb-7.6.patch into a separate file
  • Add support for gdb 7.8.

I've done some light testing with the command loop, but if you have a test system, I'd be happy to drive it a bit more.

Petr Tesarik and others added 17 commits October 18, 2014 21:09
This fixes a compilation failure on ppc64.
([email protected])
The last release of gdb 5.x was in 2002. We can remove support for it,
cleaning up the code. Since setup_gdb_defaults is using its own strings,
we can leave the block for this version empty to avoid messing with the
version #defines that are used as indexes. We'll clean it up in a later
patch.
([email protected])
gdb 6.1 was released in August 2011. We can remove support for it,
cleaning up the code. Since setup_gdb_defaults is using its own strings,
we can leave the block for this version empty to avoid messing with the
version #defines that are used as indexes. We'll clean it up in a later
patch.
([email protected])
gdb 7.3.1 was released in September 2011. We can remove support for it,
cleaning up the code. Since setup_gdb_defaults is using its own strings,
we can leave the block for this version empty to avoid messing with the
version #defines that are used as indexes. We'll clean it up in a later
patch.
([email protected])
This patch cleans up the searching for a supported gdb, removing the
empty blocks and replacing the index with a simple pointer.
([email protected])
With the goal of significantly shrinking the size of the gdb patch, we
rename crash's read_string to mem_read_string and remove the renaming
of gdb's read_string.
([email protected])
patch_kernel_symbol always deals with minimal symbols. Just pass the pointer
in and back out to a callback that uses the right macros. This isn't
entirely necessary for use with gdb 7.6 but will be for gdb 7.8 since
it doesn't provide a single value that can be used to generate a pointer.
([email protected])
The hook assignment method of providing an alternative interface has been
deprecated from gdb for some time. Gdb 7.7 or 7.8 removed the
deprecated_command_loop_hook callback. It's time to just use the
interpreter interface. Since all we really want to do is provide a
different event loop, we can copy the cli interpreter and override the
command_loop_proc there. In doing so, we can also let gdb initialize
it for us, get rid of the separate newline-and-exit callback for
version queries, and drop the update_gdb_hooks callback.
([email protected])
same_file doesn't modify either argument, so constify them. This avoids
a warning when building against gdb 7.8, which must use objfile_name()
instead of objfile->name.
([email protected])
We don't change the module parameter but will pass it a const char *,
causing build warnings.
([email protected])
gdb 7.7 renamed prettyprint to prettyformat, so let's request that
option first. Failing that, fall back to the old prettyprint names.
([email protected])
We don't change the filename but can pass it a const char *, causing a
build warning.
([email protected])
There are a number of locations in which we do a callsite prototype
declaration and there's no need to do that. Just put them in defs.h --
in the right place, since the prototypes are the same whether GDB_COMMON
is set or not.
([email protected])
gdb_kernel_objfile performs the same function as symfile_objfile, introduced
sometime in the 7.x release stream. Let's just use that.
([email protected])
We maintain a global variable that is set and cleared across a single
function call. It's much cleaner to just pass it as a parameter through
to the actual consumer.
([email protected])
The crash-specific gdb interface can be implemented in a separate C file
rather than patching gdb code. This is much easier to maintain in the long
run. A later patch can integrate crash.c and gdb_interface.c for further
simplification.
([email protected])
This patch adds support for gdb 7.8.
([email protected])
@jeffmahoney
Copy link
Author

There seems to be an integration issue. 'dmesg' works but printing of the variables it uses for structured logging does not.

@crash-utility
Copy link
Collaborator

----- Original Message -----

Hi Dave -

Our Hack Week at SUSE is next week, and I'm planning on continuing a
project I've been working on to extend the Python functionality exported
by GDB to be useful through crash. I'm aware of several other projects
with similar goals. I've already gotten the needed bits integrated into
our gdb, which is at version 7.8.

Let me get this straight -- you've had to make additional changes to gdb-7.8 for
this Python functionality? Or does gdb-7.8 work as-is for Python's purposes
(i.e., outside of crash)?

How exactly does this Python functionality get used in crash? I don't speak
Python, and the only users of it with respect to crash are those who use
the built-in Python interpreter offered by the pykdump extension module:

http://people.redhat.com/anderson/extensions.html#PYTHON

When I updated the patches from an earlier version to 7.8, there were substantial
differences. I didn't really want to maintain two sets of patches: one against
gdb and the other against the gdb that crash uses. Since there's no real reason
not to update crash's gdb version to 7.8, I went ahead and integrated it with crash.

That's debatable on this end. Historically the only reason to update the embedded
gdb was absolute necessity. Stability carries much more weight within Red Hat, so
I'm bound to that (at least as long as I'm employed here). And if you're a follower
of the crash-utility mailing list, you're probably aware of how conservative I am
w/respect to major changes. I've always maintained that whatever version of gdb
is currently in place should remain there forever -- until it has to be updated...

This pull request covers a fair amount of ground.

  • I've removed support for gdb versions prior to 7.6

Why? Removing support for earlier versions is something that I don't feel comfortable
changing just because you can. There is historical precedent, where the embedded
gdb was bumped to gdb-6.1, but due to debuginfo incompatibility with certain
versions of gcc, it had to be reverted back to gdb-6.0. Because of the way
things are set up, it was trivial making the reversion -- it's basically just
changing the "default_gdb" in configure.c.

  • Cleaned up configuration of different gdb versions

I grant you that it is somewhat confusing, but the #ifdef lines w/respect to
gdb versions was changed recently such that for future updates, it should be a
relatively simple matter of searching for usages of "GDB_x_y" where x_y is the
most recent version -- and then applying any newly-required changes there, or
just appending the new version to the "#if" line.

  • Renamed read_string within crash to men_read_string to avoid renaming gdb
    functions

Similar to our Red Hat "KABI" for changing kernel interfaces, Red Hat is also
very sensitive to changing user-space ABI's as well. The functions that are
exported in the top-level "defs.h" file shouldn't be changed because extension
modules may depend upon them -- as you saw with dminfo.c and trace.c. For example,
although trace.c does get captured as a sample module within the crash source
package, it is carried as a separate crash-trace-command package. So that
would break that package, and require an update to that package.

Given that there are only 2 (count 'em) callers of the read_string() function
in gdb's valprint.c, it's certainly not an onerous modification to the gdb
sources.

  • Use pointers to minimal_symbols for patching rather than pointers to longs
    (this one is needed for 7.8)
  • The command hook has been removed in 7.8, so I've implemented an
    interpreter that just copies the cli interpreter during its own
    initialization and overrides the command fund.

Damn -- I see they've removed the deprecated_command_hook, which I was hoping
they'd keep around. I don't quite understand what you mean above, but hopefully
it works as easily as you make it sound. As I recall, it wasn't so much the
replacement of the command hook that was tricky, but rather the error handling
for errors generated within the gdb code itself.

  • Constify parameters for a few functions
  • Eliminate a few global variables that can be replaced with cleaner options
  • Break out the gdb interface code from gdb-7.6.patch into a separate file
  • Add support for gdb 7.8.

It's not clear to me how this works -- it looks like you've kept gdb-7.6
in place? But I don't seen any change in the Makefile to apply the gdb-7.8.patch?

Again, I would like to keep things working in a traditional manner. When a
new gdb-x.y version gets incorporated, there should be a new gdb-x.y.patch
file created, and the prior one removed from the Makefile. I may be misunderstanding
how this is being done.

I've done some light testing with the command loop, but if you have a
test system, I'd be happy to drive it a bit more.
You can merge this Pull Request by running:

git pull https://github.com/jeffmahoney/crash cleanup-and-update-to-gdb-7.8

Or you can view, comment on it, or merge it online at:

#2

I'm currently not using the pull functionality at github. Changes of this
magnitude (in fact any changes) need to be posted on the crash-utility mailing
list.

But as noted above, I'd rather see a "traditional" bumping of the embedded
gdb version the way it's always been done before.

Thanks,
Dave

-- Commit Summary --

  • cleanup: remove support for gdb 5.x
  • cleanup: remove support for gdb 6.0/6.1
  • cleanup: remove support for gdb 7.0/7.3.1
  • cleanup: remove static indexes for gdb version configuration
  • Add missing includes
  • cleanup: rename read_string to mem_read_string and leave gdb alone
  • cleanup: use minimal_symbol directly instead of pointers to longs
  • cleanup: use the gdb interpreter infrastructure
  • cleanup: constify arguments to same_file
  • cleanup: constify check_specified_module_tree
  • gdb: request prettyformat versions of options
  • cleanup: properly export prototypes in defs.h
  • cleanup: constify untrusted_file
  • cleanup: split crash-specific code out from gdb-7.6 and into crash.c
  • gdb: eliminate crash_from_tty
  • gdb: eliminate gdb_kernel_objfile
  • gdb: add support for gdb 7.8

-- File Changes --

M Makefile (73)
M alpha.c (7)
M configure.c (124)
A crash.c (722)
A crash.h (13)
M defs.h (119)
M dev.c (13)
M extensions/dminfo.c (4)
M extensions/trace.c (12)
M filesys.c (14)
M gdb-7.6.patch (369)
A gdb-7.8.patch (804)
M gdb_interface.c (93)
M help.c (8)
M kernel.c (32)
M main.c (8)
M memory.c (40)
M net.c (2)
M ppc.c (10)
M ppc64.c (8)
M symbols.c (48)
M task.c (12)
M unwind_x86_32_64.c (2)

-- Patch Links --

https://github.com/crash-utility/crash/pull/2.patch
https://github.com/crash-utility/crash/pull/2.diff


Reply to this email directly or view it on GitHub:
#2

crash-utility pushed a commit that referenced this pull request Feb 14, 2017
Without the patch, the backtrace displays the "cannot resolve stack
trace" warning, dumps the backtrace, and then the text symbols:

  crash> bt
  PID: 0      TASK: f0962180  CPU: 6   COMMAND: "swapper/6"
  bt: cannot resolve stack trace:
   #0 [f095ff1c] __schedule at c0b6ef8d
   #1 [f095ff58] schedule at c0b6f4a9
   #2 [f095ff64] schedule_preempt_disabled at c0b6f728
   #3 [f095ff6c] cpu_startup_entry at c04b0310
   #4 [f095ff94] start_secondary at c04468c0
  bt: text symbols on stack:
      [f095ff1c] __schedule at c0b6ef8d
      [f095ff58] schedule at c0b6f4ae
      [f095ff64] schedule_preempt_disabled at c0b6f72d
      [f095ff6c] cpu_startup_entry at c04b0315
      [f095ff94] start_secondary at c04468c5
  crash>

The backtrace shown is actually correct.
([email protected])
@leilchen leilchen mentioned this pull request Aug 31, 2017
Leo-Yan pushed a commit to Leo-Yan/crash that referenced this pull request Nov 15, 2017
happybevis pushed a commit to happybevis/crash that referenced this pull request Jul 3, 2024
If we use crash to parse ramdump(Qcom phone device) rathen than vmcore.
Start command should be like: crash vmlinux --kaslr=xxx DDRCS0_0.BIN@0x0000000080000000,... --machdep vabits_actual=39
Then We will see bt command show misleading backtrace information below:

crash> bt 16930
PID: 16930    TASK: ffffff89b3eada00  CPU: 2    COMMAND: "Firebase Backgr"
 #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4
 crash-utility#1 [ffffffc034c43850] __kvm_nvhe_$d.2314 at 6be732e004cf05a0
 crash-utility#2 [ffffffc034c438b0] __kvm_nvhe_$d.2314 at 86c54c6004ceff80
 crash-utility#3 [ffffffc034c43950] __kvm_nvhe_$d.2314 at 55d6f96003a7b120
 crash-utility#4 [ffffffc034c439f0] __kvm_nvhe_$d.2314 at 9ccec46003a80a64
 crash-utility#5 [ffffffc034c43ac0] __kvm_nvhe_$d.2314 at 8cf41e6003a945c4
 crash-utility#6 [ffffffc034c43b10] __kvm_nvhe_$d.2314 at a8f181e00372c818
 crash-utility#7 [ffffffc034c43b40] __kvm_nvhe_$d.2314 at 6dedde600372c0d0
 crash-utility#8 [ffffffc034c43b90] __kvm_nvhe_$d.2314 at 62cc07e00373d0ac
 crash-utility#9 [ffffffc034c43c00] __kvm_nvhe_$d.2314 at 72fb1de00373bedc
...
     PC: 00000073f5294840   LR: 00000070d8f39ba4   SP: 00000070d4afd5d0
    X29: 00000070d4afd600  X28: b4000071efcda7f0  X27: 00000070d4afe000
    X26: 0000000000000000  X25: 00000070d9616000  X24: 0000000000000000
    X23: 0000000000000000  X22: 0000000000000000  X21: 0000000000000000
    X20: b40000728fd27520  X19: b40000728fd27550  X18: 000000702daba000
    X17: 00000073f5294820  X16: 00000070d940f9d8  X15: 00000000000000bf
    X14: 0000000000000000  X13: 00000070d8ad2fac  X12: b40000718fce5040
    X11: 0000000000000000  X10: 0000000000000070   X9: 0000000000000001
     X8: 0000000000000062   X7: 0000000000000020   X6: 0000000000000000
     X5: 0000000000000000   X4: 0000000000000000   X3: 0000000000000000
     X2: 0000000000000002   X1: 0000000000000080   X0: b40000728fd27550
    ORIG_X0: b40000728fd27550  SYSCALLNO: ffffffff  PSTATE: 40001000

By checking the raw data below, will see the lr (fp+8) data show the pointer which already been replaced by PAC prefix.

crash> bt -f
PID: 16930    TASK: ffffff89b3eada00  CPU: 2    COMMAND: "Firebase Backgr"
 #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4
    ffffffc034c437f0: ffffffc034c43850 6be732e004cf05a4
    ffffffc034c43800: ffffffe006186108 a0ed07e004cf09c4
    ffffffc034c43810: ffffff8a1a340000 ffffff8a8d343c00
    ffffffc034c43820: ffffff89b3eada00 ffffff8b780db540
    ffffffc034c43830: ffffff89b3eada00 0000000000000000
    ffffffc034c43840: 0000000000000004 712b828118484a00
 crash-utility#1 [ffffffc034c43850] __kvm_nvhe_$d.2314 at 6be732e004cf05a0
    ffffffc034c43850: ffffffc034c438b0 86c54c6004ceff84
    ffffffc034c43860: 000000708070f000 ffffffc034c43938
    ffffffc034c43870: ffffff88bd822878 ffffff89b3eada00
...

So we check the CONFIG_ARM64_PTR_AUTH and CONFIG_ARM64_PTR_AUTH_KERNEL to double check if pac mechanism been enabled on this ramdump.
Then we use vabits to figure it out.
Fix then show the right backtrace below:
crash> bt 16930
PID: 16930    TASK: ffffff89b3eada00  CPU: 2    COMMAND: "Firebase Backgr"
 #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4
 crash-utility#1 [ffffffc034c43850] __schedule at ffffffe004cf05a0
 crash-utility#2 [ffffffc034c438b0] preempt_schedule_common at ffffffe004ceff80
 crash-utility#3 [ffffffc034c43950] unmap_page_range at ffffffe003a7b120
 crash-utility#4 [ffffffc034c439f0] unmap_vmas at ffffffe003a80a64
 crash-utility#5 [ffffffc034c43ac0] exit_mmap at ffffffe003a945c4
 crash-utility#6 [ffffffc034c43b10] __mmput at ffffffe00372c818
 crash-utility#7 [ffffffc034c43b40] mmput at ffffffe00372c0d0
 crash-utility#8 [ffffffc034c43b90] exit_mm at ffffffe00373d0ac
 crash-utility#9 [ffffffc034c43c00] do_exit at ffffffe00373bedc
     PC: 00000073f5294840   LR: 00000070d8f39ba4   SP: 00000070d4afd5d0
    X29: 00000070d4afd600  X28: b4000071efcda7f0  X27: 00000070d4afe000
    X26: 0000000000000000  X25: 00000070d9616000  X24: 0000000000000000
    X23: 0000000000000000  X22: 0000000000000000  X21: 0000000000000000
    X20: b40000728fd27520  X19: b40000728fd27550  X18: 000000702daba000
    X17: 00000073f5294820  X16: 00000070d940f9d8  X15: 00000000000000bf
    X14: 0000000000000000  X13: 00000070d8ad2fac  X12: b40000718fce5040
    X11: 0000000000000000  X10: 0000000000000070   X9: 0000000000000001
     X8: 0000000000000062   X7: 0000000000000020   X6: 0000000000000000
     X5: 0000000000000000   X4: 0000000000000000   X3: 0000000000000000
     X2: 0000000000000002   X1: 0000000000000080   X0: b40000728fd27550
    ORIG_X0: b40000728fd27550  SYSCALLNO: ffffffff  PSTATE: 40001000

Let's use GENMASK to replace the pac pointer to fix it.
gki related commit url here:
https://lore.kernel.org/all/[email protected]/
happybevis pushed a commit to happybevis/crash that referenced this pull request Jul 3, 2024
If we use crash to parse ramdump(Qcom phone device) rathen than vmcore.
Start command should be like: crash vmlinux --kaslr=xxx DDRCS0_0.BIN@0x0000000080000000,... --machdep vabits_actual=39
Then We will see bt command show misleading backtrace information below:

crash> bt 16930
PID: 16930    TASK: ffffff89b3eada00  CPU: 2    COMMAND: "Firebase Backgr"
 #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4
 crash-utility#1 [ffffffc034c43850] __kvm_nvhe_$d.2314 at 6be732e004cf05a0
 crash-utility#2 [ffffffc034c438b0] __kvm_nvhe_$d.2314 at 86c54c6004ceff80
 crash-utility#3 [ffffffc034c43950] __kvm_nvhe_$d.2314 at 55d6f96003a7b120
 crash-utility#4 [ffffffc034c439f0] __kvm_nvhe_$d.2314 at 9ccec46003a80a64
 crash-utility#5 [ffffffc034c43ac0] __kvm_nvhe_$d.2314 at 8cf41e6003a945c4
 crash-utility#6 [ffffffc034c43b10] __kvm_nvhe_$d.2314 at a8f181e00372c818
 crash-utility#7 [ffffffc034c43b40] __kvm_nvhe_$d.2314 at 6dedde600372c0d0
 crash-utility#8 [ffffffc034c43b90] __kvm_nvhe_$d.2314 at 62cc07e00373d0ac
 crash-utility#9 [ffffffc034c43c00] __kvm_nvhe_$d.2314 at 72fb1de00373bedc
...
     PC: 00000073f5294840   LR: 00000070d8f39ba4   SP: 00000070d4afd5d0
    X29: 00000070d4afd600  X28: b4000071efcda7f0  X27: 00000070d4afe000
    X26: 0000000000000000  X25: 00000070d9616000  X24: 0000000000000000
    X23: 0000000000000000  X22: 0000000000000000  X21: 0000000000000000
    X20: b40000728fd27520  X19: b40000728fd27550  X18: 000000702daba000
    X17: 00000073f5294820  X16: 00000070d940f9d8  X15: 00000000000000bf
    X14: 0000000000000000  X13: 00000070d8ad2fac  X12: b40000718fce5040
    X11: 0000000000000000  X10: 0000000000000070   X9: 0000000000000001
     X8: 0000000000000062   X7: 0000000000000020   X6: 0000000000000000
     X5: 0000000000000000   X4: 0000000000000000   X3: 0000000000000000
     X2: 0000000000000002   X1: 0000000000000080   X0: b40000728fd27550
    ORIG_X0: b40000728fd27550  SYSCALLNO: ffffffff  PSTATE: 40001000

By checking the raw data below, will see the lr (fp+8) data show the pointer which already been replaced by PAC prefix.

crash> bt -f
PID: 16930    TASK: ffffff89b3eada00  CPU: 2    COMMAND: "Firebase Backgr"
 #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4
    ffffffc034c437f0: ffffffc034c43850 6be732e004cf05a4
    ffffffc034c43800: ffffffe006186108 a0ed07e004cf09c4
    ffffffc034c43810: ffffff8a1a340000 ffffff8a8d343c00
    ffffffc034c43820: ffffff89b3eada00 ffffff8b780db540
    ffffffc034c43830: ffffff89b3eada00 0000000000000000
    ffffffc034c43840: 0000000000000004 712b828118484a00
 crash-utility#1 [ffffffc034c43850] __kvm_nvhe_$d.2314 at 6be732e004cf05a0
    ffffffc034c43850: ffffffc034c438b0 86c54c6004ceff84
    ffffffc034c43860: 000000708070f000 ffffffc034c43938
    ffffffc034c43870: ffffff88bd822878 ffffff89b3eada00
...

So we check the CONFIG_ARM64_PTR_AUTH and CONFIG_ARM64_PTR_AUTH_KERNEL to double check if pac mechanism been enabled on this ramdump.
Then we use vabits to figure it out.
Fix then show the right backtrace below:
crash> bt 16930
PID: 16930    TASK: ffffff89b3eada00  CPU: 2    COMMAND: "Firebase Backgr"
 #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4
 crash-utility#1 [ffffffc034c43850] __schedule at ffffffe004cf05a0
 crash-utility#2 [ffffffc034c438b0] preempt_schedule_common at ffffffe004ceff80
 crash-utility#3 [ffffffc034c43950] unmap_page_range at ffffffe003a7b120
 crash-utility#4 [ffffffc034c439f0] unmap_vmas at ffffffe003a80a64
 crash-utility#5 [ffffffc034c43ac0] exit_mmap at ffffffe003a945c4
 crash-utility#6 [ffffffc034c43b10] __mmput at ffffffe00372c818
 crash-utility#7 [ffffffc034c43b40] mmput at ffffffe00372c0d0
 crash-utility#8 [ffffffc034c43b90] exit_mm at ffffffe00373d0ac
 crash-utility#9 [ffffffc034c43c00] do_exit at ffffffe00373bedc
     PC: 00000073f5294840   LR: 00000070d8f39ba4   SP: 00000070d4afd5d0
    X29: 00000070d4afd600  X28: b4000071efcda7f0  X27: 00000070d4afe000
    X26: 0000000000000000  X25: 00000070d9616000  X24: 0000000000000000
    X23: 0000000000000000  X22: 0000000000000000  X21: 0000000000000000
    X20: b40000728fd27520  X19: b40000728fd27550  X18: 000000702daba000
    X17: 00000073f5294820  X16: 00000070d940f9d8  X15: 00000000000000bf
    X14: 0000000000000000  X13: 00000070d8ad2fac  X12: b40000718fce5040
    X11: 0000000000000000  X10: 0000000000000070   X9: 0000000000000001
     X8: 0000000000000062   X7: 0000000000000020   X6: 0000000000000000
     X5: 0000000000000000   X4: 0000000000000000   X3: 0000000000000000
     X2: 0000000000000002   X1: 0000000000000080   X0: b40000728fd27550
    ORIG_X0: b40000728fd27550  SYSCALLNO: ffffffff  PSTATE: 40001000

Let's use GENMASK to replace the pac pointer to fix it.
gki related commit url here:
https://lore.kernel.org/all/[email protected]/

Signed-off-by: bevis_chen <[email protected]>
happybevis pushed a commit to happybevis/crash that referenced this pull request Jul 3, 2024
If we use crash to parse ramdump(Qcom phone device) rathen than vmcore.
Start command should be like: crash vmlinux --kaslr=xxx DDRCS0_0.BIN@0x0000000080000000,... --machdep vabits_actual=39
Then We will see bt command show misleading backtrace information below:

crash> bt 16930
PID: 16930    TASK: ffffff89b3eada00  CPU: 2    COMMAND: "Firebase Backgr"
 #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4
 crash-utility#1 [ffffffc034c43850] __kvm_nvhe_$d.2314 at 6be732e004cf05a0
 crash-utility#2 [ffffffc034c438b0] __kvm_nvhe_$d.2314 at 86c54c6004ceff80
 crash-utility#3 [ffffffc034c43950] __kvm_nvhe_$d.2314 at 55d6f96003a7b120
 crash-utility#4 [ffffffc034c439f0] __kvm_nvhe_$d.2314 at 9ccec46003a80a64
 crash-utility#5 [ffffffc034c43ac0] __kvm_nvhe_$d.2314 at 8cf41e6003a945c4
 crash-utility#6 [ffffffc034c43b10] __kvm_nvhe_$d.2314 at a8f181e00372c818
 crash-utility#7 [ffffffc034c43b40] __kvm_nvhe_$d.2314 at 6dedde600372c0d0
 crash-utility#8 [ffffffc034c43b90] __kvm_nvhe_$d.2314 at 62cc07e00373d0ac
 crash-utility#9 [ffffffc034c43c00] __kvm_nvhe_$d.2314 at 72fb1de00373bedc
...
     PC: 00000073f5294840   LR: 00000070d8f39ba4   SP: 00000070d4afd5d0
    X29: 00000070d4afd600  X28: b4000071efcda7f0  X27: 00000070d4afe000
    X26: 0000000000000000  X25: 00000070d9616000  X24: 0000000000000000
    X23: 0000000000000000  X22: 0000000000000000  X21: 0000000000000000
    X20: b40000728fd27520  X19: b40000728fd27550  X18: 000000702daba000
    X17: 00000073f5294820  X16: 00000070d940f9d8  X15: 00000000000000bf
    X14: 0000000000000000  X13: 00000070d8ad2fac  X12: b40000718fce5040
    X11: 0000000000000000  X10: 0000000000000070   X9: 0000000000000001
     X8: 0000000000000062   X7: 0000000000000020   X6: 0000000000000000
     X5: 0000000000000000   X4: 0000000000000000   X3: 0000000000000000
     X2: 0000000000000002   X1: 0000000000000080   X0: b40000728fd27550
    ORIG_X0: b40000728fd27550  SYSCALLNO: ffffffff  PSTATE: 40001000

By checking the raw data below, will see the lr (fp+8) data show the pointer which already been replaced by PAC prefix.

crash> bt -f
PID: 16930    TASK: ffffff89b3eada00  CPU: 2    COMMAND: "Firebase Backgr"
 #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4
    ffffffc034c437f0: ffffffc034c43850 6be732e004cf05a4
    ffffffc034c43800: ffffffe006186108 a0ed07e004cf09c4
    ffffffc034c43810: ffffff8a1a340000 ffffff8a8d343c00
    ffffffc034c43820: ffffff89b3eada00 ffffff8b780db540
    ffffffc034c43830: ffffff89b3eada00 0000000000000000
    ffffffc034c43840: 0000000000000004 712b828118484a00
 crash-utility#1 [ffffffc034c43850] __kvm_nvhe_$d.2314 at 6be732e004cf05a0
    ffffffc034c43850: ffffffc034c438b0 86c54c6004ceff84
    ffffffc034c43860: 000000708070f000 ffffffc034c43938
    ffffffc034c43870: ffffff88bd822878 ffffff89b3eada00
...

So we check the CONFIG_ARM64_PTR_AUTH and CONFIG_ARM64_PTR_AUTH_KERNEL to double check if pac mechanism been enabled on this ramdump.
Then we use vabits to figure it out.
Fix then show the right backtrace below:
crash> bt 16930
PID: 16930    TASK: ffffff89b3eada00  CPU: 2    COMMAND: "Firebase Backgr"
 #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4
 crash-utility#1 [ffffffc034c43850] __schedule at ffffffe004cf05a0
 crash-utility#2 [ffffffc034c438b0] preempt_schedule_common at ffffffe004ceff80
 crash-utility#3 [ffffffc034c43950] unmap_page_range at ffffffe003a7b120
 crash-utility#4 [ffffffc034c439f0] unmap_vmas at ffffffe003a80a64
 crash-utility#5 [ffffffc034c43ac0] exit_mmap at ffffffe003a945c4
 crash-utility#6 [ffffffc034c43b10] __mmput at ffffffe00372c818
 crash-utility#7 [ffffffc034c43b40] mmput at ffffffe00372c0d0
 crash-utility#8 [ffffffc034c43b90] exit_mm at ffffffe00373d0ac
 crash-utility#9 [ffffffc034c43c00] do_exit at ffffffe00373bedc
     PC: 00000073f5294840   LR: 00000070d8f39ba4   SP: 00000070d4afd5d0
    X29: 00000070d4afd600  X28: b4000071efcda7f0  X27: 00000070d4afe000
    X26: 0000000000000000  X25: 00000070d9616000  X24: 0000000000000000
    X23: 0000000000000000  X22: 0000000000000000  X21: 0000000000000000
    X20: b40000728fd27520  X19: b40000728fd27550  X18: 000000702daba000
    X17: 00000073f5294820  X16: 00000070d940f9d8  X15: 00000000000000bf
    X14: 0000000000000000  X13: 00000070d8ad2fac  X12: b40000718fce5040
    X11: 0000000000000000  X10: 0000000000000070   X9: 0000000000000001
     X8: 0000000000000062   X7: 0000000000000020   X6: 0000000000000000
     X5: 0000000000000000   X4: 0000000000000000   X3: 0000000000000000
     X2: 0000000000000002   X1: 0000000000000080   X0: b40000728fd27550
    ORIG_X0: b40000728fd27550  SYSCALLNO: ffffffff  PSTATE: 40001000

Let's use GENMASK to replace the pac pointer to fix it.
gki related commit url here:
https://lore.kernel.org/all/[email protected]/

Signed-off-by: bevis_chen <[email protected]>
lian-bo pushed a commit that referenced this pull request Jul 25, 2024
For ramdump(Qcom phone device) case with the kernel option
CONFIG_ARM64_PTR_AUTH_KERNEL enabled, the bt command may print
incorrect stacktrace as below:

  crash> bt 16930
  PID: 16930    TASK: ffffff89b3eada00  CPU: 2    COMMAND: "Firebase Backgr"
   #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4
   #1 [ffffffc034c43850] __kvm_nvhe_$d.2314 at 6be732e004cf05a0
   #2 [ffffffc034c438b0] __kvm_nvhe_$d.2314 at 86c54c6004ceff80
   #3 [ffffffc034c43950] __kvm_nvhe_$d.2314 at 55d6f96003a7b120
  ...
       PC: 00000073f5294840   LR: 00000070d8f39ba4   SP: 00000070d4afd5d0
      X29: 00000070d4afd600  X28: b4000071efcda7f0  X27: 00000070d4afe000
      X26: 0000000000000000  X25: 00000070d9616000  X24: 0000000000000000
      X23: 0000000000000000  X22: 0000000000000000  X21: 0000000000000000
      X20: b40000728fd27520  X19: b40000728fd27550  X18: 000000702daba000
      X17: 00000073f5294820  X16: 00000070d940f9d8  X15: 00000000000000bf
      X14: 0000000000000000  X13: 00000070d8ad2fac  X12: b40000718fce5040
      X11: 0000000000000000  X10: 0000000000000070   X9: 0000000000000001
       X8: 0000000000000062   X7: 0000000000000020   X6: 0000000000000000
       X5: 0000000000000000   X4: 0000000000000000   X3: 0000000000000000
       X2: 0000000000000002   X1: 0000000000000080   X0: b40000728fd27550
      ORIG_X0: b40000728fd27550  SYSCALLNO: ffffffff  PSTATE: 40001000

Crash tool can not get the KERNELPACMASK value from the vmcoreinfo, need
to calculate its value based on the vabits.

With the patch:

  crash> bt 16930
  PID: 16930    TASK: ffffff89b3eada00  CPU: 2    COMMAND: "Firebase Backgr"
   #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4
   #1 [ffffffc034c43850] __schedule at ffffffe004cf05a0
   #2 [ffffffc034c438b0] preempt_schedule_common at ffffffe004ceff80
   #3 [ffffffc034c43950] unmap_page_range at ffffffe003a7b120
   #4 [ffffffc034c439f0] unmap_vmas at ffffffe003a80a64
   #5 [ffffffc034c43ac0] exit_mmap at ffffffe003a945c4
   #6 [ffffffc034c43b10] __mmput at ffffffe00372c818
   #7 [ffffffc034c43b40] mmput at ffffffe00372c0d0
   #8 [ffffffc034c43b90] exit_mm at ffffffe00373d0ac
   #9 [ffffffc034c43c00] do_exit at ffffffe00373bedc
       PC: 00000073f5294840   LR: 00000070d8f39ba4   SP: 00000070d4afd5d0
      X29: 00000070d4afd600  X28: b4000071efcda7f0  X27: 00000070d4afe000
      X26: 0000000000000000  X25: 00000070d9616000  X24: 0000000000000000
      X23: 0000000000000000  X22: 0000000000000000  X21: 0000000000000000
      X20: b40000728fd27520  X19: b40000728fd27550  X18: 000000702daba000
      X17: 00000073f5294820  X16: 00000070d940f9d8  X15: 00000000000000bf
      X14: 0000000000000000  X13: 00000070d8ad2fac  X12: b40000718fce5040
      X11: 0000000000000000  X10: 0000000000000070   X9: 0000000000000001
       X8: 0000000000000062   X7: 0000000000000020   X6: 0000000000000000
       X5: 0000000000000000   X4: 0000000000000000   X3: 0000000000000000
       X2: 0000000000000002   X1: 0000000000000080   X0: b40000728fd27550
      ORIG_X0: b40000728fd27550  SYSCALLNO: ffffffff  PSTATE: 40001000

Related kernel commits:
689eae42afd7 ("arm64: mask PAC bits of __builtin_return_address")
de1702f65feb ("arm64: move PAC masks to <asm/pointer_auth.h>")

Signed-off-by: bevis_chen <[email protected]>
adi-g15-ibm pushed a commit to adi-g15-ibm/crash that referenced this pull request Jul 31, 2024
The stack unwinding is for kernel addresses only. If non-kernel address
encountered, it is usually a user space address, or non-address value
like a function call parameter. So stopping stack unwinding at non-kernel
address will decrease the invalid unwind results.

Before:
crash> gdb bt
 #0  0xffffffff816a8f65 in context_switch ...
 crash-utility#1  __schedule () ...
 crash-utility#2  0xffffffff816a94e9 in schedule ...
 crash-utility#3  0xffffffff816a86fd in schedule_hrtimeout_range_clock ...
 crash-utility#4  0xffffffff816a8733 in schedule_hrtimeout_range ...
 crash-utility#5  0xffffffff8124bb7e in ep_poll ...
 crash-utility#6  0xffffffff8124d00d in SYSC_epoll_wait ...
 crash-utility#7  SyS_epoll_wait ...
 crash-utility#8  <signal handler called>
 crash-utility#9  0x00007f0449407923 in ?? ()
 crash-utility#10 0xffff880100000001 in ?? ()
 crash-utility#11 0xffff880169b3c010 in ?? ()
 crash-utility#12 0x0000000000000040 in irq_stack_union ()
 crash-utility#13 0xffff880169b3c058 in ?? ()
 crash-utility#14 0xffff880169b3c048 in ?? ()
 crash-utility#15 0xffff880169b3c050 in ?? ()
 crash-utility#16 0x0000000000000000 in ?? ()

After:
crash> gdb bt
 #0  0xffffffff816a8f65 in context_switch ...
 crash-utility#1  __schedule () ...
 crash-utility#2  0xffffffff816a94e9 in schedule () ...
 crash-utility#3  0xffffffff816a86fd in schedule_hrtimeout_range_clock ...
 crash-utility#4  0xffffffff816a8733 in schedule_hrtimeout_range ...
 crash-utility#5  0xffffffff8124bb7e in ep_poll ...
 crash-utility#6  0xffffffff8124d00d in SYSC_epoll_wait ...
 crash-utility#7  SyS_epoll_wait ...
 crash-utility#8  <signal handler called>

Cc: Sourabh Jain <[email protected]>
Cc: Hari Bathini <[email protected]>
Cc: Mahesh J Salgaonkar <[email protected]>
Cc: Naveen N. Rao <[email protected]>
Cc: Lianbo Jiang <[email protected]>
Cc: HAGIO KAZUHITO(萩尾 一仁) <[email protected]>
Cc: Tao Liu <[email protected]>
Cc: Alexey Makhalov <[email protected]>
Signed-off-by: Tao Liu <[email protected]>
adi-g15-ibm added a commit to adi-g15-ibm/crash that referenced this pull request Jul 31, 2024
Currently, gdb passthroughs of 'bt', 'frame', 'up', 'down', 'info
locals' don't work. This is due to gdb not knowing the register values to
unwind the stack frames

Every gdb passthrough goes through `gdb_interface`. And then, gdb expects
`crash_target::fetch_registers` to give it the register values, which is
dependent on `machdep->get_cpu_reg` to read the register values for
specific architecture.

                                      ----------------------------
           gdb passthrough (eg. "bt") |                          |
   crash   -------------------------> |                          |
                                      |      gdb_interface       |
                                      |                          |
                                      |                          |
                                      |  ----------------------  |
                 fetch_registers      |  |                    |  |
crash_target<-------------------------+--|        gdb         |  |
            --------------------------+->|                    |  |
              Registers (SP,NIP, etc.)|  |                    |  |
                                      |  |                    |  |
                                      |  ----------------------  |
                                      ----------------------------

Implement `machdep->get_cpu_reg` on PPC64, so that crash provides the
register values to gdb to unwind stack frames properly

With these changes, on powerpc, 'bt' command output in gdb mode, will look
like this:

    gdb> bt
    #0  0xc0000000002a53e8 in crash_setup_regs (oldregs=<optimized out>, newregs=0xc00000000486f8d8) at ./arch/powerpc/include/asm/kexec.h:69
    crash-utility#1  __crash_kexec (regs=<optimized out>) at kernel/kexec_core.c:974
    crash-utility#2  0xc000000000168918 in panic (fmt=<optimized out>) at kernel/panic.c:358
    crash-utility#3  0xc000000000b735f8 in sysrq_handle_crash (key=<optimized out>) at drivers/tty/sysrq.c:155
    crash-utility#4  0xc000000000b742cc in __handle_sysrq (key=key@entry=99, check_mask=check_mask@entry=false) at drivers/tty/sysrq.c:602
    crash-utility#5  0xc000000000b7506c in write_sysrq_trigger (file=<optimized out>, buf=<optimized out>, count=2, ppos=<optimized out>) at drivers/tty/sysrq.c:1163
    crash-utility#6  0xc00000000069a7bc in pde_write (ppos=<optimized out>, count=<optimized out>, buf=<optimized out>, file=<optimized out>, pde=0xc000000009ed3a80) at fs/proc/inode.c:340
    crash-utility#7  proc_reg_write (file=<optimized out>, buf=<optimized out>, count=<optimized out>, ppos=<optimized out>) at fs/proc/inode.c:352
    crash-utility#8  0xc0000000005b3bbc in vfs_write (file=file@entry=0xc00000009dda7d00, buf=buf@entry=0xebcfc7c6040 <error: Cannot access memory at address 0xebcfc7c6040>, count=count@entry=2, pos=pos@entry=0xc00000000486fda0) at fs/read_write.c:582

instead of earlier output without this patch:

    gdb> bt
    #0  <unavailable> in ?? ()
    Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Also, 'get_dumpfile_regs' has been introduced to get registers from
multiple supported vmcore formats. Correspondingly a flag 'BT_NO_PRINT_REGS'
has been introduced to tell helper functions to get registers, to not
print registers with every call to backtrace in gdb.

 Note: This feature to support GDB unwinding doesn't support live debugging

Cc: Sourabh Jain <[email protected]>
Cc: Hari Bathini <[email protected]>
Cc: Mahesh J Salgaonkar <[email protected]>
Cc: Naveen N. Rao <[email protected]>
Cc: Lianbo Jiang <[email protected]>
Cc: HAGIO KAZUHITO(萩尾 一仁) <[email protected]>
Cc: Tao Liu <[email protected]>
Cc: Alexey Makhalov <[email protected]>
Improved-by: Tao Liu <[email protected]>
Signed-off-by: Aditya Gupta <[email protected]>
lian-bo pushed a commit that referenced this pull request Aug 14, 2024
See the following stack trace:
(gdb) bt
 #0  0x00005635ac2b166b in arm64_unwind_frame (frame=0x7ffdaf35cb70,
     bt=0x7ffdaf35d430) at arm64.c:2821
 #1  arm64_back_trace_cmd (bt=0x7ffdaf35d430) at arm64.c:3306
 #2  0x00005635ac27b108 in back_trace (bt=bt@entry=0x7ffdaf35d430) at
     kernel.c:3239
 #3  0x00005635ac2880ae in cmd_bt () at kernel.c:2863
 #4  0x00005635ac1f16dc in exec_command () at main.c:893
 #5  0x00005635ac1f192a in main_loop () at main.c:840
 #6  0x00005635ac50df81 in captured_main (data=<optimized out>) at main.c:1284
 #7  gdb_main (args=<optimized out>) at main.c:1313
 #8  0x00005635ac50e000 in gdb_main_entry (argc=<optimized out>,
     argv=<optimized out>) at main.c:1338
 #9  0x00005635ac1ea2a5 in main (argc=5, argv=0x7ffdaf35dde8) at main.c:721

The issue may be encountered when thread_union symbol not found in vmlinux
due to compiling optimization.

This patch will try the following 2 methods to get the irq_stack_size
when thread_union symbol unavailable:

1. change the thread_shift when KASAN is enabled and with vmcoreinfo.
   In arm64/include/asm/memory.h:

   #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
   ...
   #define IRQ_STACK_SIZE               THREAD_SIZE

   Since enabling the KASAN will affect the final value,
   this patch reset IRQ_STACK_SIZE according to the calculation process in
   kernel code.

2. Try getting the value from kernel code disassembly, to get
   THREAD_SHIFT directly from tbnz instruction.

   In arch/arm64/kernel/entry.S:
   .macro kernel_ventry, el:req, ht:req, regsize:req, label:req
   ...
         add     sp, sp, x0
         sub     x0, sp, x0
         tbnz    x0, #THREAD_SHIFT, 0f

   $ gdb vmlinux
   (gdb) disass vectors
   Dump of assembler code for function vectors:
      ...
      0xffff800080010804 <+4>:     add     sp, sp, x0
      0xffff800080010808 <+8>:     sub     x0, sp, x0
      0xffff80008001080c <+12>:    tbnz    w0, #16, 0xffff80008001081c <vectors+28>

Signed-off-by: yeping.zheng <[email protected]>
Improved-by: Tao Liu <[email protected]>
lian-bo pushed a commit that referenced this pull request Aug 14, 2024
See the following stack trace:
(gdb) bt
 #0  0x00005635ac2b166b in arm64_unwind_frame (frame=0x7ffdaf35cb70,
     bt=0x7ffdaf35d430) at arm64.c:2821
 #1  arm64_back_trace_cmd (bt=0x7ffdaf35d430) at arm64.c:3306
 #2  0x00005635ac27b108 in back_trace (bt=bt@entry=0x7ffdaf35d430) at
     kernel.c:3239
 #3  0x00005635ac2880ae in cmd_bt () at kernel.c:2863
 #4  0x00005635ac1f16dc in exec_command () at main.c:893
 #5  0x00005635ac1f192a in main_loop () at main.c:840
 #6  0x00005635ac50df81 in captured_main (data=<optimized out>) at main.c:1284
 #7  gdb_main (args=<optimized out>) at main.c:1313
 #8  0x00005635ac50e000 in gdb_main_entry (argc=<optimized out>,
     argv=<optimized out>) at main.c:1338
 #9  0x00005635ac1ea2a5 in main (argc=5, argv=0x7ffdaf35dde8) at main.c:721

The issue may be encountered when thread_union symbol not found in vmlinux
due to compiling optimization.

This patch will try the following 2 methods to get the irq_stack_size
when thread_union symbol unavailable:

1. change the thread_shift when KASAN is enabled and with vmcoreinfo.
   In arm64/include/asm/memory.h:

   #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
   ...
   #define IRQ_STACK_SIZE               THREAD_SIZE

   Since enabling the KASAN will affect the final value,
   this patch reset IRQ_STACK_SIZE according to the calculation process in
   kernel code.

2. Try getting the value from kernel code disassembly, to get
   THREAD_SHIFT directly from tbnz instruction.

   In arch/arm64/kernel/entry.S:
   .macro kernel_ventry, el:req, ht:req, regsize:req, label:req
   ...
         add     sp, sp, x0
         sub     x0, sp, x0
         tbnz    x0, #THREAD_SHIFT, 0f

   $ gdb vmlinux
   (gdb) disass vectors
   Dump of assembler code for function vectors:
      ...
      0xffff800080010804 <+4>:     add     sp, sp, x0
      0xffff800080010808 <+8>:     sub     x0, sp, x0
      0xffff80008001080c <+12>:    tbnz    w0, #16, 0xffff80008001081c <vectors+28>

Signed-off-by: yeping.zheng <[email protected]>
Improved-by: Tao Liu <[email protected]>
lian-bo added a commit that referenced this pull request Aug 19, 2024
Sometimes, in production environment, there are still some vmcores that
are incomplete, such as partial header or the data is corrupted. When
crash tool attempts to parse such vmcores, it may fail as below:

  $ ./crash --osrelease vmcore
  Bus error (core dumped)

or

  $ crash vmlinux vmcore
  ...
  Bus error (core dumped)
 $

Gdb calltrace:

  $ gdb /home/lijiang/src/crash/crash /tmp/core.126301
  Core was generated by `./crash --osrelease /home/lijiang/src/39317/vmcore'.
  Program terminated with signal SIGBUS, Bus error.
  #0  __memcpy_evex_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:831
  831             LOAD_ONE_SET((%rsi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7))
  (gdb) bt
  #0  __memcpy_evex_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:831
  #1  0x0000000000651096 in read_dump_header (file=0x7ffc59ddff5f "/home/lijiang/src/39317/vmcore") at diskdump.c:820
  #2  0x0000000000651cf3 in is_diskdump (file=0x7ffc59ddff5f "/home/lijiang/src/39317/vmcore") at diskdump.c:1042
  #3  0x0000000000502ac9 in get_osrelease (dumpfile=0x7ffc59ddff5f "/home/lijiang/src/39317/vmcore") at main.c:1938
  #4  0x00000000004fb2e8 in main (argc=3, argv=0x7ffc59dde3a8) at main.c:271
  (gdb) frame 1
  #1  0x0000000000651096 in read_dump_header (file=0x7ffc59ddff5f "/home/lijiang/src/39317/vmcore") at diskdump.c:820
  820                   memcpy(dd->dumpable_bitmap, dd->bitmap + bitmap_len/2,

This may happen on attempting access to a page of the buffer that lies
beyond the end of the mapped file(see the mmap() man page).

Let's add a check to avoid such issues as much as possible, but still
not guarantee that it can work well in any extreme situation.

Fixes: a334423 ("diskdump: use mmap/madvise to improve the start-up")
Reported-by: Buland Kumar Singh <[email protected]>
Signed-off-by: Lianbo Jiang <[email protected]>
adi-g15-ibm pushed a commit to adi-g15-ibm/crash that referenced this pull request Aug 27, 2024
The stack unwinding is for kernel addresses only. If non-kernel address
encountered, it is usually a user space address, or non-address value
like a function call parameter. So stopping stack unwinding at non-kernel
address will decrease the invalid unwind results.

Before:
crash> gdb bt
 #0  0xffffffff816a8f65 in context_switch ...
 crash-utility#1  __schedule () ...
 crash-utility#2  0xffffffff816a94e9 in schedule ...
 crash-utility#3  0xffffffff816a86fd in schedule_hrtimeout_range_clock ...
 crash-utility#4  0xffffffff816a8733 in schedule_hrtimeout_range ...
 crash-utility#5  0xffffffff8124bb7e in ep_poll ...
 crash-utility#6  0xffffffff8124d00d in SYSC_epoll_wait ...
 crash-utility#7  SyS_epoll_wait ...
 crash-utility#8  <signal handler called>
 crash-utility#9  0x00007f0449407923 in ?? ()
 crash-utility#10 0xffff880100000001 in ?? ()
 crash-utility#11 0xffff880169b3c010 in ?? ()
 crash-utility#12 0x0000000000000040 in irq_stack_union ()
 crash-utility#13 0xffff880169b3c058 in ?? ()
 crash-utility#14 0xffff880169b3c048 in ?? ()
 crash-utility#15 0xffff880169b3c050 in ?? ()
 crash-utility#16 0x0000000000000000 in ?? ()

After:
crash> gdb bt
 #0  0xffffffff816a8f65 in context_switch ...
 crash-utility#1  __schedule () ...
 crash-utility#2  0xffffffff816a94e9 in schedule () ...
 crash-utility#3  0xffffffff816a86fd in schedule_hrtimeout_range_clock ...
 crash-utility#4  0xffffffff816a8733 in schedule_hrtimeout_range ...
 crash-utility#5  0xffffffff8124bb7e in ep_poll ...
 crash-utility#6  0xffffffff8124d00d in SYSC_epoll_wait ...
 crash-utility#7  SyS_epoll_wait ...
 crash-utility#8  <signal handler called>

Cc: Sourabh Jain <[email protected]>
Cc: Hari Bathini <[email protected]>
Cc: Mahesh J Salgaonkar <[email protected]>
Cc: Naveen N. Rao <[email protected]>
Cc: Lianbo Jiang <[email protected]>
Cc: HAGIO KAZUHITO(萩尾 一仁) <[email protected]>
Cc: Tao Liu <[email protected]>
Cc: Alexey Makhalov <[email protected]>
Signed-off-by: Tao Liu <[email protected]>
adi-g15-ibm added a commit to adi-g15-ibm/crash that referenced this pull request Aug 27, 2024
Currently, gdb passthroughs of 'bt', 'frame', 'up', 'down', 'info
locals' don't work. This is due to gdb not knowing the register values to
unwind the stack frames

Every gdb passthrough goes through `gdb_interface`. And then, gdb expects
`crash_target::fetch_registers` to give it the register values, which is
dependent on `machdep->get_cpu_reg` to read the register values for
specific architecture.

                                      ----------------------------
           gdb passthrough (eg. "bt") |                          |
   crash   -------------------------> |                          |
                                      |      gdb_interface       |
                                      |                          |
                                      |                          |
                                      |  ----------------------  |
                 fetch_registers      |  |                    |  |
crash_target<-------------------------+--|        gdb         |  |
            --------------------------+->|                    |  |
              Registers (SP,NIP, etc.)|  |                    |  |
                                      |  |                    |  |
                                      |  ----------------------  |
                                      ----------------------------

Implement `machdep->get_cpu_reg` on PPC64, so that crash provides the
register values to gdb to unwind stack frames properly

With these changes, on powerpc, 'bt' command output in gdb mode, will look
like this:

    gdb> bt
    #0  0xc0000000002a53e8 in crash_setup_regs (oldregs=<optimized out>, newregs=0xc00000000486f8d8) at ./arch/powerpc/include/asm/kexec.h:69
    crash-utility#1  __crash_kexec (regs=<optimized out>) at kernel/kexec_core.c:974
    crash-utility#2  0xc000000000168918 in panic (fmt=<optimized out>) at kernel/panic.c:358
    crash-utility#3  0xc000000000b735f8 in sysrq_handle_crash (key=<optimized out>) at drivers/tty/sysrq.c:155
    crash-utility#4  0xc000000000b742cc in __handle_sysrq (key=key@entry=99, check_mask=check_mask@entry=false) at drivers/tty/sysrq.c:602
    crash-utility#5  0xc000000000b7506c in write_sysrq_trigger (file=<optimized out>, buf=<optimized out>, count=2, ppos=<optimized out>) at drivers/tty/sysrq.c:1163
    crash-utility#6  0xc00000000069a7bc in pde_write (ppos=<optimized out>, count=<optimized out>, buf=<optimized out>, file=<optimized out>, pde=0xc000000009ed3a80) at fs/proc/inode.c:340
    crash-utility#7  proc_reg_write (file=<optimized out>, buf=<optimized out>, count=<optimized out>, ppos=<optimized out>) at fs/proc/inode.c:352
    crash-utility#8  0xc0000000005b3bbc in vfs_write (file=file@entry=0xc00000009dda7d00, buf=buf@entry=0xebcfc7c6040 <error: Cannot access memory at address 0xebcfc7c6040>, count=count@entry=2, pos=pos@entry=0xc00000000486fda0) at fs/read_write.c:582

instead of earlier output without this patch:

    gdb> bt
    #0  <unavailable> in ?? ()
    Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Also, 'get_dumpfile_regs' has been introduced to get registers from
multiple supported vmcore formats. Correspondingly a flag 'BT_NO_PRINT_REGS'
has been introduced to tell helper functions to get registers, to not
print registers with every call to backtrace in gdb.

 Note: This feature to support GDB unwinding doesn't support live debugging

Cc: Sourabh Jain <[email protected]>
Cc: Hari Bathini <[email protected]>
Cc: Mahesh J Salgaonkar <[email protected]>
Cc: Naveen N. Rao <[email protected]>
Cc: Lianbo Jiang <[email protected]>
Cc: HAGIO KAZUHITO(萩尾 一仁) <[email protected]>
Cc: Tao Liu <[email protected]>
Cc: Alexey Makhalov <[email protected]>
Improved-by: Tao Liu <[email protected]>
Signed-off-by: Aditya Gupta <[email protected]>
lian-bo pushed a commit that referenced this pull request Nov 4, 2024
Currently, gdb passthroughs of 'bt', 'frame', 'up', 'down', 'info
locals' don't work. This is due to gdb not knowing the register values to
unwind the stack frames

Every gdb passthrough goes through `gdb_interface`. And then, gdb expects
`crash_target::fetch_registers` to give it the register values, which is
dependent on `machdep->get_current_task_reg` to read the register values for
specific architecture.

                                      ----------------------------
           gdb passthrough (eg. "bt") |                          |
   crash   -------------------------> |                          |
                                      |      gdb_interface       |
                                      |                          |
                                      |                          |
                                      |  ----------------------  |
                 fetch_registers      |  |                    |  |
crash_target<-------------------------+--|        gdb         |  |
            --------------------------+->|                    |  |
              Registers (SP,NIP, etc.)|  |                    |  |
                                      |  |                    |  |
                                      |  ----------------------  |
                                      ----------------------------

Implement `machdep->get_current_task_reg` on PPC64, so that crash provides the
register values to gdb to unwind stack frames properly

With these changes, on powerpc, 'bt' command output in gdb mode, will look
like this:

    gdb> bt
    #0  0xc0000000002a53e8 in crash_setup_regs (oldregs=<optimized out>, newregs=0xc00000000486f8d8) at ./arch/powerpc/include/asm/kexec.h:69
    #1  __crash_kexec (regs=<optimized out>) at kernel/kexec_core.c:974
    #2  0xc000000000168918 in panic (fmt=<optimized out>) at kernel/panic.c:358
    #3  0xc000000000b735f8 in sysrq_handle_crash (key=<optimized out>) at drivers/tty/sysrq.c:155
    #4  0xc000000000b742cc in __handle_sysrq (key=key@entry=99, check_mask=check_mask@entry=false) at drivers/tty/sysrq.c:602
    #5  0xc000000000b7506c in write_sysrq_trigger (file=<optimized out>, buf=<optimized out>, count=2, ppos=<optimized out>) at drivers/tty/sysrq.c:1163
    #6  0xc00000000069a7bc in pde_write (ppos=<optimized out>, count=<optimized out>, buf=<optimized out>, file=<optimized out>, pde=0xc000000009ed3a80) at fs/proc/inode.c:340
    #7  proc_reg_write (file=<optimized out>, buf=<optimized out>, count=<optimized out>, ppos=<optimized out>) at fs/proc/inode.c:352
    #8  0xc0000000005b3bbc in vfs_write (file=file@entry=0xc00000009dda7d00, buf=buf@entry=0xebcfc7c6040 <error: Cannot access memory at address 0xebcfc7c6040>, count=count@entry=2, pos=pos@entry=0xc00000000486fda0) at fs/read_write.c:582

instead of earlier output without this patch:

    gdb> bt
    #0  <unavailable> in ?? ()
    Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Also, 'get_dumpfile_regs' has been introduced to get registers from
multiple supported vmcore formats. Correspondingly a flag 'BT_NO_PRINT_REGS'
has been introduced to tell helper functions to get registers, to not
print registers with every call to backtrace in gdb.

Note: This feature to support GDB unwinding doesn't support live debugging

[lijiang: squash these five patches(see the Link) into one patch]

Link: https://www.mail-archive.com/[email protected]/msg01084.html
Link: https://www.mail-archive.com/[email protected]/msg01083.html
Link: https://www.mail-archive.com/[email protected]/msg01089.html
Link: https://www.mail-archive.com/[email protected]/msg01090.html
Link: https://www.mail-archive.com/[email protected]/msg01091.html
Co-developed-by:: Tao Liu <[email protected]>
Signed-off-by: Aditya Gupta <[email protected]>
liutgnu added a commit to liutgnu/crash-preview that referenced this pull request Dec 5, 2024
…usly

There is an issue that, for kernel modules, "dis -rl" fails to display
modules code line number data after execute "bt" command in crash.

Without the patch:
  crsah> mod -S
  crash> bt
  PID: 1500     TASK: ff2bd8b093524000  CPU: 16   COMMAND: "lpfc_worker_0"
   #0 [ff2c9f725c39f9e0] machine_kexec at ffffffff8e0686d3
   ...snip...
   crash-utility#8 [ff2c9f725c39fcc0] __lpfc_sli_release_iocbq_s4 at ffffffffc0f2f425 [lpfc]
   ...snip...
  crash> dis -rl ffffffffc0f60f82
  0xffffffffc0f60eb0 <lpfc_nlp_get>:      nopl   0x0(%rax,%rax,1) [FTRACE NOP]
  0xffffffffc0f60eb5 <lpfc_nlp_get+5>:    push   %rbp
  0xffffffffc0f60eb6 <lpfc_nlp_get+6>:    push   %rbx
  0xffffffffc0f60eb7 <lpfc_nlp_get+7>:    test   %rdi,%rdi

With the patch:
  crash> mod -S
  crash> bt
  PID: 1500     TASK: ff2bd8b093524000  CPU: 16   COMMAND: "lpfc_worker_0"
   #0 [ff2c9f725c39f9e0] machine_kexec at ffffffff8e0686d3
   ...snip...
   crash-utility#8 [ff2c9f725c39fcc0] __lpfc_sli_release_iocbq_s4 at ffffffffc0f2f425 [lpfc]
   ...snip...
  crash> dis -rl ffffffffc0f60f82
  /usr/src/debug/kernel-4.18.0-425.13.1.el8_7/linux-4.18.0-425.13.1.el8_7.x86_64/drivers/scsi/lpfc/lpfc_hbadisc.c: 6756
  0xffffffffc0f60eb0 <lpfc_nlp_get>:      nopl   0x0(%rax,%rax,1) [FTRACE NOP]
  /usr/src/debug/kernel-4.18.0-425.13.1.el8_7/linux-4.18.0-425.13.1.el8_7.x86_64/drivers/scsi/lpfc/lpfc_hbadisc.c: 6759
  0xffffffffc0f60eb5 <lpfc_nlp_get+5>:    push   %rbp

The root cause is, after kernel module been loaded by mod command, the symtable
is not expanded in gdb side. crash bt or dis command will trigger such an
expansion. However the symtable expansion is different for the 2 commands:

The stack trace of "dis -rl" for symtable expanding:

  #0  0x00000000008d8d9f in add_compunit_symtab_to_objfile ...
  crash-utility#1  0x00000000006d3293 in buildsym_compunit::end_symtab_with_blockvector ...
  crash-utility#2  0x00000000006d336a in buildsym_compunit::end_symtab_from_static_block ...
  crash-utility#3  0x000000000077e8e9 in process_full_comp_unit ...
  crash-utility#4  process_queue ...
  crash-utility#5  dw2_do_instantiate_symtab ...
  crash-utility#6  0x000000000077ed67 in dw2_instantiate_symtab ...
  crash-utility#7  0x000000000077f75e in dw2_expand_all_symtabs ...
  crash-utility#8  0x00000000008f254d in gdb_get_line_number ...
  crash-utility#9  0x00000000008f22af in gdb_command_funnel_1 ...
  crash-utility#10 0x00000000008f2003 in gdb_command_funnel ...
  crash-utility#11 0x00000000005b7f02 in gdb_interface ...
  crash-utility#12 0x00000000005f5bd8 in get_line_number ...
  crash-utility#13 0x000000000059e574 in cmd_dis ...

The stack trace of "bt" for symtable expanding:

  #0  0x00000000008d8d9f in add_compunit_symtab_to_objfile ...
  crash-utility#1  0x00000000006d3293 in buildsym_compunit::end_symtab_with_blockvector ...
  crash-utility#2  0x00000000006d336a in buildsym_compunit::end_symtab_from_static_block ...
  crash-utility#3  0x000000000077e8e9 in process_full_comp_unit ...
  crash-utility#4  process_queue ...
  crash-utility#5  dw2_do_instantiate_symtab ...
  crash-utility#6  0x000000000077ed67 in dw2_instantiate_symtab ...
  crash-utility#7  0x000000000077f8ed in dw2_lookup_symbol ...
  crash-utility#8  0x00000000008e6d03 in lookup_symbol_via_quick_fns ...
  crash-utility#9  0x00000000008e7153 in lookup_symbol_in_objfile ...
  crash-utility#10 0x00000000008e73c6 in lookup_symbol_global_or_static_iterator_cb ...
  crash-utility#11 0x00000000008b99c4 in svr4_iterate_over_objfiles_in_search_order ...
  crash-utility#12 0x00000000008e754e in lookup_global_or_static_symbol ...
  crash-utility#13 0x00000000008e75da in lookup_static_symbol ...
  crash-utility#14 0x00000000008e632c in lookup_symbol_aux ...
  crash-utility#15 0x00000000008e5a7a in lookup_symbol_in_language ...
  crash-utility#16 0x00000000008e5b30 in lookup_symbol ...
  crash-utility#17 0x00000000008f2a4a in gdb_get_datatype ...
  crash-utility#18 0x00000000008f22c0 in gdb_command_funnel_1 ...
  crash-utility#19 0x00000000008f2003 in gdb_command_funnel ...
  crash-utility#20 0x00000000005b7f02 in gdb_interface ...
  crash-utility#21 0x00000000005f8a9f in datatype_info ...
  crash-utility#22 0x0000000000599947 in cpu_map_size ...
  crash-utility#23 0x00000000005a975d in get_cpus_online ...
  crash-utility#24 0x0000000000637a8b in diskdump_get_prstatus_percpu ...
  crash-utility#25 0x000000000062f0e4 in get_netdump_regs_x86_64 ...
  crash-utility#26 0x000000000059fe68 in back_trace ...
  crash-utility#27 0x00000000005ab1cb in cmd_bt ...

For the stacktrace of "dis -rl", it calls dw2_expand_all_symtabs() to expand
all symtable of the objfile, or "*.ko.debug" in our case. However for
the stacktrace of "bt", it doesn't expand all, but only a subset of symtable
which is enough to find a symbol by dw2_lookup_symbol(). As a result, the
objfile->compunit_symtabs, which is the head of a single linked list of
struct compunit_symtab, is not NULL but didn't contain all symtables. It
will not be reinitialized in gdb_get_line_number() by "dis -rl" because
!objfile_has_full_symbols(objfile) check will fail, so it cannot display
the proper code line number data.

Since objfile_has_full_symbols(objfile) check cannot ensure all symbols
been expanded, this patch add a new member as a flag for struct objfile
to record if all symbols have been expanded. The flag will be set only ofter
expand_all_symtabs been called.

Signed-off-by: Tao Liu <[email protected]>
liutgnu pushed a commit to liutgnu/crash-preview that referenced this pull request Dec 5, 2024
Same as the Linux commit f766f77a74f5 ("riscv/stacktrace: Fix
stack output without ra on the stack top").

When a function doesn't have a callee, then it will not
push ra into the stack, such as lkdtm functions, so
correct the FP of the second frame and use pt_regs to get
the right PC of the second frame.

Before this patch, the `bt -f` outputs only the first frame with
the wrong PC and FP of next frame:
```
crash> bt -f
PID: 1        TASK: ff600000000e0000  CPU: 1    COMMAND: "sh"
 #0 [ff20000000013cf0] lkdtm_EXCEPTION at ffffffff805303c0
    [PC: ffffffff805303c0 RA: ff20000000013d10 SP: ff20000000013cf0 SIZE: 16] <- wrong next PC
    ff20000000013cf0: 0000000000000001 ff20000000013d10 <- next FP
    ff20000000013d00: ff20000000013d40
crash>
```
After this patch, the `bt` outputs the full frames:
```
crash> bt
PID: 1        TASK: ff600000000e0000  CPU: 1    COMMAND: "sh"
 #0 [ff20000000013cf0] lkdtm_EXCEPTION at ffffffff805303c0
 crash-utility#1 [ff20000000013d00] lkdtm_do_action at ffffffff8052fe36
 crash-utility#2 [ff20000000013d10] direct_entry at ffffffff80530018
 crash-utility#3 [ff20000000013d40] full_proxy_write at ffffffff80305044
 crash-utility#4 [ff20000000013d80] vfs_write at ffffffff801b68b4
 crash-utility#5 [ff20000000013e30] ksys_write at ffffffff801b6c4a
 crash-utility#6 [ff20000000013e80] __riscv_sys_write at ffffffff801b6cc4
 crash-utility#7 [ff20000000013e90] do_trap_ecall_u at ffffffff80836798
crash>
```

Acked-by: Kazuhito Hagio <[email protected]>
Signed-off-by: Song Shuai <[email protected]>
liutgnu pushed a commit to liutgnu/crash-preview that referenced this pull request Dec 5, 2024
This patch introduces per-cpu IRQ stacks for RISCV64 to let
"bt" do backtrace on it and 'bt -E' search eframes on it,
and the 'help -m' command displays the addresses of each
per-cpu IRQ stack.

TEST: a vmcore dumped via hacking the handle_irq_event_percpu()
( Why not using lkdtm INT_HW_IRQ_EN EXCEPTION ?
  There is a deadlock[1] in crash_kexec path if use that)

  crash> bt
  PID: 0        TASK: ffffffff8140db00  CPU: 0    COMMAND: "swapper/0"
   #0 [ff20000000003e60] __handle_irq_event_percpu at ffffffff8006462e
   crash-utility#1 [ff20000000003ed0] handle_irq_event_percpu at ffffffff80064702
   crash-utility#2 [ff20000000003ef0] handle_irq_event at ffffffff8006477c
   crash-utility#3 [ff20000000003f20] handle_fasteoi_irq at ffffffff80068664
   crash-utility#4 [ff20000000003f50] generic_handle_domain_irq at ffffffff80063988
   crash-utility#5 [ff20000000003f60] plic_handle_irq at ffffffff8046633e
   crash-utility#6 [ff20000000003fb0] generic_handle_domain_irq at ffffffff80063988
   crash-utility#7 [ff20000000003fc0] riscv_intc_irq at ffffffff80465f8e
   crash-utility#8 [ff20000000003fd0] handle_riscv_irq at ffffffff808361e8
       PC: ffffffff80837314  [default_idle_call+50]
       RA: ffffffff80837310  [default_idle_call+46]
       SP: ffffffff81403da0  CAUSE: 8000000000000009
  epc : ffffffff80837314 ra : ffffffff80837310 sp : ffffffff81403da0
   gp : ffffffff814ef848 tp : ffffffff8140db00 t0 : ff2000000004bb18
   t1 : 0000000000032c73 t2 : ffffffff81200a48 s0 : ffffffff81403db0
   s1 : 0000000000000000 a0 : 0000000000000004 a1 : 0000000000000000
   a2 : ff6000009f1e7000 a3 : 0000000000002304 a4 : ffffffff80c1c2d8
   a5 : 0000000000000000 a6 : ff6000001fe01958 a7 : 00002496ea89dbf1
   s2 : ffffffff814f0220 s3 : 0000000000000001 s4 : 000000000000003f
   s5 : ffffffff814f03d8 s6 : 0000000000000000 s7 : ffffffff814f00d0
   s8 : ffffffff81526f10 s9 : ffffffff80c1d880 s10: 0000000000000000
   s11: 0000000000000001 t3 : 0000000000003392 t4 : 0000000000000000
   t5 : 0000000000000000 t6 : 0000000000000040
   status: 0000000200000120 badaddr: 0000000000000000
    cause: 8000000000000009 orig_a0: ffffffff80837310
  --- <IRQ stack> ---
   crash-utility#9 [ffffffff81403da0] default_idle_call at ffffffff80837314
   crash-utility#10 [ffffffff81403db0] do_idle at ffffffff8004d0a0
   crash-utility#11 [ffffffff81403e40] cpu_startup_entry at ffffffff8004d21e
   crash-utility#12 [ffffffff81403e60] kernel_init at ffffffff8083746a
   crash-utility#13 [ffffffff81403e70] arch_post_acpi_subsys_init at ffffffff80a006d8
   crash-utility#14 [ffffffff81403e80] console_on_rootfs at ffffffff80a00c92
  crash>

  crash> bt -E
  CPU 0 IRQ STACK:
  KERNEL-MODE EXCEPTION FRAME AT: ff20000000003a48
       PC: ffffffff8006462e  [__handle_irq_event_percpu+30]
       RA: ffffffff80064702  [handle_irq_event_percpu+18]
       SP: ff20000000003e60  CAUSE: 000000000000000d
  epc : ffffffff8006462e ra : ffffffff80064702 sp : ff20000000003e60
   gp : ffffffff814ef848 tp : ffffffff8140db00 t0 : 0000000000046600
   t1 : ffffffff80836464 t2 : ffffffff81200a48 s0 : ff20000000003ed0
   s1 : 0000000000000000 a0 : 0000000000000000 a1 : 0000000000000118
   a2 : 0000000000000052 a3 : 0000000000000000 a4 : 0000000000000000
   a5 : 0000000000010001 a6 : ff6000001fe01958 a7 : 00002496ea89dbf1
   s2 : ff60000000941ab0 s3 : ffffffff814a0658 s4 : ff60000000089230
   s5 : ffffffff814a0518 s6 : ffffffff814a0620 s7 : ffffffff80e5f0f8
   s8 : ffffffff80fc50b0 s9 : ffffffff80c1d880 s10: 0000000000000000
   s11: 0000000000000001 t3 : 0000000000003392 t4 : 0000000000000000
   t5 : 0000000000000000 t6 : 0000000000000040
   status: 0000000200000100 badaddr: 0000000000000078
    cause: 000000000000000d orig_a0: ff20000000003ea0

  CPU 1 IRQ STACK: (none found)

  crash>

  crash> help -m
  <snip>
             machspec: ced1e0
          irq_stack_size: 16384
           irq_stacks[0]: ff20000000000000
           irq_stacks[1]: ff20000000008000
  crash>

[1]: https://lore.kernel.org/linux-riscv/[email protected]/

Signed-off-by: Song Shuai <[email protected]>
liutgnu pushed a commit to liutgnu/crash-preview that referenced this pull request Dec 5, 2024
The patch introduces per-cpu overflow stacks for RISCV64 to let
"bt" do backtrace on it and the 'help -m' command dispalys the
addresss of each per-cpu overflow stack.

TEST: a lkdtm DIRECT EXHAUST_STACK vmcore

  crash> bt
  PID: 1        TASK: ff600000000d8000  CPU: 1    COMMAND: "sh"
   #0 [ff6000001fc501c0] riscv_crash_save_regs at ffffffff8000a1dc
   crash-utility#1 [ff6000001fc50320] panic at ffffffff808773ec
   crash-utility#2 [ff6000001fc50380] walk_stackframe at ffffffff800056da
       PC: ffffffff80876a34  [memset+96]
       RA: ffffffff80563dc0  [recursive_loop+68]
       SP: ff2000000000fd50  CAUSE: 000000000000000f
  epc : ffffffff80876a34 ra : ffffffff80563dc0 sp : ff2000000000fd50
   gp : ffffffff81515d38 tp : 0000000000000000 t0 : ff2000000000fd58
   t1 : ff600000000d88c8 t2 : 6143203a6d74646b s0 : ff20000000010190
   s1 : 0000000000000012 a0 : ff2000000000fd58 a1 : 1212121212121212
   a2 : 0000000000000400 a3 : ff20000000010158 a4 : 0000000000000000
   a5 : 725bedba92260900 a6 : 000000000130e0f0 a7 : 0000000000000000
   s2 : ff2000000000fd58 s3 : ffffffff815170d8 s4 : ff20000000013e60
   s5 : 000000000000000e s6 : ff20000000013e60 s7 : 0000000000000000
   s8 : ff60000000861000 s9 : 00007fffc3641694 s10: 00007fffc3641690
   s11: 00005555796ed240 t3 : 0000000000010297 t4 : ffffffff80c17810
   t5 : ffffffff8195e7b8 t6 : ff20000000013b18
   status: 0000000200000120 badaddr: ff2000000000fd58
    cause: 000000000000000f orig_a0: 0000000000000000
  --- <OVERFLOW stack> ---
   crash-utility#3 [ff2000000000fd50] memset at ffffffff80876a34
   crash-utility#4 [ff20000000010190] recursive_loop at ffffffff80563e16
   crash-utility#5 [ff200000000105d0] recursive_loop at ffffffff80563e16
   < recursive_loop ...>
   crash-utility#16 [ff20000000013490] recursive_loop at ffffffff80563e16
   crash-utility#17 [ff200000000138d0] recursive_loop at ffffffff80563e16
   crash-utility#18 [ff20000000013d10] lkdtm_EXHAUST_STACK at ffffffff8088005e
   crash-utility#19 [ff20000000013d30] lkdtm_do_action at ffffffff80563292
   crash-utility#20 [ff20000000013d40] direct_entry at ffffffff80563474
   crash-utility#21 [ff20000000013d70] full_proxy_write at ffffffff8032fb3a
   crash-utility#22 [ff20000000013db0] vfs_write at ffffffff801d6414
   crash-utility#23 [ff20000000013e60] ksys_write at ffffffff801d67b8
   crash-utility#24 [ff20000000013eb0] __riscv_sys_write at ffffffff801d6832
   crash-utility#25 [ff20000000013ec0] do_trap_ecall_u at ffffffff80884a20
  crash>

  crash> help -m
  <snip>
          irq_stack_size: 16384
           irq_stacks[0]: ff20000000000000
           irq_stacks[1]: ff20000000008000
          overflow_stack_size: 4096
           overflow_stacks[0]: ff6000001fa7a510
           overflow_stacks[1]: ff6000001fc4f510
  crash>

Signed-off-by: Song Shuai <[email protected]>
liutgnu pushed a commit to liutgnu/crash-preview that referenced this pull request Dec 5, 2024
On recent x86_64 kernels, the check of caller function (BT_CHECK_CALLER)
does not work correctly due to inappropriate direct_call_targets.  As a
result, the correct frame is ignored and the remaining frames will be
truncated.

Skip the caller check if ORC unwinder is available, as the check is not
necessary with it.

Without the patch:
  crash> bt 493113
  PID: 493113   TASK: ff2e34ecbd3ca2c0  CPU: 27   COMMAND: "sriov_fec_daemo"
   #0 [ff77abc4e81cfb08] __schedule at ffffffff81b239cb
   crash-utility#1 [ff77abc4e81cfb70] schedule at ffffffff81b23e2d
   crash-utility#2 [ff77abc4e81cfb88] schedule_timeout at ffffffff81b2c9e8
      RIP: 000000000047cdbb  RSP: 000000c0000975a8  RFLAGS: 00000216
      ...

With the patch:
  crash> bt 493113
  PID: 493113   TASK: ff2e34ecbd3ca2c0  CPU: 27   COMMAND: "sriov_fec_daemo"
   #0 [ff77abc4e81cfb08] __schedule at ffffffff81b239cb
   crash-utility#1 [ff77abc4e81cfb70] schedule at ffffffff81b23e2d
   crash-utility#2 [ff77abc4e81cfb88] schedule_timeout at ffffffff81b2c9e8
   crash-utility#3 [ff77abc4e81cfbf0] __wait_for_common at ffffffff81b24abb
   crash-utility#4 [ff77abc4e81cfc68] vfio_unregister_group_dev at ffffffffc10e76ae [vfio]
   crash-utility#5 [ff77abc4e81cfca8] vfio_pci_core_unregister_device at ffffffffc11bb599 [vfio_pci_core]
   crash-utility#6 [ff77abc4e81cfcc0] vfio_pci_remove at ffffffffc103e045 [vfio_pci]
   crash-utility#7 [ff77abc4e81cfcd0] pci_device_remove at ffffffff815d7513
   ...

Reported-by: Crystal Wood <[email protected]>
Signed-off-by: Kazuhito Hagio <[email protected]>
liutgnu added a commit to liutgnu/crash-preview that referenced this pull request Dec 5, 2024
…ss range

Previously, to find a module symbol and its offset by an arbitrary address,
all symbols within the module will be iterated by address ascending order
until the last symbol with a smaller address been noticed.

However if the address is not within the module address range, e.g.
the address is higher than the module's last symbol's address, then
the module can be surely skipped, because its symbol iteration is
unnecessary. This can speed up the kernel module symbols finding and improve
the overall performance.

Without the patch:
  $ time echo "bt 8993" | ~/crash-dev/crash vmcore vmlinux
  crash> bt 8993
  PID: 8993     TASK: ffff927569cc2100  CPU: 2    COMMAND: "WriterPool0"
   #0 [ffff927569cd76f0] __schedule at ffffffffb3db78d8
   crash-utility#1 [ffff927569cd7758] schedule_preempt_disabled at ffffffffb3db8bf9
   crash-utility#2 [ffff927569cd7768] __mutex_lock_slowpath at ffffffffb3db6ca7
   crash-utility#3 [ffff927569cd77c0] mutex_lock at ffffffffb3db602f
   crash-utility#4 [ffff927569cd77d8] ucache_retrieve at ffffffffc0cf4409 [secfs2]
   ...snip the stacktrace of the same module...
   crash-utility#11 [ffff927569cd7ba0] cskal_path_vfs_getattr_nosec at ffffffffc05cae76 [falcon_kal]
   ...snip...
   crash-utility#13 [ffff927569cd7c40] _ZdlPv at ffffffffc086e751 [falcon_lsm_serviceable]
   ...snip...
   crash-utility#20 [ffff927569cd7ef8] unload_network_ops_symbols at ffffffffc06f11c0 [falcon_lsm_pinned_14713]
   crash-utility#21 [ffff927569cd7f50] system_call_fastpath at ffffffffb3dc539a
      RIP: 00007f2b28ed4023  RSP: 00007f2a45fe7f80  RFLAGS: 00000206
      RAX: 0000000000000012  RBX: 00007f2a68302e00  RCX: 00007f2a682546d8
      RDX: 0000000000000826  RSI: 00007eb57ea6a000  RDI: 00000000000000e3
      RBP: 00007eb57ea6a000   R8: 0000000000000826   R9: 00000002670bdfd2
      R10: 00000002670bdfd2  R11: 0000000000000293  R12: 00000002670bdfd2
      R13: 00007f29d501a480  R14: 0000000000000826  R15: 00000002670bdfd2
      ORIG_RAX: 0000000000000012  CS: 0033  SS: 002b
  crash>
  real	7m14.826s
  user	7m12.502s
  sys	0m1.091s

With the patch:
  $ time echo "bt 8993" | ~/crash-dev/crash vmcore vmlinux
  crash> bt 8993
  PID: 8993     TASK: ffff927569cc2100  CPU: 2    COMMAND: "WriterPool0"
   #0 [ffff927569cd76f0] __schedule at ffffffffb3db78d8
   crash-utility#1 [ffff927569cd7758] schedule_preempt_disabled at ffffffffb3db8bf9
   ...snip the same output...
  crash>
  real	0m8.827s
  user	0m7.896s
  sys	0m0.938s

Signed-off-by: Tao Liu <[email protected]>
liutgnu pushed a commit to liutgnu/crash-preview that referenced this pull request Dec 5, 2024
- Add basic support for the 'bt' command.
- LooongArch64: Add 'bt -f' command support
- LoongArch64: Add 'bt -l' command support

E.g. With this patch:
crash> bt
PID: 1832     TASK: 900000009a552100  CPU: 11   COMMAND: "bash"
 #0 [900000009beffb60] __cpu_possible_mask at 90000000014168f0
 crash-utility#1 [900000009beffb60] __crash_kexec at 90000000002e7660
 crash-utility#2 [900000009beffcd0] panic at 9000000000f0ec28
 crash-utility#3 [900000009beffd60] sysrq_handle_crash at 9000000000a2c188
 crash-utility#4 [900000009beffd70] __handle_sysrq at 9000000000a2c85c
 crash-utility#5 [900000009beffdc0] write_sysrq_trigger at 9000000000a2ce10
 crash-utility#6 [900000009beffde0] proc_reg_write at 90000000004ce454
 crash-utility#7 [900000009beffe00] vfs_write at 900000000043e838
 crash-utility#8 [900000009beffe40] ksys_write at 900000000043eb58
 crash-utility#9 [900000009beffe80] do_syscall at 9000000000f2da54
 crash-utility#10 [900000009beffea0] handle_syscall at 9000000000221440
crash>
...

Co-developed-by: Youling Tang <[email protected]>
Signed-off-by: Youling Tang <[email protected]>
Signed-off-by: Ming Wang <[email protected]>
liutgnu pushed a commit to liutgnu/crash-preview that referenced this pull request Dec 5, 2024
…ame" warning

The "bogus exception frame" warning was observed again on a specific
vmcore, and the remaining frame was truncated on x86_64 machine, when
executing the "bt" command as below:

  crash> bt 0 -c 8
  PID: 0        TASK: ffff9948c08f5640  CPU: 8    COMMAND: "swapper/8"
   #0 [fffffe1788788e58] crash_nmi_callback at ffffffff972672bb
   crash-utility#1 [fffffe1788788e68] nmi_handle at ffffffff9722eb8e
   crash-utility#2 [fffffe1788788eb0] default_do_nmi at ffffffff97e51cd0
   crash-utility#3 [fffffe1788788ed0] exc_nmi at ffffffff97e51ee1
   crash-utility#4 [fffffe1788788ef0] end_repeat_nmi at ffffffff980015f9
      [exception RIP: __update_load_avg_se+13]
      RIP: ffffffff9736b16d  RSP: ffffbec3c08acc78  RFLAGS: 00000046
      RAX: 0000000000000000  RBX: ffff994c2f2b1a40  RCX: ffffbec3c08acdc0
      RDX: ffff9948e4fe1d80  RSI: ffff994c2f2b1a40  RDI: 0000001d7ad7d55d
      RBP: ffffbec3c08acc88   R8: 0000001d921fca6f   R9: ffff994c2f2b1328
      R10: 00000000fffd0010  R11: ffffffff98e060c0  R12: 0000001d7ad7d55d
      R13: 0000000000000005  R14: ffff994c2f2b19c0  R15: 0000000000000001
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  --- <NMI exception stack> ---
   crash-utility#5 [ffffbec3c08acc78] __update_load_avg_se at ffffffff9736b16d
   crash-utility#6 [ffffbec3c08acce0] enqueue_entity at ffffffff9735c9ab
   crash-utility#7 [ffffbec3c08acd28] enqueue_task_fair at ffffffff9735cef8
  ...
  crash-utility#18 [ffffbec3c08acf90] blk_complete_reqs at ffffffff977978d0
  crash-utility#19 [ffffbec3c08acfa0] __do_softirq at ffffffff97e66f7a
  crash-utility#20 [ffffbec3c08acff0] do_softirq at ffffffff9730f6ef
  --- <IRQ stack> ---
  crash-utility#21 [ffffbec3c022ff18] do_idle at ffffffff97368288
      [exception RIP: unknown or invalid address]
      RIP: 0000000000000000  RSP: 0000000000000000  RFLAGS: 00000000
      RAX: 0000000000000000  RBX: 000000089726a2d0  RCX: 0000000000000000
      RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000000
      RBP: ffffffff9726a3dd   R8: 0000000000000000   R9: 0000000000000000
      R10: ffffffff9720015a  R11: e48885e126bc1600  R12: 0000000000000000
      R13: ffffffff973684a9  R14: 0000000000000094  R15: 0000000040000000
      ORIG_RAX: 0000000000000000  CS: 0000  SS: 0000
  bt: WARNING: possibly bogus exception frame
  crash>

Actually there is no exception frame, when called from do_softirq().
With the patch:

  crash> bt 0 -c 8
  ...
  crash-utility#18 [ffffbec3c08acf90] blk_complete_reqs at ffffffff977978d0
  crash-utility#19 [ffffbec3c08acfa0] __do_softirq at ffffffff97e66f7a
  crash-utility#20 [ffffbec3c08acff0] do_softirq at ffffffff9730f6ef
  --- <IRQ stack> ---
  crash-utility#21 [ffffbec3c022ff28] cpu_startup_entry at ffffffff973684a9
  crash-utility#22 [ffffbec3c022ff38] start_secondary at ffffffff9726a3dd
  crash-utility#23 [ffffbec3c022ff50] secondary_startup_64_no_verify at ffffffff9720015a
  crash>

Reported-by: Jie Li <[email protected]>
Signed-off-by: Lianbo Jiang <[email protected]>
liutgnu added a commit to liutgnu/crash-preview that referenced this pull request Dec 5, 2024
The following segmentation fault occurred during session initialization:

  $ crash vmlinx vmcore
  ...
  please wait... (determining panic task)Segmentation fault

Here is the backtrace of the crash-utility:

  (gdb) bt
  #0  value_search_module_6_4 (value=18446603338276298752, offset=0x7ffffffface0) at symbols.c:5564
  crash-utility#1  0x0000555555812bd0 in value_to_symstr (value=18446603338276298752,
      buf=buf@entry=0x7fffffffb9c0 "", radix=10, radix@entry=0) at symbols.c:5872
  crash-utility#2  0x00005555557694a2 in display_memory (addr=<optimized out>, count=2048, flag=208,
      memtype=memtype@entry=1, opt=opt@entry=0x0) at memory.c:1740
  crash-utility#3  0x0000555555769e1f in raw_stack_dump (stackbase=<optimized out>, size=<optimized out>)
      at memory.c:2194
  crash-utility#4  0x00005555557923ff in get_active_set_panic_task () at task.c:8639
  crash-utility#5  0x00005555557930d2 in get_dumpfile_panic_task () at task.c:7628
  crash-utility#6  0x00005555557a89d3 in panic_search () at task.c:7380
  crash-utility#7  get_panic_context () at task.c:6267
  crash-utility#8  task_init () at task.c:687
  crash-utility#9  0x00005555557305b3 in main_loop () at main.c:787
  ...

This is due to lack of existence check on module symbol table.  Not all
mod_mem_type will be existent for a module, e.g. in the following module
case:

  (gdb) p lm->symtable[0]
  $1 = (struct syment *) 0x4dcbad0
  (gdb) p lm->symtable[1]
  $2 = (struct syment *) 0x4dcbb70
  (gdb) p lm->symtable[2]
  $3 = (struct syment *) 0x4dcbc10
  (gdb) p lm->symtable[3]
  $4 = (struct syment *) 0x0
  (gdb) p lm->symtable[4]
  $5 = (struct syment *) 0x4dcbcb0
  (gdb) p lm->symtable[5]
  $6 = (struct syment *) 0x4dcbd00
  (gdb) p lm->symtable[6]
  $7 = (struct syment *) 0x0

MOD_RO_AFTER_INIT(3) and MOD_INIT_RODATA(6) do not exist, which should
be skipped, otherwise the segmentation fault will happen.

Fixes: 7750e61 ("Support module memory layout change on Linux 6.4")
Closes: crash-utility#176
Reported-by: Naveen Chaudhary <[email protected]>
Signed-off-by: Tao Liu <[email protected]>
liutgnu pushed a commit to liutgnu/crash-preview that referenced this pull request Dec 5, 2024
With Kernel commit 65c9cc9e2c14 ("x86/fred: Reserve space for the FRED
stack frame") in Linux 6.9-rc1 and later, x86_64 will add extra padding
('TOP_OF_KERNEL_STACK_PADDING (2 * 8)', see: arch/x86/include/asm\
/thread_info.h,) for kernel stack when the CONFIG_X86_FRED is enabled.

As a result, the pt_regs will be moved downwards due to the offset of
padding, and the values of registers read from pt_regs will be incorrect
as below.

Without the patch:
  crash> bt
  PID: 2040     TASK: ffff969136fc4180  CPU: 16   COMMAND: "bash"
   #0 [ffffa996409aba38] machine_kexec at ffffffff9f881eb7
   crash-utility#1 [ffffa996409aba90] __crash_kexec at ffffffff9fa1e49e
   crash-utility#2 [ffffa996409abb48] panic at ffffffff9f91a6cd
   crash-utility#3 [ffffa996409abbc8] sysrq_handle_crash at ffffffffa0015076
   crash-utility#4 [ffffa996409abbd0] __handle_sysrq at ffffffffa0015640
   crash-utility#5 [ffffa996409abc00] write_sysrq_trigger at ffffffffa0015ce5
   crash-utility#6 [ffffa996409abc28] proc_reg_write at ffffffff9fd35bf5
   crash-utility#7 [ffffa996409abc40] vfs_write at ffffffff9fc8d462
   crash-utility#8 [ffffa996409abcd0] ksys_write at ffffffff9fc8dadf
   crash-utility#9 [ffffa996409abd08] do_syscall_64 at ffffffffa0517429
  crash-utility#10 [ffffa996409abf40] entry_SYSCALL_64_after_hwframe at ffffffffa060012b
      [exception RIP: unknown or invalid address]
      RIP: 0000000000000246  RSP: 0000000000000000  RFLAGS: 0000002b
      RAX: 0000000000000002  RBX: 00007f9b9f5b13e0  RCX: 000055cee7486fb0
      RDX: 0000000000000001  RSI: 0000000000000001  RDI: 00007f9b9f4fda57
      RBP: 0000000000000246   R8: 00007f9b9f4fda57   R9: ffffffffffffffda
      R10: 0000000000000000  R11: 00007f9b9f5b14e0  R12: 0000000000000002
      R13: 000055cee7486fb0  R14: 0000000000000002  R15: 00007f9b9f5fb780
      ORIG_RAX: 0000000000000033  CS: 7ffe65327978  SS: 0000
  bt: WARNING: possibly bogus exception frame
  crash>

With the patch:

  crash> bt
  PID: 2040     TASK: ffff969136fc4180  CPU: 16   COMMAND: "bash"
   #0 [ffffa996409aba38] machine_kexec at ffffffff9f881eb7
   crash-utility#1 [ffffa996409aba90] __crash_kexec at ffffffff9fa1e49e
   crash-utility#2 [ffffa996409abb48] panic at ffffffff9f91a6cd
   crash-utility#3 [ffffa996409abbc8] sysrq_handle_crash at ffffffffa0015076
   crash-utility#4 [ffffa996409abbd0] __handle_sysrq at ffffffffa0015640
   crash-utility#5 [ffffa996409abc00] write_sysrq_trigger at ffffffffa0015ce5
   crash-utility#6 [ffffa996409abc28] proc_reg_write at ffffffff9fd35bf5
   crash-utility#7 [ffffa996409abc40] vfs_write at ffffffff9fc8d462
   crash-utility#8 [ffffa996409abcd0] ksys_write at ffffffff9fc8dadf
   crash-utility#9 [ffffa996409abd08] do_syscall_64 at ffffffffa0517429
  crash-utility#10 [ffffa996409abf40] entry_SYSCALL_64_after_hwframe at ffffffffa060012b
      RIP: 00007f9b9f4fda57  RSP: 00007ffe65327978  RFLAGS: 00000246
      RAX: ffffffffffffffda  RBX: 0000000000000002  RCX: 00007f9b9f4fda57
      RDX: 0000000000000002  RSI: 000055cee7486fb0  RDI: 0000000000000001
      RBP: 000055cee7486fb0   R8: 0000000000000000   R9: 00007f9b9f5b14e0
      R10: 00007f9b9f5b13e0  R11: 0000000000000246  R12: 0000000000000002
      R13: 00007f9b9f5fb780  R14: 0000000000000002  R15: 00007f9b9f5f69e0
      ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b
  crash>

Link: https://www.mail-archive.com/[email protected]/msg00754.html
Signed-off-by: Lianbo Jiang <[email protected]>
Signed-off-by: Tao Liu <[email protected]>
liutgnu pushed a commit to liutgnu/crash-preview that referenced this pull request Dec 5, 2024
The commit 48764a1 may cause a regression issue when the CONFIG_X86_FRED
is not enabled, this is because the SIZE(fred_frame) will call the
SIZE_verify() to determine if the fred_frame is valid, otherwise it will
emit an error:

  crash> bt 1

  bt: invalid structure size: fred_frame
        FILE: x86_64.c  LINE: 4089  FUNCTION: x86_64_low_budget_back_trace_cmd()

  [/home/k-hagio/bin/crash] error trace: 588df3 => 5cbc72 => 5eb3e1 => 5eb366
  PID: 1        TASK: ffff9f94c024b980  CPU: 2    COMMAND: "systemd"
     #0 [ffffade44001bca8] __schedule at ffffffffb948ebbb
     crash-utility#1 [ffffade44001bd10] schedule at ffffffffb948f04d
     crash-utility#2 [ffffade44001bd20] schedule_hrtimeout_range_clock at ffffffffb9494fef
     crash-utility#3 [ffffade44001bda8] ep_poll at ffffffffb8c91be8
     crash-utility#4 [ffffade44001be48] do_epoll_wait at ffffffffb8c91d11
     crash-utility#5 [ffffade44001be80] __x64_sys_epoll_wait at ffffffffb8c92590
     crash-utility#6 [ffffade44001bed0] do_syscall_64 at ffffffffb947f459
     crash-utility#7 [ffffade44001bf50] entry_SYSCALL_64_after_hwframe at ffffffffb96000ea

      5eb366: SIZE_verify.part.42+70
      5eb3e1: SIZE_verify+49
      5cbc72: x86_64_low_budget_back_trace_cmd+3010
      588df3: back_trace+1523

  bt: invalid structure size: fred_frame
        FILE: x86_64.c  LINE: 4089  FUNCTION: x86_64_low_budget_back_trace_cmd()

Let's replace the SIZE(fred_frame) with the VALID_SIZE(fred_frame) to
fix it.

Fixes: 48764a1 ("x86_64: fix for adding top_of_kernel_stack_padding for kernel stack")
Reported-by: Kazuhito Hagio <[email protected]>
Signed-off-by: Lianbo Jiang <[email protected]>
liutgnu pushed a commit to liutgnu/crash-preview that referenced this pull request Dec 5, 2024
For ramdump(Qcom phone device) case with the kernel option
CONFIG_ARM64_PTR_AUTH_KERNEL enabled, the bt command may print
incorrect stacktrace as below:

  crash> bt 16930
  PID: 16930    TASK: ffffff89b3eada00  CPU: 2    COMMAND: "Firebase Backgr"
   #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4
   crash-utility#1 [ffffffc034c43850] __kvm_nvhe_$d.2314 at 6be732e004cf05a0
   crash-utility#2 [ffffffc034c438b0] __kvm_nvhe_$d.2314 at 86c54c6004ceff80
   crash-utility#3 [ffffffc034c43950] __kvm_nvhe_$d.2314 at 55d6f96003a7b120
  ...
       PC: 00000073f5294840   LR: 00000070d8f39ba4   SP: 00000070d4afd5d0
      X29: 00000070d4afd600  X28: b4000071efcda7f0  X27: 00000070d4afe000
      X26: 0000000000000000  X25: 00000070d9616000  X24: 0000000000000000
      X23: 0000000000000000  X22: 0000000000000000  X21: 0000000000000000
      X20: b40000728fd27520  X19: b40000728fd27550  X18: 000000702daba000
      X17: 00000073f5294820  X16: 00000070d940f9d8  X15: 00000000000000bf
      X14: 0000000000000000  X13: 00000070d8ad2fac  X12: b40000718fce5040
      X11: 0000000000000000  X10: 0000000000000070   X9: 0000000000000001
       X8: 0000000000000062   X7: 0000000000000020   X6: 0000000000000000
       X5: 0000000000000000   X4: 0000000000000000   X3: 0000000000000000
       X2: 0000000000000002   X1: 0000000000000080   X0: b40000728fd27550
      ORIG_X0: b40000728fd27550  SYSCALLNO: ffffffff  PSTATE: 40001000

Crash tool can not get the KERNELPACMASK value from the vmcoreinfo, need
to calculate its value based on the vabits.

With the patch:

  crash> bt 16930
  PID: 16930    TASK: ffffff89b3eada00  CPU: 2    COMMAND: "Firebase Backgr"
   #0 [ffffffc034c437f0] __switch_to at ffffffe0036832d4
   crash-utility#1 [ffffffc034c43850] __schedule at ffffffe004cf05a0
   crash-utility#2 [ffffffc034c438b0] preempt_schedule_common at ffffffe004ceff80
   crash-utility#3 [ffffffc034c43950] unmap_page_range at ffffffe003a7b120
   crash-utility#4 [ffffffc034c439f0] unmap_vmas at ffffffe003a80a64
   crash-utility#5 [ffffffc034c43ac0] exit_mmap at ffffffe003a945c4
   crash-utility#6 [ffffffc034c43b10] __mmput at ffffffe00372c818
   crash-utility#7 [ffffffc034c43b40] mmput at ffffffe00372c0d0
   crash-utility#8 [ffffffc034c43b90] exit_mm at ffffffe00373d0ac
   crash-utility#9 [ffffffc034c43c00] do_exit at ffffffe00373bedc
       PC: 00000073f5294840   LR: 00000070d8f39ba4   SP: 00000070d4afd5d0
      X29: 00000070d4afd600  X28: b4000071efcda7f0  X27: 00000070d4afe000
      X26: 0000000000000000  X25: 00000070d9616000  X24: 0000000000000000
      X23: 0000000000000000  X22: 0000000000000000  X21: 0000000000000000
      X20: b40000728fd27520  X19: b40000728fd27550  X18: 000000702daba000
      X17: 00000073f5294820  X16: 00000070d940f9d8  X15: 00000000000000bf
      X14: 0000000000000000  X13: 00000070d8ad2fac  X12: b40000718fce5040
      X11: 0000000000000000  X10: 0000000000000070   X9: 0000000000000001
       X8: 0000000000000062   X7: 0000000000000020   X6: 0000000000000000
       X5: 0000000000000000   X4: 0000000000000000   X3: 0000000000000000
       X2: 0000000000000002   X1: 0000000000000080   X0: b40000728fd27550
      ORIG_X0: b40000728fd27550  SYSCALLNO: ffffffff  PSTATE: 40001000

Related kernel commits:
689eae42afd7 ("arm64: mask PAC bits of __builtin_return_address")
de1702f65feb ("arm64: move PAC masks to <asm/pointer_auth.h>")

Signed-off-by: bevis_chen <[email protected]>
liutgnu added a commit to liutgnu/crash-preview that referenced this pull request Dec 5, 2024
See the following stack trace:
(gdb) bt
 #0  0x00005635ac2b166b in arm64_unwind_frame (frame=0x7ffdaf35cb70,
     bt=0x7ffdaf35d430) at arm64.c:2821
 crash-utility#1  arm64_back_trace_cmd (bt=0x7ffdaf35d430) at arm64.c:3306
 crash-utility#2  0x00005635ac27b108 in back_trace (bt=bt@entry=0x7ffdaf35d430) at
     kernel.c:3239
 crash-utility#3  0x00005635ac2880ae in cmd_bt () at kernel.c:2863
 crash-utility#4  0x00005635ac1f16dc in exec_command () at main.c:893
 crash-utility#5  0x00005635ac1f192a in main_loop () at main.c:840
 crash-utility#6  0x00005635ac50df81 in captured_main (data=<optimized out>) at main.c:1284
 crash-utility#7  gdb_main (args=<optimized out>) at main.c:1313
 crash-utility#8  0x00005635ac50e000 in gdb_main_entry (argc=<optimized out>,
     argv=<optimized out>) at main.c:1338
 crash-utility#9  0x00005635ac1ea2a5 in main (argc=5, argv=0x7ffdaf35dde8) at main.c:721

The issue may be encountered when thread_union symbol not found in vmlinux
due to compiling optimization.

This patch will try the following 2 methods to get the irq_stack_size
when thread_union symbol unavailable:

1. change the thread_shift when KASAN is enabled and with vmcoreinfo.
   In arm64/include/asm/memory.h:

   #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
   ...
   #define IRQ_STACK_SIZE               THREAD_SIZE

   Since enabling the KASAN will affect the final value,
   this patch reset IRQ_STACK_SIZE according to the calculation process in
   kernel code.

2. Try getting the value from kernel code disassembly, to get
   THREAD_SHIFT directly from tbnz instruction.

   In arch/arm64/kernel/entry.S:
   .macro kernel_ventry, el:req, ht:req, regsize:req, label:req
   ...
         add     sp, sp, x0
         sub     x0, sp, x0
         tbnz    x0, #THREAD_SHIFT, 0f

   $ gdb vmlinux
   (gdb) disass vectors
   Dump of assembler code for function vectors:
      ...
      0xffff800080010804 <+4>:     add     sp, sp, x0
      0xffff800080010808 <+8>:     sub     x0, sp, x0
      0xffff80008001080c <+12>:    tbnz    w0, crash-utility#16, 0xffff80008001081c <vectors+28>

Signed-off-by: yeping.zheng <[email protected]>
Improved-by: Tao Liu <[email protected]>
liutgnu pushed a commit to liutgnu/crash-preview that referenced this pull request Dec 5, 2024
Sometimes, in production environment, there are still some vmcores that
are incomplete, such as partial header or the data is corrupted. When
crash tool attempts to parse such vmcores, it may fail as below:

  $ ./crash --osrelease vmcore
  Bus error (core dumped)

or

  $ crash vmlinux vmcore
  ...
  Bus error (core dumped)
 $

Gdb calltrace:

  $ gdb /home/lijiang/src/crash/crash /tmp/core.126301
  Core was generated by `./crash --osrelease /home/lijiang/src/39317/vmcore'.
  Program terminated with signal SIGBUS, Bus error.
  #0  __memcpy_evex_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:831
  831             LOAD_ONE_SET((%rsi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7))
  (gdb) bt
  #0  __memcpy_evex_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:831
  crash-utility#1  0x0000000000651096 in read_dump_header (file=0x7ffc59ddff5f "/home/lijiang/src/39317/vmcore") at diskdump.c:820
  crash-utility#2  0x0000000000651cf3 in is_diskdump (file=0x7ffc59ddff5f "/home/lijiang/src/39317/vmcore") at diskdump.c:1042
  crash-utility#3  0x0000000000502ac9 in get_osrelease (dumpfile=0x7ffc59ddff5f "/home/lijiang/src/39317/vmcore") at main.c:1938
  crash-utility#4  0x00000000004fb2e8 in main (argc=3, argv=0x7ffc59dde3a8) at main.c:271
  (gdb) frame 1
  crash-utility#1  0x0000000000651096 in read_dump_header (file=0x7ffc59ddff5f "/home/lijiang/src/39317/vmcore") at diskdump.c:820
  820                   memcpy(dd->dumpable_bitmap, dd->bitmap + bitmap_len/2,

This may happen on attempting access to a page of the buffer that lies
beyond the end of the mapped file(see the mmap() man page).

Let's add a check to avoid such issues as much as possible, but still
not guarantee that it can work well in any extreme situation.

Fixes: a334423 ("diskdump: use mmap/madvise to improve the start-up")
Reported-by: Buland Kumar Singh <[email protected]>
Signed-off-by: Lianbo Jiang <[email protected]>
liutgnu pushed a commit to liutgnu/crash-preview that referenced this pull request Dec 5, 2024
Currently, gdb passthroughs of 'bt', 'frame', 'up', 'down', 'info
locals' don't work. This is due to gdb not knowing the register values to
unwind the stack frames

Every gdb passthrough goes through `gdb_interface`. And then, gdb expects
`crash_target::fetch_registers` to give it the register values, which is
dependent on `machdep->get_current_task_reg` to read the register values for
specific architecture.

                                      ----------------------------
           gdb passthrough (eg. "bt") |                          |
   crash   -------------------------> |                          |
                                      |      gdb_interface       |
                                      |                          |
                                      |                          |
                                      |  ----------------------  |
                 fetch_registers      |  |                    |  |
crash_target<-------------------------+--|        gdb         |  |
            --------------------------+->|                    |  |
              Registers (SP,NIP, etc.)|  |                    |  |
                                      |  |                    |  |
                                      |  ----------------------  |
                                      ----------------------------

Implement `machdep->get_current_task_reg` on PPC64, so that crash provides the
register values to gdb to unwind stack frames properly

With these changes, on powerpc, 'bt' command output in gdb mode, will look
like this:

    gdb> bt
    #0  0xc0000000002a53e8 in crash_setup_regs (oldregs=<optimized out>, newregs=0xc00000000486f8d8) at ./arch/powerpc/include/asm/kexec.h:69
    crash-utility#1  __crash_kexec (regs=<optimized out>) at kernel/kexec_core.c:974
    crash-utility#2  0xc000000000168918 in panic (fmt=<optimized out>) at kernel/panic.c:358
    crash-utility#3  0xc000000000b735f8 in sysrq_handle_crash (key=<optimized out>) at drivers/tty/sysrq.c:155
    crash-utility#4  0xc000000000b742cc in __handle_sysrq (key=key@entry=99, check_mask=check_mask@entry=false) at drivers/tty/sysrq.c:602
    crash-utility#5  0xc000000000b7506c in write_sysrq_trigger (file=<optimized out>, buf=<optimized out>, count=2, ppos=<optimized out>) at drivers/tty/sysrq.c:1163
    crash-utility#6  0xc00000000069a7bc in pde_write (ppos=<optimized out>, count=<optimized out>, buf=<optimized out>, file=<optimized out>, pde=0xc000000009ed3a80) at fs/proc/inode.c:340
    crash-utility#7  proc_reg_write (file=<optimized out>, buf=<optimized out>, count=<optimized out>, ppos=<optimized out>) at fs/proc/inode.c:352
    crash-utility#8  0xc0000000005b3bbc in vfs_write (file=file@entry=0xc00000009dda7d00, buf=buf@entry=0xebcfc7c6040 <error: Cannot access memory at address 0xebcfc7c6040>, count=count@entry=2, pos=pos@entry=0xc00000000486fda0) at fs/read_write.c:582

instead of earlier output without this patch:

    gdb> bt
    #0  <unavailable> in ?? ()
    Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Also, 'get_dumpfile_regs' has been introduced to get registers from
multiple supported vmcore formats. Correspondingly a flag 'BT_NO_PRINT_REGS'
has been introduced to tell helper functions to get registers, to not
print registers with every call to backtrace in gdb.

Note: This feature to support GDB unwinding doesn't support live debugging

[lijiang: squash these five patches(see the Link) into one patch]

Link: https://www.mail-archive.com/[email protected]/msg01084.html
Link: https://www.mail-archive.com/[email protected]/msg01083.html
Link: https://www.mail-archive.com/[email protected]/msg01089.html
Link: https://www.mail-archive.com/[email protected]/msg01090.html
Link: https://www.mail-archive.com/[email protected]/msg01091.html
Co-developed-by:: Tao Liu <[email protected]>
Signed-off-by: Aditya Gupta <[email protected]>
lian-bo pushed a commit that referenced this pull request Dec 11, 2024
Let gdb fetch_registers include cpu context registers x19~x28, which
will be helpful to show more args when gdb stack unwind.

Without the patch:
    crash> gdb bt
    #0  __switch_to (prev=<unavailable>, prev@entry=0xffffff80025f0000, next=next@entry=<unavailable>) at arch/arm64/kernel/process.c:566
    #1  0xffffffc008f820b8 in context_switch (rq=0xffffff81fcf419c0, prev=0xffffff80025f0000, next=<unavailable>, rf=<optimized out>) at kernel/sched/core.c:5471
    #2  __schedule (sched_mode=<optimized out>, sched_mode@entry=168999904) at kernel/sched/core.c:6857
    #3  0xffffffc008f82514 in schedule () at kernel/sched/core.c:6933
    ...

With the patch:
    crash> gdb bt
    #0  __switch_to (prev=prev@entry=0xffffff80025f0000, next=next@entry=0xffffff80026092c0) at arch/arm64/kernel/process.c:566
    #1  0xffffffc008f820b8 in context_switch (rq=0xffffff81fcf419c0, prev=0xffffff80025f0000, next=0xffffff80026092c0, rf=<optimized out>) at kernel/sched/core.c:5471
    #2  __schedule (sched_mode=<optimized out>, sched_mode@entry=168999904) at kernel/sched/core.c:6857
    #3  0xffffffc008f82514 in schedule () at kernel/sched/core.c:6933
    ...

Signed-off-by: Guanyou.Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants