Skip to content

Option to SIGQUIT or throw error during ESMF_Abort #296

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
danrosen25 opened this issue Sep 11, 2024 · 8 comments · May be fixed by #361
Open

Option to SIGQUIT or throw error during ESMF_Abort #296

danrosen25 opened this issue Sep 11, 2024 · 8 comments · May be fixed by #361
Assignees
Labels
feature/enhancement New feature or request
Milestone

Comments

@danrosen25
Copy link
Member

The current method to debug ESMF Errors is to build a back trace using ESMF_LogSetError and rc. This gives you a limited amount of information about the state at the time of the error. I started investigating throwing a SIGQUIT error, which can print a backtrace and dump a core. The core dump can be analyzed to see the state causing the error.

diff --git a/src/Infrastructure/VM/src/ESMCI_VMKernel.C b/src/Infrastructure/VM/src/ESMCI_VMKernel.C
index 63b85ad0c3..43c85c5c5c 100644
--- a/src/Infrastructure/VM/src/ESMCI_VMKernel.C
+++ b/src/Infrastructure/VM/src/ESMCI_VMKernel.C
@@ -899,6 +899,7 @@ struct SpawnArg{
 void VMK::abort(){
   // abort default (all MPI) virtual machine
   int finalized;
+  raise (SIGQUIT);
   MPI_Finalized(&finalized);
   if (!finalized)
     MPI_Abort(default_mpi_c, EXIT_FAILURE);
@danrosen25 danrosen25 self-assigned this Sep 11, 2024
@danrosen25 danrosen25 added the feature/enhancement New feature or request label Sep 11, 2024
@anntsay
Copy link

anntsay commented Oct 2, 2024

Dan propose to have this as a runtime option -> that way ESMF quit on error and output info. this allow easier troubleshooting and debugging.

Bob: looks reasonable. and maybe put in 8.8 becuase it is not heavy weight. and this new method will be optional.
Ann confirm that ESMF_LogSetError and rc will still be available and will be the default.

@anntsay
Copy link

anntsay commented Oct 2, 2024

Bill: CESM also uses this.. it make sense to use this as an option
Dan: this is only optional method.. default is still the current method. this is set as a one time flag at run-time.

@danrosen25
Copy link
Member Author

Look at the LogSetError option for abort on error.
Runtime flag (using environment) ESMF_RUNTIME_ABORT_ON_ERROR

@anntsay
Copy link

anntsay commented Feb 26, 2025

design consideration on to handle MPI aborts that makes this story a medium.

this ticket may be beneficial to CESM: CESM back traces is only available to certain compilers and so this feature may help.

Bill: is there a C mechanism for producing backtrace?
gnu backtraces
execinfo
Gerhard: can unroll the stacks.

@danrosen25
Copy link
Member Author

Testing on Mac OS and Derecho
raise (SIGQUIT);
SIGQUIT will terminate the current task and the mpirun application is sending SIGTERM to other processes.

Executing SIGQUIT on rank 2

dec2436.hsn.de.hpc.ucar.edu 1: rank-1 do nothing
dec2448.hsn.de.hpc.ucar.edu 5: rank-5 do nothing
dec2448.hsn.de.hpc.ucar.edu 6: rank-6 do nothing
dec2436.hsn.de.hpc.ucar.edu 0: rank_sum:28
rank-0 do nothing
dec2448.hsn.de.hpc.ucar.edu 7: rank-7 do nothing
dec2436.hsn.de.hpc.ucar.edu 3: rank-3 do nothing
dec2448.hsn.de.hpc.ucar.edu 4: rank-4 do nothing
dec2436.hsn.de.hpc.ucar.edu: rank 2 died from signal 3 and dumped core
dec2436.hsn.de.hpc.ucar.edu: rank 1 died from signal 15
RESULT=143

Adding sleep for longer than walltime

dec2436.hsn.de.hpc.ucar.edu 0: rank_sum:28
rank-0 do nothing
dec2436.hsn.de.hpc.ucar.edu 1: rank-1 do nothing
dec2448.hsn.de.hpc.ucar.edu 4: rank-4 do nothing
dec2436.hsn.de.hpc.ucar.edu 3: rank-3 do nothing
dec2448.hsn.de.hpc.ucar.edu 5: rank-5 do nothing
dec2448.hsn.de.hpc.ucar.edu 6: rank-6 do nothing
dec2448.hsn.de.hpc.ucar.edu 7: rank-7 do nothing
=>> PBS: job killed: walltime 77 exceeded limit 60
Terminated
dec2436.hsn.de.hpc.ucar.edu: rank 1 died from signal 15

@danrosen25
Copy link
Member Author

Branch is ready to be discussed
esmf/tree/feature/sigquit

Alternative option is to utilize execinfo, which provides backtrace and backtrace_symbols. This will not provide a core dump. This is already available for writing to the ESMF PET logs using c_esmc_vmlogbacktrace (VM::logBacktrace) or ESMF_VMLogBacktrace.

@danrosen25 danrosen25 linked a pull request Mar 5, 2025 that will close this issue
@danrosen25
Copy link
Member Author

I split the LogMsgAbort and ESMF_Abort settings into two options, as suggested by @billsacks. Then I tested initializing ESMF with a config file, I had to move some code but that's working now. I also added SIGABRT, which calls std::abort() from the standard library. Inside of std::abort() it will raise signal SIGABRT. After some further reading SIGQUIT is usually initiated externally and is not available for Window/MinGW. The POSIX documentation says that both SIGQUIT and SIGABRT will both core dump.

PR Open: #361

SIGABRT

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x0000ffff7e147aac in __GI_abort () at abort.c:79
#2  0x0000ffff7f12ae60 in ESMCI::VMK::abort() () from /home/dev/install/esmf/gfortran9/O/sigquit/lib/libesmf.so
#3  0x0000ffff7f119618 in ESMCI::VM::abort(int*) () from /home/dev/install/esmf/gfortran9/O/sigquit/lib/libesmf.so
#4  0x0000ffff7f13c0f8 in c_esmc_vmabort_ () from /home/dev/install/esmf/gfortran9/O/sigquit/lib/libesmf.so
#5  0x0000ffff7f6fc784 in __esmf_vmmod_MOD_esmf_vmabort () from /home/dev/install/esmf/gfortran9/O/sigquit/lib/libesmf.so
#6  0x0000ffff7f710da4 in f_esmf_vmabort_ () from /home/dev/install/esmf/gfortran9/O/sigquit/lib/libesmf.so
#7  0x0000ffff7f6918c0 in __esmf_logerrmod_MOD_esmf_logwrite () from /home/dev/install/esmf/gfortran9/O/sigquit/lib/libesmf.so
#8  0x0000ffff7f6926e8 in __esmf_logerrmod_MOD_esmf_logseterror () from /home/dev/install/esmf/gfortran9/O/sigquit/lib/libesmf.so
#9  0x0000aaaadc088cac in ocn::advance (model=..., rc=0) at /home/dev/esmf_abort/ocn.F90:335

SIGQUIT

#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x0000ffff9e480e68 in ESMCI::VMK::abort() () from /home/dev/install/esmf/gfortran9/O/sigquit/lib/libesmf.so
#2  0x0000ffff9e46f618 in ESMCI::VM::abort(int*) () from /home/dev/install/esmf/gfortran9/O/sigquit/lib/libesmf.so
#3  0x0000ffff9e4920f8 in c_esmc_vmabort_ () from /home/dev/install/esmf/gfortran9/O/sigquit/lib/libesmf.so
#4  0x0000ffff9ea52784 in __esmf_vmmod_MOD_esmf_vmabort () from /home/dev/install/esmf/gfortran9/O/sigquit/lib/libesmf.so
#5  0x0000ffff9ea66da4 in f_esmf_vmabort_ () from /home/dev/install/esmf/gfortran9/O/sigquit/lib/libesmf.so
#6  0x0000ffff9e9e78c0 in __esmf_logerrmod_MOD_esmf_logwrite () from /home/dev/install/esmf/gfortran9/O/sigquit/lib/libesmf.so
#7  0x0000ffff9e9e86e8 in __esmf_logerrmod_MOD_esmf_logseterror () from /home/dev/install/esmf/gfortran9/O/sigquit/lib/libesmf.so
#8  0x0000aaaae59e8cac in ocn::advance (model=..., rc=0) at /home/dev/esmf_abort/ocn.F90:335

@danrosen25
Copy link
Member Author

Also tested on derecho with intel and mpich

#0  0x00001461e5608cbb in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00001461e5608cbb in raise () from /lib64/libc.so.6
#1  0x00001461e560a304 in abort () from /lib64/libc.so.6
#2  0x0000000000451f56 in for.issue_diagnostic ()
#3  0x0000000000416a94 in for.signal_handler ()
#4  <signal handler called>
#5  0x00001461e5608cbb in raise () from /lib64/libc.so.6
#6  0x00001461e560a355 in abort () from /lib64/libc.so.6
#7  0x00001461eb6f1da0 in ESMCI::VMK::abort() () from /glade/work/drosen/install/esmf/intel-2023.2.1-mpich-8.1.27/O/sigabrt/lib/libesmf.so
#8  0x00001461eb70eff4 in ESMCI::VM::abort(int*) () from /glade/work/drosen/install/esmf/intel-2023.2.1-mpich-8.1.27/O/sigabrt/lib/libesmf.so
#9  0x00001461eb6ec5f5 in c_esmc_vmabort_ () from /glade/work/drosen/install/esmf/intel-2023.2.1-mpich-8.1.27/O/sigabrt/lib/libesmf.so
#10 0x00001461ebf3b951 in esmf_vmmod_mp_esmf_vmabort_ () from /glade/work/drosen/install/esmf/intel-2023.2.1-mpich-8.1.27/O/sigabrt/lib/libesmf.so
#11 0x00001461ebe69f02 in esmf_logerrmod_mp_esmf_logwrite_ () from /glade/work/drosen/install/esmf/intel-2023.2.1-mpich-8.1.27/O/sigabrt/lib/libesmf.so
#12 0x00001461ebe6a2c1 in esmf_logerrmod_mp_esmf_logseterror_ ()
   from /glade/work/drosen/install/esmf/intel-2023.2.1-mpich-8.1.27/O/sigabrt/lib/libesmf.so
#13 0x000000000041f1d1 in ocn::advance (model=..., rc=0) at /glade/work/drosen/src/esmf_abort/ocn.F90:331

@danrosen25 danrosen25 added this to the v8.9.0 milestone Apr 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants