Realm DMA Crash in Pennant C++ #1803

lightsighter · 2024-12-06T00:47:23Z

I observe non-deterministic crashes when running Pennant C++ in debug mode on 4 nodes with 4 GPUs/node. It manifests with different backtraces (running with CUDA_LAUNCH_BLOCKING=1 so I can see exactly which kernel/copy is failing):

#6  0x00007f9a34469859 in __GI_abort () at abort.c:79
#7  0x0000560cac7f5d67 in Realm::Cuda::launch_kernel (func_info=..., params=0x7f9a281f83f0, num_elems=22, stream=0x560cdf48f190)
    at /home/mebauer/legion/runtime//realm/cuda/cuda_module.cc:1105
#8  0x0000560cac7f65a1 in Realm::Cuda::GPU::launch_batch_affine_fill_kernel (this=0x560cdf3045a0, fill_info=0x7f9a281f83f0, dim=2, 
    elem_size=8, volume=22, stream=0x560cdf48f190) at /home/mebauer/legion/runtime//realm/cuda/cuda_module.cc:1172
#9  0x0000560cac84b5da in Realm::Cuda::GPUfillXferDes::progress_xd (this=0x7f9a1b5c8b30, channel=0x560ce28eaf30, work_until=...)
    at /home/mebauer/legion/runtime//realm/cuda/cuda_internal.cc:2083
#10 0x0000560cac8535c0 in Realm::XDQueue<Realm::Cuda::GPUfillChannel, Realm::Cuda::GPUfillXferDes>::do_work (this=0x560ce28eaf68, 
    work_until=...) at /home/mebauer/legion/runtime/realm/transfer/channel.inl:166
#11 0x0000560cac437566 in Realm::BackgroundWorkManager::Worker::do_work (this=0x7f9a281f99c0, max_time_in_ns=-1, 
    interrupt_flag=0x0) at /home/mebauer/legion/runtime//realm/bgwork.cc:600
#12 0x0000560cac434f46 in Realm::BackgroundWorkThread::main_loop (this=0x560ce1816e00)
    at /home/mebauer/legion/runtime//realm/bgwork.cc:103
#13 0x0000560cac438d5c in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop>
    (obj=0x560ce1816e00) at /home/mebauer/legion/runtime/realm/threads.inl:97
#14 0x0000560cac5626ea in Realm::KernelThread::pthread_entry (data=0x560ce1816ea0)
    at /home/mebauer/legion/runtime//realm/threads.cc:854
#15 0x00007f9a3673f609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#16 0x00007f9a34566353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

and also:

#6  0x00007f850d54a859 in __GI_abort () at abort.c:79
#7  0x000055960b4ee5bd in Realm::Cuda::GPUXferDes::progress_xd (this=0x7f84700a9130, channel=0x559621f13a50, work_until=...)
    at /home/mebauer/legion/runtime//realm/cuda/cuda_internal.cc:854
#8  0x000055960b4fda4c in Realm::XDQueue<Realm::Cuda::GPUChannel, Realm::Cuda::GPUXferDes>::do_work (this=0x559621f13a88, 
    work_until=...) at /home/mebauer/legion/runtime/realm/transfer/channel.inl:166
#9  0x000055960b0e1566 in Realm::BackgroundWorkManager::Worker::do_work (this=0x7f84fbffd9c0, max_time_in_ns=-1, 
    interrupt_flag=0x0) at /home/mebauer/legion/runtime//realm/bgwork.cc:600
#10 0x000055960b0def46 in Realm::BackgroundWorkThread::main_loop (this=0x559620e3fd50)
    at /home/mebauer/legion/runtime//realm/bgwork.cc:103
#11 0x000055960b0e2d5c in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop>
    (obj=0x559620e3fd50) at /home/mebauer/legion/runtime/realm/threads.inl:97
#12 0x000055960b20c6ea in Realm::KernelThread::pthread_entry (data=0x559620e3fe70)
    at /home/mebauer/legion/runtime//realm/threads.cc:854
#13 0x00007f850f820609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#14 0x00007f850d647353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

It pretty much crashes every single time in the DMA system, just varies where it is going to crash.

To reproduce, download the master branches of Legion and Pennant C++. Modify the Makefile to set DEBUG=1 and enable (uncomment) -DPRECOMPACTED_RECT_POINTS and disable (comment out) -DENABLE_GATHER_COPIES.

After building, use the following script to launch jobs to sbatch (note the REALM_FREEZE_ON_ERROR=1 means processes will freeze when you crash so you'll need to explicitly kill your jobs):

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --time=00:30:00

root_dir="$PWD"

export LD_LIBRARY_PATH="$PWD"

export GASNET_PHYSMEM_MAX=16G

export CUDA_LAUNCH_BLOCKING=1
export REALM_FREEZE_ON_ERROR=1

ulimit -S -c 0 # disable core dumps

export LEGION_DEFAULT_ARGS="-ll:gpu 4 -ll:util 2 -ll:bgwork 2 -ll:csize 15000 -ll:fsize 14000 -ll:zsize 1024 -ll:rsize 512 -ll:gsize 0 -gex:obcount 8192 -lg:prof 1 -lg:prof_logfile /home/mebauer/pennant-legion/prof_%.log"

srun -n 4 -N 4 --ntasks-per-node 1 --cpu_bind none "$root_dir/pennant" -f "$root_dir"/test/leblanc/leblanc.pnt -n 16

Submit with sbatch -n 4 -N 4 --exclusive <script_name>.

The text was updated successfully, but these errors were encountered:

apryakhin · 2024-12-11T18:51:16Z

I am working on this issue now

lightsighter assigned apryakhin Dec 6, 2024

lightsighter added bug Realm Issues pertaining to Realm labels Dec 6, 2024

lightsighter mentioned this issue Dec 6, 2024

Realm: Profiling breaks with CUPTI #1800

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Realm DMA Crash in Pennant C++ #1803

Realm DMA Crash in Pennant C++ #1803

lightsighter commented Dec 6, 2024

apryakhin commented Dec 11, 2024

Realm DMA Crash in Pennant C++ #1803

Realm DMA Crash in Pennant C++ #1803

Comments

lightsighter commented Dec 6, 2024

apryakhin commented Dec 11, 2024