Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failure (KD tree) or hang #1262

Open
Tracked by #1032
bandokihiro opened this issue May 19, 2022 · 20 comments
Open
Tracked by #1032

Assertion failure (KD tree) or hang #1262

bandokihiro opened this issue May 19, 2022 · 20 comments
Assignees

Comments

@bandokihiro
Copy link

With 58a56ef, I have the following assertion failure on a 3-node 18-rank configuration

solver: /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/legion/region_tree.inl:6604: Legion::Internal::KDNode<DIM, T, RT>::KDNode(Legion::Rect<DIM, T>&, std::vector<std::pair<Realm::Rect<N, T>, RT> >&) [with int DIM = 1; T = long long int; RT = unsigned int; Legion::Rect<DIM, T> = Realm::Rect<1, long long int>]: Assertion `right_set.size() < subrects.size()' failed.

gdb0.txt
gdb1.txt

This doesn't reproduce for a 2-node 12-rank configuration.

With 2dc0392 (I noticed some kd-tree related recent commits), the same 3-node 18-rank configuration hangs with the following unusual backtraces
gdb0.txt
gdb2.txt

@lightsighter
Copy link
Contributor

Attach a debugger and print out all the rectangles in the subrects data structure and post them here.

I'm going to ignore the hang on 2dc0392 unless you can reproduce it on a more recent commit.

@bandokihiro
Copy link
Author

On one of the frozen process

$4 = {
  {<std::__pair_base<Realm::Rect<1, long long>, unsigned int>> = {<No data fields>}, first = {lo = {x = 26634}, hi = {x = 191999}}, second = 1}, 
  {<std::__pair_base<Realm::Rect<1, long long>, unsigned int>> = {<No data fields>}, first = {lo = {x = 1112}, hi = {x = 191079}}, second = 3}, 
  {<std::__pair_base<Realm::Rect<1, long long>, unsigned int>> = {<No data fields>}, first = {lo = {x = 23431}, hi = {x = 191988}}, second = 0},
  {<std::__pair_base<Realm::Rect<1, long long>, unsigned int>> = {<No data fields>}, first = {lo = {x = 29}, hi = {x = 152559}}, second = 4}, 
  {<std::__pair_base<Realm::Rect<1, long long>, unsigned int>> = {<No data fields>}, first = {lo = {x = 35220}, hi = {x = 151989}}, second = 6}, 
  {<std::__pair_base<Realm::Rect<1, long long>, unsigned int>> = {<No data fields>}, first = {lo = {x = 0}, hi = {x = 161763}}, second = 9},
  {<std::__pair_base<Realm::Rect<1, long long>, unsigned int>> = {<No data fields>}, first = {lo = {x = 23031}, hi = {x = 167159}}, second = 8}, 
  {<std::__pair_base<Realm::Rect<1, long long>, unsigned int>> = {<No data fields>}, first = {lo = {x = 23030}, hi = {x = 167988}}, second = 7}, 
  {<std::__pair_base<Realm::Rect<1, long long>, unsigned int>> = {<No data fields>}, first = {lo = {x = 17}, hi = {x = 154187}}, second = 5},
  {<std::__pair_base<Realm::Rect<1, long long>, unsigned int>> = {<No data fields>}, first = {lo = {x = 629}, hi = {x = 173429}}, second = 2}, 
  {<std::__pair_base<Realm::Rect<1, long long>, unsigned int>> = {<No data fields>}, first = {lo = {x = 11200}, hi = {x = 155818}}, second = 10}, 
  {<std::__pair_base<Realm::Rect<1, long long>, unsigned int>> = {<No data fields>}, first = {lo = {x = 22920}, hi = {x = 164488}}, second = 11},
  {<std::__pair_base<Realm::Rect<1, long long>, unsigned int>> = {<No data fields>}, first = {lo = {x = 696}, hi = {x = 176767}}, second = 13}, 
  {<std::__pair_base<Realm::Rect<1, long long>, unsigned int>> = {<No data fields>}, first = {lo = {x = 20336}, hi = {x = 191459}}, second = 15}, 
  {<std::__pair_base<Realm::Rect<1, long long>, unsigned int>> = {<No data fields>}, first = {lo = {x = 30415}, hi = {x = 150378}}, second = 12},
  {<std::__pair_base<Realm::Rect<1, long long>, unsigned int>> = {<No data fields>}, first = {lo = {x = 42370}, hi = {x = 191982}}, second = 16}, 
  {<std::__pair_base<Realm::Rect<1, long long>, unsigned int>> = {<No data fields>}, first = {lo = {x = 33200}, hi = {x = 191969}}, second = 17}, 
  {<std::__pair_base<Realm::Rect<1, long long>, unsigned int>> = {<No data fields>}, first = {lo = {x = 1177}, hi = {x = 191204}}, second = 14}
}

I have updated to 8f24818

@lightsighter
Copy link
Contributor

Pull and try again.

@bandokihiro
Copy link
Author

On 69eadbc, the assertion is gone. Now the 3-node case hangs. The 2-node case still runs to completion. I tried looking at backtraces but nothing stood out to me. The only thing that seems to change between processes is the state of the gasnet poller. I saw things like

Thread 8 (Thread 0x20003b90f890 (LWP 176473)):
#0  0x0000000011b34834 in Realm::Clock::current_time_in_nanoseconds (absolute=false) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/timers.inl:69
#1  0x000000001364b9d4 in Realm::XmitSrcDestPair::time_since_failure (this=0x20056c2f8a40) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2525
#2  0x000000001364d290 in Realm::GASNetEXPoller::do_work (this=0x341b35d0, work_until=...) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2884
#3  0x0000000013353944 in Realm::BackgroundWorkManager::Worker::do_work (this=0x20003b90e8e8, max_time_in_ns=-1, interrupt_flag=0x0) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/bgwork.cc:621
#4  0x0000000013350ba0 in Realm::BackgroundWorkThread::main_loop (this=0x34c4e200) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/bgwork.cc:125
#5  0x0000000013356628 in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x34c4e200) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/threads.inl:97
#6  0x000000001359b268 in Realm::KernelThread::pthread_entry (data=0x34c4e580) at /gpfs/alpine/scratch/bandok/csc335/Softwares/legion/runtime/realm/threads.cc:774
#7  0x0000200000208ae0 in start_thread () from /lib64/power9/libpthread.so.0
#8  0x000020000a80e7c8 in clone () from /lib64/power9/libc.so.6

but I don't know what it exacly means. I am attaching here the 18 backtraces, sorry I wasn't able to more precisely identify the problem.
backtraces.zip

@lightsighter
Copy link
Contributor

Please leave debug hanging processes on sapling, run with -ll:force_kthreads. If possible see if it hangs with -lg:inorder.

@bandokihiro
Copy link
Author

These results come from Summit, trying to reproduce on sapling right now. It does hang with -lg:inorder and the last run produced the following warning

[0 - 20003c20f890]    6.020287 {4}{runtime}: [warning 1114] LEGION WARNING: Failed to find a refinement for KD tree with 1 dimensions and 18 rectangles.

@bandokihiro
Copy link
Author

bandokihiro commented May 20, 2022

The same configuration with 18 ranks on a single sapling node didn't hang. Same for 2 sapling nodes.

@bandokihiro
Copy link
Author

I tried to reproduce on sapling using the 16 gpus, it didn't. I tried on Piz Daint with 18 ranks, it also didn't reproduce. Since the only thing I could tell from the backtraces was that there is some failure at the gasnetex interface, I tried gasnet1 on Summit. It didn't reproduce. Maybe @streichler can chime in.

@lightsighter
Copy link
Contributor

On which machine did the hang occur before? What was the result of running with -lg:inorder -ll:force_kthreads? If you can get it to hang with that, then please report the backtraces.

@bandokihiro
Copy link
Author

The hang was obtained on Summit using 3 nodes and 18 ranks only with gasnetex. Attached are the 18 backtraces
backtraces.zip
The command line was the following

jsrun --nrs 18 --rs_per_host 6 --tasks_per_rs 1 --cpu_per_rs 7 --gpu_per_rs 1 --bind packed:7 --latency_priority gpu-cpu /gpfs/alpine/scratch/bandok/csc335/Softwares/DG-Legion/build_2/exec/solver -logfile logs/log_%.log -hdf5:forcerw -dm:memoize -ll:cpu 0 -ll:util 2 -ll:bgwork 3 -ll:bgworkpin 1 -ll:csize 20000 -ll:rsize 0 -ll:gpu 1 -ll:fsize 14000 -ll:onuma 0 -ll:ht_sharing 1 -ll:ocpu 1 -ll:othr 3 -ll:ib_rsize 512m -ll:ib_zsize 0 -cuda:legacysync 1 -ll:force_kthreads -lg:warn -lg:partcheck -lg:safe_ctrlrepl 1 -lg:safe_mapper -lg:inorder

@eddy16112
Copy link
Contributor

The hang occurs only with gasnetex reminder me that my program also runs extremely slow with gasnetex. Have you tried running with -gex:bindcuda 0 to see if the hang still occurs?

@bandokihiro
Copy link
Author

Thanks for the suggestion. Without binding, it doesn't hang. But I need this for best performance.

@eddy16112
Copy link
Contributor

Thanks for the suggestion. Without binding, it doesn't hang. But I need this for best performance.

I think it is a realm issue, and I just created an issue for my program. #1265
It may not be a hang for you, but just runs very slow. If you can reduce your problem size, it may runs with -gex:cudabind 1. You can also add GASNET_NUM_QPS=1 to see if it works for me.

@bandokihiro
Copy link
Author

This depends on the configuration (problem size + number of ranks), so I don't think it is the same issue. Runs that do not hang are fast with cuda binding.

@eddy16112
Copy link
Contributor

I just run my program several times, and it is not even slower, but also sometime hangs as well. I just reproduce the hanging with the realm memspeed benchmark under test/realm . Could you please run the memspeed test on summit? Just want to double check if you can reproduce it or not.

@lightsighter
Copy link
Contributor

I agree with @eddy16112 that this is most likely a Realm/GASNet issue. Legion hangs usually will not reproduce with -lg:inorder and when they do it's usually obvious which task is at fault. In this case it looks like all the tasks are at the same point in the program and waiting for the same index tasks launch to run. Please run with -ll:defalloc 0 -level dma=2 and provide logs for each process of a hung run.

@bandokihiro
Copy link
Author

Attached are the logs
logs.zip

@bandokihiro
Copy link
Author

@eddy16112 I quickly ran memspeed with 2 ranks on summit, it worked but i don't think that is necessarilly useful info to you.

@lightsighter
Copy link
Contributor

This is definitely a Realm issue, there are 251 started copies in log_3.log but only 244 of them have finished. Are you sure you're on the most recent commit of control_replication? @streichler fixed a bunch of issues with Realm copies hanging recently.

@bandokihiro
Copy link
Author

This is "fixed" by adding the following flag -gex:obcount 256 (higher if it is still not enough). Leaving this open and tagging @streichler so that he can track it and possibly add a cleaner fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants