-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assertion failure (KD tree) or hang #1262
Comments
Attach a debugger and print out all the rectangles in the I'm going to ignore the hang on 2dc0392 unless you can reproduce it on a more recent commit. |
On one of the frozen process
I have updated to 8f24818 |
Pull and try again. |
On 69eadbc, the assertion is gone. Now the 3-node case hangs. The 2-node case still runs to completion. I tried looking at backtraces but nothing stood out to me. The only thing that seems to change between processes is the state of the gasnet poller. I saw things like
but I don't know what it exacly means. I am attaching here the 18 backtraces, sorry I wasn't able to more precisely identify the problem. |
Please leave debug hanging processes on sapling, run with |
These results come from Summit, trying to reproduce on sapling right now. It does hang with
|
The same configuration with 18 ranks on a single sapling node didn't hang. Same for 2 sapling nodes. |
I tried to reproduce on sapling using the 16 gpus, it didn't. I tried on Piz Daint with 18 ranks, it also didn't reproduce. Since the only thing I could tell from the backtraces was that there is some failure at the gasnetex interface, I tried gasnet1 on Summit. It didn't reproduce. Maybe @streichler can chime in. |
On which machine did the hang occur before? What was the result of running with |
The hang was obtained on Summit using 3 nodes and 18 ranks only with gasnetex. Attached are the 18 backtraces
|
The hang occurs only with gasnetex reminder me that my program also runs extremely slow with gasnetex. Have you tried running with |
Thanks for the suggestion. Without binding, it doesn't hang. But I need this for best performance. |
I think it is a realm issue, and I just created an issue for my program. #1265 |
This depends on the configuration (problem size + number of ranks), so I don't think it is the same issue. Runs that do not hang are fast with cuda binding. |
I just run my program several times, and it is not even slower, but also sometime hangs as well. I just reproduce the hanging with the realm memspeed benchmark under |
I agree with @eddy16112 that this is most likely a Realm/GASNet issue. Legion hangs usually will not reproduce with |
Attached are the logs |
@eddy16112 I quickly ran memspeed with 2 ranks on summit, it worked but i don't think that is necessarilly useful info to you. |
This is definitely a Realm issue, there are 251 started copies in |
This is "fixed" by adding the following flag |
With 58a56ef, I have the following assertion failure on a 3-node 18-rank configuration
gdb0.txt
gdb1.txt
This doesn't reproduce for a 2-node 12-rank configuration.
With 2dc0392 (I noticed some kd-tree related recent commits), the same 3-node 18-rank configuration hangs with the following unusual backtraces
gdb0.txt
gdb2.txt
The text was updated successfully, but these errors were encountered: