-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: set gasnetex by default when building with USE_GASNET/Legion_USE_GASNet #1508
Comments
I'm not sure if the issue has been fixed but AFAIK the |
I have run S3D on 2048 nodes (16k ranks) on Frontier using gasnetex without needing |
Yes, I have seen hangs without |
I'm generally in favor.
|
Fixed.
Not fixed. Fixing should be a gate to switching gasnetex to the default.
Requires both |
Could someone explain what this issue actually is? What causes it, and what options do we have for fixing it? Is there a timeline on a fix?
Same. |
I talked with @streichler. My understanding of this issue is that Realm makes a certain number of output buffers for receiving messages from senders in the registered memory segment for GASNet. Right now Realm assigns these buffers to senders as it receives messages, but once a buffer is assigned to a sender it will always be associated with that sender. If you run out of buffers because they've all been assigned to other senders then nothing can receive the incoming message and we hang. As @streichler suggested there are two solutions:
Sean says:
I think @eddy16112 is going to take a look at this now that he has a reproducer although it's likely going to take a while to get all the details right and get it tested well.
GASNet right now has a build parameter that puts a static upper bound on the number of threads that can send active messages, presumably because they have some per-thread data structures for performance. If you run with |
I am trying to figure out if we can turn the obcount hang into an error message telling people to increase the obcount. I reproduced the hang with the memspeed. Apparently, the program will fall into the overflow code path and never send the packet, which is the place the hang comes from. https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/gasnetex/gasnetex_internal.cc?ref_type=heads#L1841-1854 @streichler is this code path typical for the obcount hang? Is there any other case that will fall into the same code path? If not, can we throw an error message of Insufficient obcount here? |
Just for myself, the updated line numbers are https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/gasnetex/gasnetex_internal.cc#L1890-1904 |
Realm's
gasnetex
backend has been stable for some time now. I am wondering if it makes sense to set it as the default, i.e., whenUSE_GASNET=1
(Makefile build) orLegion_USE_GASNet=ON
(CMake build) we setREALM_NETWORKS=gasnetex
/Legion_NETWORKS=gasnetex
instead ofgasnet1
.If you were already setting
REALM_NETWORKS
/Legion_NETWORKS
this would have no effect.Thoughts?
Here's a sample MR based on this approach: https://gitlab.com/StanfordLegion/legion/-/merge_requests/840
The text was updated successfully, but these errors were encountered: