Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: set gasnetex by default when building with USE_GASNET/Legion_USE_GASNet #1508

Open
elliottslaughter opened this issue Jul 12, 2023 · 9 comments

Comments

@elliottslaughter
Copy link
Contributor

Realm's gasnetex backend has been stable for some time now. I am wondering if it makes sense to set it as the default, i.e., when USE_GASNET=1 (Makefile build) or Legion_USE_GASNet=ON (CMake build) we set REALM_NETWORKS=gasnetex / Legion_NETWORKS=gasnetex instead of gasnet1.

If you were already setting REALM_NETWORKS/Legion_NETWORKS this would have no effect.

Thoughts?

Here's a sample MR based on this approach: https://gitlab.com/StanfordLegion/legion/-/merge_requests/840

@mariodirenzo
Copy link

I'm not sure if the issue has been fixed but AFAIK the gasnetex layer requires setting the runtime flag -gex:obcount followed by a large number, which depends on the number of nodes involved in a calculation. The last time I've seen an execution failing because of the missing -gex:obcount there wasn't any error message and it was up to the user to understand that the flag was required. I believe that switching by default to the gasnetex backend without fixing the issue could break many applications that need to run at large scale.

@syamajala
Copy link
Contributor

I have run S3D on 2048 nodes (16k ranks) on Frontier using gasnetex without needing -gex:obcount. This could be application dependent though.

@eddy16112
Copy link
Contributor

Yes, I have seen hangs without -gex:obcount as well.

@manopapad
Copy link
Contributor

I'm generally in favor. gasnetex has shown better performance than gasnet1 for legate, but we have run into some papercuts when using gasnetex that don't happen with gasnet1. These are bit old, so they may have been fixed since.

  • Slow memory registration at startup: gasnetex would try to register all of -ll:csize with the NIC, and that could lead to multi-minute delays at startup.

  • -gex:obcount issue: With gasnet1 there was only a small number of endpoints per rank, that require the allocation of an output buffer. With gasnetex there's now also 2 endpoints per GPU (one for fsize and one for ib_fsize) (because we can have direct communication between GPUs), so now the default limit may not be large enough if we need to instantiate all endpoints, which happens in cases of all-to-all communication. Realm can't calculate a good setting at gasnet initialization time, because it doesn't know at that point how many GPU processors are going to be created. Setting -gex:obcount to (4 + 2 * gpus/node) * nodes should be sufficient for the worst-case scenario, but this pessimistic setting will almost always be overkill (if you don't have true all-to-all communication). Previously this would result in a hang.

  • Too many simultaneous local client threads: With -gex:immediate 1 (default) meta-tasks will inject AM requests directly into gasnet, and gasnet wants to record all the thread IDs that have ever called into it, in a statically-sized data structure, and this limit can be reached at larger scales, producing an error message like GASNet Extended API: Too many simultaneous local client threads (limit=256). To raise this limit, configure GASNet using --with-max-pthreads-per-node=N. We work around this by setting -gex:immediate 0.

@streichler
Copy link
Contributor

Slow memory registration at startup

Fixed.

-gex:obcount issue

Not fixed. Fixing should be a gate to switching gasnetex to the default.

Too many simultaneous local client threads

Requires both -gex:immediate 1 (the default) and also -ll:force_kthreads (not the default, but a very common opt-in) to occur. Should eventually be fixed, but probably doesn't need to be a blocker.

@elliottslaughter
Copy link
Contributor Author

-gex:obcount issue

Not fixed. Fixing should be a gate to switching gasnetex to the default.

Could someone explain what this issue actually is? What causes it, and what options do we have for fixing it? Is there a timeline on a fix?

Too many simultaneous local client threads

Requires both -gex:immediate 1 (the default) and also -ll:force_kthreads (not the default, but a very common opt-in) to occur. Should eventually be fixed, but probably doesn't need to be a blocker.

Same.

@lightsighter
Copy link
Contributor

Could someone explain what this issue actually is? What causes it, and what options do we have for fixing it? Is there a timeline on a fix?

I talked with @streichler. My understanding of this issue is that Realm makes a certain number of output buffers for receiving messages from senders in the registered memory segment for GASNet. Right now Realm assigns these buffers to senders as it receives messages, but once a buffer is assigned to a sender it will always be associated with that sender. If you run out of buffers because they've all been assigned to other senders then nothing can receive the incoming message and we hang. As @streichler suggested there are two solutions:

  1. Allocate enough of these buffers on every node using the formula that Manolis gave above to ensure that there are enough buffers so every sender can have one and therefore you'll never run out. This is overly pessimistic and results in memory consumption on each node that grows with the scale of the machine, so while it is a sound work-around it's not a reasonable one if we're going to care about scalability.
  2. Have a way to temporally multiplex (share) buffers between endpoints so that multiple endpoints can receive messages into the same buffer in a non-interfering way.

Sean says:

I think (2) is the better answer, but it requires messing with a bunch of finicky and aggressive multi-threaded code and neither breaking it nor slowing it down

I think @eddy16112 is going to take a look at this now that he has a reproducer although it's likely going to take a while to get all the details right and get it tested well.

Same.

GASNet right now has a build parameter that puts a static upper bound on the number of threads that can send active messages, presumably because they have some per-thread data structures for performance. If you run with -gex:immediate 1 Realm will send messages immediately instead of sticking them in a queue for active message sender threads to push out, which means that pretty much any Realm thread can send an active message for things like event triggers or DMAs or ... whatever. If you also run with -ll:force_kthreads, then Realm will create and destroy many more threads for running tasks on processors. From GASNets' perspective each of those threads is a new sender thread (GASNet doesn't track when threads are destroyed; once it's seen a thread it assumes it's alive forever). So if you run with both -gex:immediate 1 and -ll:force_kthreads it's not surprising that you quickly exceed the static upper bound on the number of threads that GASNet sets for sending active messages. I don't think we need to fix this one other than maybe just automatically switching -gex:immediate to 0 if -ll:force_kthreads is set.

@eddy16112
Copy link
Contributor

I am trying to figure out if we can turn the obcount hang into an error message telling people to increase the obcount. I reproduced the hang with the memspeed. Apparently, the program will fall into the overflow code path and never send the packet, which is the place the hang comes from. https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/gasnetex/gasnetex_internal.cc?ref_type=heads#L1841-1854 @streichler is this code path typical for the obcount hang? Is there any other case that will fall into the same code path? If not, can we throw an error message of Insufficient obcount here?

@eddy16112
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants