-
Notifications
You must be signed in to change notification settings - Fork 940
Fix some issues with dynamic algorithm selection in coll/tuned #8198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The mca parameters coll_tuned_*_algorithm are ignored unless coll_tuned_use_dynamic_rules is true so mention that in the description. Signed-off-by: Joseph Schuchart <[email protected]> (cherry picked from commit 06f605c)
Signed-off-by: Joseph Schuchart <[email protected]> (cherry picked from commit 7261255)
…d fall back to linear Bcast: scatter_allgather and scatter_allgather_ring expect N_elem >= N_procs Allreduce: rabenseifner expects N_elem >= pow2 nearest to N_procs In all cases, the implementations will fall back to a linear implementation, which will most likely yield the worst performance (noted for 4B bcast on 128 ranks) Signed-off-by: Joseph Schuchart <[email protected]> (cherry picked from commit 04d198f)
|
I added a commit that removes the selection of linear algorithms in allreduce and allgather. In my measurements the latency for these ranges is higher than necessary and I don't see how that is motivated by previous measurements (it seems unlikely to me that linear algorithms perform well at several dozens or hundreds of ranks). |
|
@devreal ICYMI, something was off with allgatherv too (I'd tested with the 4.1.x branch #8186 (comment)). Is that something you are seeing? |
|
@rajachan I have not yet looked at allgatherv. I can run some tests for that over night and see. Do remember at what scales things were weird? |
|
I was running with ~1K ranks (32 nodes with 36 ranks per node). |
|
Btw, your master PR is missing the allreduce/allgather commit. |
Oops, pushed to the wrong branch. Will fix in a minute |
Nice catch |
Signed-off-by: Joseph Schuchart <[email protected]> (cherry picked from commit 22e289b)
…lgather These selections seem harmful in my measurements and don't seem to be motivated by previous measurement data. Signed-off-by: Joseph Schuchart <[email protected]> (cherry picked from commit a15e5dc)
0f89397 to
3cae9f7
Compare
This PR addresses a potential performance issue with the algorithm selection in coll/tuned and some minor issues found while digging into it:
Backport of #8186 to v4.1.x