Fix some issues with dynamic algorithm selection in coll/tuned#8186
Fix some issues with dynamic algorithm selection in coll/tuned#8186rajachan merged 5 commits intoopen-mpi:masterfrom
Conversation
The mca parameters coll_tuned_*_algorithm are ignored unless coll_tuned_use_dynamic_rules is true so mention that in the description. Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
…d fall back to linear Bcast: scatter_allgather and scatter_allgather_ring expect N_elem >= N_procs Allreduce: rabenseifner expects N_elem >= pow2 nearest to N_procs In all cases, the implementations will fall back to a linear implementation, which will most likely yield the worst performance (noted for 4B bcast on 128 ranks) Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
wckzhang
left a comment
There was a problem hiding this comment.
Looks fine, since most of the tuning data was collected on power of 2 procs, should have considered the non pow-2 fallbacks and done slightly more finer grained tuning. This likely applies to other collectives.
|
Are we going to need this for 4.1.x? It does seem to fix a serious performance regression. |
|
Yes, I will backport that to 4.1.x later today. |
|
@devreal are these the only collectives you saw regressions with? I saw a similar issue with Allgatherv in the 4.1.x branch. Will redo my tests this afternoon to verify.
|
…lgather These selections seem harmful in my measurements and don't seem to be motivated by previous measurement data. Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
|
@rajachan There is indeed a problem with allgatherv: I believe decisions were generated based on the output of the OSU benchmark, which reports the number of bytes sent by each process. However, the decision logic uses he number of bytes to be received by each process. I'm working on a quick fix based on my measurements. Unfortunately, the v4.1.x backport of this PR is already merged so I will create new PRs for master and v4.1.x. |
|
Great, thanks! |
We should address this in the collectives tuning scripts so we don't run into this again the next time we tune the defaults (although it is not clear to me right now how we would account for this). Perhaps an issue against https://github.com/open-mpi/ompi-collectives-tuning/ is in order. |
This PR addresses a potential performance issue with the algorithm selection in
coll/tunedand some minor issues found while digging into it:coll_tuned_*_algorithmMCA variables should mention that they only take effect if thecoll_tuned_use_dynamic_rulesvariable is set to true.const.