Skip to content

Conversation

@devreal
Copy link
Contributor

@devreal devreal commented Nov 10, 2020

This PR addresses a potential performance issue with the algorithm selection in coll/tuned and some minor issues found while digging into it:

  1. Performance: some bcast and allreduce algorithms require the number of elements to be larger than the number of ranks and will fall back to a linear implementation if that is not the case. This hit me bitterly with 4B bcast on 128 ranks on a single node where latency increased 10x compared to 127 ranks.
  2. The documentation for the coll_tuned_*_algorithm MCA variables should mention that they only take effect if the coll_tuned_use_dynamic_rules variable is set to true.
  3. Mark global static array used for the MCA parameters as const.
  4. Fix some glitches in comments.

Backport of #8186 to v4.1.x

The mca parameters coll_tuned_*_algorithm are ignored unless coll_tuned_use_dynamic_rules is true so mention that in the description.

Signed-off-by: Joseph Schuchart <[email protected]>
(cherry picked from commit 06f605c)
Signed-off-by: Joseph Schuchart <[email protected]>
(cherry picked from commit 7261255)
…d fall back to linear

Bcast: scatter_allgather and scatter_allgather_ring expect N_elem >= N_procs
Allreduce: rabenseifner expects N_elem >= pow2 nearest to N_procs

In all cases, the implementations will fall back to a linear implementation,
which will most likely yield the worst performance (noted for 4B bcast on 128 ranks)

Signed-off-by: Joseph Schuchart <[email protected]>
(cherry picked from commit 04d198f)
@devreal
Copy link
Contributor Author

devreal commented Nov 11, 2020

I added a commit that removes the selection of linear algorithms in allreduce and allgather. In my measurements the latency for these ranges is higher than necessary and I don't see how that is motivated by previous measurements (it seems unlikely to me that linear algorithms perform well at several dozens or hundreds of ranks).

@rajachan
Copy link
Member

@devreal ICYMI, something was off with allgatherv too (I'd tested with the 4.1.x branch #8186 (comment)). Is that something you are seeing?

@devreal
Copy link
Contributor Author

devreal commented Nov 11, 2020

@rajachan I have not yet looked at allgatherv. I can run some tests for that over night and see. Do remember at what scales things were weird?

@rajachan
Copy link
Member

I was running with ~1K ranks (32 nodes with 36 ranks per node).

@rajachan
Copy link
Member

Btw, your master PR is missing the allreduce/allgather commit.

@devreal
Copy link
Contributor Author

devreal commented Nov 11, 2020

Btw, your master PR is missing the allreduce/allgather commit.

Oops, pushed to the wrong branch. Will fix in a minute

@wckzhang
Copy link
Contributor

Btw, your master PR is missing the allreduce/allgather commit.

Nice catch

Signed-off-by: Joseph Schuchart <[email protected]>
(cherry picked from commit 22e289b)
…lgather

These selections seem harmful in my measurements and don't seem to be
motivated by previous measurement data.

Signed-off-by: Joseph Schuchart <[email protected]>
(cherry picked from commit a15e5dc)
@devreal devreal force-pushed the fix-tuned-dynamic-v4.1.x branch from 0f89397 to 3cae9f7 Compare November 11, 2020 17:43
@jsquyres jsquyres merged commit 6f21a39 into open-mpi:v4.1.x Nov 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants