Skip to content

Conversation

@bosilca
Copy link
Member

@bosilca bosilca commented Jan 16, 2021

This PR provides an alternative solution to #8283, where all temporaries are handled as packed data resulting in less memory allocated for storing the temporary data.

Signed-off-by: George Bosilca bosilca@icl.utk.edu

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
@bosilca bosilca requested a review from mkurnosov January 16, 2021 04:20
@bosilca bosilca added this to the v5.0.0 milestone Jan 16, 2021
@ggouaillardet
Copy link
Contributor

@bwbarrett gcc5 test failed because of a full disk

ar: `u' modifier ignored since `D' is the default (see `U')
ar: .libs/libmpi_usempif08_pmpi.a: No space left on device
Makefile:2077: recipe for target 'libmpi_usempif08_pmpi.la' failed

@ggouaillardet
Copy link
Contributor

@bosilca is this supposed to always work on an heterogeneous cluster (mixing big and little endian for example) ?

Copy link
Contributor

@mkurnosov mkurnosov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bosilca thank you. I did not find any bugs. I have tested for contiguous and non-contiguous data types, for all roots and MPI_IN_PLACE.

One proposal: it is possible to reduce memory consumption in non-root and non-leaf processes: scount * (size + 1) / 2 it is an upper bound. We can compute subtree size and allocate buffer of corresponding size:

int vparent = (bmtree->tree_prev - root + size) % size;
int subtree_size = vrank - vparent;
if (size - vrank < subtree_size)
    subtree_size = size - vrank;
packed_size = scount * subtree_size;

@bosilca
Copy link
Member Author

bosilca commented Jan 16, 2021

@ggouaillardet this version is almost heterogeneous. The only remaining issue is that all participants except the root, will need to create a convertor for the root architecture (from the ompi_proc_t structure corresponding to the root) and use this convertor to unpack the local part of the message. This is true for non-leaves (where we already have an unpack operation), but also for the leaves where we will need a temporary buffer and an unpack.

@bosilca
Copy link
Member Author

bosilca commented Jan 16, 2021

@mkurnosov sure, it makes sense to reduce the temporary memory for non-leaves. If we are looking at optimizations, we could also avoid packing the local part on the root, and or interleaving pack and send on the root (which will also lead to less memory required for temporaries).

@jsquyres
Copy link
Member

jsquyres commented Feb 2, 2021

bot:aws:retest

@gpaulsen
Copy link
Member

gpaulsen commented Mar 2, 2021

bot:aws:retest

@jsquyres
Copy link
Member

@bosilca Should we merge this PR? (and close #8283 and cherry pick to v4.0.x and v4.1.x, and then close #8285?)

@bosilca
Copy link
Member Author

bosilca commented Mar 15, 2021

I'm done with this. I was under the impression that @mkurnosov wanted to add some optimizations, but it's entirely up to him.

@jsquyres
Copy link
Member

I think that we should merge this PR, and if optimization comes in later, great. This fixes a real bug, so we should go ahead and merge it.

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by proxy for @mkurnosov

@mkurnosov
Copy link
Contributor

@bosilca @jsquyres I am sorry, I will prepare a small fix in the near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants