Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big MPI---large-count and displacement support--collective chapter #80

Closed
jeffhammond opened this issue Feb 12, 2018 · 37 comments
Closed
Assignees
Labels
scheduled reading Reading is scheduled for the next meeting wg-collectives Collectives Working Group wg-large-counts Large Counts Working Group

Comments

@jeffhammond
Copy link
Member

jeffhammond commented Feb 12, 2018

Problem

Sending more than 2Gi elements in MPI is a pain.

The general strategy for implementing large-count operations is to use datatypes. In some cases, this is straightforward, but it appears to be a very poor solution in the case of v-collectives and reductions. In order to use the datatype solution for v-collectives, one has to map (counts[],type) to (newcounts[],newtypes[]), which then requires the w-collective, since only it takes a vector of types. For reductions, one has to unwind the datatype inside of a user-defined reduction. None of the solutions available outside of MPI work for nonblocking collectives, due to the allocation of temporary vector arguments. If it is possible with generalized requests, it is onerous.

A more subtle issue is the large-displacement problem, which exists even if all of the counts are less than INT_MAX because of the limitations of the offset vector. If the sum of counts[i] up to any i<comm_size exceeds INT_MAX, then displs[i] will overflow. This means that one cannot use any of the v-collectives for relatively small data sets, e.g. 3B floats, which is only 12 GB per process. This is likely to be limiting when implementing 3D FFT, matrix transpose and IO aggregation, all of which are likely use v-collectives. Neighborhood collectives fixed the large-displacement problem, but if a user wants to use those as a drop-in replacement, they have to create a new communicator.

The displacement issue is exacerbated in the large-count case because all the displacements are interpreted in bytes rather than the extent of the datatype, so there is no way to index beyond 2GB of data, irrespective of the datatype and the counts.

Using the w-collective for large-count v-collectives has these issues:

  • Calling the w-collectives requires the allocation and assignment of O(nproc) vectors, which is tedious but certainly not a memory issue if one is in the large-count regime.
  • One cannot deallocate the argument vectors until the operation completes, which means that one cannot implement the nonblocking case, since there is no opportunity to deallocate the temporary vectors in the wait call (any solution involving generalized requests is almost certainly untenable for most users).
  • Because MPI_ALLTOALLW takes displacements of type int and interprets these irrespective of the extent of the datatype (see page 173 of MPI-3), it is hard to index more than 2GB of data ''using any datatype''. There is a solution using datatypes encoded with the offset internally (e.g. via MPI_Type_create_struct), but it is far from user-friendly.

In the absence of proper support in the MPI standard, the most reasonable implementation of large-count v-collectives uses point-to-point, which means that users must make relatively nontrivial changes to their code to support large counts, or they have to use something like BigMPI, which already implements these functions (vcollectives_x.c)). An RMA-based implementation is also possible, but users are unlikely to accept this suggestion.

One can map also the v-collectives to MPI_Neighborhood_alltoallw, but in a far-from-efficient manner, and this is not particularly useful for the nonblocking case because MPI_Dist_graph_create_adjacent is blocking.

Proposal

The straightforward, user-friendly solution to this problem is to add new functions that use MPI_Count and MPI_Aint for counts and displacements, respectively.

We are not proposing to add new functions for everything, just the standard collectives (neighborhood collectives will be proposed later as a separate ticket).

Adding _x versions of the v-collectives and w-collectives that have the count of type MPI_Count and displacement vectors of type MPI_Aint[] is the most direct solution and prevents users from having to allocate and set O(Nproc) vectors in the course of mapping to the most general collective available (e.g. MPI_NEIGHBORHOOD_ALLTOALLW).

We add reductions (reduce, allreduce, reduce_scatter, reduce_scatter_block, scan, exscan) as well, with the limitation that user-defined reductions are not supported because these would require a new version of MPI_User_function, MPI_Op_create, and MPI_Op_free, which is error-prone. For user-defined reductions, it is feasible to use user-defined datatypes without an obvious loss of efficiency. Furthermore, there are other issues (mpi-forum/mpi-forum-historic#339) with user-defined reductions that should be addressed if this change is made.

Alternative solution

Another solution would be to add large-count support to derived datatypes, e.g. MPI_Type_contiguous_x, but this is not user-friendly. We should not ask users to start using derived datatypes to broadcast a contiguous array of 2.2 billion elements, for example.

Changes to the Text

These changes have been made in https://github.com/mpi-forum/mpi-standard/pull/34.

Impact on implementations

BigMPI implements large-count variants of most of the proposed functions, sometimes in more than one way. For example, large-count blocking collectives were implemented using point-to-point, neighbor_alltoallw, and one-sided. Nonblocking collectives are a problem, which is one of the big motivations for this ticket.

The implementations inside of MPI libraries is straightforward assuming they convert message sizes to bytes internally and support e.g. 1B 4-byte types correctly.

Impact on Users

This ticket is the result of user complaints about MPI (e.g. http://gentryx.de/news_the_troubling_state_of_MPI.html, which was prominently cited in https://www.hpcwire.com/2014/04/30/time-look-beyond-mpi/).

The BigMPI project thoroughly evaluated the Forum's contention that datatypes were sufficient to address the large-count issue and found that this solution is unlikely to satisfy the majority of users, due to a number of performance and usability issues.

References

@tonyskjellum
Copy link

tonyskjellum commented Sep 5, 2018

We are going to read this in Barcelona. Just this base ticket, not all its relatives that were spawned on June 14 (97, 98, 99, 100). We will bring those forward later. Tickets #98, 99, and 100 are all important and no more controversial than this ticket (#80), while #97 remains highly controversial. Also,
all the WITH_INFO tickets await resolution on Ticket #80 and other Big MPI tickets before proceeding.

The latest text for Ticket #80 is here:

mpi32-report-ticket80-04sep2018.pdf [reductions]

(Note there is other work that we need to consider under s-collectives and v-collectives, but they are not part of this pull request.)

@hritzdorf
Copy link

There are some small errors in argument lists:
Page 164, Line 20: add INTENT(IN) ::
INTEGER(KIND = MPI_COUNT_KIND) sendcounts(), recvcount -> INTEGER(KIND = MPI_COUNT_KIND), INTENT(IN) :: sendcounts(), recvcount
Line 21: add ::
INTEGER(KIND = MPI_ADDRESS_KIND), INTENT(IN) displs() -> INTEGER(KIND = MPI_ADDRESS_KIND), INTENT(IN) :: displs()
Line 30: add space between ) DISPLS
INTEGER(KIND = MPI_ADDRESS_KIND)DISPLS() -> INTEGER(KIND = MPI_ADDRESS_KIND) DISPLS()
Page 207, Line 16: remove root
INTEGER(KIND = MPI_COUNT_KIND), INTENT(IN) :: count, root -> INTEGER(KIND = MPI_COUNT_KIND), INTENT(IN) :: count
Page 213, Line 19, 20, 30: Same corrections as Page 164

@tonyskjellum
Copy link

tonyskjellum commented Sep 5, 2018 via email

@tonyskjellum
Copy link

I am having trouble with Git so this is delaying publishing a new version; not sure why we are no longer seeing those repos.

@jeffhammond
Copy link
Member Author

I apologize for Git issues. Nobody has ever tried to contribute to the large-count effort before so I was not aware that I was the only person who could write to the repo.

Everyone in the GitHub group now has write access. Anyone who wants to contribute just needs to request access to that group.

@tonyskjellum tonyskjellum changed the title Big MPI---large-count support Big MPI---large-count and displacement support Sep 19, 2018
@tonyskjellum
Copy link

At the Barcelona WG meeting, @jdinan suggested that everyone in HPC world moving to 64-bit ABI (ILP64) ; that would make integers 64-bit. See https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models .

@tonyskjellum
Copy link

Noting that the topology chapter is not covered by this version of the ticket nor the proposed reading material. A separate ticket will be made for that so this can proceed. If there should be an objection at the reading of this ticket that it does not address the topology chapter, we will point to the second ticket.

@tonyskjellum
Copy link

tonyskjellum commented Sep 19, 2018

Rolf notes that MPI_Alltoallw is inconsistent in its definition because it has byte displacements, yet they are defined as int, not MP_Aint. Therefore, the new API must account for this inconsistency and should it handle it via MPI_Aint for displacements; that is in fact what is currently proposed in the pull request as written.

So, we need a ticket as an "Advice to Users."

@tonyskjellum
Copy link

tonyskjellum commented Sep 19, 2018

Per Rolf, there are two kinds of displacements: index displacements within an array (declared as int displacements), and bytes displacements (which are declared as MPI_Aint)...

For index displacements within an array, all arithmetic (for example: count = disp2-disp1) are done with normal built-in plus and/or minus operators).

For the byte displacements, they can always be used as relative displacements to the beginning of a buffer, or they can be used as absolute displacements (relative to MPI_BOTTOM). Thus, they must always be MPI_Aint. Additionally, the difference of two relative displacements should always be calculated MPI_Aint_diff(), not with an arithmetic minus (-) operator. The same applies for MPI_Aint_add() for the summation of an absolute address plus a relative address.

Therefore:

  • MPI_Count is a fine replacement for int everywhere that a count appears [no controversy]
  • Where there are index displacements, we should replace int with MPI_Count, because this reflects the difference of such indices (not MPI_Aint)
  • Where there are byte displacements, we should replace keep MPI_Aint where it is already specified, and repair any APIs that previously got this wrong in the _x version. For example, we noted MPI_Alltoallw is such a case.

It is necessary that the size of the integer representing MPI_Count >= size of the integer representing MPI_Aint. This rule is already in the standard. [See p 17 of MPI-3.1 standard, Section 2.5.8 Counts. Lines 15-19. Already covered in the standard.] In MPI-3.0, we already have MPI_GET_EXTENT_X that is using MPI_Count. So MPI_Count is not new.

What we are recommending is to change the text of this proposal as follows: We will not put MPI_Aint on all displacements. We will put MPI_Aint on displacements involving bytes; we will put MPI_Count on displacements that are of index type.

@tonyskjellum tonyskjellum changed the title Big MPI---large-count and displacement support Big MPI---large-count and displacement support--collective chapter Sep 19, 2018
@jeffhammond
Copy link
Member Author

jeffhammond commented Sep 19, 2018 via email

@jeffhammond
Copy link
Member Author

jeffhammond commented Sep 19, 2018 via email

@tonyskjellum
Copy link

tonyskjellum commented Sep 23, 2018

The key outcome of the reading is the plan for an holistic look at the API across the entire API; a voting strategy followed by a final vote on the entire API addition was discussed and accepted as Forum-compliant (by acclamation / without objection).

There were no specific objections to the API as presented currently in this ticket, ticket #105

It was pointed out that we still have more tickets to write and implement besides those already open fir "Big MPI." We have to look at the entire standard end-to-end.

The current goal is to read "all" Big MPI tickets in December.

@dholmes-epcc-ed-ac-uk
Copy link
Member

dholmes-epcc-ed-ac-uk commented Sep 23, 2018

Note also the creation of issue #107 and, in particular, the consequential question of whether we should actually replace MPI_COUNT with size_t in all C bindings and replace MPI_AINT with ptrdiff_t in all C bindings (with similar appropriate changes towards using language-specified types for the Fortran bindings).
#107 (comment)

Assertion: using the naturally-sized types specified in the C language would achieve the goal of all the Big MPI issues for the C bindings. The short-term consequence (huge one-off churn affecting most APIs) is identical.

Question: are there similar appropriate types specified in the Fortran language?

Observation: the datatype naming rule proposed in issue #74 (if accepted) will permit the addition of MPI datatypes for size_t and ptrdiff_t (plus Fortran equivalents, if any) without further changes to the MPI Standard.

Corollary: issues #107 and #109 become moot.
Corollary: MPI_AINT_ADD and MPI_AINT_DIFF become superfluous.

@jdinan had a good reason to keep the MPI-namespaced types but I have completely forgotten it. @jdinan: please could you comment?

Do we want MPI to continue to move in the direction of a DSL for communication or return to its roots of a library for communication?

Note, IMHO, the concept of this/these proposal(s) is essential (cope with big machines); only the presentation style in the API is being debated. If we cannot find a technical reason to choose between language-specified and MPI-defined types, then we need the Architecture Review Board to reconvene and expurgate via a fiat.

@tonyskjellum
Copy link

tonyskjellum commented Sep 23, 2018 via email

@dholmes-epcc-ed-ac-uk
Copy link
Member

@tonyskjellum deprecate Fortran? <end_troll_mode>

That possibly constitutes a technical reason not to choose language-specific types, at least for the Fortran bindings.

@jeffhammond
Copy link
Member Author

@dholmes-epcc-ed-ac-uk Please remember that if we change int to size_t or such widening, we will break every single use of count arrays, as are used in vector collectives.

As far as I can tell, this didn't happen with POSIX when those APIs switched from int to size_t because, while this changed the ABI, POSIX doesn't have any APIs that take vectors of counts. Rather in e.g. writev, the length is inside of the iovec struct so any code that uses this function would have to allocate an array using an expression including sizeof(iovec), which safely promotes when compiled on a 64-bit system.

@jeffhammond
Copy link
Member Author

@dholmes-epcc-ed-ac-uk Fortran does not have unsigned integers, so it is rather hard to support size_t properly.

@mhoemmen
Copy link

@jeffhammond You have C99 so you can say int64_t right?

@jeffhammond
Copy link
Member Author

@mhoemmen What does C99 have to do with Fortran not supporting unsigned types? In any case, the MPI standard does not require C99, although it supports int64_t and MPI_INT64_T.

@jeffhammond
Copy link
Member Author

@mhoemmen Assuming https://stackoverflow.com/a/1089204/2189128 is reliable, ISO C recommends that size_t be castable to long but then we'd not be able to support more than 8 exibytes of memory per node, which will render MPI obsolete in the yottascale era 😜

@mhoemmen
Copy link

What does C99 have to do with Fortran not supporting unsigned types?

What I mean is that switching from int to size_t, signed to unsigned, is more troublesome than switching from int to ptrdiff_t (signed to signed). The latter has the advantage of Fortran compatibility (assuming that Fortran has no unsigned integer types).

@jeffhammond
Copy link
Member Author

@mhoemmen We are never going to replace int with a wider type in the existing MPI symbols. See #80 (comment) for details.

@mhoemmen
Copy link

ah ok never mind then :)

@dholmes-epcc-ed-ac-uk
Copy link
Member

Having the vector arguments be typed with MPI_COUNT or MPI_AINT does not help with ABI portability with respect to using size_t or ptrdiff_t instead. Both sets of types are of a fixed length on a particular machine but could be different between machines. If I write code that assumes the size of any of these it will break when that size changes.

For the avoidance of doubt, I say above that the consequences to the API of using size_t are identical to using MPI_COUNT because the proposal is to churn the API in exactly the same manner. Specifically, if it is decided that we will have two symbols, the existing function signature and one with "_X" appended, then the "_X" variant will have the new type(s), whichever type of types that ends up being. Users can continue to compile against the existing symbols with their existing code and variable declarations. If and only if they wish to switch do they have to verify they are using suitably sizes variables and arrays.

If the MPI Forum decides to fork MPI (seriously discussed as an option at the Sept 2018 meeting, straw poll 16,2,0 in favour), then MPI-4.0 may change the types in the existing API function definitions without changing their symbol names, which breaks backward compatibility. This option imposes a burden on the MPI Forum and on MPI library writers to continue support for a line of MPI-3.x releases that contain existing MPI-3.1 interfaces plus minor fixes and updates cherry-picked from the MPI-4 fork.

@jeffhammond
Copy link
Member Author

@dholmes-epcc-ed-ac-uk Sorry, I misread your comment and thought you were suggesting replacing int with a wider type, as opposed to replacing MPI_Count with an ISO/POSIX-standard one.

If we are going to fork the standard, I suggest that we use MPI_Count and MPI_Aint everywhere, but prescribe how these are typedef-d. That way, we can preserve a universal API definition while supporting both ABIs. This is not unlike what I've proposed in #13 for MPI_Socket.

@dholmes-epcc-ed-ac-uk
Copy link
Member

@jeffhammond I like that. So, we are suggesting that part of the C binding as defined in MPI-4k (pronounced MPI-fork) should be:

typedef size_t MPI_COUNT;
typedef ptrdiff_t MPI_AINT;

That allows humans and compilers alike to see the equivalence and use whichever they are more comfortable with.

The Fortran binding can do whatever seems appropriate for that language (probably these will remain "opaque" types).

Issue #107 becomes moot. Issue #109 is not, in fact it should be expanded to include F2C and C2F conversion functions or a promise of automatic representation conversion during heterogenous MPI communication.

@jeffhammond
Copy link
Member Author

@dholmes-epcc-ed-ac-uk We need to stopping talking about forks. Python forking was/is a disaster for users and maintainers of dependent projects. MPI-4 needs to be one standard with two well-defined ABIs.

@jsquyres
Copy link
Member

Agree. The word "fork" has connotations of splitting and becoming two entirely different things. Even though I'm not there at the meeting, I get the sense that that's not what the Forum is talking about here.

@dholmes-epcc-ed-ac-uk I appreciate the pun "MPI-4K" =~ "MPI Fork", but I think it sends the wrong message.

@hjelmn
Copy link

hjelmn commented Sep 24, 2018

I am not entirely sure I agree. There is a discussion to break backward compatibility goin forward and provide for only a 64-bit clean interface. How this is done is under discussion.

BTW, I was there for the discussion. I will vote no on any attempt at adding additional _x symbols unless we plan to fork afterward.

@jeffhammond
Copy link
Member Author

I suspect The Register is already writing a salacious article about the forking of MPI that will terrify users and cause them to rewrite their apps in Spark.

@tonyskjellum
Copy link

tonyskjellum commented Sep 24, 2018 via email

@dholmes-epcc-ed-ac-uk
Copy link
Member

To clarify for those not present at the meeting, the discussion prior to the 16-2 in favour straw-poll covered a number of possible API changes related to how we should express the Big MPI adjustments (and others). There was a general (and strong) feeling that creating "_X" versions in MPI-4 only to be faced later with the necessity of creating "_Y" versions in future for some other API change was a really bad idea.

The straw-poll itself immediately followed a suggestion that MPI-4 should define two APIs, possibly to be expressed via two header files in C (and, I guess, two modules in Fortran), for example, "mpi3.h" and "mpi4.h".

The straw poll question was carefully worded to extract maximum support, something like "given the dislike for the _X mess, could you countenance supporting a proposal that breaks backwards compatibility, for example, in this way?" with the other option being "I will never support anything that is not backwards compatible under any circumstances".

Despite heavily biasing the question, I was not expecting the strength of support for such a radical idea.

Perhaps, "fork" is the wrong word. However, Python was mentioned as a cautionary tale during the discussion and before the straw-poll.

Others present can correct me, if I am mis-remembering or over-editorialising.

@tonyskjellum tonyskjellum added the wg-collectives Collectives Working Group label Sep 26, 2018
@tonyskjellum
Copy link

Latest update (Chapter 5 and change log)
-- Add _X APIs for persistent collective
-- Fix enumerations of persistent operations throughout chapter
-- Add enumerations of _X APIs throughout chapter
-- Update change log for addition of _X APIs that are new

mpi32-report-ticket80-03oct18-2231.pdf

@wesbland
Copy link
Member

wesbland commented Oct 7, 2020

@tonyskjellum / @puribangalore - Is this issue replaced by #137? Can we close this?

@puribangalore
Copy link

puribangalore commented Oct 7, 2020 via email

@wesbland wesbland closed this as completed Oct 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scheduled reading Reading is scheduled for the next meeting wg-collectives Collectives Working Group wg-large-counts Large Counts Working Group
Projects
None yet
Development

No branches or pull requests

10 participants