Skip to content

2017 06 21

Aurelien Bouteiller edited this page Jun 21, 2017 · 4 revisions

Presents:

  • Intel - Wesley, Marc
  • LANL - Howard
  • ANL - Ken
  • Auburn - Nawrin, Alexander
  • UTK - Aurelien
  • LLNL - Ignacio, Murali

Topics:

  • Short summary of MPI F2F meeting activities from those who where presents (not much happened related to error handling)
  • Discussions on MPI_Init and errors within

MPI_Init and errors

Ignacio wonders what happens if an error strikes during MPI_Init. We have discussed 3 different ways of dealing with it:

  1. MPI_Init always succeed. That is, from the perspective of the MPI program. If the MPI library cannot make it appear "as if" MPI_Init had succeeded it will still abort before returning from MPI_Init. This is what ULFM recommends; with the understanding that missing processes from MPI Init may trigger UFLM errors on the first post-init operation.
  2. MPI_Init may return an error code. This requires a small change to the spec to mandate that error codes can be returned by MPI_Init (for now, the error happens before anybody has set an error handler on COMM_WORLD, therefore the application calls the default handler -> abort). Implementations would have to always init error codes/classes even when Init fails. For it to be useful, MPI_Init would have to return an error code at all ranks; that may add some overhead to the normal operation of MPI_Init.
  3. More like 2.bis MPI_Set_errhandler can be called before MPI_Init. This is a big change for such a small addition in functionality, everybody finds it gross. Let's not go there.
  4. Delegate the issue to sessions. Whatever that means.

Now, back to Ignacios base problem: MPI_Reinit is called after MPI_Init, i.e., MPI_Reinit may employ normal error handling capabilities. As an alternative, MPI_Reinit may call reinit everytime if fails, until it succeeds (or abort when it is not possible to maintain contractual obligations on specified semantic). That in effect would mean that MPI_Reinit would "never fail" from an user perspective, although it may cycle multiple times.

Clone this wiki locally