Skip to content

2017 09 13

Wesley Bland edited this page Sep 13, 2017 · 1 revision

Attendees

  • Intel - Wesley, Rob
  • Argonne, Ken, Yanfei
  • UTK - Aurelien
  • Auburn - Nawrin
  • LLNL - Ignacio, Murali
  • UT Chattanooga - Tony
  • ORNL - Geoffroy

Agenda for F2F

  • WG Time
    • Briefly? discuss catastrophic errors again
    • Move forward on process failure (ULFM, Reinit, etc.)
  • Reading
    • Read error handlers

Con Call Notes

Error Handlers

  • Went over slides and PDF for reading
    • Want to change one sentence in advice to implementors in Section 8.2.
    • This should be a small enough change to be acceptable. Will point it out separately.

Catastrophic Errors

  • Discussed current proposal and decided that we're still happy with it.
  • Global state of MPI_GET_STATE is ok because if any thread is catastrophic, all threads are catastrophic and can't recover anyway.
    • If you're checking the state, you're probably going to do it in an error handlers so you'll know which error code to look for to find out about the error.
  • Bill Gropp was asking us to look at things like what POSIX does for errors, but it's difficult to replicate that in MPI because of the much larger amount of state that MPI has to maintain across multiple processes. POSIX is more local and stateless (or the state lives in the user's data).
  • We might end up needing more error classes so we can give the user specific information about errors.
  • Might be ready to move forward on a December reading here.

Process Failure

  • Aurelien proposed adding a MPI_COMM_REVOKE_ALL function to resolve the deadlock problem with overlapping communicators.
    • Others were skeptical because you might always have to assume that you need to revoke all communicators any time you have overlapping communication.
    • Aurelien asserted that having concurrent communication with overlapping communicators is not common and might not be as bad as we think.
Clone this wiki locally