Skip to content

2017 10 18

Wesley Bland edited this page Oct 19, 2017 · 1 revision

Attendees

  • Intel - Wesley, Jim
  • UTK - Aurelien
  • ORNL - Geoffroy
  • Argonne - Yanfei, Ken
  • Sandia - Keita
  • LLNL - Murali, Ignacio

RMA Fault Tolerance (Data Resilience)

Link to pull request

Continued discussion of whether this work is useful.

  • Aurelien - The description of the failure model is unclear. We need to better differentiate between MPI_ERR_PROC_FAILED and MPI_ERR_DATA_UNAVAILABLE.
  • Jim - Should MPI_ERR_DATA_UNAVAILABLE be usable outside of RMA? Does it apply to point-to-point or collectives?
  • Jim/Aurelien - Is the justification for this work that flush doesn't allow detection of process failure? They're still not convinced that this is true.
    • As long as flush can complete successfully, do we really need to tell you if a process failed on the other end?
  • Jim - On the other hand, it might be true that we can't guarantee any process failure detection in any RMA operation. Maybe we should just not allow process failure errors (as opposed to "upgrading" other types of errors to process failure).
  • Jim - One place this still makes sense as is is having a process with data corrupted because another process failed during a put. If a third process is reading the bad memory, it could get MPI_ERR_DATA_UNAVAILABLE instead of MPI_ERR_PROC_FAILED.

Bottom Line

  • We're still unclear on the failure model expected here. We probably need to get more feedback from Jeff.
  • We also aren't convinced that process failure semantics aren't sufficient to tell the user all of the actionable information that they need.

For next week:

  • Get feedback from Jeff when he comes back from paternity leave.

In the future:

  • Start discussing text for FA-MPI and Reinit.
Clone this wiki locally