Skip to content

2017 10 11

Wesley Bland edited this page Oct 11, 2017 · 2 revisions

Attendees

  • Intel - Wesley
  • ORNL - Geoffroy
  • Argonne - Yanfei, Ken
  • Auburn - Nawrin
  • Sandia - Keita
  • UTC - Tony

RMA Fault Tolerance (Data Resilience)

Link to pull request

Summary of previous discussion

  • Jeff - RMA is different from communicator-based FT because it is more data focused and it is more expensive and less likely to detect process failure. We should add more text to focus on conveying that the data is unavailable.
  • Others - This is a bit out of scope of the initial ULFM proposal but still important. Maybe this should be an accompanying proposal

Discussion on today's call

  • Wesley - After reading through the proposal again, I think it makes sense to bring this into ULFM proper. It completes the picture for RMA because if we can detect process failure, we do, but we can also express failure in other, cheaper ways.
  • Keita - The expected recovery model is unclear here.
    • Good point: Need to add some advice to say that we expect the user to free the window, fix the data and recreate the window. They may or may not discover a process failure during this procedure.
  • Wesley - The advice about MPI_WIN_FREE needs to be expanded to cover MPI_DATA_UNAVAILABLE around lines 418-419.

ULFM

Aurelien merged the proposal to detect when a communicator is revoked. This is now part of ULFM proper.

For next week:

  • All - Go over the MPI_ERR_DATA_UNAVAILABLE proposal text and leave comments. Specifically look at new text proposals in the comments.
Clone this wiki locally