Skip to content

2017 05 10

Wesley Bland edited this page May 10, 2017 · 1 revision

Attendees

  • Intel - Wesley, Marc, Rob
  • UTK - Aurelien, George
  • Sandia National Lab - Keita
  • Argonne National Lab - Yanfei, Ken
  • Lawrence Livermore National Lab - Ignacio, Murali
  • Auburn - Nawrin
  • Los Alamos National Lab - Howard
  • Oak Ridge National Lab - Geoffroy

Slides

In repo

Notes

Marc - What if some processes exit the ongoing operation before the failure? Where do they recover?

  • This would be up to the application to solve after the fact by deciding to where it needs to roll back/forward.

Aurelien - A benefit of this is that it works across libraries for automatic recovery, but it still requires the application to have consensus across the software stack about where it needs to start recovery.

George - The biggest problem of ordering issues does not get solved by this. The application will still have to do all the same amount of work after the automatic recovery that it should have done before calling MPI_COMM_SHRINK. Instead, it could go to the parent communicator of the overlapping communicators to do recovery.

  • Wesley - This is true. For overlapping communicators, recovery is always going to be more complex. The optimization is that if you know that you aren't using overlapping communicators, you can do more independent recovery.

George - We can't automatically replace handles because all ongoing requests would need to be dealt with.

  • Wesley - Agree. This would also make it hard to determine new rank vs. old rank.

We discussed concern about supporting multiple FT models within MPI, but decided that as a Standard, this shouldn't be a problem. The only concern is from an application's perspective.

  • You can only have one FT model per MPI_COMM_WORLD.
  • If we add sessions, you might be able to have more than one model per session, but it's up to the application to make sure that it can actually make that make sense.

Action Items

  • Working Group - Try to fix the deadlock problem in ULFM, either via automatic recovery or otherwise. If we do something else, we need to be able to explain why it's better than automatic recovery.
  • Working Group - Add function to the FT chapter to pick which FT model we want to use.
  • Ignacio - Write text for the backward recovery model to move forward with standardization of a second recovery model. This is probably the best way to make sure that ULFM and Reinit can live in the Standard together.
Clone this wiki locally