Skip to content

2016 04 26

Wesley Bland edited this page Apr 26, 2016 · 2 revisions

Attendees

  • ORNL - Christian
  • UTK - Aurelien
  • Argonne - Ken
  • Sandia - Keita
  • LLNL - Murali
  • Ohio State - Sourav

Slides

Slides for today

ULFM + Disconnect

  • Wesley - Have a combination of MPI_COMM_FREE and MPI_FINALIZE where the MPI_COMM_DISCONNECT can partially fail, but the overall semantics would still be attempted.
    • Requests would be completed (perhaps with an error)
    • Other parts of the MPI_COMM_DISCONNECT may fail too
  • Aurelien - If MPI_COMM_DISCONNECT is not resilient, we could have a problem where messages could show up from a disconnected process because it doesn't know that the communicator has been disconnected or the messages are late. This would add software overhead in some situations.
    • This could be solved by making the operation resilient, but that would have other implications.
  • We decided to go with a "best effort" scenario where the implementation will try to clean things up the best it can (implementation dependent). This would still require a software fix as a fallback if there's no other way to prevent late messages.

ULFM + Sessions

  • If Sessions expands to also cover windows and files, then we need to modify ULFM to account for that.
    • This probably means we need to be able to "repair a set" in the same way we "repair a communicator".
  • There are two options:
    • Create a new, shrunken set (static sets)
    • Repair the communicator in place by removing failed processes (dynamic sets)
  • We chose the former because we didn't see any major benefits from repairing in place.
  • We also talked about whether it would be possible to replace processes inline with these new features
    • This would require repairing all sets that the process could be in on all processes, which has scalability issues.
    • That would require a centralized way of looking up set information.
Clone this wiki locally