Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Noncatastrophic Errors #28

Closed
wesbland opened this issue Dec 9, 2015 · 87 comments
Closed

Noncatastrophic Errors #28

wesbland opened this issue Dec 9, 2015 · 87 comments
Assignees
Labels
passed final vote Passed the final formal vote wg-ft Fault Tolerance Working Group

Comments

@wesbland
Copy link
Member

wesbland commented Dec 9, 2015

Background

Currently, MPI treats all errors as catastrophic regardless of what they are. However, there's lots of errors that don't actually need to be catastrophic because they don't actually prevent the library from being able to continue. For example, if an app calls MPI_ALLOC_MEM and the system is out of memory, it returns MPI_ERR_MEM. This error doesn't actually need to be fatal. The application could free some memory can try the call again. What's missing is a way to query the library to ask if an error was catastrophic or not.

Proposal

Update for Feb 2018 Meeting

We propose to remove the MPI Standard text that says that errors put MPI into an undefined state and replace it with text that says that MPI should continue to operate and return errors via the usual error handlers.

Users may receive the same error forever in some cases, and they are free to make the determination of when to give up and terminate the application.

Original version

We propose to add a new API which queries whether an error code is catastrophic:

MPI_ERR_IS_CATASTROPHIC(int errorcode, int *catastrophic);

This call returns MPI_CATASTROPHIC if the state of the MPI library is now undefined. It returns MPI_NONCATASTROPHIC if the application can retry/continue (probably after doing something to try to fix the error based on the error class).

We also have to tweak the text about all errors causing MPI to be undefined to say that only catastrophic errors cause MPI to be undefined and noncatastrophic errors do not.

Impact on Implementations

Update for Feb 2018 Meeting

This will encourage (though not require) implementations to do more than just abort after a fault. Instead, they should follow the current error handler setup and return helpful error codes/classes to the user.

Original Version

This will require implementors to store more information in the error codes to be able to tell whether an error is catastrophic or not. To support this proposal fully, they might need to do internally track whether certain types of errors are catastrophic or not. The weakest possible support could just say that all errors are catastrophic.

Impact on Users

Users that want to maintain current behavior can continue with no changes to semantics or performance.

@wesbland wesbland added not ready wg-ft Fault Tolerance Working Group labels Dec 9, 2015
@wesbland wesbland added this to the 2016-02 Chicago, USA milestone Dec 9, 2015
@wesbland wesbland self-assigned this Dec 9, 2015
@abouteiller
Copy link
Member

The call cannot return constants if it has a LOGICAL fortran mapping as we discussed earlier. Is there a PR yet ?

@wesbland wesbland added scheduled reading Reading is scheduled for the next meeting and removed not ready labels Feb 2, 2016
@wesbland
Copy link
Member Author

Updated PDF: issue-28.pdf
Marked Up PDF: issue-28-markup.pdf

Pull Request: mpi-forum/mpi-standard#9.

@abouteiller
Copy link
Member

Found the following text about "resource errors" in 2.8

a resource error may occur when a program exceeds the amount of available system resources (number of pending messages, system buffers, etc.). The occurrence of this type of error depends on the amount of available resources in the system and the resource allocation mechanism used; this may differ from system to system. A high-quality implementation will provide generous limits on the important resources so as to alleviate the portability problem this represents.

Could we reframe the "catastrophic" state of mind to clarifying that "resource" errors (to be further qualified) do not undefine MPI?

@wesbland
Copy link
Member Author

wesbland commented Mar 2, 2016

I don't think this text needs to be changed. There are times where a "resource error" may be catastrophic and times where it won't. We are careful in the rest of the text to not actually suggest any specific errors will or won't be catastrophic because the library could be configured to say that all errors are catastrophic.

@wesbland
Copy link
Member Author

wesbland commented May 23, 2016

@wesbland
Copy link
Member Author

wesbland commented Jun 7, 2016

PDF to be read at June 2016 meeting: catastrophic.pdf

@wesbland
Copy link
Member Author

During the reading at the Bellevue, WA, June 2016 meeting, some changes were proposed:

  • Remove the error code specific MPI_ERR_IS_CATASTROPHIC function because the same functionality can be captured by the more general function.
  • Rename the function MPI_IS_CATASTROPHIC to MPI_GET_STATE which would an enum value to indicate the state of MPI (where there is currently only one mandated state (MPI_UNDEFINED).
  • At some point in the future, we might want to add a session argument to this function so different state can be returned in different sessions.

These changes will get merged and the issue will be re-read at the next meeting.

@wesbland
Copy link
Member Author

Dan pointed out that there also needs to be a state like MPI_IS_OK in addition to MPI_IS_CATASTROPHIC.

@wesbland
Copy link
Member Author

PDFs for Dec '16 reading:

Clean PDF
PDF with highlighting

@hritzdorf
Copy link

The type of state in Fortran versions of MPI_Get_state must be INTEGER (not logical).

@wesbland
Copy link
Member Author

wesbland commented Dec 2, 2016

Thanks. I'll make that change before the meeting begins and we can add it as a "no no" vote if necessary.

@wesbland wesbland removed this from the 2016-12 Dallas, USA milestone Dec 7, 2016
@wesbland
Copy link
Member Author

wesbland commented Dec 7, 2016

At the Dec 2016 meeting, we agreed to make some changes to the definition of "catastrophic", which will require a new reading.

@wesbland
Copy link
Member Author

Thanks @hritzdorf. I've fixed that and here's a new PDF.

@wesbland
Copy link
Member Author

wesbland commented Dec 6, 2017

Notes from the reading:

The Forum felt strongly that the way to detect catastrophic errors should not be via an API call, but should come from the error class itself. The initial concern about the fact that not all errors have an error class was dismissed because you would never have checked for an error until you received an error code anyway.

Furthermore, the Forum decided that it would rather remove the notion of catastrophic errors completely and just treat all errors the same, as non-catastrophic errors. It would be up to the user to determine which errors are actually catastrophic and which ones aren't.

This has these main consequences:

  1. If the MPI library has what it considers a "catastrophic error", it might have to just abort. The set of errors that falls into this category should be very limited, however.

  2. The user will be responsible for deciding which kinds of errors it wants to handle and which ones it doesn't. This means that we'll need to provide more specific error classes whenever possible. We should look at what kinds of error classes might be useful. One example would be to look at errno for similar errors that we could borrow.

  3. The proposal should be changed to remove all of the notions of catastrophic errors and just remove the sentence about MPI being undefined after an error.

  4. Catastrophic (or any other) errors cannot be permanent. If they are, the library is probably in a situation where it probably just has to abort.

@wesbland wesbland removed this from the 2017-12 San Jose, USA milestone Dec 11, 2017
@wesbland wesbland added not ready and removed scheduled reading Reading is scheduled for the next meeting labels Dec 11, 2017
@wesbland
Copy link
Member Author

Updated PDF for February/March 2018 Reading:

issue-28-markedup.pdf

@wesbland wesbland added this to the 2018-02 Portland, USA milestone Feb 21, 2018
@wesbland wesbland added scheduled reading Reading is scheduled for the next meeting and removed not ready labels Feb 21, 2018
@wesbland
Copy link
Member Author

wesbland commented Feb 28, 2018

There was a minor change brought over from #1 and #3 that will need to be read as a no-no change in the 2018-06 meeting (if this reading is successful).

https://github.com/mpi-forum/mpi-standard/pull/9/commits/d97fa311b83071b850fe7e2b357f1f2b6ed4bfea

@wesbland wesbland added had reading Completed the formal proposal reading and removed scheduled reading Reading is scheduled for the next meeting labels Mar 1, 2018
@wesbland wesbland changed the title Query MPI (Catastrophic) State Noncatastrophic Errors Mar 1, 2018
@wesbland
Copy link
Member Author

wesbland commented May 15, 2018

@schulzm
Copy link

schulzm commented Jun 14, 2018

Passed no-no vote for final changes during Austin Forum Meeting in June 2018

@schulzm
Copy link

schulzm commented Jun 14, 2018

This passed the first vote in Austin; we will have the second vote in Barcelona.

@schulzm schulzm added passed first vote Passed the first formal vote and removed had reading Completed the formal proposal reading labels Jun 14, 2018
@schulzm schulzm added passed final vote Passed the final formal vote and removed passed first vote Passed the first formal vote labels Sep 21, 2018
@schulzm
Copy link

schulzm commented Sep 21, 2018

Passed second vote at Barcelona meeting in Sep. 2018, ready to be merged into golden copy

@schulzm
Copy link

schulzm commented Sep 24, 2018

Vote tally: 16 yes, 0 abstain, 0 no - full results at https://www.mpi-forum.org/meetings/2018/09/votes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
passed final vote Passed the final formal vote wg-ft Fault Tolerance Working Group
Projects
None yet
Development

No branches or pull requests