-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Noncatastrophic Errors #28
Comments
The call cannot return constants if it has a LOGICAL fortran mapping as we discussed earlier. Is there a PR yet ? |
Updated PDF: issue-28.pdf Pull Request: mpi-forum/mpi-standard#9. |
Found the following text about "resource errors" in 2.8
Could we reframe the "catastrophic" state of mind to clarifying that "resource" errors (to be further qualified) do not undefine MPI? |
I don't think this text needs to be changed. There are times where a "resource error" may be catastrophic and times where it won't. We are careful in the rest of the text to not actually suggest any specific errors will or won't be catastrophic because the library could be configured to say that all errors are catastrophic. |
Updated PDF with markup: issue-28-markup.pdf Pull Request: https://github.com/mpi-forum/mpi-standard/pull/9 |
PDF to be read at June 2016 meeting: catastrophic.pdf |
During the reading at the Bellevue, WA, June 2016 meeting, some changes were proposed:
These changes will get merged and the issue will be re-read at the next meeting. |
Dan pointed out that there also needs to be a state like |
PDFs for Dec '16 reading: |
The type of state in Fortran versions of MPI_Get_state must be INTEGER (not logical). |
Thanks. I'll make that change before the meeting begins and we can add it as a "no no" vote if necessary. |
At the Dec 2016 meeting, we agreed to make some changes to the definition of "catastrophic", which will require a new reading. |
Thanks @hritzdorf. I've fixed that and here's a new PDF. |
Notes from the reading: The Forum felt strongly that the way to detect catastrophic errors should not be via an API call, but should come from the error class itself. The initial concern about the fact that not all errors have an error class was dismissed because you would never have checked for an error until you received an error code anyway. Furthermore, the Forum decided that it would rather remove the notion of catastrophic errors completely and just treat all errors the same, as non-catastrophic errors. It would be up to the user to determine which errors are actually catastrophic and which ones aren't. This has these main consequences:
|
Updated PDF for February/March 2018 Reading: |
Here's the no-no changes that will be read and voted on at the June meeting: |
Passed no-no vote for final changes during Austin Forum Meeting in June 2018 |
This passed the first vote in Austin; we will have the second vote in Barcelona. |
Passed second vote at Barcelona meeting in Sep. 2018, ready to be merged into golden copy |
Vote tally: 16 yes, 0 abstain, 0 no - full results at https://www.mpi-forum.org/meetings/2018/09/votes |
Background
Currently, MPI treats all errors as catastrophic regardless of what they are. However, there's lots of errors that don't actually need to be catastrophic because they don't actually prevent the library from being able to continue. For example, if an app calls
MPI_ALLOC_MEM
and the system is out of memory, it returnsMPI_ERR_MEM
. This error doesn't actually need to be fatal. The application could free some memory can try the call again. What's missing is a way to query the library to ask if an error was catastrophic or not.Proposal
Update for Feb 2018 Meeting
We propose to remove the MPI Standard text that says that errors put MPI into an undefined state and replace it with text that says that MPI should continue to operate and return errors via the usual error handlers.
Users may receive the same error forever in some cases, and they are free to make the determination of when to give up and terminate the application.
Original version
We propose to add a new API which queries whether an error code is catastrophic:This call returnsMPI_CATASTROPHIC
if the state of the MPI library is now undefined. It returnsMPI_NONCATASTROPHIC
if the application can retry/continue (probably after doing something to try to fix the error based on the error class).We also have to tweak the text about all errors causing MPI to be undefined to say that only catastrophic errors cause MPI to be undefined and noncatastrophic errors do not.Impact on Implementations
Update for Feb 2018 Meeting
This will encourage (though not require) implementations to do more than just abort after a fault. Instead, they should follow the current error handler setup and return helpful error codes/classes to the user.
Original Version
This will require implementors to store more information in the error codes to be able to tell whether an error is catastrophic or not. To support this proposal fully, they might need to do internally track whether certain types of errors are catastrophic or not. The weakest possible support could just say that all errors are catastrophic.Impact on Users
Users that want to maintain current behavior can continue with no changes to semantics or performance.
The text was updated successfully, but these errors were encountered: