-
Notifications
You must be signed in to change notification settings - Fork 56
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Always propagate completion errors to NCCL
When progressing completions, we could get a completion error for a request before NCCL gets to calling test() explicitly for that request. Since NCCL tests for completions in order, this can lead to hangs when there are non-recoverable failures in the network and NCCL never receives a successful completion for the earliest request. With this change, completion errors are always passed up the stack so NCCL can abort the job and fail gracefully where possible. This logic can further be enhanced based on provider-specific information from completion error entry to distinguish between fatal errors vs recoverable user errors, but that would not be portable. Fixes #346 Signed-off-by: Raghu Raja <[email protected]> (cherry picked from commit 5aac4dc)
- Loading branch information
Showing
2 changed files
with
168 additions
and
106 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters