-
-
Notifications
You must be signed in to change notification settings - Fork 12.3k
[Bugfix][ROCm] Fixing trying to import non-existent symbols from libnccl.so #25605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Gregory Shtrasberg <[email protected]>
Signed-off-by: Gregory Shtrasberg <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix a crash on ROCm when certain NCCL symbols are not present by silently ignoring them during library initialization. However, the current implementation appears to be flawed. It checks if the function pointer is None, whereas ctypes is expected to raise an AttributeError for missing symbols. This suggests the fix may not work as intended. I have provided a suggestion to correctly handle missing symbols using a try...except block, which should make the fix more robust.
|
I'm not sure if it is better to exclude nccl specific functions like what you did here or include them like: #25608 |
Signed-off-by: Gregory Shtrasberg <[email protected]>
That's also possible, only the condition should not be CUDA/ROCm, but ideally, NCCL version. |
| "ncclCommWindowDeregister" | ||
| ]: | ||
| # These symbols require NCCL >= 2.27.03, and having | ||
| # an exception here on ROCm platform is not allowed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a logger.warn_once here that mentions that these symbols failed to import and that users should update their nccl/rccl libraries if they want to use them?
…nabled Signed-off-by: Gregory Shtrasberg <[email protected]>
|
I confirm that it is necessary to be able to run mGPU models in ROCM, it solves the following error: |
…ccl.so (vllm-project#25605) Signed-off-by: Gregory Shtrasberg <[email protected]>
…ccl.so (#25605) Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: yewentao256 <[email protected]>
…ccl.so (vllm-project#25605) Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: Tomer Asida <[email protected]>
…ccl.so (vllm-project#25605) Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…ccl.so (vllm-project#25605) Signed-off-by: Gregory Shtrasberg <[email protected]>
…ccl.so (vllm-project#25605) Signed-off-by: Gregory Shtrasberg <[email protected]>
…ccl.so (vllm-project#25605) Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…ccl.so (vllm-project#25605) Signed-off-by: Gregory Shtrasberg <[email protected]>
…ccl.so (vllm-project#25605) Signed-off-by: Gregory Shtrasberg <[email protected]>
Fixing an issue from #24532
ncclCommWindowRegister and ncclCommWindowDeregister only exist on nccl >= 2.27.03
On ROCm having an unhandled exception (trying to import symbols in the above part of the code crashes graph capturing due to the
HIP error: operation not permitted when stream is capturingerror.I suppose the least invasive solution would be to swallow the import error on ROCm for these two symbols silently.
cc @Amir-19