[bugfix]put nccl window register/deregister behind cuda platform#25608
[bugfix]put nccl window register/deregister behind cuda platform#25608Amir-19 wants to merge 2 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Amir Samani <asamani@nvidia.com>
There was a problem hiding this comment.
Code Review
This pull request aims to conditionally load NCCL window register/deregister functions only on the CUDA platform. The implementation correctly separates the CUDA-specific functions but introduces a critical caching bug where the function cache is not platform-aware. This can lead to AttributeErrors if the cache is populated on a non-CUDA platform first. Additionally, there is a minor typo in a new variable name. I've provided suggestions to fix both issues.
| raise e | ||
|
|
||
| function_specs = list(NCCLLibrary.exported_functions) | ||
| if current_platform.is_cuda(): |
There was a problem hiding this comment.
These functions should exist on both platforms in NCCL >= 2.27.03.
The original issue was that on ROCm trying to import a non-existent symbol causes a crash, while on CUDA (apparently) it does not.
|
closed in favor of #25605 |
Purpose
put nccl window register/deregister behind cuda platform
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.