[Build] Update CUTLASS revision from v4.2.1 to v4.4.2#37491
[Build] Update CUTLASS revision from v4.2.1 to v4.4.2#37491meena-at-work wants to merge 1 commit intovllm-project:mainfrom
Conversation
Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com>
There was a problem hiding this comment.
Code Review
This pull request updates the CUTLASS dependency from v4.2.1 to v4.4.2 by modifying the CUTLASS_REVISION variable in CMakeLists.txt. The stated purpose is to fix a non-deterministic crash with MoE models. No issues were found in the provided code changes.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
tlrmchlsmth
left a comment
There was a problem hiding this comment.
Thanks for the contribution @meena-at-work. Have you tried this yourself to see if it fixes #35566?
|
@tlrmchlsmth -- I haven't run that specific model reported in #35566 , no -- so I've removed references to that from the PR comment. |
|
I have not run this exact model, but I run Model is starting, runs normally, my usual workflow with |
|
@meena-at-work Seems like build errors are related: Please take a look. @tlrmchlsmth We probably need to enable |
|
Interesting, the build now fails also on my machine. This is the header of VLLM's start message with working build from yesterday: This is this revision: 577df69 Seems like either my environment (I saw nvidia updates) or main branch changes made it incompatible. Will try to bisect to breaking revision later. |
|
Done. It seems this revision causes compilation issue: 8b10e4f
Reverting this revision allows me to build using latest main. |
The PyTorch Stable ABI requires all types to be trivially copyable. Reference types (const Tensor&) are not trivially copyable and cannot be used in STABLE_TORCH_LIBRARY registrations. This fixes build failure when combining PR vllm-project#37491 (CUTLASS upgrade to v4.4.2) with the libtorch stable ABI migration. Also adds missing CUTLASS include directories to _C_stable_libtorch target in CMakeLists.txt. Signed-off-by: Your Name <your.email@example.com>
The PyTorch Stable ABI requires all types to be trivially copyable. Reference types (const Tensor&) are not trivially copyable and cannot be used in STABLE_TORCH_LIBRARY registrations. This fixes build failure when combining PR vllm-project#37491 (CUTLASS upgrade to v4.4.2) with the libtorch stable ABI migration. Also adds missing CUTLASS include directories to _C_stable_libtorch target in CMakeLists.txt.
Bump the CUTLASS dependency from v4.2.1 to v4.4.2 in CMakeLists.txt.
The primary motivation is fixing non-deterministic TMA descriptor crashes on DGX Spark (GB10 / SM121) with NVFP4 MoE models. The crash occurs in tma_warp_specialized_generic_moe_gemm_kernelLauncher<Sm120, fp4> from
fused_moe_120.so.
This should fix #35566.
The same root cause was fixed upstream in:
Additional notable fixes included in v4.3.0–v4.4.2:
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.