Skip to content

[Build] Update CUTLASS revision from v4.2.1 to v4.4.2#37491

Open
meena-at-work wants to merge 1 commit intovllm-project:mainfrom
meena-at-work:update-cutlass-to-4.4.2
Open

[Build] Update CUTLASS revision from v4.2.1 to v4.4.2#37491
meena-at-work wants to merge 1 commit intovllm-project:mainfrom
meena-at-work:update-cutlass-to-4.4.2

Conversation

@meena-at-work
Copy link
Copy Markdown

@meena-at-work meena-at-work commented Mar 18, 2026

Bump the CUTLASS dependency from v4.2.1 to v4.4.2 in CMakeLists.txt.

The primary motivation is fixing non-deterministic TMA descriptor crashes on DGX Spark (GB10 / SM121) with NVFP4 MoE models. The crash occurs in tma_warp_specialized_generic_moe_gemm_kernelLauncher<Sm120, fp4> from
fused_moe_120.so.

This should fix #35566.

The same root cause was fixed upstream in:

Additional notable fixes included in v4.3.0–v4.4.2:

  • SM100 attention kernel correctness fixes (softmax, shared memory >48KB, dead-hang)
  • nvfp4 grouped GEMM core dump fix
  • SM120 Blackwell pingpong kernel: memory fence fix for CLC scheduler
  • SM120 missing SMEM alignment fix for scale factors
  • SM120f (DGX Spark) example compilation enabled — required for correct kernel selection on SM121 hardware
  • Hopper FMHA causal attention performance regression fix on CUDA 13.1

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the CUTLASS dependency from v4.2.1 to v4.4.2 by modifying the CUTLASS_REVISION variable in CMakeLists.txt. The stated purpose is to fix a non-deterministic crash with MoE models. No issues were found in the provided code changes.

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 18, 2026
@tlrmchlsmth tlrmchlsmth self-assigned this Mar 18, 2026
Copy link
Copy Markdown
Member

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @meena-at-work. Have you tried this yourself to see if it fixes #35566?

@meena-at-work
Copy link
Copy Markdown
Author

@tlrmchlsmth -- I haven't run that specific model reported in #35566 , no -- so I've removed references to that from the PR comment.

@kilork
Copy link
Copy Markdown

kilork commented Mar 19, 2026

I have not run this exact model, but I run lukealonso/MiniMax-M2.5-NVFP4 with following command on two dgx sparks:

vllm serve lukealonso/MiniMax-M2.5-NVFP4 \
      --trust-remote-code \
      --port 0.0.0.0 \
      --host 30000 \
      --gpu-memory-utilization 0.7 \
      -tp 2 \
      --distributed-executor-backend ray \
      --served-model-name minimax-m2.5 \
      --max-model-len 196608 \
      --load-format fastsafetensors \
      --kv-cache-dtype fp8  \
      --quantization modelopt_fp4 \
      --max-num-batched-tokens 8192 \
      --max-num-seqs 64 \
      --moe-backend cutlass \
      --enable-auto-tool-choice \
      --interleave-mm-strings \
      --tokenizer /minimax-m2.5-tokenizer \
      --chat-template /minimax-m2.5-tokenizer/chat_template.jinja \
      --tool-call-parser minimax_m2 \
      --reasoning-parser minimax_m2

Model is starting, runs normally, my usual workflow with opencode works just fine. I build an image with this pr included and see in build logs the cutlass 4.4.2 version.

@pavanimajety
Copy link
Copy Markdown
Collaborator

@meena-at-work Seems like build errors are related:


[2026-03-19T00:42:45Z] #40 2467.7             instantiation of "at::Tensor machete::run_impl<MacheteKernel>(machete::MMArgs) [with MacheteKernel=machete::MacheteKernelTemplate<cutlass::bfloat16_t, cutlass::vllm_uint4b8_t, cutlass::bfloat16_t, float, cutlass::bfloat16_t, void, void, void, cutlass::gemm::KernelTmaWarpSpecializedCooperative, machete::sch_256x16_1x1x1_TmaMI__TmaCoop_streamK>]" at line 161 of /workspace/csrc/quantization/machete/generated/machete_mm_impl_part2.cu

Please take a look. @tlrmchlsmth We probably need to enable ready-run-all-tests for this one

@kilork
Copy link
Copy Markdown

kilork commented Mar 20, 2026

Interesting, the build now fails also on my machine. This is the header of VLLM's start message with working build from yesterday:

(APIServer pid=893) INFO 03-20 09:10:26 [utils.py:297]
(APIServer pid=893) INFO 03-20 09:10:26 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=893) INFO 03-20 09:10:26 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.2rc1.dev87+g577df69b2.d20260318
(APIServer pid=893) INFO 03-20 09:10:26 [utils.py:297]   █▄█▀ █     █     █     █  model   lukealonso/MiniMax-M2.5-NVFP4
(APIServer pid=893) INFO 03-20 09:10:26 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=893) INFO 03-20 09:10:26 [utils.py:297]
(APIServer pid=893) INFO 03-20 09:10:26 [utils.py:233] non-default args: {'model_tag': 'lukealonso/MiniMax-M2.5-NVFP4', 'chat_template': '/minimax-m2.5-tokenizer/chat_template.jinja', 'enable_auto_tool_choice': True, 'tool_call_parser': 'minimax_m2', 'host': '0.0.0.0', 'port': 30000, 'model': 'lukealonso/MiniMax-M2.5-NVFP4', 'tokenizer': '/minimax-m2.5-tokenizer', 'trust_remote_code': True, 'max_model_len': 196608, 'quantization': 'modelopt_fp4', 'served_model_name': ['minimax-m2.5'], 'load_format': 'fastsafetensors', 'reasoning_parser': 'minimax_m2', 'distributed_executor_backend': 'ray', 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.7, 'kv_cache_dtype': 'fp8', 'interleave_mm_strings': True, 'max_num_batched_tokens': 8192, 'max_num_seqs': 64, 'moe_backend': 'cutlass'}

This is this revision: 577df69

Seems like either my environment (I saw nvidia updates) or main branch changes made it incompatible.

Will try to bisect to breaking revision later.

@kilork
Copy link
Copy Markdown

kilork commented Mar 20, 2026

Done. It seems this revision causes compilation issue: 8b10e4f

git bisect log:

vllm on  main (BISECTING) via △ v4.3.0 via 🐍 v3.14.3 took 19s
❯ git bisect bad
status: waiting for good commit(s), bad commit known

vllm on  main (BISECTING) via △ v4.3.0 via 🐍 v3.14.3
❯ git bisect good 577df69b26491aaa8f3fef2ea44d6ac256172032
Bisecting: 51 revisions left to test after this (roughly 6 steps)
[98ff0429175b98169e1ebffd5ff32d0635bd39cc] [CI][BugFix][AMD] Don't set VLLM_ROCM_USE_AITER anymore in test_rocm_aiter_topk since its not necessary (#36996)

vllm on  HEAD (98ff042) (BISECTING) via △ v4.3.0 via 🐍 v3.14.3
❯ git bisect visualize

vllm on  HEAD (98ff042) (BISECTING) via △ v4.3.0 via 🐍 v3.14.3 took 17s
❯ git bisect bad
Bisecting: 25 revisions left to test after this (roughly 5 steps)
[a32eaf5bb288fd925d66716a7050cc4444a7dfb1] [CI] Merge `cleanup_pr_body.yml` and `reminder_comment.yml` (#37552)

vllm on  HEAD (a32eaf5) (BISECTING) via △ v4.3.0 via 🐍 v3.14.3
❯ git bisect good
Bisecting: 12 revisions left to test after this (roughly 4 steps)
[2890aecce5d1fe1dcdb61be4bedbe2d46700e51c] [CPU][UX] Do not crash when tcmalloc/libiomp are not ldpreloaded (#37561)

vllm on  HEAD (2890aec) (BISECTING) via △ v4.3.0 via 🐍 v3.14.3
❯ git bisect bad
Bisecting: 6 revisions left to test after this (roughly 3 steps)
[104605cbf2046d09436a41a2367a975f73116138] Remove deprecated reasoning_content message field(part-2) (#37480)

vllm on  HEAD (104605c) (BISECTING) via △ v4.3.0 via 🐍 v3.14.3
❯ git bisect good
Bisecting: 3 revisions left to test after this (roughly 2 steps)
[e27b8ba3d17df1330c81adf755988e8ee0fd6ab8] [Bug] Fix fp8 trtllm MoE modular kernel supported routing methods (#37346)

vllm on  HEAD (e27b8ba) (BISECTING) via △ v4.3.0 via 🐍 v3.14.3
❯ git bisect bad
Bisecting: 0 revisions left to test after this (roughly 1 step)
[40b8363b45a9c59984907603b00b736e41d25065] [MRV2] Use fp32 for draft logits (#37526)

vllm on  HEAD (40b8363) (BISECTING) via △ v4.3.0 via 🐍 v3.14.3
❯ git bisect bad
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[8b10e4fb316c14cfdb3109ac6f87722ec2a6c3c8] [1/n] Migrate permute_cols to libtorch stable ABI (#31509)

vllm on  HEAD (8b10e4f) (BISECTING) via △ v4.3.0 via 🐍 v3.14.3
❯ git bisect bad
8b10e4fb316c14cfdb3109ac6f87722ec2a6c3c8 is the first bad commit
commit 8b10e4fb316c14cfdb3109ac6f87722ec2a6c3c8 (HEAD)
Author: mikaylagawarecki <mikaylagawarecki@gmail.com>
Date:   Thu Mar 19 11:27:26 2026 -0400

    [1/n] Migrate permute_cols to libtorch stable ABI (#31509)

    Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

 CMakeLists.txt                             | 43 ++++++++++++++++++++++++++++++++++++++++++-
 csrc/libtorch_stable/ops.h                 |  9 +++++++++
 csrc/{ => libtorch_stable}/permute_cols.cu | 40 +++++++++++++++++++++++-----------------
 csrc/libtorch_stable/torch_bindings.cpp    | 21 +++++++++++++++++++++
 csrc/libtorch_stable/torch_utils.h         | 13 +++++++++++++
 csrc/ops.h                                 |  1 -
 csrc/torch_bindings.cpp                    |  3 ---
 setup.py                                   |  5 +++++
 vllm/platforms/cuda.py                     |  1 +
 9 files changed, 114 insertions(+), 22 deletions(-)
 create mode 100644 csrc/libtorch_stable/ops.h
 rename csrc/{ => libtorch_stable}/permute_cols.cu (68%)
 create mode 100644 csrc/libtorch_stable/torch_bindings.cpp
 create mode 100644 csrc/libtorch_stable/torch_utils.h

Reverting this revision allows me to build using latest main.

kilork added a commit to kilork/vllm that referenced this pull request Mar 21, 2026
The PyTorch Stable ABI requires all types to be trivially copyable.
Reference types (const Tensor&) are not trivially copyable and cannot
be used in STABLE_TORCH_LIBRARY registrations.

This fixes build failure when combining PR vllm-project#37491 (CUTLASS upgrade to
v4.4.2) with the libtorch stable ABI migration.

Also adds missing CUTLASS include directories to _C_stable_libtorch
target in CMakeLists.txt.

Signed-off-by: Your Name <your.email@example.com>
kilork added a commit to kilork/vllm that referenced this pull request Mar 21, 2026
The PyTorch Stable ABI requires all types to be trivially copyable.
Reference types (const Tensor&) are not trivially copyable and cannot
be used in STABLE_TORCH_LIBRARY registrations.

This fixes build failure when combining PR vllm-project#37491 (CUTLASS upgrade to
v4.4.2) with the libtorch stable ABI migration.

Also adds missing CUTLASS include directories to _C_stable_libtorch
target in CMakeLists.txt.
@kilork
Copy link
Copy Markdown

kilork commented Mar 21, 2026

I createt PR #37744 to address compilation issue. This could be applied before or after the #37491. But I guess it should be go before or be included in that PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

CUDA illegal memory access in MoE layer with MiniMax-M2.5 NVFP4 on Blackwell (SM120)

4 participants