[Build] Update CUTLASS revision from v4.2.1 to v4.4.2 by meena-at-work · Pull Request #37491 · vllm-project/vllm

meena-at-work · 2026-03-18T22:32:21Z

Bump the CUTLASS dependency from v4.2.1 to v4.4.2 in CMakeLists.txt.

The primary motivation is fixing non-deterministic TMA descriptor crashes on DGX Spark (GB10 / SM121) with NVFP4 MoE models. The crash occurs in tma_warp_specialized_generic_moe_gemm_kernelLauncher<Sm120, fp4> from
fused_moe_120.so.

This should fix #35566.

The same root cause was fixed upstream in:

CUTLASS v4.4.0: "Fix TMA descriptor OOB address generation mode"
TRT-LLM: [https://nvbugs/5820511][fix] Upgrade Cutlass version NVIDIA/TensorRT-LLM#11956
FlashInfer: [Bug] NVFP4 MoE models crash on GB10 (SM121) during CUDA graph capture flashinfer-ai/flashinfer#2776 (fix: Upgrade cutlass 4.2.1 -> 4.4.2 flashinfer-ai/flashinfer#2798)

Additional notable fixes included in v4.3.0–v4.4.2:

SM100 attention kernel correctness fixes (softmax, shared memory >48KB, dead-hang)
nvfp4 grouped GEMM core dump fix
SM120 Blackwell pingpong kernel: memory fence fix for CLC scheduler
SM120 missing SMEM alignment fix for scale factors
SM120f (DGX Spark) example compilation enabled — required for correct kernel selection on SM121 hardware
Hopper FMHA causal attention performance regression fix on CUDA 13.1

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com>

gemini-code-assist

Code Review

This pull request updates the CUTLASS dependency from v4.2.1 to v4.4.2 by modifying the CUTLASS_REVISION variable in CMakeLists.txt. The stated purpose is to fix a non-deterministic crash with MoE models. No issues were found in the provided code changes.

github-actions · 2026-03-18T22:55:47Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

tlrmchlsmth

Thanks for the contribution @meena-at-work. Have you tried this yourself to see if it fixes #35566?

meena-at-work · 2026-03-19T16:01:52Z

@tlrmchlsmth -- I haven't run that specific model reported in #35566 , no -- so I've removed references to that from the PR comment.

kilork · 2026-03-19T16:27:42Z

I have not run this exact model, but I run lukealonso/MiniMax-M2.5-NVFP4 with following command on two dgx sparks:

vllm serve lukealonso/MiniMax-M2.5-NVFP4 \
      --trust-remote-code \
      --port 0.0.0.0 \
      --host 30000 \
      --gpu-memory-utilization 0.7 \
      -tp 2 \
      --distributed-executor-backend ray \
      --served-model-name minimax-m2.5 \
      --max-model-len 196608 \
      --load-format fastsafetensors \
      --kv-cache-dtype fp8  \
      --quantization modelopt_fp4 \
      --max-num-batched-tokens 8192 \
      --max-num-seqs 64 \
      --moe-backend cutlass \
      --enable-auto-tool-choice \
      --interleave-mm-strings \
      --tokenizer /minimax-m2.5-tokenizer \
      --chat-template /minimax-m2.5-tokenizer/chat_template.jinja \
      --tool-call-parser minimax_m2 \
      --reasoning-parser minimax_m2

Model is starting, runs normally, my usual workflow with opencode works just fine. I build an image with this pr included and see in build logs the cutlass 4.4.2 version.

pavanimajety · 2026-03-19T16:54:30Z

@meena-at-work Seems like build errors are related:


[2026-03-19T00:42:45Z] #40 2467.7             instantiation of "at::Tensor machete::run_impl<MacheteKernel>(machete::MMArgs) [with MacheteKernel=machete::MacheteKernelTemplate<cutlass::bfloat16_t, cutlass::vllm_uint4b8_t, cutlass::bfloat16_t, float, cutlass::bfloat16_t, void, void, void, cutlass::gemm::KernelTmaWarpSpecializedCooperative, machete::sch_256x16_1x1x1_TmaMI__TmaCoop_streamK>]" at line 161 of /workspace/csrc/quantization/machete/generated/machete_mm_impl_part2.cu

Please take a look. @tlrmchlsmth We probably need to enable ready-run-all-tests for this one

kilork · 2026-03-20T09:31:19Z

Interesting, the build now fails also on my machine. This is the header of VLLM's start message with working build from yesterday:

(APIServer pid=893) INFO 03-20 09:10:26 [utils.py:297]
(APIServer pid=893) INFO 03-20 09:10:26 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=893) INFO 03-20 09:10:26 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.2rc1.dev87+g577df69b2.d20260318
(APIServer pid=893) INFO 03-20 09:10:26 [utils.py:297]   █▄█▀ █     █     █     █  model   lukealonso/MiniMax-M2.5-NVFP4
(APIServer pid=893) INFO 03-20 09:10:26 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=893) INFO 03-20 09:10:26 [utils.py:297]
(APIServer pid=893) INFO 03-20 09:10:26 [utils.py:233] non-default args: {'model_tag': 'lukealonso/MiniMax-M2.5-NVFP4', 'chat_template': '/minimax-m2.5-tokenizer/chat_template.jinja', 'enable_auto_tool_choice': True, 'tool_call_parser': 'minimax_m2', 'host': '0.0.0.0', 'port': 30000, 'model': 'lukealonso/MiniMax-M2.5-NVFP4', 'tokenizer': '/minimax-m2.5-tokenizer', 'trust_remote_code': True, 'max_model_len': 196608, 'quantization': 'modelopt_fp4', 'served_model_name': ['minimax-m2.5'], 'load_format': 'fastsafetensors', 'reasoning_parser': 'minimax_m2', 'distributed_executor_backend': 'ray', 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.7, 'kv_cache_dtype': 'fp8', 'interleave_mm_strings': True, 'max_num_batched_tokens': 8192, 'max_num_seqs': 64, 'moe_backend': 'cutlass'}

This is this revision: 577df69

Seems like either my environment (I saw nvidia updates) or main branch changes made it incompatible.

Will try to bisect to breaking revision later.

kilork · 2026-03-20T21:04:19Z

Done. It seems this revision causes compilation issue: 8b10e4f

git bisect log:

vllm on  main (BISECTING) via △ v4.3.0 via 🐍 v3.14.3 took 19s
❯ git bisect bad
status: waiting for good commit(s), bad commit known

vllm on  main (BISECTING) via △ v4.3.0 via 🐍 v3.14.3
❯ git bisect good 577df69b26491aaa8f3fef2ea44d6ac256172032
Bisecting: 51 revisions left to test after this (roughly 6 steps)
[98ff0429175b98169e1ebffd5ff32d0635bd39cc] [CI][BugFix][AMD] Don't set VLLM_ROCM_USE_AITER anymore in test_rocm_aiter_topk since its not necessary (#36996)

vllm on  HEAD (98ff042) (BISECTING) via △ v4.3.0 via 🐍 v3.14.3
❯ git bisect visualize

vllm on  HEAD (98ff042) (BISECTING) via △ v4.3.0 via 🐍 v3.14.3 took 17s
❯ git bisect bad
Bisecting: 25 revisions left to test after this (roughly 5 steps)
[a32eaf5bb288fd925d66716a7050cc4444a7dfb1] [CI] Merge `cleanup_pr_body.yml` and `reminder_comment.yml` (#37552)

vllm on  HEAD (a32eaf5) (BISECTING) via △ v4.3.0 via 🐍 v3.14.3
❯ git bisect good
Bisecting: 12 revisions left to test after this (roughly 4 steps)
[2890aecce5d1fe1dcdb61be4bedbe2d46700e51c] [CPU][UX] Do not crash when tcmalloc/libiomp are not ldpreloaded (#37561)

vllm on  HEAD (2890aec) (BISECTING) via △ v4.3.0 via 🐍 v3.14.3
❯ git bisect bad
Bisecting: 6 revisions left to test after this (roughly 3 steps)
[104605cbf2046d09436a41a2367a975f73116138] Remove deprecated reasoning_content message field(part-2) (#37480)

vllm on  HEAD (104605c) (BISECTING) via △ v4.3.0 via 🐍 v3.14.3
❯ git bisect good
Bisecting: 3 revisions left to test after this (roughly 2 steps)
[e27b8ba3d17df1330c81adf755988e8ee0fd6ab8] [Bug] Fix fp8 trtllm MoE modular kernel supported routing methods (#37346)

vllm on  HEAD (e27b8ba) (BISECTING) via △ v4.3.0 via 🐍 v3.14.3
❯ git bisect bad
Bisecting: 0 revisions left to test after this (roughly 1 step)
[40b8363b45a9c59984907603b00b736e41d25065] [MRV2] Use fp32 for draft logits (#37526)

vllm on  HEAD (40b8363) (BISECTING) via △ v4.3.0 via 🐍 v3.14.3
❯ git bisect bad
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[8b10e4fb316c14cfdb3109ac6f87722ec2a6c3c8] [1/n] Migrate permute_cols to libtorch stable ABI (#31509)

vllm on  HEAD (8b10e4f) (BISECTING) via △ v4.3.0 via 🐍 v3.14.3
❯ git bisect bad
8b10e4fb316c14cfdb3109ac6f87722ec2a6c3c8 is the first bad commit
commit 8b10e4fb316c14cfdb3109ac6f87722ec2a6c3c8 (HEAD)
Author: mikaylagawarecki <mikaylagawarecki@gmail.com>
Date:   Thu Mar 19 11:27:26 2026 -0400

    [1/n] Migrate permute_cols to libtorch stable ABI (#31509)

    Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

 CMakeLists.txt                             | 43 ++++++++++++++++++++++++++++++++++++++++++-
 csrc/libtorch_stable/ops.h                 |  9 +++++++++
 csrc/{ => libtorch_stable}/permute_cols.cu | 40 +++++++++++++++++++++++-----------------
 csrc/libtorch_stable/torch_bindings.cpp    | 21 +++++++++++++++++++++
 csrc/libtorch_stable/torch_utils.h         | 13 +++++++++++++
 csrc/ops.h                                 |  1 -
 csrc/torch_bindings.cpp                    |  3 ---
 setup.py                                   |  5 +++++
 vllm/platforms/cuda.py                     |  1 +
 9 files changed, 114 insertions(+), 22 deletions(-)
 create mode 100644 csrc/libtorch_stable/ops.h
 rename csrc/{ => libtorch_stable}/permute_cols.cu (68%)
 create mode 100644 csrc/libtorch_stable/torch_bindings.cpp
 create mode 100644 csrc/libtorch_stable/torch_utils.h

Reverting this revision allows me to build using latest main.

The PyTorch Stable ABI requires all types to be trivially copyable. Reference types (const Tensor&) are not trivially copyable and cannot be used in STABLE_TORCH_LIBRARY registrations. This fixes build failure when combining PR vllm-project#37491 (CUTLASS upgrade to v4.4.2) with the libtorch stable ABI migration. Also adds missing CUTLASS include directories to _C_stable_libtorch target in CMakeLists.txt. Signed-off-by: Your Name <your.email@example.com>

The PyTorch Stable ABI requires all types to be trivially copyable. Reference types (const Tensor&) are not trivially copyable and cannot be used in STABLE_TORCH_LIBRARY registrations. This fixes build failure when combining PR vllm-project#37491 (CUTLASS upgrade to v4.4.2) with the libtorch stable ABI migration. Also adds missing CUTLASS include directories to _C_stable_libtorch target in CMakeLists.txt.

kilork · 2026-03-21T08:59:20Z

I createt PR #37744 to address compilation issue. This could be applied before or after the #37491. But I guess it should be go before or be included in that PR.

[Build] Update CUTLASS revision from v4.2.1 to v4.4.2

a7442e1

Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com>

meena-at-work requested review from LucasWilkinson and tlrmchlsmth as code owners March 18, 2026 22:32

mergify bot added ci/build nvidia labels Mar 18, 2026

github-project-automation bot added this to NVIDIA Mar 18, 2026

gemini-code-assist bot reviewed Mar 18, 2026

View reviewed changes

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 18, 2026

tlrmchlsmth self-assigned this Mar 18, 2026

tlrmchlsmth reviewed Mar 18, 2026

View reviewed changes

kilork mentioned this pull request Mar 21, 2026

[Bugfix] Fix PyTorch stable ABI compatibility for permute_cols kilork/vllm#1

Closed

3 tasks

kilork mentioned this pull request Mar 21, 2026

[Bugfix] Fix PyTorch stable ABI compatibility for permute_cols #37744

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Build] Update CUTLASS revision from v4.2.1 to v4.4.2#37491

[Build] Update CUTLASS revision from v4.2.1 to v4.4.2#37491
meena-at-work wants to merge 1 commit intovllm-project:mainfrom
meena-at-work:update-cutlass-to-4.4.2

meena-at-work commented Mar 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

tlrmchlsmth left a comment

Uh oh!

meena-at-work commented Mar 19, 2026

Uh oh!

kilork commented Mar 19, 2026

Uh oh!

pavanimajety commented Mar 19, 2026

Uh oh!

kilork commented Mar 20, 2026 •

edited

Loading

Uh oh!

kilork commented Mar 20, 2026 •

edited

Loading

Uh oh!

kilork commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

meena-at-work commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

meena-at-work commented Mar 19, 2026

Uh oh!

kilork commented Mar 19, 2026

Uh oh!

pavanimajety commented Mar 19, 2026

Uh oh!

kilork commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kilork commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kilork commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

meena-at-work commented Mar 18, 2026 •

edited

Loading

kilork commented Mar 20, 2026 •

edited

Loading

kilork commented Mar 20, 2026 •

edited

Loading