feat: K=64 block-scaled MoE GEMM for SM120 (RTX PRO 6000)#2786
feat: K=64 block-scaled MoE GEMM for SM120 (RTX PRO 6000)#2786brandonmmusic-max wants to merge 6 commits intoflashinfer-ai:mainfrom
Conversation
SM120 (RTX PRO 6000 Blackwell) has 99KB SMEM, which is insufficient for K=128 block-scaled tiles at runtime. This patch adds K=64 tile shapes that fit within the SMEM budget. Changes: - sm120_blockscaled_mma_builder.inl: Fix SF SMEM layout for K=64 by adding EffBlk_SF/FoldSFIntoBasicBlock logic. When K/SFVectorSize < Blk_SF, the effective block size is clamped and folded into the basic block to keep TMA layouts flat. - generate_kernels.py: Add K=64 CTA shapes (128x128x64, 128x256x64, 256x128x64) for SM120, update fp8xfp4 filter to allow K=64. - moe_gemm_template_dispatch_tma_ws.h: Add K=64 entries to are_tile_shapes_supported_sm120() and SHAPE_CASE dispatch. Tested on 4x RTX PRO 6000 (SM 12.0, 99KB SMEM) with Qwen3.5-397B-A17B-NVFP4: - Single-user decode: 142 -> 283 tok/s (+99%) - 8-user system throughput: 510 -> 1283 tok/s (+151%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical performance bottleneck for Mixture-of-Experts (MoE) GEMM operations on NVIDIA's SM120 architecture, such as the RTX PRO 6000. Previously, these GPUs were unable to effectively utilize block-scaled tile shapes due to shared memory overflow issues with K=128 tiles and compilation failures with K=64 tiles. By correctly implementing and configuring K=64 tile shapes, this change resolves the underlying technical hurdles, leading to substantial improvements in decode throughput and overall efficiency for large language models on Blackwell workstation GPUs. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughExpands supported SM120 GEMM CTA/tile shapes (adds several K=64 configurations) and updates the FlashInfer CUTLASS kernel generator to enumerate and emit only the allowed SM120/SM121 shapes for specific mixed-precision paths; also removes one K=64 shape from SM80 fused grouped-GEMM list. Changes
Sequence Diagram(s)(Skipped — changes are confined to shape lists and gating logic without introducing new multi-component control flow.) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can use Trivy to scan for security misconfigurations and secrets in Infrastructure as Code files.Add a .trivyignore file to your project to customize which findings Trivy reports. |
There was a problem hiding this comment.
Code Review
This pull request introduces support for K=64 tile shapes for block-scaled grouped GEMM on SM120, which significantly improves decode throughput on Blackwell workstation GPUs. The changes involve updating the supported tile shapes in the CUTLASS kernel dispatch logic and the kernel generation script. My review focuses on improving the readability and maintainability of the configuration code in the Python kernel generator. The changes are logical and well-aligned with the goal of enabling new hardware features.
| if act_type == DataType.e4m3 and weight_type == e2m1: | ||
| if cta_shape_mnk != [128, 128, 128]: | ||
| if cta_shape_mnk not in ([128, 128, 128], [128, 128, 64], [128, 256, 64], [256, 128, 64]): | ||
| continue |
There was a problem hiding this comment.
To improve readability and performance, it's more idiomatic to check for membership against a set of tuples rather than a tuple of lists. This avoids potential performance issues with list comparisons and makes the intent clearer.
| if act_type == DataType.e4m3 and weight_type == e2m1: | |
| if cta_shape_mnk != [128, 128, 128]: | |
| if cta_shape_mnk not in ([128, 128, 128], [128, 128, 64], [128, 256, 64], [256, 128, 64]): | |
| continue | |
| if act_type == DataType.e4m3 and weight_type == e2m1: | |
| if tuple(cta_shape_mnk) not in {(128, 128, 128), (128, 128, 64), (128, 256, 64), (256, 128, 64)}: | |
| continue |
|
Pre-built Docker image with this fix applied is available: Based on |
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@flashinfer/jit/gemm/cutlass/generate_kernels.py`:
- Around line 763-770: The e2m1-only kernel generation path is producing kernels
for all entries in cta_shapes_mnk (including [128,128,256], [256,128,128],
[128,256,128]) but the SM120 dispatch SHAPE_CASE only supports a subset used by
the mixed FP8xFP4 path; replicate the same shape filter applied in the mixed
path to the e2m1-alone branch (i.e., in the code that builds kernels when e2m1
is present but fp4 is not) so only shapes that match the SM120 dispatch
SHAPE_CASE are compiled; alternatively remove e2m1-alone from supported_dtypes
or extend the SM120 dispatch switch to include the additional shapes, but the
recommended quick fix is to apply the existing filter logic used for the mixed
FP8xFP4 case to the e2m1-only generation path (referencing cta_shapes_mnk,
supported_dtypes, and the SM120 SHAPE_CASE/dispatch logic).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b9e887de-a38e-4991-b509-f15a471c5a5e
📒 Files selected for processing (2)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.hflashinfer/jit/gemm/cutlass/generate_kernels.py
- Use set of tuples instead of list comparison for fp8xfp4 filter (CodeRabbit) - Apply same shape filter to e2m1-only path to avoid compiling kernels that overflow SMEM or lack dispatch SHAPE_CASEs (CodeRabbit) - Sort cta_shapes_mnk for readability (Gemini Code Assist) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Updated Benchmark Results — Methodology CorrectionAfter more thorough testing, the initial benchmark numbers were inflated by MTP speculative decoding achieving near-100% acceptance on trivial thinking tokens ( Corrected results with thinking disabled and real prompts (substantive generation):
Benchmark methodology breakdown:
The K=64 kernel patch provides a real ~20-25% improvement over the pre-patch baseline on identical hardware. The fix unblocks SM120 GPUs from falling back to slow GEMM paths by giving the autotuner CUTLASS tiles that fit within 99KB SMEM. Note: Engine-level throughput (what most benchmarks report) will differ from these end-to-end API measurements which include HTTP overhead. The relative improvement from the K=64 patch is consistent across measurement methods. |
|
Is this guarded to sm120 only or sm120 AND sm121? Flashinfer arch guards are arch-specific, not family specific. |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Fixed the ruff format issue (one tuple per line). Pre-commit should pass now. Could a maintainer run |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
flashinfer/jit/gemm/cutlass/generate_kernels.py (1)
813-830: Hoist_sm120_supported_shapesoutside the loop.The constant set is redefined on every iteration of the loop. Moving it outside the loop avoids redundant object creation.
♻️ Proposed refactor
+ # SM120 supported shapes for block-scaled paths + _sm120_supported_shapes = { + (128, 128, 128), + (128, 128, 64), + (128, 256, 64), + (256, 128, 64), + } + partial_args = product( supported_dtypes, quant_ops, epi_tags, epi_fusions, cta_shapes_mnk, cga_shapes, swap_ab, ) operations = list() for ( dtype, quant_op, epi_tag, epi_fusion, cta_shape_mnk, cga_shape, swap_ab, ) in partial_args: # Ignored mainloop_schedule = KernelScheduleType.TmaWarpSpecializedCooperative epi_schedule = None if isinstance(dtype, tuple): act_type, weight_type = dtype else: act_type, weight_type = dtype, dtype - # SM120 supported shapes for block-scaled paths - _sm120_supported_shapes = { - (128, 128, 128), - (128, 128, 64), - (128, 256, 64), - (256, 128, 64), - } - # For mixed FP8xFP4 on SM120/SM121, only emit shapes that fit in 99KB SMEMThe filtering logic itself correctly addresses the prior review concern—both FP8xFP4 and FP4xFP4 (e2m1-only) paths now filter to shapes matching the C++
are_tile_shapes_supported_sm120()andSHAPE_CASEdispatch entries.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/jit/gemm/cutlass/generate_kernels.py` around lines 813 - 830, Hoist the constant set _sm120_supported_shapes out of the loop so it is created once instead of on every iteration: move the definition of _sm120_supported_shapes to just before the loop that iterates over CTA shapes, and leave the existing filtering logic using act_type, weight_type, cta_shape_mnk, DataType.e4m3 and e2m1 unchanged so the checks (if tuple(cta_shape_mnk) not in _sm120_supported_shapes: continue) still work but without repeatedly allocating the set.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@flashinfer/jit/gemm/cutlass/generate_kernels.py`:
- Around line 813-830: Hoist the constant set _sm120_supported_shapes out of the
loop so it is created once instead of on every iteration: move the definition of
_sm120_supported_shapes to just before the loop that iterates over CTA shapes,
and leave the existing filtering logic using act_type, weight_type,
cta_shape_mnk, DataType.e4m3 and e2m1 unchanged so the checks (if
tuple(cta_shape_mnk) not in _sm120_supported_shapes: continue) still work but
without repeatedly allocating the set.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d7127835-cc9c-47aa-bdb9-338696b034e5
📒 Files selected for processing (1)
flashinfer/jit/gemm/cutlass/generate_kernels.py
Move the constant set outside the iteration loop to avoid redundant object creation on each iteration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Re: SM120 vs SM121 guard Both are covered. The C++ dispatch at Arch::kMinComputeCapability == 120 || Arch::kMinComputeCapability == 121And SM121 (DGX Spark / GB10) has the same ~99KB SMEM constraint as SM120, so the K=64 tiles are needed on both. Re: coderabbit nitpick Fixed in 7145384 — hoisted |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
flashinfer/jit/gemm/cutlass/generate_kernels.py (1)
762-790: Optional cleanup: keepcta_shapes_mnkaligned with the supported-shape source.
cta_shapes_mnkstill lists shapes that are always filtered out by_sm120_supported_shapes. Definingcta_shapes_mnkfrom the supported list directly would reduce drift risk.♻️ Suggested local simplification
- cta_shapes_mnk = [ - [128, 128, 64], - [128, 128, 128], - [128, 128, 256], - [128, 256, 64], - [128, 256, 128], - [256, 128, 64], - [256, 128, 128], - ] + _sm120_supported_shapes = [ + (128, 128, 64), + (128, 128, 128), + (128, 256, 64), + (256, 128, 64), + ] + cta_shapes_mnk = [list(shape) for shape in _sm120_supported_shapes] @@ - _sm120_supported_shapes = { - (128, 128, 128), - (128, 128, 64), - (128, 256, 64), - (256, 128, 64), - } + _sm120_supported_shape_set = set(_sm120_supported_shapes)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/jit/gemm/cutlass/generate_kernels.py` around lines 762 - 790, cta_shapes_mnk currently contains shapes that are always filtered out by _sm120_supported_shapes; update the definition of cta_shapes_mnk to derive it from _sm120_supported_shapes (or filter the existing list against that set) so the two stay in sync—e.g., generate cta_shapes_mnk by iterating over _sm120_supported_shapes (preserving desired ordering or sorting) or replace the literal list with a filtered comprehension that only keeps shapes present in _sm120_supported_shapes; ensure references to cta_shapes_mnk elsewhere (naming/iteration) still work with the resulting list.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@flashinfer/jit/gemm/cutlass/generate_kernels.py`:
- Around line 762-790: cta_shapes_mnk currently contains shapes that are always
filtered out by _sm120_supported_shapes; update the definition of cta_shapes_mnk
to derive it from _sm120_supported_shapes (or filter the existing list against
that set) so the two stay in sync—e.g., generate cta_shapes_mnk by iterating
over _sm120_supported_shapes (preserving desired ordering or sorting) or replace
the literal list with a filtered comprehension that only keeps shapes present in
_sm120_supported_shapes; ensure references to cta_shapes_mnk elsewhere
(naming/iteration) still work with the resulting list.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 92b409fb-4bc5-437c-992f-db418be6cff9
📒 Files selected for processing (1)
flashinfer/jit/gemm/cutlass/generate_kernels.py
|
Re: cta_shapes_mnk alignment with _sm120_supported_shapes The extra shapes in |
- Add --apply-flashinfer-pr flag to build-and-copy.sh for applying FlashInfer PRs at build time (mirrors existing --apply-vllm-pr) - Include K=64 SM120 CUTLASS patch for workstation Blackwell GPUs (ref: flashinfer-ai/flashinfer#2786) - Skip FlashInfer rebuild when wheels already exist - Add Qwen3.5 recipes (122B-A10B, 397B-A17B NVFP4) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add --apply-flashinfer-pr flag to build-and-copy.sh for applying FlashInfer PRs at build time (mirrors existing --apply-vllm-pr) - Include K=64 SM120 CUTLASS patch for workstation Blackwell GPUs (ref: flashinfer-ai/flashinfer#2786) - Skip FlashInfer rebuild when wheels already exist - Add Qwen3.5 recipes (122B-A10B, 397B-A17B NVFP4) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Using this patch in production on DGX Spark (SM121) — it gives ~2x decode throughput improvement with K=64 tiles vs K=128. Running Nemotron-3-Super-120B and Qwen3.5-122B NVFP4 models via vLLM. Confirmed it works cleanly on top of the CUTLASS 4.4.2 upgrade merged in #2798. cc @aleozlx since you're actively testing SM121 on #2780. |
Two fixes that enable K=64 tile shapes for block-scaled MoE GEMM on SM120 (RTX 5090/PRO 6000) and SM121 (DGX Spark GB10): 1. TMA zero-stride basis handling (copy_traits_sm90_tma.hpp): When K=64 with SFVectorSize=32, scale factor folding creates a broadcast dimension with zero stride. The existing code passes this to basis_get() which produces invalid TMA descriptors. Fix: detect zero-stride basis via is_constant<0> and emit shape=1, stride=0. 2. Scale factor block size clamping (sm120_blockscaled_mma_builder.inl): K=64 with SFVectorSize=32 gives NumSFAlongK=2, but Blk_SF=4. The division Blk_SF/MMA_NSF overflows. Fix: clamp effective block size (EffBlk_SF) to min(NumSFAlongK, Blk_SF) and conditionally fold into kBasicBlock to keep TMA layout flat. Together these enable K=64 CTA shapes ([128,128,64], [128,256,64], [256,128,64]) which achieve 7-11 pipeline stages vs 2 with K=128, giving ~2x single-user decode throughput on SM120/SM121. Tested on DGX Spark (SM121) with Nemotron-3-Super-120B and Qwen3.5-122B NVFP4 models via FlashInfer CUTLASS MoE backend. Related: FlashInfer PR flashinfer-ai/flashinfer#2786 adds the K=64 tile shapes to FlashInfer's kernel generation but depends on these CUTLASS fixes for correctness. Signed-off-by: Rob Tand <robert.tand@icloud.com>
|
/bot run |
There’s another gentleman below who says he’s running it on the DGX spark who gotten 2x decode with the patch, so it’ll work on sm 121 as well. |
|
@brandonmmusic-max I have a bunch of other changes that depend on your changes (specifically for spark). Let me know if you need any help getting these out. Nice work btw. I know they aren't leading to quite the performance gains you'd expected but they're still necessary for nvfp4 support and surely nvfp4 gates other enhancements. |
|
I think i tagged you in the other one I was working on (NVIDIA/cutlass#3121). |
|
there appears to be errors on sm120 |
…oval
1. Revert accidental removal of (128,128,64) from SM80 fused grouped
GEMM shapes — unrelated to SM120 changes.
2. Update CUTLASS submodule to include two fixes required for K=64
block-scaled MoE GEMM compilation on SM120 (99KB SMEM):
a) copy_traits_sm90_tma.hpp: Handle zero-stride basis elements in
fill_tma_gmem_shape_stride. When basis element is constant-zero
(broadcast dimension for SFVectorSize), basis_get returns the
entire gmem_shape tuple instead of a scalar. Fix: detect
is_constant<0> and set shape=1, stride=0 directly.
b) sm120_blockscaled_mma_builder.inl: Clamp Blk_SF to
min(K/SFVectorSize, Blk_SF) and fold the effective block into
kBasicBlockShape when tile K is too small for the default block
size. Keeps outer K dimensions trivial so TMA constructs valid
descriptors.
For K=64 with SFVectorSize=32: K/SFVectorSize=2 < Blk_SF=4, which
previously produced a zero-size dimension in the SF SMEM layout,
triggering "TMA requires CTA_Tile and SLayout top-level size
equivalence."
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Updated — reverted the SM80 tile removal and added the two CUTLASS fixes via submodule update. Ready for another build test when you get a chance. |
|
thanks for persistent support may i ask which commit is relevant that we are trying to run from
https://github.com/brandonmmusic-max/cutlass.git ? i looked at the branch but it's not obvious to me which commit was related https://github.com/brandonmmusic-max/cutlass/commits therefore i cannot issue ci run for unvetted code (this is the reason CI's don't just automatically run). thx for understanding also note that if you wanted the cutlass version bump, our main branch has bumped it to a new version |
|
Totally fair, I should have made that clearer — sorry about that. It's one commit:
I checked and the bug is still there in v4.4.2 ( A few ways we could handle this — happy to go with whatever's easiest for you: A) I rebase the fix onto v4.4.2 so you can see a clean diff against the version main already uses B) Drop the submodule change entirely and just patch the files inline somewhere in C) I push the fix upstream to NVIDIA/cutlass first and we wait for it to land before merging this The FlashInfer-side changes (generate_kernels.py + dispatch) are independent — they just need these two CUTLASS fixes to actually compile. |
|
[SUCCESS] Pipeline #46627414: 10/20 passed |
- Add --apply-flashinfer-pr flag to build-and-copy.sh for applying FlashInfer PRs at build time (mirrors existing --apply-vllm-pr) - Include K=64 SM120 CUTLASS patch for workstation Blackwell GPUs (ref: flashinfer-ai/flashinfer#2786) - Skip FlashInfer rebuild when wheels already exist - Add Qwen3.5 recipes (122B-A10B, 397B-A17B NVFP4) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Summary
Problem
On SM120 (RTX PRO 6000, DGX Spark — 99KB SMEM):
This leaves no usable block-scaled tile shapes on SM120.
Root Cause: Two CUTLASS Bugs
Bug 1: TMA
basis_getwith zero-stride basis (copy_traits_sm90_tma.hpp)In
fill_tma_gmem_shape_stride, when a TMA basis element is constant-zero (_0, representing a broadcast dimension where SFVectorSize elements share one scale factor),basis_get(_0, gmem_shape)falls back to the generic overload which returns the entiregmem_shapetuple instead of a scalar. This tuple can't convert touint64_t.For K≥128, the zero-stride mode gets absorbed into adjacent TMA dimensions during basis computation, so the bug is latent. For K=64, the different SMEM layout structure exposes the standalone
_0basis element.Bug 2: SF SMEM layout nesting mismatch (
sm120_blockscaled_mma_builder.inl)The scale factor SMEM layout uses
kBasicBlockShape = Shape<SFVectorSize, MMA_NSF>whereMMA_NSF = AtomK / SFVectorSize.For fp8×fp4 (MMA atom K=32, SFVectorSize=32):
MMA_NSF=1, sokBasicBlockShape = Shape<32, 1>. For K=64, only 2 SFs along K (64/32=2), butBlk_SF=4creates an outer K shape with a zero dimension, triggering:"TMA requires CTA_Tile and SLayout top-level size equivalence."For fp4×fp4 (MMA atom K=64):
MMA_NSF=2, and the outer K shape is triviallyShape<1, 1>— this is why fp4×fp4 K=64 already compiles but fp8×fp4 doesn't.Changes
3rdparty/cutlass(submodule → brandonmmusic-max/cutlass@flashinfer-sm120-k64-fixes)include/cute/atom/copy_traits_sm90_tma.hpp:fill_tma_gmem_shape_stride, detectis_constant<0>basis elements and set shape=1, stride=0 directly instead of callingbasis_getwhich returns a tupleinclude/cutlass/gemm/collective/builders/sm120_blockscaled_mma_builder.inl:EffBlk_SF = min(K/SFVectorSize, Blk_SF)to clamp the SF block size when K is smallFoldSFIntoBasicBlock— whenEffBlk_SF > MMA_NSF, folds SF intokBasicBlockShapeso outer K dimensions become trivial (all 1s), allowing TMA to construct valid descriptorsThese CUTLASS fixes are generic and will be upstreamed to NVIDIA/cutlass. Once merged upstream, the submodule can be pointed back to the official repo.
flashinfer/jit/gemm/cutlass/generate_kernels.py[128,128,64],[128,256,64],[256,128,64]to SM120cta_shapes_mnk_sm120_supported_shapesset for 99KB SMEM constraint filteringcsrc/.../moe_gemm_template_dispatch_tma_ws.hare_tile_shapes_supported_sm120()SHAPE_CASE(120, ...)for K=64 dispatchBenchmark Results
Hardware: 4x RTX PRO 6000 Blackwell (96GB, SM 12.0, 99KB SMEM)
Model: Qwen3.5-397B-A17B-NVFP4, TP=4, MTP=5
Environment: CUDA 13.2, CUTLASS 4.4.1, vLLM 0.17.1rc1, FlashInfer 0.6.6
Decode Throughput
Context Length Scaling (single user, 1K output)
Notes
Test Plan
Related: vllm-project/vllm#30135