[NVIDIA] fix(jit): enable GDC for CUTLASS fused MoE PDL — prevent random crashes on SM12x#2913
[NVIDIA] fix(jit): enable GDC for CUTLASS fused MoE PDL — prevent random crashes on SM12x#2913johnnynunez wants to merge 2 commits intoflashinfer-ai:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
📝 WalkthroughWalkthroughExtended CUTLASS Grid Dependency Control (GDC) compile-time enablement to cover additional SM100-family CUDA architectures and added corresponding NVCC defines to JIT build pipelines for fused MoE and FP8 blockscale kernels. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/arch/grid_dependency_control.h`:
- Around line 36-54: The new multiline preprocessor block for enabling
CUTLASS_GDC (the `#if` that checks CUDA_BARRIER_ENABLED,
CUTLASS_ENABLE_GDC_FOR_SM100 and various __CUDA_ARCH__ cases including
CUDA_ARCH_FAMILY and CUDA_ARCH_CONDITIONAL_OR_FAMILY) is misformatted and
failing clang-format; run clang-format on
cutlass_extensions/arch/grid_dependency_control.h (or the changed file) and
reformat the `#if/`#endif block so line wrapping and indentation follow the
project's clang-format rules, then commit the formatted file ensuring the
symbols CUTLASS_GDC_ENABLED, CUDA_BARRIER_ENABLED, CUTLASS_ENABLE_GDC_FOR_SM100,
__CUDA_ARCH__, CUDA_ARCH_FAMILY and CUDA_ARCH_CONDITIONAL_OR_FAMILY remain
unchanged semantically.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2e539431-67bf-4479-a371-4d64c698e324
📒 Files selected for processing (3)
csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/arch/grid_dependency_control.hflashinfer/jit/fused_moe.pyflashinfer/jit/gemm/fp8_blockscale.py
...al/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/arch/grid_dependency_control.h
Show resolved
Hide resolved
|
steps to replicate: Run vLLM on Thor & SparkStep-by-step guide to building and running vLLM with FlashInfer on NVIDIA Thor (SM110) and Spark (SM121) platforms. 1. Install uvClear any stale cache, then install the sudo rm -rf ~/.cache/
sudo apt install ccache
curl -LsSf https://astral.sh/uv/install.sh | sh2. Create a Virtual Environmentsudo apt install python3-dev
uv venv .vllm --python 3.12
source .vllm/bin/activate3. Install PyTorchuv pip install --force-reinstall torch torchvision4. Build and Install vLLM
git clone --recursive https://github.com/johnnynunez/vllm.git
cd vllm
export VLLM_VERSION=0.18.1
export TORCH_CUDA_ARCH_LIST=12.1a
export USE_CUDNN=1
export VERBOSE=1
export CUDA_HOME=/usr/local/cuda
export PATH="${CUDA_HOME}/bin:$PATH"
export SETUPTOOLS_SCM_PRETEND_VERSION="${VLLM_VERSION}"
export DG_JIT_USE_NVRTC=1 # DeepGEMM NVRTC support — up to 10x compilation speedup
python3 use_existing_torch.py || echo "Skipping use_existing_torch.py"
uv pip install -r requirements/build.txt -v
python3 -m setuptools_scm
# Constrain parallelism on aarch64 to avoid OOM during compilation
ARCH=$(uname -i)
if [ "${ARCH}" = "aarch64" ]; then
export NVCC_THREADS=1
export CUDA_NVCC_FLAGS="-Xcudafe --threads=1"
export MAKEFLAGS='-j2'
export CMAKE_BUILD_PARALLEL_LEVEL=$MAX_JOBS
export NINJAFLAGS='-j2'
fi
uv build --wheel --no-build-isolation -v --out-dir ./wheels .
uv pip install ./wheels/vllm*.whl
cd /opt/vllm
uv pip install compressed-tensors5. Uninstall Pre-built FlashInfer PackagesRemove any pre-compiled FlashInfer packages to avoid conflicts with the editable install: uv pip uninstall flashinfer-cubin flashinfer-python6. Install FlashInfer from Source
sudo rm -rf ~/.cache/
git clone --recursive https://github.com/johnnynunez/flashinfer.git
cd flashinfer
uv pip install --force-reinstall --no-build-isolation -e .7. Export Environment VariablesSet the CUDA architecture target and related paths. Use export TORCH_CUDA_ARCH_LIST=12.1a # Spark: 12.1a — Thor: 11.0a
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export CUDA_HOME=/usr/local/cuda
export CPATH=$CUDA_HOME/include:${CPATH}
export C_INCLUDE_PATH=$CUDA_HOME/include:${C_INCLUDE_PATH}
export CPLUS_INCLUDE_PATH=$CUDA_HOME/include:${CPLUS_INCLUDE_PATH}
# Recommended on Jetson platforms
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$CUDA_HOME/lib:${LD_LIBRARY_PATH}
export LIBRARY_PATH=$CUDA_HOME/lib64:$CUDA_HOME/lib:${LIBRARY_PATH}8. Clear MemoryDrop filesystem caches to free up memory before serving: sudo sysctl -w vm.drop_caches=39. Serve the Model (Speculative Decoding with MTP)Launch vLLM with Qwen3.5-122B using 3 speculative tokens via MTP: vllm serve Sehyo/Qwen3.5-122B-A10B-NVFP4 \
--port 9000 \
--max-num-seqs 2 \
--tensor-parallel-size 1 \
--max-model-len 131072 \
--trust-remote-code \
--gpu-memory-utilization 0.80 \
--kv-cache-dtype fp8 \
--speculative_config '{"method":"mtp","num_speculative_tokens":3}'10. Run a Stress Test (Separate Terminal)In another terminal with the python3 -c "
import requests, time, sys, concurrent.futures
MODEL = 'Sehyo/Qwen3.5-122B-A10B-NVFP4'
PORT = 9000
# ~100K tokens — safely under 131072 - 1024 = 130048 limit
parts = []
for i in range(3000):
parts.append(f'Section {i}: The quick brown fox jumps over the lazy dog. Technology advances rapidly in quantum computing and distributed systems. ')
prompt = 'Write a comprehensive analysis: ' + ' '.join(parts)
print(f'Approx words: {len(prompt.split())}')
sys.stdout.flush()
def send_request(idx):
t0 = time.time()
try:
r = requests.post(f'http://localhost:{PORT}/v1/completions', json={
'model': MODEL,
'prompt': prompt,
'max_tokens': 1024,
'temperature': 0.7,
}, timeout=600)
elapsed = time.time() - t0
if r.status_code == 200:
data = r.json()
text = data['choices'][0]['text']
usage = data.get('usage', {})
return f'[{idx}] OK - {len(text)}ch, prompt={usage.get(\"prompt_tokens\",\"?\")}, gen={usage.get(\"completion_tokens\",\"?\")}, {elapsed:.1f}s'
else:
err = r.json().get('error',{}).get('message','')[:200]
return f'[{idx}] FAIL ({r.status_code}): {err}'
except Exception as e:
elapsed = time.time() - t0
return f'[{idx}] CRASH - {type(e).__name__}: {e} ({elapsed:.1f}s)'
# Phase 1: Single ~100K token request
print('=== Phase 1: Single ~100K token request ===')
sys.stdout.flush()
print(send_request(1)); sys.stdout.flush()
# Phase 2: 2 concurrent
print('=== Phase 2: 2 concurrent ===')
sys.stdout.flush()
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as pool:
futs = [pool.submit(send_request, i) for i in range(2, 4)]
for f in concurrent.futures.as_completed(futs):
print(f.result()); sys.stdout.flush()
# Phase 3: 10 rapid
print('=== Phase 3: 10 rapid sequential ===')
sys.stdout.flush()
for i in range(4, 14):
r = send_request(i)
print(r); sys.stdout.flush()
if 'CRASH' in r: break
print('Done.')
" 2>&1 |
|
Now it is working perfectly and B200 accuracy tests passed for NVFP4. Nemotron Super NVFP4 - DGX Spark export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--kv-cache-dtype fp8 \
--trust-remote-code \
--gpu-memory-utilization 0.7 \
--max-model-len 262144 \
--max-num-seqs 10 \
--enable-prefix-caching \
--host 0.0.0.0 \
--port 8000 \
--enable-auto-tool-choice \
--load-format fastsafetensors \
--tool-call-parser qwen3_coder \
--reasoning-parser nemotron_v3 \
--mamba_ssm_cache_dtype float32Results (Benchmark & Stress Test) --> Auto-detected HF model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (served as: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4)
llama-benchy (0.3.5)
Date: 2026-03-30 01:35:34
Benchmarking model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 at http://localhost:8000/v1
Concurrency levels: [1]
Loading text from cache: /home/johnny/.cache/llama-benchy/cc6a0b5782734ee3b9069aa3b64cc62c.txt
Total tokens available in text corpus: 143827
Warming up...
Warmup (User only) complete. Delta: 16 tokens (Server: 38, Local: 22)
Warmup (System+Empty) complete. Delta: 16 tokens (Server: 38, Local: 22)
Running coherence test...
Coherence test PASSED.
Measuring latency using mode: api...
Average latency (api): 1.63 ms
Running test: pp=2048, tg=32, depth=0, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=4096, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=8192, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=16384, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=32768, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=65535, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=100000, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=200000, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Printing results in MD format:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:-----------------------------------------------|-----------------:|-----------------:|-------------:|-------------------:|-------------------:|-------------------:|
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 | 1722.48 ± 394.11 | | 1269.76 ± 345.98 | 1268.14 ± 345.98 | 1269.84 ± 345.98 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 | 12.76 ± 0.01 | 13.00 ± 0.00 | | | |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d4096 | 1948.06 ± 80.28 | | 3161.05 ± 134.07 | 3159.43 ± 134.07 | 3161.13 ± 134.05 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 @ d4096 | 12.75 ± 0.01 | 13.00 ± 0.00 | | | |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d8192 | 1964.84 ± 4.14 | | 5213.28 ± 10.99 | 5211.65 ± 10.99 | 5213.35 ± 10.97 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 @ d8192 | 12.71 ± 0.01 | 13.00 ± 0.00 | | | |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d16384 | 1934.31 ± 5.53 | | 9530.67 ± 27.20 | 9529.04 ± 27.20 | 9530.74 ± 27.22 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 @ d16384 | 12.64 ± 0.01 | 13.00 ± 0.00 | | | |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d32768 | 1857.07 ± 14.17 | | 18750.32 ± 143.56 | 18748.69 ± 143.56 | 18750.39 ± 143.57 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 @ d32768 | 12.64 ± 0.02 | 13.00 ± 0.00 | | | |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d65535 | 1759.29 ± 5.89 | | 38416.91 ± 128.78 | 38415.28 ± 128.78 | 38416.98 ± 128.78 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 @ d65535 | 12.64 ± 0.04 | 13.00 ± 0.00 | | | |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d100000 | 1656.44 ± 4.33 | | 61608.98 ± 160.90 | 61607.35 ± 160.90 | 61609.06 ± 160.91 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 @ d100000 | 12.69 ± 0.08 | 13.67 ± 0.47 | | | |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d200000 | 1397.08 ± 7.47 | | 144626.89 ± 771.10 | 144625.26 ± 771.10 | 144626.94 ± 771.11 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 @ d200000 | 12.59 ± 0.12 | 14.00 ± 0.00 | | | |
llama-benchy (0.3.5)
date: 2026-03-30 01:35:34 | latency mode: api
(APIServer pid=33932) INFO 03-30 01:50:49 [loggers.py:259] Engine 000: Avg prompt throughput: 20205.7 tokens/s, Avg generation throughput: 3.2 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=33932) INFO 03-30 01:50:59 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% |
|
/bot run |
|
Looks like this PR eliminates NVFP4 crashes with flashinfer_cutlass kernel. I build from main with this PR applied on top. @johnnynunez - I'm getting slightly better numbers than you:
llama-benchy (0.3.5) |
Summary
-DCUTLASS_ENABLE_GDC_FOR_SM100=1compile flag to all CUTLASS fused MoE JIT modules (SM100/SM103/SM120) and-DCUTLASS_ENABLE_GDC_FOR_SM90=1to SM90 modulesgrid_dependency_control.hwith upstream CUTLASS to support SM100/SM103/SM110/SM120/SM121 GDC-DCUTLASS_ENABLE_GDC_FOR_SM90=1to FP8 blockscale GEMM SM90 moduleProblem
Random
cudaErrorIllegalInstructioncrashes on DGX Spark (SM121) and RTX 50-series (SM120) when running NVFP4 MoE models (e.g., Nemotron, Qwen3.5-122B) under load. The crashes are intermittent and worsen with longer context lengths and higher concurrency.Root cause: PR #2780 fixed the missing GDC compile flags for GEMM modules (
flashinfer/jit/gemm/core.py), but the CUTLASS fused MoE modules inflashinfer/jit/fused_moe.pyand the FP8 blockscale GEMM module were not fixed. This is the exact same class of bug as #2708.Without
-DCUTLASS_ENABLE_GDC_FOR_SM100=1, CUTLASS'sgrid_dependency_control.hcompileswait_on_dependent_grids()andlaunch_dependent_grids()as empty no-ops:Meanwhile, the host-side code still sets
programmaticStreamSerializationAllowed = true(PDL enabled) viadevice_support_pdl()which returnsTruefor allmajor >= 9, including SM12x. This means:cudaErrorIllegalInstructionThe crash is random because it depends on exact kernel scheduling timing, which varies per request.
Fix
flashinfer/jit/fused_moe.py— Added GDC flags to all CUTLASS fused MoE modules:gen_cutlass_fused_moe_sm120_module()-DCUTLASS_ENABLE_GDC_FOR_SM100=1gen_cutlass_fused_moe_sm103_module()-DCUTLASS_ENABLE_GDC_FOR_SM100=1gen_cutlass_fused_moe_sm100_module()-DCUTLASS_ENABLE_GDC_FOR_SM100=1gen_cutlass_fused_moe_sm90_module()-DCUTLASS_ENABLE_GDC_FOR_SM90=1gen_trtllm_gen_fused_moe_sm100_module()-DCUTLASS_ENABLE_GDC_FOR_SM100=1flashinfer/jit/gemm/fp8_blockscale.py— Added-DCUTLASS_ENABLE_GDC_FOR_SM90=1togen_fp8_blockscale_gemm_sm90_module().csrc/nv_internal/.../grid_dependency_control.h— Synced with upstream CUTLASS (3rdparty/cutlass/include/cutlass/arch/grid_dependency_control.h) to add SM100+ GDC support. Previously only handled SM90, so any nv_internal TensorRT-LLM code compiled for SM12x would have GDC barriers silently compiled as no-ops.Why
-DCUTLASS_ENABLE_GDC_FOR_SM100=1covers SM12xCUTLASS uses a single flag for the entire Blackwell family. From
grid_dependency_control.h:Why SM90 GDC flag was NOT added to SM100+ modules
PR #2716 attempted to add both
-DCUTLASS_ENABLE_GDC_FOR_SM90=1and-DCUTLASS_ENABLE_GDC_FOR_SM100=1to all modules. It broke AOT builds becausesm120_gemm_tma_warpspecialized_cooperative_asymmetric_dma.hppchecksCUTLASS_ENABLE_GDC_FOR_SM90and callsscheduler.is_last_tile()— a method not present on the SM120 scheduler. PR #2780 corrected this by using only the SM100 flag for SM100+ modules. This PR follows the same approach.Related
Test plan
rm -rf ~/.cache/flashinfer/cudaErrorIllegalInstructionCUDA_LAUNCH_BLOCKING=1workaround is no longer neededFLASHINFER_CUDA_ARCH_LIST="12.1a"completes without errorspytest tests/moe/Summary by CodeRabbit