[NVIDIA] Bugfix NVFP4 DGX Spark and RTX50#38423
[NVIDIA] Bugfix NVFP4 DGX Spark and RTX50#38423johnnynunez wants to merge 18 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the CUTLASS revision to v4.4.2 and upgrades FlashInfer to version 0.6.7 across the Dockerfiles and requirement files. It also introduces runtime checks to verify that NVFP4 quantization kernels are compiled for the current GPU's SM version (SM100 or SM120) before use, preventing invalid backend selection or runtime failures. I have no feedback to provide.
Signed-off-by: johnnynunez <johnnynuca14@gmail.com>
Signed-off-by: johnnynunez <johnnynuca14@gmail.com>
|
Could a maintainer please add the |
|
Hi @johnnynunez, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
CUTLASS v4.4.2 added ArchTag to DispatchPolicy in sm90_gemm_tma_warpspecialized_cooperative.hpp to distinguish SM90 from SM120 kernel paths. Machete's custom MacheteCollectiveMma defines its own DispatchPolicy but was missing this field, causing all 18 Machete template instantiations to fail with "has no member ArchTag". Also reformats nvfp4_scaled_mm_entry.cu to satisfy pre-commit linter. Signed-off-by: johnnynunez <johnnynuca14@gmail.com>
|
Getting consistent Illegal Instruction crashes with this PR. Building Flashinfer from main with FLASHINFER_CUDA_ARCH_LIST=12.1a |
Look if you applied correctly the PR and cutlass version |
|
Hi @johnnynunez, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @johnnynunez, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Related bug some models: |
Test Report: PR #38423 on DGX Spark SM121 with Qwen3.5-122B-A10B-NVFP4Hardware: Single DGX Spark GB10 (SM121, 128GB UMA) Results
What this PR fixedPreviously (before this PR), 128K context crashed with What still crashesMTP + 128K context. Short MTP requests (64 tokens) succeed, but longer decode (1024 tokens) hits 32K context + MTP works perfectly at all token lengths. |
Thank you for all tests. Yes, some users found that. Reported to cutlass team. Thank you |
vllm/model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py
Outdated
Show resolved
Hide resolved
Fix fp8 trtllm gen routing bias dtype Signed-off-by: Johnny <johnnynuca14@gmail.com>
could you try? To me now it is working... you were right, it was race condition... Run vLLM on Thor & SparkStep-by-step guide to building and running vLLM with FlashInfer on NVIDIA Thor (SM110) and Spark (SM121) platforms. 1. Install uvClear any stale cache, then install the sudo rm -rf ~/.cache/
sudo apt install ccache
curl -LsSf https://astral.sh/uv/install.sh | sh2. Create a Virtual Environmentsudo apt install python3-dev
uv venv .vllm --python 3.12
source .vllm/bin/activate3. Install PyTorchuv pip install --force-reinstall torch torchvision4. Build and Install vLLM
git clone --recursive https://github.com/johnnynunez/vllm.git
cd vllm
export VLLM_VERSION=0.18.1
export TORCH_CUDA_ARCH_LIST=12.1a
export USE_CUDNN=1
export VERBOSE=1
export CUDA_HOME=/usr/local/cuda
export PATH="${CUDA_HOME}/bin:$PATH"
export SETUPTOOLS_SCM_PRETEND_VERSION="${VLLM_VERSION}"
export DG_JIT_USE_NVRTC=1 # DeepGEMM NVRTC support — up to 10x compilation speedup
python3 use_existing_torch.py || echo "Skipping use_existing_torch.py"
uv pip install -r requirements/build.txt -v
python3 -m setuptools_scm
# Constrain parallelism on aarch64 to avoid OOM during compilation
ARCH=$(uname -i)
if [ "${ARCH}" = "aarch64" ]; then
export NVCC_THREADS=1
export CUDA_NVCC_FLAGS="-Xcudafe --threads=1"
export MAKEFLAGS='-j2'
export CMAKE_BUILD_PARALLEL_LEVEL=$MAX_JOBS
export NINJAFLAGS='-j2'
fi
uv build --wheel --no-build-isolation -v --out-dir ./wheels .
uv pip install ./wheels/vllm*.whl
cd /opt/vllm
uv pip install compressed-tensors5. Uninstall Pre-built FlashInfer PackagesRemove any pre-compiled FlashInfer packages to avoid conflicts with the editable install: uv pip uninstall flashinfer-cubin flashinfer-python6. Install FlashInfer from Source
sudo rm -rf ~/.cache/
git clone --recursive https://github.com/johnnynunez/flashinfer.git
cd flashinfer
uv pip install --force-reinstall --no-build-isolation -e .7. Export Environment VariablesSet the CUDA architecture target and related paths. Use export TORCH_CUDA_ARCH_LIST=12.1a # Spark: 12.1a — Thor: 11.0a
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export CUDA_HOME=/usr/local/cuda
export CPATH=$CUDA_HOME/include:${CPATH}
export C_INCLUDE_PATH=$CUDA_HOME/include:${C_INCLUDE_PATH}
export CPLUS_INCLUDE_PATH=$CUDA_HOME/include:${CPLUS_INCLUDE_PATH}
# Recommended on Jetson platforms
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$CUDA_HOME/lib:${LD_LIBRARY_PATH}
export LIBRARY_PATH=$CUDA_HOME/lib64:$CUDA_HOME/lib:${LIBRARY_PATH}8. Clear MemoryDrop filesystem caches to free up memory before serving: sudo sysctl -w vm.drop_caches=39. Serve the Model (Speculative Decoding with MTP)Launch vLLM with Qwen3.5-122B using 3 speculative tokens via MTP: vllm serve Sehyo/Qwen3.5-122B-A10B-NVFP4 \
--port 9000 \
--max-num-seqs 2 \
--tensor-parallel-size 1 \
--max-model-len 131072 \
--trust-remote-code \
--gpu-memory-utilization 0.80 \
--kv-cache-dtype fp8 \
--speculative_config '{"method":"mtp","num_speculative_tokens":3}'10. Run a Stress Test (Separate Terminal)In another terminal with the python3 -c "
import requests, time, sys, concurrent.futures
MODEL = 'Sehyo/Qwen3.5-122B-A10B-NVFP4'
PORT = 9000
# ~100K tokens — safely under 131072 - 1024 = 130048 limit
parts = []
for i in range(3000):
parts.append(f'Section {i}: The quick brown fox jumps over the lazy dog. Technology advances rapidly in quantum computing and distributed systems. ')
prompt = 'Write a comprehensive analysis: ' + ' '.join(parts)
print(f'Approx words: {len(prompt.split())}')
sys.stdout.flush()
def send_request(idx):
t0 = time.time()
try:
r = requests.post(f'http://localhost:{PORT}/v1/completions', json={
'model': MODEL,
'prompt': prompt,
'max_tokens': 1024,
'temperature': 0.7,
}, timeout=600)
elapsed = time.time() - t0
if r.status_code == 200:
data = r.json()
text = data['choices'][0]['text']
usage = data.get('usage', {})
return f'[{idx}] OK - {len(text)}ch, prompt={usage.get(\"prompt_tokens\",\"?\")}, gen={usage.get(\"completion_tokens\",\"?\")}, {elapsed:.1f}s'
else:
err = r.json().get('error',{}).get('message','')[:200]
return f'[{idx}] FAIL ({r.status_code}): {err}'
except Exception as e:
elapsed = time.time() - t0
return f'[{idx}] CRASH - {type(e).__name__}: {e} ({elapsed:.1f}s)'
# Phase 1: Single ~100K token request
print('=== Phase 1: Single ~100K token request ===')
sys.stdout.flush()
print(send_request(1)); sys.stdout.flush()
# Phase 2: 2 concurrent
print('=== Phase 2: 2 concurrent ===')
sys.stdout.flush()
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as pool:
futs = [pool.submit(send_request, i) for i in range(2, 4)]
for f in concurrent.futures.as_completed(futs):
print(f.result()); sys.stdout.flush()
# Phase 3: 10 rapid
print('=== Phase 3: 10 rapid sequential ===')
sys.stdout.flush()
for i in range(4, 14):
r = send_request(i)
print(r); sys.stdout.flush()
if 'CRASH' in r: break
print('Done.')
" 2>&1 |
|
ready to merge! @mgoin Now it is working perfectly and B200 accuracy tests passed for NVFP4 Nemotron Super NVFP4 - DGX Spark export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--kv-cache-dtype fp8 \
--trust-remote-code \
--gpu-memory-utilization 0.7 \
--max-model-len 262144 \
--max-num-seqs 10 \
--enable-prefix-caching \
--host 0.0.0.0 \
--port 8000 \
--enable-auto-tool-choice \
--load-format fastsafetensors \
--tool-call-parser qwen3_coder \
--reasoning-parser nemotron_v3 \
--mamba_ssm_cache_dtype float32Results (Benchmark & Stress Test) --> Auto-detected HF model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (served as: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4)
llama-benchy (0.3.5)
Date: 2026-03-30 01:35:34
Benchmarking model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 at http://localhost:8000/v1
Concurrency levels: [1]
Loading text from cache: /home/johnny/.cache/llama-benchy/cc6a0b5782734ee3b9069aa3b64cc62c.txt
Total tokens available in text corpus: 143827
Warming up...
Warmup (User only) complete. Delta: 16 tokens (Server: 38, Local: 22)
Warmup (System+Empty) complete. Delta: 16 tokens (Server: 38, Local: 22)
Running coherence test...
Coherence test PASSED.
Measuring latency using mode: api...
Average latency (api): 1.63 ms
Running test: pp=2048, tg=32, depth=0, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=4096, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=8192, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=16384, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=32768, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=65535, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=100000, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=200000, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Printing results in MD format:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:-----------------------------------------------|-----------------:|-----------------:|-------------:|-------------------:|-------------------:|-------------------:|
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 | 1722.48 ± 394.11 | | 1269.76 ± 345.98 | 1268.14 ± 345.98 | 1269.84 ± 345.98 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 | 12.76 ± 0.01 | 13.00 ± 0.00 | | | |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d4096 | 1948.06 ± 80.28 | | 3161.05 ± 134.07 | 3159.43 ± 134.07 | 3161.13 ± 134.05 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 @ d4096 | 12.75 ± 0.01 | 13.00 ± 0.00 | | | |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d8192 | 1964.84 ± 4.14 | | 5213.28 ± 10.99 | 5211.65 ± 10.99 | 5213.35 ± 10.97 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 @ d8192 | 12.71 ± 0.01 | 13.00 ± 0.00 | | | |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d16384 | 1934.31 ± 5.53 | | 9530.67 ± 27.20 | 9529.04 ± 27.20 | 9530.74 ± 27.22 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 @ d16384 | 12.64 ± 0.01 | 13.00 ± 0.00 | | | |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d32768 | 1857.07 ± 14.17 | | 18750.32 ± 143.56 | 18748.69 ± 143.56 | 18750.39 ± 143.57 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 @ d32768 | 12.64 ± 0.02 | 13.00 ± 0.00 | | | |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d65535 | 1759.29 ± 5.89 | | 38416.91 ± 128.78 | 38415.28 ± 128.78 | 38416.98 ± 128.78 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 @ d65535 | 12.64 ± 0.04 | 13.00 ± 0.00 | | | |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d100000 | 1656.44 ± 4.33 | | 61608.98 ± 160.90 | 61607.35 ± 160.90 | 61609.06 ± 160.91 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 @ d100000 | 12.69 ± 0.08 | 13.67 ± 0.47 | | | |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d200000 | 1397.08 ± 7.47 | | 144626.89 ± 771.10 | 144625.26 ± 771.10 | 144626.94 ± 771.11 |
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg32 @ d200000 | 12.59 ± 0.12 | 14.00 ± 0.00 | | | |
llama-benchy (0.3.5)
date: 2026-03-30 01:35:34 | latency mode: api
(APIServer pid=33932) INFO 03-30 01:50:49 [loggers.py:259] Engine 000: Avg prompt throughput: 20205.7 tokens/s, Avg generation throughput: 3.2 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=33932) INFO 03-30 01:50:59 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% |
| # Currently FI requires bfloat16 routing bias. | ||
| # https://github.com/flashinfer-ai/flashinfer/issues/2909 | ||
| if e_score_correction_bias is not None: | ||
| e_score_correction_bias = e_score_correction_bias.to(torch.bfloat16) |
There was a problem hiding this comment.
@pavanimajety do you know if this is right? I thought we fixed this issue for trtllm MoE across the board
| # Currently FI requires bfloat16 routing bias. | ||
| # https://github.com/flashinfer-ai/flashinfer/issues/2909 | ||
| if e_score_correction_bias is not None: | ||
| e_score_correction_bias = e_score_correction_bias.to(torch.bfloat16) | ||
|
|
Signed-off-by: Johnny <johnnynuca14@gmail.com>
Summary
Fix
cudaErrorIllegalInstructionwhen running NVFP4 models (e.g.nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4) on SM12x GPUs (RTX 50 series SM120, DGX Spark SM121).Root causes
CUTLASS v4.2.2 lacks SM12x NVFP4 tile constraints — The bundled CUTLASS was missing SM120f family-level compilation support for NVFP4/MX Grouped GEMM and SM121-specific tile configurations (DGX Spark). This caused
IllegalInstructionduring decode when small-M tile variants were selected. Related upstream: NVIDIA/cutlass#3038.FlashInfer 0.6.6 bundles CUTLASS 4.2.1 — The FlashInfer CUTLASS MoE backend failed on SM12x with
Failed to initialize cutlass TMA WS grouped gemmdue to the same missing tile constraints. Fixed upstream in flashinfer-ai/flashinfer#2798.cutlass_scaled_mm_supports_fp4()reported false availability — Only checked CUDA runtime version (>= 12080), not whether the SM-specific kernel was actually compiled. On a build with onlyENABLE_NVFP4_SM100, it incorrectly reported CUTLASS as available for SM12x, then failed at dispatch.Quantization kernels had no SM runtime guard — The
scaled_fp4_quant,silu_and_mul_nvfp4_quant, and expert quant entry points dispatched to_sm1xxakernels if any SM1xx was compiled, with no runtime check. If only SM100 SASS existed, CUDA would JIT-compile SM100 PTX for SM120 (different major arch), producing illegal instructions asynchronously — surfacing later atsynchronize()as an opaque CUDA error.FlashInfer CUTLASS backend bypassed quant kernel checks —
select_nvfp4_linear_backend()selected FlashInfer CUTLASS solely onhas_device_capability(100), without verifying the vLLM quantization kernels (used by all non-Marlin backends) were compiled for the current SM.Changes
CMakeLists.txtdocker/Dockerfiledocker/Dockerfile.nightly_torchdocker/versions.jsonFLASHINFER_VERSION:0.6.6→0.6.7nvfp4_scaled_mm_entry.cucutlass_scaled_mm_supports_fp4()now checks compile-timeENABLE_NVFP4_SM100/ENABLE_NVFP4_SM120guards per SM range instead of a blanket>= 100checknvfp4_quant_entry.cunvfp4_quant_sm_supported()runtime guard to all four quant entry points (scaled_fp4_quant,scaled_fp4_experts_quant,silu_and_mul_nvfp4_quant,silu_and_mul_scaled_fp4_experts_quant)nvfp4_utils.pyselect_nvfp4_linear_backend()gates FlashInfer CUTLASS oncutlass_fp4_supported()+ adds validation assert for all FlashInfer backendsWhat is NOT changed
Marlin remains a valid fallback on SM12x. Marlin FP4 uses weight-only dequantization to BF16 — it does not use native FP4 tensor core instructions and works correctly on all Blackwell architectures including DGX Spark. Benchmarks confirm Marlin is stable on SM121 (~558 tok/s, on par with vLLM CUTLASS at ~562 tok/s). The Marlin path (
apply_fp4_marlin_linear) bypasses the vLLM quant kernels entirely, so the SM guards innvfp4_quant_entry.cudo not affect it.Behavior on SM12x after this PR
ENABLE_NVFP4_SM120+ CUTLASS v4.4.2IllegalInstructionENABLE_NVFP4_SM120IllegalInstruction(SM100 PTX JIT to SM120)Failed to initialize cutlass TMA WS grouped gemm(CUTLASS 4.2.1 in FlashInfer 0.6.6)Follow-up: FlashInfer 0.6.8
flashinfer-ai/flashinfer#2738 (merged March 28, 2026) adds native NVFP4 and MXFP4 group GEMM support for SM120/SM121 (RTX 50 / DGX Spark) directly in FlashInfer. This will land in FlashInfer 0.6.8. Once released,
FLASHINFER_VERSIONshould be bumped indocker/Dockerfile,docker/Dockerfile.nightly_torch, anddocker/versions.jsonto unlock FlashInfer's own SM12x NVFP4/MXFP4 kernels (including GDC unguarding and PDL group GEMM fixes). TODO comments have been added to both Dockerfiles tracking this.Test plan
CUDA_ARCHS="12.0a;12.1a"on DGX Spark (SM121), verify NVFP4 model serves with vLLM CUTLASS backend (VLLM_NVFP4_GEMM_BACKEND=cutlass --moe-backend=cutlass)CUDA_ARCHS="12.0a;12.1a", verify Marlin fallback still works (VLLM_NVFP4_GEMM_BACKEND=marlin --moe-backend=marlin)CUDA_ARCHS="10.0a"only, verify Marlin fallback on SM12x (noIllegalInstruction)tests/models/quantization/test_nvfp4.pyon SM120+DockerfileandDockerfile.nightly_torch