Skip to content

Commit 03e7064

Browse files
byshiueRansiki
authored andcommitted
[Fix][Chore][Qwen3] fix bug of using fp4 on sm120 (NVIDIA#6065)
Signed-off-by: bhsueh <[email protected]> Signed-off-by: Ransiki Zhang <[email protected]>
1 parent a640a66 commit 03e7064

File tree

4 files changed

+5
-5
lines changed

4 files changed

+5
-5
lines changed

cpp/tensorrt_llm/thop/attentionOp.cpp

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -671,7 +671,8 @@ bool attention_supports_nvfp4_output(int64_t const num_heads, int64_t const num_
671671
bool const use_paged_context_fmha, bool is_mla_enable)
672672
{
673673
// Only Blackwell supports NVFP4 output.
674-
if (tensorrt_llm::common::getSMVersion() < 100)
674+
// SM 120 does not support NVFP4 output.
675+
if (tensorrt_llm::common::getSMVersion() < 100 || tensorrt_llm::common::getSMVersion() == 120)
675676
{
676677
return false;
677678
}

examples/models/core/qwen/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ In addition, there are two shared files in the parent folder [`examples`](../../
7070
| Qwen2.5-72B(-Instruct)| Y | Y | - | Y | Y* | Y | Y | Y | Y | - | Ampere+ |
7171
| QwQ-32B | Y | Y | - | Y | Y | Y | Y | Y | Y | - | Ampere+ |
7272
| Qwen3-32B | Y | Y | Y | - | - | - | - | Y | - | Y | Hopper+ |
73-
| Qwen3-235B-A3B | Y | Y | Y | - | - | - | - | Y | - | Y | Hopper+ |
73+
| Qwen3-235B-A22B | Y | Y | Y | - | - | - | - | Y | - | Y | Hopper+ |
7474

7575
Please note that Y* sign means that the model does not support all the AWQ + TP combination.
7676

tests/integration/defs/accuracy/test_llm_api_pytorch.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1844,7 +1844,7 @@ def test_nvfp4(self, tp_size, pp_size, ep_size, attention_dp, cuda_graph,
18441844
cuda_graph_config=CudaGraphConfig() if cuda_graph else None,
18451845
moe_config=MoeConfig(backend=moe_backend))
18461846

1847-
kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.6)
1847+
kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.4)
18481848
with LLM(
18491849
f"{llm_models_root()}/Qwen3/saved_models_Qwen3-235B-A22B_nvfp4_hf",
18501850
tensor_parallel_size=tp_size,

tests/integration/test_lists/waives.txt

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -399,8 +399,7 @@ examples/test_llama.py::test_llm_llama_v3_1_2nodes_8gpus[llama-3.1-8b-disable_fp
399399
test_e2e.py::test_openai_multinodes_chat_tp16pp1 SKIP (https://nvbugs/5112075)
400400
examples/test_qwen.py::test_llm_hf_qwen_quantization_1gpu[qwen2_vl_7b_instruct-fp8-bfloat16] SKIP (https://nvbugs/5322488)
401401
accuracy/test_cli_flow.py::TestSantacoder::test_auto_dtype SKIP (https://nvbugs/5234043)
402-
full:B200/accuracy/test_llm_api_pytorch.py::TestQwen3_235B_A22B::test_nvfp4[latency_moe_cutlass] SKIP (https://nvbugs/5355219)
403-
full:B200/accuracy/test_llm_api_pytorch.py::TestQwen3_235B_A22B::test_nvfp4[latency_moe_trtllm] SKIP (https://nvbugs/5355219)
402+
full:B200/accuracy/test_llm_api_pytorch.py::TestQwen3_235B_A22B::test_nvfp4[latency_moe_trtllm] SKIP (https://nvbugs/5401163)
404403
examples/test_llama.py::test_llm_llama_lookahead_xqa_fp8_1gpu[llama-3.1-8b] SKIP (https://nvbugs/5355054)
405404
examples/test_llama.py::test_llm_llama_lookahead_xqa_fp8_1gpu[llama-3.2-1b] SKIP (https://nvbugs/5355054)
406405
examples/test_multimodal.py::test_llm_multimodal_general[VILA1.5-3b-pp:1-tp:1-float16-bs:8-cpp_e2e:True-nb:1] SKIP (https://nvbugs/5360086)

0 commit comments

Comments
 (0)