Skip to content

Commit a2757b0

Browse files
faradawnnv-guomingz
authored andcommitted
[None][doc] add Qwen3-next deployment guide and test cases into L0.
Signed-off-by: Faradawn Yang <[email protected]> Signed-off-by: Robin Kobus <[email protected]> Signed-off-by: nv-guomingz <[email protected]>
1 parent 2f94cc3 commit a2757b0

File tree

7 files changed

+53
-19
lines changed

7 files changed

+53
-19
lines changed

docs/source/deployment-guide/quick-start-recipe-for-qwen3-next-on-trtllm.md

Lines changed: 10 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
1-
# Quick Start Recipe for Qwen3 Next on TensorRT LLM
1+
# Quick Start Recipe for Qwen3 Next on TensorRT LLM - Blackwell & Hopper Hardware
22

33
## Introduction
44

5-
This is a functional quick-start guide for running the Qwen3-Next model on TensorRT LLM. It focuses on a working setup with recommended defaults. Additional performance optimizations and support (such as Blackwell) will be rolled out in future updates.
5+
This is a functional quick-start guide for running the Qwen3-Next model on TensorRT LLM. It focuses on a working setup with recommended defaults. Additional performance optimizations and support will be rolled out in future updates.
66

77
## Prerequisites
88

9-
* GPU: NVIDIA Hopper Architecture
9+
* GPU: NVIDIA Blackwell or Hopper Architecture
1010
* OS: Linux
1111
* Drivers: CUDA Driver 575 or Later
1212
* Docker with NVIDIA Container Toolkit installed
@@ -29,9 +29,9 @@ make -C docker release_build IMAGE_TAG=qwen3-next-local
2929
make -C docker release_run IMAGE_NAME=tensorrt_llm IMAGE_TAG=qwen3-next-local LOCAL_USER=1
3030
```
3131

32-
### Creating the TRT-LLM Server config
32+
### Creating the TensorRT LLM Server config
3333

34-
We create a YAML configuration file `/tmp/config.yml` for the TensorRT LLM Server. Note that we should set kv_cache_reuse to false.
34+
We create a YAML configuration file `/tmp/config.yml` for the TensorRT LLM Server. Note that we should set kv_cache_reuse to false.
3535

3636
```shell
3737
EXTRA_LLM_API_FILE=/tmp/config.yml
@@ -52,16 +52,15 @@ EOF
5252
```
5353

5454

55-
### Launch the TRT-LLM Server
55+
### Launch the TensorRT LLM Server
5656

57-
Below is an example command to launch the TRT-LLM server with the Qwen3-Next model from within the container. Note that we currently only support pytorch backend.
57+
Below is an example command to launch the TensorRT LLM server with the Qwen3-Next model from within the container.
5858

5959
```shell
6060
trtllm-serve Qwen/Qwen3-Next-80B-A3B-Thinking \
6161
--host 0.0.0.0 \
6262
--port 8000 \
63-
--backend pytorch \
64-
--max_batch_size 720 \
63+
--max_batch_size 16 \
6564
--max_num_tokens 4096 \
6665
--tp_size 4 \
6766
--pp_size 1 \
@@ -89,9 +88,6 @@ These options are used directly on the command line when you start the `trtllm-s
8988
* **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
9089
* **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.
9190

92-
#### `--backend pytorch`
93-
94-
* **Description:** Tells TensorRT LLM to use the **pytorch** backend.
9591

9692
#### `--max_batch_size`
9793

@@ -155,7 +151,7 @@ curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
155151

156152
When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
157153

158-
After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server.
154+
After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.
159155

160156
```shell
161157
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
@@ -177,8 +173,6 @@ Here is an example response:
177173
{"id":"chatcmpl-64ac201c77bf46a7a3a4eca7759b1fd8","object":"chat.completion","created":1759022940,"model":"Qwen/Qwen3-Next-80B-A3B-Thinking","choices":[{"index":0,"message":{"role":"assistant","content":"Okay, the user is asking \"Where is New York?\" Hmm, this seems straightforward but I need to be careful. New York could mean different things—maybe they're confused about the city versus the state. \n\nFirst thought: Are they a tourist planning a trip? Or maybe a student doing homework? Could even be someone国外 who's only heard \"New York\" in movies and isn't sure if it's a city or state. \n\nI should clarify both possibilities immediately. People often mix them up. Like, if someone says \"I'm going to New York\" they're probably talking about NYC, but technically New York State is bigger. \n\nLet me break it down: \n- New York City (NYC) is the famous one—Manhattan, skyscrapers, Times Square. \n- Then New York State (NY) is the whole state, which includes NYC but also upstate areas like Albany (the capital), Buffalo, and even the Adirondacks. \n\nWait, should I mention that NYC is in New York State? Yeah, that's crucial. Otherwise they might think it's two separate things. Also, where is the state located? Northeast US, borders other states like Pennsylvania, New Jersey... maybe name a few neighbors for context. \n\nOh! And the city vs state confusion is super common. Like, people say \"New York\" for the city but forget it's part of a larger state. Should I give examples? \n- \"If someone says 'hot dogs in New York' they mean NYC\" \n- But \"the New York State Police\" covers the whole state\" \n\n...Maybe add a fun fact? Like how NYC is just one city within the state? Or that the state capital isn't NYC but Albany? Yeah, that'll help clear confusion. \n\nAlso—should I ask if they meant the city or the state? Nah, better to cover both upfront. Keep it simple but precise. No jargon. They might not know terms like \"metropolitan area\" or \"boroughs,\" so explain \"boroughs\" as neighborhoods (Manhattan, Brooklyn etc.). \n\nCoordinates: 40.7°N, 74°W? Maybe overkill unless they're into geography. But for the state, roughly 42°N to 45°N latitude? Hmm, probably unnecessary. Stick to landmarks: borders Lake Erie, Atlantic coast... \n\nWait—user might be asking literally for GPS coordinates? Unlikely, but just in case I can add: \"Geographically, New York City is at approximately 40.7°N latitude and 74.0°W longitude...\" but only if the answer feels too dry otherwise. \n\nPriority: Clear, simple, correct. No assuming prior knowledge. Say \"New York can refer to...\" then split into two cases. End with \"Most people mean the city when they say 'New York' in casual talk.\" That’s the key takeaway. \n\n...Double-check: Is Albany really the capital? Yes. And NYC is the largest city but not capital. Got it. Also, New York State is the 27th largest by area? Maybe not relevant. Stick to basics. \n\nFinal structure: \n1. \"New York can mean two things: the city or the state.\" \n2. For city: \n - Where it is (Northeast US) \n - Inside New York State \n - Famous parts (Manhattan etc.) \n3. For state: \n - Larger area, includes city \n - Borders (PA, NJ, CT, MA, VT, Canada) \n - Capital is Albany (not NYC!) \n4. Clear summary: \"In everyday talk, 'New York' usually means NYC.\" \n\nYeah. That covers it without overwhelming them. No markdown, just plain English. Keep it warm—add a \"hope that helps!\" at the end.\n</think>\n\nThat's a common point of confusion! **\"New York\" can refer to two different things**, and it's important to distinguish between them:\n\n### 1. **New York City (NYC)** \n - **What it is**: The largest and most famous city in the United States, known for landmarks like Times Square, Central Park, the Statue of Liberty, and Wall Street. \n - **Where it is**: \n - Located in the **northeastern United States**. \n - Situated at the mouth of the **Hudson River**, where it meets the **Atlantic Ocean**. \n - Part of **New York State** (see below). \n - **Geographic details**: \n - Coordinates: Approximately **40.7° N latitude, 74.0° W longitude**. \n - Composed of **5 boroughs**: Manhattan (the \"city\" most people picture), Brooklyn, Queens, The Bronx, and Staten Island. \n - Panoramic view of NYC (including Brooklyn and New Jersey skyline):","reasoning_content":null,"reasoning":null,"tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null,"mm_embedding_handle":null,"disaggregated_params":null,"avg_decoded_tokens_per_iter":1.0}],"usage":{"prompt_tokens":15,"total_tokens":1039,"completion_tokens":1024},"prompt_token_ids":null}
178174
```
179175

180-
181-
182176
### Troubleshooting Tips
183177

184178
* If you encounter CUDA out-of-memory errors, try reducing `max_batch_size` or `max_seq_len`.
@@ -234,7 +228,7 @@ If you want to save the results to a file add the following options.
234228
--result-filename "concurrency_${concurrency}.json"
235229
```
236230

237-
For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py>.
231+
For more benchmarking options see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
238232

239233
Run `bench.sh` to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above `bench.sh` script.
240234

docs/source/models/supported-models.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ The following is a table of supported models for the PyTorch backend:
2323
| `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B` |
2424
| `Qwen3ForCausalLM` | Qwen3 | `Qwen/Qwen3-8B` |
2525
| `Qwen3MoeForCausalLM` | Qwen3MoE | `Qwen/Qwen3-30B-A3B` |
26-
26+
| `Qwen3NextForCausalLM` | Qwen3Next | `Qwen/Qwen3-Next-80B-A3B-Thinking` |
2727

2828

2929
## Model-Feature Support Matrix(Key Models)
@@ -34,6 +34,7 @@ Note: Support for other models may vary. Features marked "N/A" are not applicabl
3434
| ------------------------------ | ----------------- | ---------- | -------------------------- | --------------------- | --------------- | --- | ------------------------- | ------------------------- | ------------- | ---------------- | -------------- | ------------------------ | --------------------- | --------------- |
3535
| DeepseekV3ForCausalLM | Yes | Yes | Yes | Yes | Yes [^1] | Yes | No | No | Yes | Yes | Yes [^2] | N/A | Yes | Yes |
3636
| Qwen3MoeForCausalLM | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | N/A | Yes | Yes |
37+
| Qwen3NextForCausalLM | No | Yes | No | No | No | No | No | No | No | No | No | No | No | No |
3738
| Llama4ForConditionalGeneration | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Untested | N/A | Yes | Yes |
3839
| GPT-OSS | Yes | Yes | Yes | Yes | No | No | Yes | No | Yes | Yes | No | N/A | Yes | Yes |
3940

tests/integration/defs/accuracy/references/gsm8k.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,8 @@ Qwen3/Qwen3-235B-A22B:
123123
quant_algo: NVFP4
124124
kv_cache_quant_algo: FP8
125125
accuracy: 85.78
126+
Qwen3/Qwen3-Next-80B-A3B-Thinking:
127+
- accuracy: 81.577
126128
moonshotai/Kimi-K2-Instruct:
127129
- quant_algo: FP8_BLOCK_SCALES
128130
accuracy: 94.84

tests/integration/defs/accuracy/references/mmlu.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -229,6 +229,8 @@ Qwen3/Qwen3-235B-A22B:
229229
quant_algo: NVFP4
230230
kv_cache_quant_algo: FP8
231231
accuracy: 86
232+
Qwen3/Qwen3-Next-80B-A3B-Thinking:
233+
- accuracy: 86
232234
moonshotai/Kimi-K2-Instruct:
233235
- quant_algo: FP8_BLOCK_SCALES
234236
accuracy: 87.65

tests/integration/defs/accuracy/test_llm_api_pytorch.py

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3525,6 +3525,37 @@ def test_auto_dtype_tp4(self):
35253525
task.evaluate(llm)
35263526

35273527

3528+
@pytest.mark.skip_less_device_memory(80000)
3529+
class TestQwen3NextThinking(LlmapiAccuracyTestHarness):
3530+
MODEL_NAME = "Qwen3/Qwen3-Next-80B-A3B-Thinking"
3531+
MODEL_PATH = f"{llm_models_root()}/{MODEL_NAME}"
3532+
3533+
@skip_pre_hopper
3534+
@pytest.mark.skip_less_device(4)
3535+
@pytest.mark.parametrize("tp_size,pp_size,ep_size", [(4, 1, 4)],
3536+
ids=["tp4ep4"])
3537+
def test_auto_dtype(self, tp_size, pp_size, ep_size):
3538+
if get_device_count() != tp_size * pp_size:
3539+
pytest.skip("Device count mismatch with world size")
3540+
3541+
kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.6,
3542+
enable_block_reuse=False)
3543+
cuda_graph_config = CudaGraphConfig(enable_padding=True,
3544+
max_batch_size=720)
3545+
3546+
with LLM(self.MODEL_PATH,
3547+
max_num_tokens=4096,
3548+
tensor_parallel_size=tp_size,
3549+
pipeline_parallel_size=pp_size,
3550+
moe_expert_parallel_size=ep_size,
3551+
kv_cache_config=kv_cache_config,
3552+
cuda_graph_config=cuda_graph_config) as llm:
3553+
task = MMLU(self.MODEL_NAME)
3554+
task.evaluate(llm)
3555+
task = GSM8K(self.MODEL_NAME)
3556+
task.evaluate(llm)
3557+
3558+
35283559
class TestNano_V2_VLM(LlmapiAccuracyTestHarness):
35293560
MODEL_NAME = "nvidia/Nano-v2-VLM"
35303561
MODEL_PATH = f"{llm_models_root()}/Nano-v2-VLM"

tests/integration/test_lists/test-db/l0_dgx_b200.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ l0_dgx_b200:
3939
- accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_nvfp4[dep4_latency_moe_trtllm-torch_compile=False]
4040
- accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_nvfp4[dep4_latency_moe_cutlass-torch_compile=False]
4141
- accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_nvfp4[dep4_latency_moe_cutlass-torch_compile=True]
42+
- accuracy/test_llm_api_pytorch.py::TestQwen3NextThinking::test_auto_dtype[tp4ep4]
4243
- accuracy/test_llm_api_pytorch.py::TestLlama4ScoutInstruct::test_fp4[tp4-cuda_graph=True]
4344
- accuracy/test_llm_api_pytorch.py::TestLlama4ScoutInstruct::test_fp4[tp8ep8-cuda_graph=True]
4445
- accuracy/test_llm_api_pytorch.py::TestLlama4ScoutInstruct::test_fp8_chunked_prefill[tp4ep4-cuda_graph=True]

tests/integration/test_lists/test-db/l0_dgx_h100.yml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ l0_dgx_h100:
1919
- unittest/_torch/multi_gpu -m "not post_merge" TIMEOUT (90)
2020
- unittest/_torch/auto_deploy/unit/multigpu
2121
- unittest/_torch/modeling/test_modeling_pixtral.py::test_tensor_parallelism
22+
# ------------- Disaggregated serving tests ---------------
2223
- accuracy/test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_eagle3[eagle3_one_model=False-overlap_scheduler=False]
2324
- accuracy/test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_eagle3[eagle3_one_model=True-overlap_scheduler=True]
2425
- accuracy/test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_guided_decoding[xgrammar]
@@ -59,6 +60,7 @@ l0_dgx_h100:
5960
tests:
6061
# ------------- PyTorch tests ---------------
6162
- unittest/llmapi/test_llm_multi_gpu_pytorch.py -m "gpu4"
63+
# ------------- Model specific tests ---------------
6264
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16_4gpus[tp4-attn_backend=TRTLLM-torch_compile=False]
6365
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16_4gpus[tp2pp2-attn_backend=TRTLLM-torch_compile=False]
6466
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16_4gpus[tp2pp2-attn_backend=TRTLLM-torch_compile=True]
@@ -69,6 +71,9 @@ l0_dgx_h100:
6971
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8_4gpus[tp2pp2-fp8kv=True-attn_backend=TRTLLM-torch_compile=False]
7072
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8_4gpus[tp2pp2-fp8kv=True-attn_backend=TRTLLM-torch_compile=True]
7173
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_guided_decoding_4gpus[xgrammar]
74+
- test_e2e.py::test_ptp_quickstart_advanced_bs1
75+
- test_e2e.py::test_ptp_quickstart_advanced_deepseek_v3_lite_4gpus_adp_balance[DeepSeek-V3-Lite-FP8-DeepSeek-V3-Lite/fp8]
76+
# ------------- Disaggregated serving tests ---------------
7277
- disaggregated/test_disaggregated.py::test_disaggregated_multi_gpu_with_mpirun[TinyLlama-1.1B-Chat-v1.0]
7378
- disaggregated/test_disaggregated.py::test_disaggregated_multi_gpu_with_mpirun_trt_backend[TinyLlama-1.1B-Chat-v1.0]
7479
- disaggregated/test_disaggregated.py::test_disaggregated_ctxpp2_genpp2[TinyLlama-1.1B-Chat-v1.0]
@@ -86,8 +91,6 @@ l0_dgx_h100:
8691
- accuracy/test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_ctx_pp_gen_tp_asymmetric[MMLU-gen_tp=2-ctx_pp=2]
8792
- accuracy/test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_multi_instance[GSM8K]
8893
- accuracy/test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_multi_instance[MMLU]
89-
- test_e2e.py::test_ptp_quickstart_advanced_bs1
90-
- test_e2e.py::test_ptp_quickstart_advanced_deepseek_v3_lite_4gpus_adp_balance[DeepSeek-V3-Lite-FP8-DeepSeek-V3-Lite/fp8]
9194
- condition:
9295
ranges:
9396
system_gpu_count:

0 commit comments

Comments
 (0)