Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 36 additions & 28 deletions examples/models/core/qwen/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,33 +42,35 @@ In addition, there are two shared files in the parent folder [`examples`](../../
* [`summarize.py`](../../../summarize.py) to summarize the articles in the [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset.

## Support Matrix
| Model Name | FP16/BF16 | FP8 | WO | AWQ | GPTQ | SQ | TP | PP | Arch |
| :-------------: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :-----: |
| Qwen-1_8B(-Chat) | Y | Y | Y | Y* | Y | Y | Y | Y | Ampere+ |
| Qwen-7B(-Chat) | Y | Y | Y | Y | Y | Y | Y | Y | Ampere+ |
| Qwen-14B(-Chat) | Y | Y | Y | Y* | Y | Y | Y | Y | Ampere+ |
| Qwen-72B(-Chat) | Y | Y | Y | Y | Y | Y | Y | Y | Ampere+ |
| Qwen1.5-0.5B(-Chat)| Y | Y | Y | Y* | Y | Y | Y | Y | Ampere+ |
| Qwen1.5-1.8B(-Chat)| Y | Y | Y | Y* | Y | Y | Y | Y | Ampere+ |
| Qwen1.5-4B(-Chat) | Y | Y | Y | Y* | Y | Y | Y | Y | Ampere+ |
| Qwen1.5-7B(-Chat) | Y | Y | Y | Y | Y | Y | Y | Y | Ampere+ |
| Qwen1.5-14B(-Chat) | Y | Y | Y | Y* | Y | Y | Y | Y | Ampere+ |
| Qwen1.5-32B(-Chat) | Y | Y | Y | Y | Y | Y | Y | Y | Ampere+ |
| Qwen1.5-72B(-Chat) | Y | Y | Y | Y | Y | Y | Y | Y | Ampere+ |
| Qwen1.5-110B(-Chat)| Y | Y | Y | Y | Y | Y | Y | Y | Ampere+ |
| Qwen1.5-MoE-A2.7B(-Chat)| Y | - | Y | - | - | - | Y | Y | Ampere+ |
| Qwen2-0.5B(-Instruct)| Y | Y | Y | Y* | Y | Y | Y | Y | Ampere+ |
| Qwen2-1.5B(-Instruct)| Y | Y | Y | Y* | Y | Y | Y | Y | Ampere+ |
| Qwen2-7B(-Instruct)| Y | Y | Y | Y | Y | Y | Y | Y | Ampere+ |
| Qwen2-57B-A14B(-Instruct)| Y | - | Y | - | - | - | Y | Y | Ampere+ |
| Qwen2-72B(-Instruct)| Y | Y | Y | Y* | Y | Y | Y | Y | Ampere+ |
| Qwen2.5-0.5B(-Instruct)| Y | Y | Y | Y* | Y | Y | Y | Y | Ampere+ |
| Qwen2.5-3B(-Instruct)| Y | Y | Y | Y* | Y | Y | Y | Y | Ampere+ |
| Qwen2.5-1.5B(-Instruct)| Y | Y | Y | Y | Y | Y | Y | Y | Ampere+ |
| Qwen2.5-7B(-Instruct)| Y | Y | Y | Y | Y | Y | Y | Y | Ampere+ |
| Qwen2.5-32B(-Instruct)| Y | Y | Y | Y | Y | Y | Y | Y | Ampere+ |
| Qwen2.5-72B(-Instruct)| Y | Y | Y | Y* | Y | Y | Y | Y | Ampere+ |
| QwQ-32B | Y | Y | Y | Y | Y | Y | Y | Y | Ampere+ |
| Model Name | FP16/BF16 | FP8 | nvfp4 | WO | AWQ | GPTQ | SQ | TP | PP | EP | Arch |
| :-------------: | :---: | :---: | :-----: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :-----: |
| Qwen-1_8B(-Chat) | Y | Y | - | Y | Y* | Y | Y | Y | Y | - | Ampere+ |
| Qwen-7B(-Chat) | Y | Y | - | Y | Y | Y | Y | Y | Y | - | Ampere+ |
| Qwen-14B(-Chat) | Y | Y | - | Y | Y* | Y | Y | Y | Y | - | Ampere+ |
| Qwen-72B(-Chat) | Y | Y | - | Y | Y | Y | Y | Y | Y | - | Ampere+ |
| Qwen1.5-0.5B(-Chat)| Y | Y | - | Y | Y* | Y | Y | Y | Y | - | Ampere+ |
| Qwen1.5-1.8B(-Chat)| Y | Y | - | Y | Y* | Y | Y | Y | Y | - | Ampere+ |
| Qwen1.5-4B(-Chat) | Y | Y | - | Y | Y* | Y | Y | Y | Y | - | Ampere+ |
| Qwen1.5-7B(-Chat) | Y | Y | - | Y | Y | Y | Y | Y | Y | - | Ampere+ |
| Qwen1.5-14B(-Chat) | Y | Y | - | Y | Y* | Y | Y | Y | Y | - | Ampere+ |
| Qwen1.5-32B(-Chat) | Y | Y | - | Y | Y | Y | Y | Y | Y | - | Ampere+ |
| Qwen1.5-72B(-Chat) | Y | Y | - | Y | Y | Y | Y | Y | Y | - | Ampere+ |
| Qwen1.5-110B(-Chat)| Y | Y | - | Y | Y | Y | Y | Y | Y | - | Ampere+ |
| Qwen1.5-MoE-A2.7B(-Chat)| Y | - | - | Y | - | - | - | Y | Y | - | Ampere+ |
| Qwen2-0.5B(-Instruct)| Y | Y | - | Y | Y* | Y | Y | Y | Y | - | Ampere+ |
| Qwen2-1.5B(-Instruct)| Y | Y | - | Y | Y* | Y | Y | Y | Y | - | Ampere+ |
| Qwen2-7B(-Instruct)| Y | Y | - | Y | Y | Y | Y | Y | Y | - | Ampere+ |
| Qwen2-57B-A14B(-Instruct)| Y | - | - | Y | - | - | - | Y | Y | - | Ampere+ |
| Qwen2-72B(-Instruct)| Y | Y | - | Y | Y* | Y | Y | Y | Y | - | Ampere+ |
| Qwen2.5-0.5B(-Instruct)| Y | Y | - | Y | Y* | Y | Y | Y | Y | - | Ampere+ |
| Qwen2.5-3B(-Instruct)| Y | Y | - | Y | Y* | Y | Y | Y | Y | - | Ampere+ |
| Qwen2.5-1.5B(-Instruct)| Y | Y | - | Y | Y | Y | Y | Y | Y | - | Ampere+ |
| Qwen2.5-7B(-Instruct)| Y | Y | - | Y | Y | Y | Y | Y | Y | - | Ampere+ |
| Qwen2.5-32B(-Instruct)| Y | Y | - | Y | Y | Y | Y | Y | Y | - | Ampere+ |
| Qwen2.5-72B(-Instruct)| Y | Y | - | Y | Y* | Y | Y | Y | Y | - | Ampere+ |
| QwQ-32B | Y | Y | - | Y | Y | Y | Y | Y | Y | - | Ampere+ |
| Qwen3-32B | Y | Y | Y | - | - | - | - | Y | - | Y | Hopper+ |
| Qwen3-235B-A3B | Y | Y | Y | - | - | - | - | Y | - | Y | Hopper+ |

Please note that Y* sign means that the model does not support all the AWQ + TP combination.

Expand All @@ -82,6 +84,8 @@ Please note that Y* sign means that the model does not support all the AWQ + TP

Currently Qwen1 models does not support dynamic NTK and logn attention. Therefore, accuracy on long sequence input for the Qwen-7B and Qwen-14B model is not promised.

For Qwen3 models, we only list the largest models for dense and MoE architectures, but models of other sizes follow similar patterns.

## Usage

The TensorRT-LLM Qwen example code locates at [examples/models/core/qwen](./). It takes HF weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
Expand Down Expand Up @@ -692,7 +696,11 @@ concurrency=128
path_data=./aa_prompt_isl_1k_osl_2k_qwen3_10000samples.txt

# Setup the extra configuration for llm-api
echo -e "disable_overlap_scheduler: false\cuda_graph_config: {}\nprint_iter_log: true\ncuda_graph_batch_sizes: [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,32,64,128]\nenable_attention_dp: true " > ${path_config}
echo -e "disable_overlap_scheduler: false
print_iter_log: true
cuda_graph_config:
batch_sizes: [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,32,64,128]
enable_attention_dp: true " > ${path_config}

# Run trtllm-bench with pytorch backend
mpirun --allow-run-as-root --oversubscribe -n 1 \
Expand Down