Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions docs/source/features/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,12 +96,13 @@ The language component decides which quantization methods are supported by a giv
| Model | NVFP4 | MXFP4 | FP8(per tensor)| FP8(block scaling) | FP8(rowwise) | FP8 KV Cache |W4A8 AWQ | W4A16 AWQ | W4A8 GPTQ | W4A16 GPTQ |
| :------------- | :---: | :---: | :---: | :---: | :---: | :---: | :-------: | :-------: | :--------: | :--------: |
| Blackwell(sm120) | Y | Y | Y | . | . | Y | . | . | . | . |
| Blackwell(sm100) | Y | Y | Y | Y | . | Y | . | . | . | . |
| Blackwell(sm100/103) | Y | Y | Y | Y | . | Y | . | . | . | . |
| Hopper | . | . | Y | Y | Y | Y | Y | Y | Y | Y |
| Ada Lovelace | . | . | Y | . | . | Y | Y | Y | Y | Y |
| Ampere | . | . | . | . | . | Y | . | Y | . | Y |

```{note}
FP8 block wise scaling GEMM kernels for sm100 are using MXFP8 recipe (E4M3 act/weight and UE8M0 act/weight scale), which is slightly different from SM90 FP8 recipe (E4M3 act/weight and FP32 act/weight scale).
FP8 block wise scaling GEMM kernels for sm100/103 are using MXFP8 recipe (E4M3 act/weight and UE8M0 act/weight scale), which is slightly different from SM90 FP8 recipe (E4M3 act/weight and FP32 act/weight scale).
```


Expand Down
3 changes: 2 additions & 1 deletion docs/source/legacy/reference/support-matrix.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,7 @@ In addition, older architectures can have limitations for newer software release
- TensorRT-LLM requires Linux x86_64 or Linux aarch64.
* - GPU Model Architectures
-
- [NVIDIA GB300 NVL72](https://www.nvidia.com/en-us/data-center/gb300-nvl72/)
- [NVIDIA GB200 NVL72](https://www.nvidia.com/en-us/data-center/gb200-nvl72/)
- [NVIDIA Blackwell Architecture](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/)
- [NVIDIA Grace Hopper Superchip](https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/)
Expand All @@ -157,7 +158,7 @@ The following table shows the supported software for TensorRT-LLM.
- [10.13](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/index.html)
* - Precision
-
- Blackwell (SM100/SM120) - FP32, FP16, BF16, FP8, FP4, INT8, INT4
- Blackwell (SM100/SM103/SM120) - FP32, FP16, BF16, FP8, FP4, INT8, INT4
- Hopper (SM90) - FP32, FP16, BF16, FP8, INT8, INT4
- Ada Lovelace (SM89) - FP32, FP16, BF16, FP8, INT8, INT4
- Ampere (SM80, SM86) - FP32, FP16, BF16, INT8, INT4[^smgte89]
Expand Down
4 changes: 2 additions & 2 deletions docs/source/models/supported-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,8 @@ Note: Support for other models may vary. Features marked "N/A" are not applicabl
| Llama4ForConditionalGeneration | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Untested | N/A | Yes | Yes |
| GPT-OSS | Yes | Yes | Yes | Yes | No | No | Yes | No | Yes | Yes | No | N/A | Yes | Yes |

[^1]: Chunked Prefill for MLA can only be enabled on SM100.
[^2]: KV cache reuse for MLA can only be enabled on SM90/SM100 and in BF16/FP8 KV cache dtype.
[^1]: Chunked Prefill for MLA can only be enabled on SM100/SM103.
[^2]: KV cache reuse for MLA can only be enabled on SM90/SM100/SM103 and in BF16/FP8 KV cache dtype.


# Multimodal Feature Support Matrix (PyTorch Backend)
Expand Down
4 changes: 2 additions & 2 deletions docs/source/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,8 @@ TensorRT LLM strives to support the most popular models on **Day 0**.
### 🔧 **Latest GPU Architecture Support**

TensorRT LLM supports the full spectrum of NVIDIA GPU architectures:
- **NVIDIA Blackwell**: B200, GB200, RTX Pro 6000 SE with FP4 optimization
- **NVIDIA Hopper**: H100, H200,GH200 with FP8 acceleration
- **NVIDIA Blackwell**: B200, B300, GB200, GB300, RTX Pro 6000 SE with FP4 optimization
- **NVIDIA Hopper**: H100, H200, GH200 with FP8 acceleration
- **NVIDIA Ada Lovelace**: L40/L40S, RTX 40 series with FP8 acceleration
- **NVIDIA Ampere**: A100, RTX 30 series for production workloads

Expand Down