From 58195e63e8a0b1a7dfdf233e70f2850a945ae233 Mon Sep 17 00:00:00 2001 From: junq <22017000+QiJune@users.noreply.github.com> Date: Tue, 15 Jul 2025 15:44:36 +0800 Subject: [PATCH 1/9] add release notes for 0.21 release Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --- docs/source/release-notes.md | 66 ++++++++++++++++++++++++++++++++++++ 1 file changed, 66 insertions(+) diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md index d5c239b82e4..ea3737f6131 100644 --- a/docs/source/release-notes.md +++ b/docs/source/release-notes.md @@ -4,6 +4,72 @@ All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our [NVIDIA Developer Forum](https://forums.developer.nvidia.com/). +## TensorRT-LLM Release 0.21.0 + +### Key Features and Enhancements +- **Model Support** + - Added Gemma3 VLM support +- **Features** + - Added large-scale EP support + - Integrated NIXL into the communication layer of the disaggregated service + - Added fabric Memory support for KV Cache Transfer + - Added MCP in ScaffoldingLLM + - Added support for w4a8_mxfp4_fp8 quantization + - Added support for fp8 rowwise quantization + - Added generation logits support in TRTLLM Sampler + - Added log probs support in TRTLLM Sampler + - Optimized TRTLLM Sampler perf single beam single step + - Enabled Disaggregated serving for Qwen-3 + - Added EAGLE3 support for Qwen-3 + - Fused finalize and allreduce for Qwen-MoE model + - Refactored Fused MoE module + - Added chunked attention kernels + - Integrated Hopper chunked attention kernels + - Introduced sliding-window attention kernels for the generation phase on Blackwell + - Updated DeepSeek FP8 TRT-LLM Gen cubins + - Added FP8 block-scale GEMM support on SM89 + - Enabled overlap scheduler between draft forwards + - Added Piecewise cuda graph support for MLA + - Added model-agnostic one-engine eagle3 + - Enabled Finalize + Allreduce + add + rmsnorm fusion + - Integrated TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner +- Benchmark: + - Added all_reduce.py benchmark script for testing + - Added beam width to low latency + - Fixed trtllm-bench iter_stats and cuda_graph_batch_sizes errors + - Enabled trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA + - Supported post_proc for bench + - Added no_kv_cache_reuse option and streaming support for trtllm serve bench + +### Infrastructure Changes +- The dependent TensorRT version is updated to 10.11.0 +- The dependent public PyTorch version is updated to 2.8.0 +- The dependent NVIDIA ModelOpt version is updated to 0.31.0 +- Upgrade gcc toolset version from 11 to 13 + +### API Changes +- Set _AutoDeployLlmArgs as primary config object +- Removed decoder request from decoder interface +- Enhanced the torch_compile_config in llm args +- Removed the redundant use_kv_cache field from PytorchConfig +- Moved allreduce_strategy from committed api to reference + +### Fixed Issues +- Fixed disaggregated service hang when MNNVL two-shot AllReduce is enabled (#4678) +- Fixed EP load balancer with MTP layer and route offset by EP rank (#4767) +- Fixed cuda graph padding for spec decoding (#4853) +- Fixed llama 4 long context issue (#4809) +- Fixed max_num_sequences calculation with overlap scheduling (#4532) +- Fixed chunked prefill + overlap scheduling (#5761) +- Fixed trtllm-bench hang issue due to LLM API IPC (#4798) +- Fixed index out of bounds error in spec decoding (#5954) +- Fixed MTP illegal memory access in cuda graph warmup (#5947) +- Fixed no free slots error with spec decode + disagg (#5975) + +### Known Issues +- accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken +- accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_auto_dtype[mtp_nextn=2-overlap_scheduler=True] is broken + ## TensorRT-LLM Release 0.20.0 ### Key Features and Enhancements From 0d26915aedcca7088acc42d76f08c06d367c982f Mon Sep 17 00:00:00 2001 From: junq <22017000+QiJune@users.noreply.github.com> Date: Wed, 16 Jul 2025 12:01:22 +0800 Subject: [PATCH 2/9] address comments Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --- docs/source/release-notes.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md index ea3737f6131..6969fc933e5 100644 --- a/docs/source/release-notes.md +++ b/docs/source/release-notes.md @@ -23,8 +23,7 @@ All published functionality in the Release Notes has been fully tested and verif - Added EAGLE3 support for Qwen-3 - Fused finalize and allreduce for Qwen-MoE model - Refactored Fused MoE module - - Added chunked attention kernels - - Integrated Hopper chunked attention kernels + - Added support for chunked attention on Blackwell - Introduced sliding-window attention kernels for the generation phase on Blackwell - Updated DeepSeek FP8 TRT-LLM Gen cubins - Added FP8 block-scale GEMM support on SM89 @@ -33,9 +32,10 @@ All published functionality in the Release Notes has been fully tested and verif - Added model-agnostic one-engine eagle3 - Enabled Finalize + Allreduce + add + rmsnorm fusion - Integrated TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner + - Added support for Eagle3 + disaggregated serving in two model speculative decoding flow - Benchmark: - Added all_reduce.py benchmark script for testing - - Added beam width to low latency + - Added beam width to trtllm-bench latency command - Fixed trtllm-bench iter_stats and cuda_graph_batch_sizes errors - Enabled trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA - Supported post_proc for bench @@ -65,6 +65,7 @@ All published functionality in the Release Notes has been fully tested and verif - Fixed index out of bounds error in spec decoding (#5954) - Fixed MTP illegal memory access in cuda graph warmup (#5947) - Fixed no free slots error with spec decode + disagg (#5975) +- Fixed one-off attention window size for Gemma3 1B (#5564) ### Known Issues - accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken From 535016f4abd6a81e2277edfdd734371b79759bda Mon Sep 17 00:00:00 2001 From: junq <22017000+QiJune@users.noreply.github.com> Date: Wed, 16 Jul 2025 12:10:30 +0800 Subject: [PATCH 3/9] update Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --- docs/source/release-notes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md index 6969fc933e5..e3b97a96915 100644 --- a/docs/source/release-notes.md +++ b/docs/source/release-notes.md @@ -23,7 +23,7 @@ All published functionality in the Release Notes has been fully tested and verif - Added EAGLE3 support for Qwen-3 - Fused finalize and allreduce for Qwen-MoE model - Refactored Fused MoE module - - Added support for chunked attention on Blackwell + - Added support for chunked MHA on Blackwell and Hopper - Introduced sliding-window attention kernels for the generation phase on Blackwell - Updated DeepSeek FP8 TRT-LLM Gen cubins - Added FP8 block-scale GEMM support on SM89 From c472ca82e786ce057e237d008a4655001be3ec6c Mon Sep 17 00:00:00 2001 From: junq <22017000+QiJune@users.noreply.github.com> Date: Wed, 16 Jul 2025 12:11:43 +0800 Subject: [PATCH 4/9] update Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --- docs/source/release-notes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md index e3b97a96915..fc498a511a6 100644 --- a/docs/source/release-notes.md +++ b/docs/source/release-notes.md @@ -23,7 +23,7 @@ All published functionality in the Release Notes has been fully tested and verif - Added EAGLE3 support for Qwen-3 - Fused finalize and allreduce for Qwen-MoE model - Refactored Fused MoE module - - Added support for chunked MHA on Blackwell and Hopper + - Added support for chunked attention on Blackwell and Hopper - Introduced sliding-window attention kernels for the generation phase on Blackwell - Updated DeepSeek FP8 TRT-LLM Gen cubins - Added FP8 block-scale GEMM support on SM89 From b796ad9e7d45a538d89e1fdec7aa27c9f3ec7fd5 Mon Sep 17 00:00:00 2001 From: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com> Date: Tue, 15 Jul 2025 21:34:28 -0700 Subject: [PATCH 5/9] Update release-notes.md Signed-off-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com> --- docs/source/release-notes.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md index fc498a511a6..0c62529fe55 100644 --- a/docs/source/release-notes.md +++ b/docs/source/release-notes.md @@ -70,6 +70,7 @@ All published functionality in the Release Notes has been fully tested and verif ### Known Issues - accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken - accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_auto_dtype[mtp_nextn=2-overlap_scheduler=True] is broken +- Bidirectional attention mask support for image tokens in Gemma3 VLMs is missing. This feature is available on main though: #5976 ## TensorRT-LLM Release 0.20.0 From 4209bb69d3402347a77fdb19364ed9dde0f2f2eb Mon Sep 17 00:00:00 2001 From: QI JUN <22017000+QiJune@users.noreply.github.com> Date: Wed, 16 Jul 2025 15:12:51 +0800 Subject: [PATCH 6/9] Update docs/source/release-notes.md Co-authored-by: Yanchao Lu Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com> --- docs/source/release-notes.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md index 0c62529fe55..85076d0eaae 100644 --- a/docs/source/release-notes.md +++ b/docs/source/release-notes.md @@ -42,10 +42,12 @@ All published functionality in the Release Notes has been fully tested and verif - Added no_kv_cache_reuse option and streaming support for trtllm serve bench ### Infrastructure Changes -- The dependent TensorRT version is updated to 10.11.0 -- The dependent public PyTorch version is updated to 2.8.0 -- The dependent NVIDIA ModelOpt version is updated to 0.31.0 -- Upgrade gcc toolset version from 11 to 13 +- The base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:25.05-py3`. +- The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.05-py3`. +- The dependent public PyTorch version is updated to 2.7.1. +- The dependent TensorRT version is updated to 10.11. +- The dependent NVIDIA ModelOpt version is updated to 0.31. +- The dependent NCCL version is updated to 2.27.5. ### API Changes - Set _AutoDeployLlmArgs as primary config object From 52beb0a7c2c5ae9d9da1494e7fa4570720243ef4 Mon Sep 17 00:00:00 2001 From: junq <22017000+QiJune@users.noreply.github.com> Date: Wed, 16 Jul 2025 15:33:26 +0800 Subject: [PATCH 7/9] update Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --- docs/source/release-notes.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md index 85076d0eaae..bbba4a26784 100644 --- a/docs/source/release-notes.md +++ b/docs/source/release-notes.md @@ -70,9 +70,8 @@ All published functionality in the Release Notes has been fully tested and verif - Fixed one-off attention window size for Gemma3 1B (#5564) ### Known Issues -- accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken -- accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_auto_dtype[mtp_nextn=2-overlap_scheduler=True] is broken -- Bidirectional attention mask support for image tokens in Gemma3 VLMs is missing. This feature is available on main though: #5976 +- accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken. +- Enabling disaggregated serving, MTP, and the overlap scheduler at the same time can lead to accuracy problems. ## TensorRT-LLM Release 0.20.0 From 0c5411ed18627521292b4a2b710a9b8aa5ae17cc Mon Sep 17 00:00:00 2001 From: junq <22017000+QiJune@users.noreply.github.com> Date: Wed, 16 Jul 2025 16:29:22 +0800 Subject: [PATCH 8/9] update Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --- docs/source/release-notes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md index bbba4a26784..aca10159a52 100644 --- a/docs/source/release-notes.md +++ b/docs/source/release-notes.md @@ -25,7 +25,7 @@ All published functionality in the Release Notes has been fully tested and verif - Refactored Fused MoE module - Added support for chunked attention on Blackwell and Hopper - Introduced sliding-window attention kernels for the generation phase on Blackwell - - Updated DeepSeek FP8 TRT-LLM Gen cubins + - Updated DeepSeek FP8 TRT-LLM Gen cubins to improve performance in large batch size scenarios - Added FP8 block-scale GEMM support on SM89 - Enabled overlap scheduler between draft forwards - Added Piecewise cuda graph support for MLA From 7d9f8c990925eaa99d863c264de00c0c3cef259d Mon Sep 17 00:00:00 2001 From: junq <22017000+QiJune@users.noreply.github.com> Date: Wed, 16 Jul 2025 16:34:43 +0800 Subject: [PATCH 9/9] update Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --- docs/source/release-notes.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md index aca10159a52..dee84ecfde5 100644 --- a/docs/source/release-notes.md +++ b/docs/source/release-notes.md @@ -33,6 +33,7 @@ All published functionality in the Release Notes has been fully tested and verif - Enabled Finalize + Allreduce + add + rmsnorm fusion - Integrated TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner - Added support for Eagle3 + disaggregated serving in two model speculative decoding flow + - Validated Llama 3.1 models on H200 NVL - Benchmark: - Added all_reduce.py benchmark script for testing - Added beam width to trtllm-bench latency command