From 58195e63e8a0b1a7dfdf233e70f2850a945ae233 Mon Sep 17 00:00:00 2001
From: junq <22017000+QiJune@users.noreply.github.com>
Date: Tue, 15 Jul 2025 15:44:36 +0800
Subject: [PATCH 1/9] add release notes for 0.21 release

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
---
 docs/source/release-notes.md | 66 ++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md
index d5c239b82e4..ea3737f6131 100644
--- a/docs/source/release-notes.md
+++ b/docs/source/release-notes.md
@@ -4,6 +4,72 @@
 
 All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our [NVIDIA Developer Forum](https://forums.developer.nvidia.com/).
 
+## TensorRT-LLM Release 0.21.0
+
+### Key Features and Enhancements
+- **Model Support**
+  - Added Gemma3 VLM support
+- **Features**
+  - Added large-scale EP support
+  - Integrated NIXL into the communication layer of the disaggregated service
+  - Added fabric Memory support for KV Cache Transfer
+  - Added MCP in ScaffoldingLLM
+  - Added support for w4a8_mxfp4_fp8 quantization
+  - Added support for fp8 rowwise quantization
+  - Added generation logits support in TRTLLM Sampler
+  - Added log probs support in TRTLLM Sampler
+  - Optimized TRTLLM Sampler perf single beam single step
+  - Enabled Disaggregated serving for Qwen-3
+  - Added EAGLE3 support for Qwen-3
+  - Fused finalize and allreduce for Qwen-MoE model
+  - Refactored Fused MoE module
+  - Added chunked attention kernels
+  - Integrated Hopper chunked attention kernels
+  - Introduced sliding-window attention kernels for the generation phase on Blackwell
+  - Updated DeepSeek FP8 TRT-LLM Gen cubins
+  - Added FP8 block-scale GEMM support on SM89
+  - Enabled overlap scheduler between draft forwards
+  - Added Piecewise cuda graph support for MLA
+  - Added model-agnostic one-engine eagle3
+  - Enabled Finalize + Allreduce + add + rmsnorm fusion
+  - Integrated TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner
+- Benchmark:
+  - Added all_reduce.py benchmark script for testing
+  - Added beam width to low latency
+  - Fixed trtllm-bench iter_stats and cuda_graph_batch_sizes errors
+  - Enabled trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA
+  - Supported post_proc for bench
+  - Added no_kv_cache_reuse option and streaming support for trtllm serve bench
+
+### Infrastructure Changes
+- The dependent TensorRT version is updated to 10.11.0
+- The dependent public PyTorch version is updated to 2.8.0
+- The dependent NVIDIA ModelOpt version is updated to 0.31.0
+- Upgrade gcc toolset version from 11 to 13
+
+### API Changes
+- Set _AutoDeployLlmArgs as primary config object
+- Removed decoder request from decoder interface
+- Enhanced the torch_compile_config in llm args
+- Removed the redundant use_kv_cache field from PytorchConfig
+- Moved allreduce_strategy from committed api to reference
+
+### Fixed Issues
+- Fixed disaggregated service hang when MNNVL two-shot AllReduce is enabled (#4678)
+- Fixed EP load balancer with MTP layer and route offset by EP rank (#4767)
+- Fixed cuda graph padding for spec decoding (#4853)
+- Fixed llama 4 long context issue (#4809)
+- Fixed max_num_sequences calculation with overlap scheduling (#4532)
+- Fixed chunked prefill + overlap scheduling (#5761)
+- Fixed trtllm-bench hang issue due to LLM API IPC (#4798)
+- Fixed index out of bounds error in spec decoding (#5954)
+- Fixed MTP illegal memory access in cuda graph warmup (#5947)
+- Fixed no free slots error with spec decode + disagg (#5975)
+
+### Known Issues
+- accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken
+- accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_auto_dtype[mtp_nextn=2-overlap_scheduler=True] is broken
+
 ## TensorRT-LLM Release 0.20.0
 
 ### Key Features and Enhancements

From 0d26915aedcca7088acc42d76f08c06d367c982f Mon Sep 17 00:00:00 2001
From: junq <22017000+QiJune@users.noreply.github.com>
Date: Wed, 16 Jul 2025 12:01:22 +0800
Subject: [PATCH 2/9] address comments

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
---
 docs/source/release-notes.md | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md
index ea3737f6131..6969fc933e5 100644
--- a/docs/source/release-notes.md
+++ b/docs/source/release-notes.md
@@ -23,8 +23,7 @@ All published functionality in the Release Notes has been fully tested and verif
   - Added EAGLE3 support for Qwen-3
   - Fused finalize and allreduce for Qwen-MoE model
   - Refactored Fused MoE module
-  - Added chunked attention kernels
-  - Integrated Hopper chunked attention kernels
+  - Added support for chunked attention on Blackwell
   - Introduced sliding-window attention kernels for the generation phase on Blackwell
   - Updated DeepSeek FP8 TRT-LLM Gen cubins
   - Added FP8 block-scale GEMM support on SM89
@@ -33,9 +32,10 @@ All published functionality in the Release Notes has been fully tested and verif
   - Added model-agnostic one-engine eagle3
   - Enabled Finalize + Allreduce + add + rmsnorm fusion
   - Integrated TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner
+  - Added support for Eagle3 + disaggregated serving in two model speculative decoding flow
 - Benchmark:
   - Added all_reduce.py benchmark script for testing
-  - Added beam width to low latency
+  - Added beam width to trtllm-bench latency command
   - Fixed trtllm-bench iter_stats and cuda_graph_batch_sizes errors
   - Enabled trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA
   - Supported post_proc for bench
@@ -65,6 +65,7 @@ All published functionality in the Release Notes has been fully tested and verif
 - Fixed index out of bounds error in spec decoding (#5954)
 - Fixed MTP illegal memory access in cuda graph warmup (#5947)
 - Fixed no free slots error with spec decode + disagg (#5975)
+- Fixed one-off attention window size for Gemma3 1B (#5564)
 
 ### Known Issues
 - accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken

From 535016f4abd6a81e2277edfdd734371b79759bda Mon Sep 17 00:00:00 2001
From: junq <22017000+QiJune@users.noreply.github.com>
Date: Wed, 16 Jul 2025 12:10:30 +0800
Subject: [PATCH 3/9] update

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
---
 docs/source/release-notes.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md
index 6969fc933e5..e3b97a96915 100644
--- a/docs/source/release-notes.md
+++ b/docs/source/release-notes.md
@@ -23,7 +23,7 @@ All published functionality in the Release Notes has been fully tested and verif
   - Added EAGLE3 support for Qwen-3
   - Fused finalize and allreduce for Qwen-MoE model
   - Refactored Fused MoE module
-  - Added support for chunked attention on Blackwell
+  - Added support for chunked MHA on Blackwell and Hopper
   - Introduced sliding-window attention kernels for the generation phase on Blackwell
   - Updated DeepSeek FP8 TRT-LLM Gen cubins
   - Added FP8 block-scale GEMM support on SM89

From c472ca82e786ce057e237d008a4655001be3ec6c Mon Sep 17 00:00:00 2001
From: junq <22017000+QiJune@users.noreply.github.com>
Date: Wed, 16 Jul 2025 12:11:43 +0800
Subject: [PATCH 4/9] update

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
---
 docs/source/release-notes.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md
index e3b97a96915..fc498a511a6 100644
--- a/docs/source/release-notes.md
+++ b/docs/source/release-notes.md
@@ -23,7 +23,7 @@ All published functionality in the Release Notes has been fully tested and verif
   - Added EAGLE3 support for Qwen-3
   - Fused finalize and allreduce for Qwen-MoE model
   - Refactored Fused MoE module
-  - Added support for chunked MHA on Blackwell and Hopper
+  - Added support for chunked attention on Blackwell and Hopper
   - Introduced sliding-window attention kernels for the generation phase on Blackwell
   - Updated DeepSeek FP8 TRT-LLM Gen cubins
   - Added FP8 block-scale GEMM support on SM89

From b796ad9e7d45a538d89e1fdec7aa27c9f3ec7fd5 Mon Sep 17 00:00:00 2001
From: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>
Date: Tue, 15 Jul 2025 21:34:28 -0700
Subject: [PATCH 5/9] Update release-notes.md

Signed-off-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>
---
 docs/source/release-notes.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md
index fc498a511a6..0c62529fe55 100644
--- a/docs/source/release-notes.md
+++ b/docs/source/release-notes.md
@@ -70,6 +70,7 @@ All published functionality in the Release Notes has been fully tested and verif
 ### Known Issues
 - accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken
 - accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_auto_dtype[mtp_nextn=2-overlap_scheduler=True] is broken
+- Bidirectional attention mask support for image tokens in Gemma3 VLMs is missing. This feature is available on main though: #5976
 
 ## TensorRT-LLM Release 0.20.0
 

From 4209bb69d3402347a77fdb19364ed9dde0f2f2eb Mon Sep 17 00:00:00 2001
From: QI JUN <22017000+QiJune@users.noreply.github.com>
Date: Wed, 16 Jul 2025 15:12:51 +0800
Subject: [PATCH 6/9] Update docs/source/release-notes.md

Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
---
 docs/source/release-notes.md | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md
index 0c62529fe55..85076d0eaae 100644
--- a/docs/source/release-notes.md
+++ b/docs/source/release-notes.md
@@ -42,10 +42,12 @@ All published functionality in the Release Notes has been fully tested and verif
   - Added no_kv_cache_reuse option and streaming support for trtllm serve bench
 
 ### Infrastructure Changes
-- The dependent TensorRT version is updated to 10.11.0
-- The dependent public PyTorch version is updated to 2.8.0
-- The dependent NVIDIA ModelOpt version is updated to 0.31.0
-- Upgrade gcc toolset version from 11 to 13
+- The base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:25.05-py3`.
+- The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.05-py3`.
+- The dependent public PyTorch version is updated to 2.7.1.
+- The dependent TensorRT version is updated to 10.11.
+- The dependent NVIDIA ModelOpt version is updated to 0.31.
+- The dependent NCCL version is updated to 2.27.5.
 
 ### API Changes
 - Set _AutoDeployLlmArgs as primary config object

From 52beb0a7c2c5ae9d9da1494e7fa4570720243ef4 Mon Sep 17 00:00:00 2001
From: junq <22017000+QiJune@users.noreply.github.com>
Date: Wed, 16 Jul 2025 15:33:26 +0800
Subject: [PATCH 7/9] update

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
---
 docs/source/release-notes.md | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md
index 85076d0eaae..bbba4a26784 100644
--- a/docs/source/release-notes.md
+++ b/docs/source/release-notes.md
@@ -70,9 +70,8 @@ All published functionality in the Release Notes has been fully tested and verif
 - Fixed one-off attention window size for Gemma3 1B (#5564)
 
 ### Known Issues
-- accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken
-- accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_auto_dtype[mtp_nextn=2-overlap_scheduler=True] is broken
-- Bidirectional attention mask support for image tokens in Gemma3 VLMs is missing. This feature is available on main though: #5976
+- accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken.
+- Enabling disaggregated serving, MTP, and the overlap scheduler at the same time can lead to accuracy problems.
 
 ## TensorRT-LLM Release 0.20.0
 

From 0c5411ed18627521292b4a2b710a9b8aa5ae17cc Mon Sep 17 00:00:00 2001
From: junq <22017000+QiJune@users.noreply.github.com>
Date: Wed, 16 Jul 2025 16:29:22 +0800
Subject: [PATCH 8/9] update

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
---
 docs/source/release-notes.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md
index bbba4a26784..aca10159a52 100644
--- a/docs/source/release-notes.md
+++ b/docs/source/release-notes.md
@@ -25,7 +25,7 @@ All published functionality in the Release Notes has been fully tested and verif
   - Refactored Fused MoE module
   - Added support for chunked attention on Blackwell and Hopper
   - Introduced sliding-window attention kernels for the generation phase on Blackwell
-  - Updated DeepSeek FP8 TRT-LLM Gen cubins
+  - Updated DeepSeek FP8 TRT-LLM Gen cubins to improve performance in large batch size scenarios
   - Added FP8 block-scale GEMM support on SM89
   - Enabled overlap scheduler between draft forwards
   - Added Piecewise cuda graph support for MLA

From 7d9f8c990925eaa99d863c264de00c0c3cef259d Mon Sep 17 00:00:00 2001
From: junq <22017000+QiJune@users.noreply.github.com>
Date: Wed, 16 Jul 2025 16:34:43 +0800
Subject: [PATCH 9/9] update

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
---
 docs/source/release-notes.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md
index aca10159a52..dee84ecfde5 100644
--- a/docs/source/release-notes.md
+++ b/docs/source/release-notes.md
@@ -33,6 +33,7 @@ All published functionality in the Release Notes has been fully tested and verif
   - Enabled Finalize + Allreduce + add + rmsnorm fusion
   - Integrated TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner
   - Added support for Eagle3 + disaggregated serving in two model speculative decoding flow
+  - Validated Llama 3.1 models on H200 NVL
 - Benchmark:
   - Added all_reduce.py benchmark script for testing
   - Added beam width to trtllm-bench latency command