NVIDIA · QiJune · Dec 10, 2025 · Dec 5, 2025 · Dec 5, 2025 · Dec 5, 2025
@@ -4,6 +4,103 @@
 
 All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our [NVIDIA Developer Forum](https://forums.developer.nvidia.com/).
 
+## TensorRT-LLM Release 1.1
+
+### Key Features and Enhancements
+
+- **Model Support**
+  - Add GPT-OSS model support.
+  - Add Hunyuan-Dense model support. Thanks to the contribution from @sorenwu.
+  - Add Hunyuan-MoE model support. Thanks to the contribution from @qianbiaoxiang.
+  - Add Nemotron Nano VL V2 model support.
+  - Add Seed-OSS model support. Thanks to the contribution from @Nekofish-L.
+
+- **Features**
+  - **KV Cache & Context:**
+    - **Connector API:** Introduced a new KV Cache Connector API for state transfer in disaggregated serving.
+    - **Reuse & Offloading:** Enabled KV cache reuse for MLA (Multi-Head Latent Attention) and added examples for host offloading.
+    - **Salting:** Implemented KV cache salting for secure cache reuse.
+  - **Speculative Decoding:**
+    - **Guided Decoding Integration:** Enabled guided decoding to work in conjunction with speculative decoding (including 2-model and draft model chunked prefill).
+    - **Eagle:** Added multi-layer Eagle support and optimizations.
+  - **Disaggregated Serving:**
+    - Added support for Guided Decoding in disaggregated mode.
+    - Optimized KV cache transfer for uneven pipeline parallelism.
+  - **Performance:**
+    - **DeepEP:** Optimized low-precision (FP4) combined kernels and all-to-all communication.
+    - **AutoTuner:** Refactored tuning config and generalized tactic selection for better kernel performance.
+    - **CuteDSL:** Integrated CuteDSL NVFP4 grouped GEMM for Blackwell.
+  - **Hardware:**
+    - **B300/GB300:** Added support for B300/GB300.
+- **Benchmark**
+  - **New Benchmarks:**
+    - **Disaggregated Serving:** Added dedicated performance tests for disaggregated serving scenarios (`test_perf.py`).
+    - **Multimodal:** Enabled `benchmark_serving` support for multimodal models.
+    - **NIM:** Added specific performance test cases for NIM (NVIDIA Inference Microservices) integration.
+  - **Tooling Improvements:**
+    - **trtllm-bench:** Added support for sampler options, accurate device iteration timing, and improved data loading for benchmark datasets.
+    - **Metrics:** Enhanced reporting to include KV cache size metrics in benchmark results.
+    - **Scaffolding:** Added benchmark support for scaffolding examples.
+- **Documentation**
+  - **Deployment Guides:** Added comprehensive deployment guides for GPT-OSS, DeepSeek-R1, and VDR 1.0.
+  - **Feature Documentation:** Created new documentation for KV Cache Connector, LoRA feature usage, and AutoDeploy.
+  - **Tech Blogs:** Published blogs on "[Combining Guided Decoding and Speculative Decoding](./blogs/tech_blog/blog12_Combining_Guided_Decoding_and_Speculative_Decoding.md)" and "[ADP Balance Strategy](./blogs/tech_blog/blog10_ADP_Balance_Strategy.md)".
+  - **Quick Start:** Refined Quick Start guides with new links to ModelOpt checkpoints and updated installation steps (Linux/Windows).
+  - **API Reference:** Enhanced LLM API documentation by explicitly labeling stable vs. unstable APIs.
+  - **Performance:** Updated online benchmarking documentation and performance overview pages.
+  - **Examples:** Refined Slurm examples and added K2 tool calling examples.
+
+### Infrastructure Changes
+
+- The base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:25.10-py3`.
+- The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.10-py3`.
+- The dependent public PyTorch version is updated to 2.9.0.
+- The dependent NVIDIA ModelOpt version is updated to 0.37.
+- The dependent xgrammar version is updated to 0.1.25.
+- The dependent transformers version is updated to 4.56.0.
+- The dependent NIXL version is updated to 0.5.0.
+
+### API Changes
+
+- **Breaking Change**: The C++ TRTLLM sampler is now enabled by default, replacing the legacy implementation. A new `sampler_type` argument has been introduced to `SamplingConfig` to explicitly control sampler selection.
+- **KV Cache Connector API:** Introduced a new KV Cache Connector API to facilitate state transfer between Disaggregated Serving workers (Context and Generation phases).
+- **LLM API Enhancements:**
+  - Added support for `prompt_logprobs` in the PyTorch backend.
+  - Standardized `topk` logprob returns across TRT and PyTorch backends.
+  - Added stable labels to arguments in the `LLM` class to better indicate API stability.
+- **Response API:** Added basic functionality for the Responses API to better handle streaming and non-streaming responses.
+- **Multimodal Inputs:** Updated the `MultimodalParams` API to support `SharedTensor`, improving memory management for visual language models.
+- **Wait and Cancel API:** Added tests and support for handling non-existent and completed request cancellations in the executor.
+
+### Fixed Issues
+
+- **DeepSeek-V3/R1:**
+  - Fixed potential hangs in DeepSeek-V3 pipelines by adjusting MNNVL configurations.
+  - Resolved illegal memory access errors in FP8 Scout and DeepSeek models.
+  - Fixed weight loading issues for DeepSeek-R1 W4A8 checkpoints (TP16 scenarios).
+- **Llama 4:** Fixed FP4 generation issues and corrected all-reduce operations in the last decoder layer.
+- **Mistral/Pixtral:** Fixed a batching bug in Mistral 3.1 where processing multiple requests with images in the same batch caused failures.
+- **Qwen:** Fixed Qwen2.5-VL failures related to CUDA graph padding and transformers version compatibility.
+- **Gemma:** Fixed out-of-bounds vector access for models with multiple layer types and resolved accuracy issues in Gemma 2.
+- **Speculative Decoding:**
+  - Fixed race conditions in one-model speculative decoding.
+  - Resolved CUDA graph warmup issues that caused failures when using speculative decoding.
+  - Fixed KV cache recompute logic in `draft_target` speculative decoding.
+- **MoE (Mixture of Experts):**
+  - Fixed OOM issues in fused MoE kernels by optimizing workspace pre-allocation.
+  - Corrected Cutlass MoE integration to fix accuracy issues on Blackwell hardware.
+  - Fixed W4A8 MoE kernel issues on Hopper architecture.
+- **General:**
+  - Fixed a potential hang caused by Python multiprocessing when prefetching weights.
+  - Resolved an issue where `torch.onnx.export` would fail with newer PyTorch versions by correctly falling back to non-dynamo modes.
+  - Fixed numerical stability issues for XQA kernels when using speculative decoding.
+  - Fixed a memory leak in the `cacheTransceiver` that could lead to hangs in disaggregated serving.
+
+### Known Issues
+
+- **GB300 Multi-Node:** Support for GB300 in multi-node configurations is currently in beta and not fully validated in this release. GB300 multi-node configurations have been validated in 1.2.0rc4+.
+
+
 ## TensorRT-LLM Release 1.0
 
 TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now stable and the default experience, and the LLM API is now stable. For more details on new developments in 1.0, please see below.
@@ -20,7 +117,7 @@ TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now s
   - Add support for sm121
   - Add LoRA support for Gemma3
   - Support PyTorch LoRA adapter eviction
-  - Add LoRA support for PyTorch backend in trtllm-serve 
+  - Add LoRA support for PyTorch backend in trtllm-serve
   - Add support of scheduling attention dp request
   - Remove padding of FusedMoE in attention DP
   - Support torch compile for attention dp
@@ -59,7 +156,7 @@ TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now s
   - Add status tags to LLM API reference
   - Support JSON Schema in OpenAI-Compatible API
   - Support chunked prefill on spec decode 2 model
-  - Add KV cache reuse support for multimodal models 
+  - Add KV cache reuse support for multimodal models
   - Support nanobind bindings
   - Add support for two-model engine KV cache reuse
   - Add Eagle-3 support for qwen3 dense model
@@ -79,7 +176,7 @@ TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now s
   - Detokenize option in /v1/completions request
   - Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner
   - Remove support for llmapi + TRT backend in Triton
-  - Add request_perf_metrics to triton LLMAPI backend 
+  - Add request_perf_metrics to triton LLMAPI backend
   - Add support for Triton request cancellation
 
 - Benchmark:
@@ -88,7 +185,7 @@ TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now s
   - Add the ability to write a request timeline for trtllm-bench
   - Add no_kv_cache_reuse option and streaming support for trtllm-serve bench
   - Add latency support for trtllm-bench
-  - Add Acceptance Rate calculation to benchmark_serving 
+  - Add Acceptance Rate calculation to benchmark_serving
   - Add wide-ep benchmarking scripts
   - Update trtllm-bench to support new Pytorch default
   - Add support for TRTLLM CustomDataset
@@ -118,8 +215,8 @@ TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now s
 - Change default LoRA cache sizes and change peft_cache_config cache size fields to take effect when not explicitly set in lora_config
 - Remove deprecated LoRA LLM args, that are already specified in lora_config
 - Add request_perf_metrics to LLMAPI
-- Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead 
-- Remove TrtGptModelOptionalParams 
+- Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead
+- Remove TrtGptModelOptionalParams
 - Remove ptuning knobs from TorchLlmArgs
 
 
@@ -167,17 +264,17 @@ TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now s
 - Fix error in post-merge-tests (#5949)
 - Fix missing arg to alltoall_prepare_maybe_dispatch (#5669)
 - Fix attention DP doesn't work with embedding TP (#5642)
-- Fix broken cyclic reference detect (#5417) 
+- Fix broken cyclic reference detect (#5417)
 - Fix permission for local user issues in NGC docker container. (#5373)
-- Fix mtp vanilla draft inputs (#5568) 
-- Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519) 
+- Fix mtp vanilla draft inputs (#5568)
+- Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519)
 - Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514)
-- Fix the issue MoE autotune fallback failed to query default heuristic (#5520) 
+- Fix the issue MoE autotune fallback failed to query default heuristic (#5520)
 - Fix the unexpected keyword argument 'streaming' (#5436)
 
 ### Known Issues
 - When using disaggregated serving with pipeline parallelism and KV cache reuse, a hang can occur. This will be fixed in a future release. In the meantime, disabling KV cache reuse will fix this issue.
-- Running multi-node cases where each node has just a single GPU is known to fail. This will be addressed in a future release. 
+- Running multi-node cases where each node has just a single GPU is known to fail. This will be addressed in a future release.
 - For the Llama 3.x and Llama 4 models, there is an issue with pipeline parallelism when using FP8 and NVFP4 weights. As a workaround, you can set the environment variable `export TRTLLM_LLAMA_EAGER_FUSION_DISABLED=1`.
 
 ## TensorRT-LLM Release 0.21.0