From e8a240e3dd2aaac62c717e6398726d4d5b05fb49 Mon Sep 17 00:00:00 2001 From: junq <22017000+QiJune@users.noreply.github.com> Date: Fri, 5 Dec 2025 18:15:53 +0800 Subject: [PATCH 1/8] update release notes for release/1.1 Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --- docs/source/release-notes.md | 97 ++++++++++++++++++++++++++++++++++++ 1 file changed, 97 insertions(+) diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md index 6fb91777eea..e817a8bb6a1 100644 --- a/docs/source/release-notes.md +++ b/docs/source/release-notes.md @@ -4,6 +4,103 @@ All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our [NVIDIA Developer Forum](https://forums.developer.nvidia.com/). +## TensorRT-LLM Release 1.1 + +### Key Features and Enhancements + +- **Model Support** + - Add GPT-OSS model support + - Add Hunyuan-Dense model support + - Add Hunyuan-MoE model support + - Add Nemotron Nano VL V2 model support + - Add Seed-OSS model support +- **Features** + - **KV Cache & Context:** + - **Connector API:** Introduced a new KV Cache Connector API for state transfer in disaggregated serving. + - **Reuse & Offloading:** Enabled KV cache reuse for MLA (Multi-Head Latent Attention) and added examples for host offloading. + - **Salting:** Implemented KV cache salting for secure cache reuse. + - **Speculative Decoding:** + - **Guided Decoding Integration:** Enabled guided decoding to work in conjunction with speculative decoding (including 2-model and draft model chunked prefill). + - **Eagle:** Added multi-layer Eagle support and optimizations. + - **Disaggregated Serving:** + - Added support for Guided Decoding in disaggregated mode. + - Optimized KV cache transfer for uneven pipeline parallelism. + - **Performance:** + - **DeepEP:** Optimized low-precision (FP4) combined kernels and all-to-all communication. + - **AutoTuner:** Refactored tuning config and generalized tactic selection for better kernel performance. + - **CuteDSL:** Integrated CuteDSL NVFP4 grouped GEMM for Blackwell. +- **Benchmark** + - **New Benchmarks:** + - **Disaggregated Serving:** Added dedicated performance tests for disaggregated serving scenarios (`test_perf.py`). + - **Multimodal:** Enabled `benchmark_serving` support for multimodal models. + - **NIM:** Added specific performance test cases for NIM (NVIDIA Inference Microservices) integration. + - **Tooling Improvements:** + - **trtllm-bench:** Added support for sampler options, accurate device iteration timing, and improved data loading for benchmark datasets. + - **Metrics:** Enhanced reporting to include KV cache size metrics in benchmark results. + - **Scaffolding:** Added benchmark support for scaffolding examples. +- **Documentation** + - **Deployment Guides:** Added comprehensive deployment guides for GPT-OSS, DeepSeek-R1, and VDR 1.0. + - **Feature Documentation:** Created new documentation for KV Cache Connector, LoRA feature usage, and AutoDeploy. + - **Tech Blogs:** Published blogs on "Combining Guided Decoding and Speculative Decoding" and "ADP Balance Strategy". + - **Hardware:** Updated documentation to reflect B300/GB300 support. + - **Quick Start:** Refined Quick Start guides with new links to ModelOpt checkpoints and updated installation steps (Linux/Windows). + - **API Reference:** Enhanced LLM API documentation by explicitly labeling stable vs. unstable APIs. + - **Performance:** Updated online benchmarking documentation and performance overview pages. + - **Examples:** Refined Slurm examples and added K2 tool calling examples. + +### Infrastructure Changes + +- The base Docker image for TensorRT-LLM is updated to \`nvcr.io/nvidia/pytorch:25.10-py3\`. +- The base Docker image for TensorRT-LLM Backend is updated to \`nvcr.io/nvidia/tritonserver:25.10-py3\`. +- The dependent public PyTorch version is updated to 2.9.0. +- The dependent NVIDIA ModelOpt version is updated to 0.37. +- The dependent xgrammar version is updated to 0.1.25. +- The dependent transformers version is updated to 4.56.0. + +### API Changes + +- **[Breaking Change**: The C++ TRTLLM sampler is now enabled by default, replacing the legacy implementation. A new `sampler_type` argument has been introduced to `SamplingConfig` to explicitly control sampler selection. +- **KV Cache Connector API:** Introduced a new KV Cache Connector API to facilitate state transfer between Disaggregated Serving workers (Context and Generation phases). +- **LLM API Enhancements:** + - Added support for `prompt_logprobs` in the PyTorch backend. + - Standardized `topk` logprob returns across TRT and PyTorch backends. + - Added stable labels to arguments in the `LLM` class to better indicate API stability. +- **Response API:** Added basic functionality for the Responses API to better handle streaming and non-streaming responses. +- **Multimodal Inputs:** Updated the `MultimodalParams` API to support `SharedTensor`, improving memory management for visual language models. +- **Wait and Cancel API:** Added tests and support for handling non-existent and completed request cancellations in the executor. + +### Fixed Issues + +- **DeepSeek-V3/R1:** + - Fixed potential hangs in DeepSeek-V3 pipelines by adjusting MNNVL configurations. + - Resolved illegal memory access errors in FP8 Scout and DeepSeek models. + - Fixed weight loading issues for DeepSeek-R1 W4A8 checkpoints (TP16 scenarios). +- **Llama 4:** Fixed FP4 generation issues and corrected all-reduce operations in the last decoder layer. +- **Mistral/Pixtral:** Fixed a batching bug in Mistral 3.1 where processing multiple requests with images in the same batch caused failures. +- **Qwen:** Fixed Qwen2.5-VL failures related to CUDA graph padding and transformers version compatibility. +- **Gemma:** Fixed out-of-bounds vector access for models with multiple layer types and resolved accuracy issues in Gemma 2\. +- **Speculative Decoding:** + - Fixed race conditions in one-model speculative decoding. + - Resolved CUDA graph warmup issues that caused failures when using speculative decoding. + - Fixed KV cache recompute logic in `draft_target` speculative decoding. +- **MoE (Mixture of Experts):** + - Fixed OOM issues in fused MoE kernels by optimizing workspace pre-allocation. + - Corrected Cutlass MoE integration to fix accuracy issues on Blackwell hardware. + - Fixed W4A8 MoE kernel issues on Hopper architecture. +- **General:** + - Fixed a potential hang caused by Python multiprocessing when prefetching weights. + - Resolved an issue where `torch.onnx.export` would fail with newer PyTorch versions by correctly falling back to non-dynamo modes. + - Fixed numerical stability issues for XQA kernels when using speculative decoding. + - Fixed a memory leak in the `cacheTransceiver` that could lead to hangs in disaggregated serving. + +### Know Issues + +- **Llama Pipeline Parallelism:** There are known stability issues when running Llama models with Pipeline Parallelism (PP) enabled in specific configurations. +- **GB300 Multi-Node:** Support for GB300 in multi-node configurations is currently in beta and not fully validated in this release. +- **Context Chunking:** In certain disaggregated serving configurations with specific chunk sizes, context chunking may exhibit performance degradation or instability. +- **Triton Backend:** There are known limitations when building Triton Docker images with specific combinations of dependencies; users should adhere strictly to the support matrix versions. + + ## TensorRT-LLM Release 1.0 TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now stable and the default experience, and the LLM API is now stable. For more details on new developments in 1.0, please see below. From 4dbbc921d5bdf7b242bb497ba421dea0c73eb8be Mon Sep 17 00:00:00 2001 From: junq <22017000+QiJune@users.noreply.github.com> Date: Fri, 5 Dec 2025 18:20:34 +0800 Subject: [PATCH 2/8] update release notes for release/1.1 Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --- docs/source/release-notes.md | 169 +++++++++++++++++------------------ 1 file changed, 84 insertions(+), 85 deletions(-) diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md index e817a8bb6a1..3d1207a0a92 100644 --- a/docs/source/release-notes.md +++ b/docs/source/release-notes.md @@ -8,97 +8,96 @@ All published functionality in the Release Notes has been fully tested and verif ### Key Features and Enhancements -- **Model Support** - - Add GPT-OSS model support - - Add Hunyuan-Dense model support - - Add Hunyuan-MoE model support - - Add Nemotron Nano VL V2 model support - - Add Seed-OSS model support -- **Features** - - **KV Cache & Context:** - - **Connector API:** Introduced a new KV Cache Connector API for state transfer in disaggregated serving. - - **Reuse & Offloading:** Enabled KV cache reuse for MLA (Multi-Head Latent Attention) and added examples for host offloading. - - **Salting:** Implemented KV cache salting for secure cache reuse. - - **Speculative Decoding:** - - **Guided Decoding Integration:** Enabled guided decoding to work in conjunction with speculative decoding (including 2-model and draft model chunked prefill). - - **Eagle:** Added multi-layer Eagle support and optimizations. - - **Disaggregated Serving:** - - Added support for Guided Decoding in disaggregated mode. - - Optimized KV cache transfer for uneven pipeline parallelism. - - **Performance:** - - **DeepEP:** Optimized low-precision (FP4) combined kernels and all-to-all communication. - - **AutoTuner:** Refactored tuning config and generalized tactic selection for better kernel performance. - - **CuteDSL:** Integrated CuteDSL NVFP4 grouped GEMM for Blackwell. -- **Benchmark** - - **New Benchmarks:** - - **Disaggregated Serving:** Added dedicated performance tests for disaggregated serving scenarios (`test_perf.py`). - - **Multimodal:** Enabled `benchmark_serving` support for multimodal models. - - **NIM:** Added specific performance test cases for NIM (NVIDIA Inference Microservices) integration. - - **Tooling Improvements:** - - **trtllm-bench:** Added support for sampler options, accurate device iteration timing, and improved data loading for benchmark datasets. - - **Metrics:** Enhanced reporting to include KV cache size metrics in benchmark results. - - **Scaffolding:** Added benchmark support for scaffolding examples. -- **Documentation** - - **Deployment Guides:** Added comprehensive deployment guides for GPT-OSS, DeepSeek-R1, and VDR 1.0. - - **Feature Documentation:** Created new documentation for KV Cache Connector, LoRA feature usage, and AutoDeploy. - - **Tech Blogs:** Published blogs on "Combining Guided Decoding and Speculative Decoding" and "ADP Balance Strategy". - - **Hardware:** Updated documentation to reflect B300/GB300 support. - - **Quick Start:** Refined Quick Start guides with new links to ModelOpt checkpoints and updated installation steps (Linux/Windows). - - **API Reference:** Enhanced LLM API documentation by explicitly labeling stable vs. unstable APIs. - - **Performance:** Updated online benchmarking documentation and performance overview pages. +- **Model Support** + - Add GPT-OSS model support + - Add Hunyuan-Dense model support + - Add Hunyuan-MoE model support + - Add Nemotron Nano VL V2 model support + - Add Seed-OSS model support +- **Features** + - **KV Cache & Context:** + - **Connector API:** Introduced a new KV Cache Connector API for state transfer in disaggregated serving. + - **Reuse & Offloading:** Enabled KV cache reuse for MLA (Multi-Head Latent Attention) and added examples for host offloading. + - **Salting:** Implemented KV cache salting for secure cache reuse. + - **Speculative Decoding:** + - **Guided Decoding Integration:** Enabled guided decoding to work in conjunction with speculative decoding (including 2-model and draft model chunked prefill). + - **Eagle:** Added multi-layer Eagle support and optimizations. + - **Disaggregated Serving:** + - Added support for Guided Decoding in disaggregated mode. + - Optimized KV cache transfer for uneven pipeline parallelism. + - **Performance:** + - **DeepEP:** Optimized low-precision (FP4) combined kernels and all-to-all communication. + - **AutoTuner:** Refactored tuning config and generalized tactic selection for better kernel performance. + - **CuteDSL:** Integrated CuteDSL NVFP4 grouped GEMM for Blackwell. +- **Benchmark** + - **New Benchmarks:** + - **Disaggregated Serving:** Added dedicated performance tests for disaggregated serving scenarios (`test_perf.py`). + - **Multimodal:** Enabled `benchmark_serving` support for multimodal models. + - **NIM:** Added specific performance test cases for NIM (NVIDIA Inference Microservices) integration. + - **Tooling Improvements:** + - **trtllm-bench:** Added support for sampler options, accurate device iteration timing, and improved data loading for benchmark datasets. + - **Metrics:** Enhanced reporting to include KV cache size metrics in benchmark results. + - **Scaffolding:** Added benchmark support for scaffolding examples. +- **Documentation** + - **Deployment Guides:** Added comprehensive deployment guides for GPT-OSS, DeepSeek-R1, and VDR 1.0. + - **Feature Documentation:** Created new documentation for KV Cache Connector, LoRA feature usage, and AutoDeploy. + - **Tech Blogs:** Published blogs on "Combining Guided Decoding and Speculative Decoding" and "ADP Balance Strategy". + - **Hardware:** Updated documentation to reflect B300/GB300 support. + - **Quick Start:** Refined Quick Start guides with new links to ModelOpt checkpoints and updated installation steps (Linux/Windows). + - **API Reference:** Enhanced LLM API documentation by explicitly labeling stable vs. unstable APIs. + - **Performance:** Updated online benchmarking documentation and performance overview pages. - **Examples:** Refined Slurm examples and added K2 tool calling examples. ### Infrastructure Changes -- The base Docker image for TensorRT-LLM is updated to \`nvcr.io/nvidia/pytorch:25.10-py3\`. -- The base Docker image for TensorRT-LLM Backend is updated to \`nvcr.io/nvidia/tritonserver:25.10-py3\`. -- The dependent public PyTorch version is updated to 2.9.0. -- The dependent NVIDIA ModelOpt version is updated to 0.37. -- The dependent xgrammar version is updated to 0.1.25. +- The base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:25.10-py3`. +- The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.10-py3`. +- The dependent public PyTorch version is updated to 2.9.0. +- The dependent NVIDIA ModelOpt version is updated to 0.37. +- The dependent xgrammar version is updated to 0.1.25. - The dependent transformers version is updated to 4.56.0. ### API Changes -- **[Breaking Change**: The C++ TRTLLM sampler is now enabled by default, replacing the legacy implementation. A new `sampler_type` argument has been introduced to `SamplingConfig` to explicitly control sampler selection. -- **KV Cache Connector API:** Introduced a new KV Cache Connector API to facilitate state transfer between Disaggregated Serving workers (Context and Generation phases). -- **LLM API Enhancements:** - - Added support for `prompt_logprobs` in the PyTorch backend. - - Standardized `topk` logprob returns across TRT and PyTorch backends. +- **Breaking Change**: The C++ TRTLLM sampler is now enabled by default, replacing the legacy implementation. A new `sampler_type` argument has been introduced to `SamplingConfig` to explicitly control sampler selection. +- **KV Cache Connector API:** Introduced a new KV Cache Connector API to facilitate state transfer between Disaggregated Serving workers (Context and Generation phases). +- **LLM API Enhancements:** + - Added support for `prompt_logprobs` in the PyTorch backend. + - Standardized `topk` logprob returns across TRT and PyTorch backends. - Added stable labels to arguments in the `LLM` class to better indicate API stability. -- **Response API:** Added basic functionality for the Responses API to better handle streaming and non-streaming responses. -- **Multimodal Inputs:** Updated the `MultimodalParams` API to support `SharedTensor`, improving memory management for visual language models. +- **Response API:** Added basic functionality for the Responses API to better handle streaming and non-streaming responses. +- **Multimodal Inputs:** Updated the `MultimodalParams` API to support `SharedTensor`, improving memory management for visual language models. - **Wait and Cancel API:** Added tests and support for handling non-existent and completed request cancellations in the executor. ### Fixed Issues -- **DeepSeek-V3/R1:** - - Fixed potential hangs in DeepSeek-V3 pipelines by adjusting MNNVL configurations. - - Resolved illegal memory access errors in FP8 Scout and DeepSeek models. - - Fixed weight loading issues for DeepSeek-R1 W4A8 checkpoints (TP16 scenarios). -- **Llama 4:** Fixed FP4 generation issues and corrected all-reduce operations in the last decoder layer. -- **Mistral/Pixtral:** Fixed a batching bug in Mistral 3.1 where processing multiple requests with images in the same batch caused failures. -- **Qwen:** Fixed Qwen2.5-VL failures related to CUDA graph padding and transformers version compatibility. -- **Gemma:** Fixed out-of-bounds vector access for models with multiple layer types and resolved accuracy issues in Gemma 2\. -- **Speculative Decoding:** - - Fixed race conditions in one-model speculative decoding. - - Resolved CUDA graph warmup issues that caused failures when using speculative decoding. - - Fixed KV cache recompute logic in `draft_target` speculative decoding. -- **MoE (Mixture of Experts):** - - Fixed OOM issues in fused MoE kernels by optimizing workspace pre-allocation. - - Corrected Cutlass MoE integration to fix accuracy issues on Blackwell hardware. - - Fixed W4A8 MoE kernel issues on Hopper architecture. -- **General:** - - Fixed a potential hang caused by Python multiprocessing when prefetching weights. - - Resolved an issue where `torch.onnx.export` would fail with newer PyTorch versions by correctly falling back to non-dynamo modes. - - Fixed numerical stability issues for XQA kernels when using speculative decoding. +- **DeepSeek-V3/R1:** + - Fixed potential hangs in DeepSeek-V3 pipelines by adjusting MNNVL configurations. + - Resolved illegal memory access errors in FP8 Scout and DeepSeek models. + - Fixed weight loading issues for DeepSeek-R1 W4A8 checkpoints (TP16 scenarios). +- **Llama 4:** Fixed FP4 generation issues and corrected all-reduce operations in the last decoder layer. +- **Mistral/Pixtral:** Fixed a batching bug in Mistral 3.1 where processing multiple requests with images in the same batch caused failures. +- **Qwen:** Fixed Qwen2.5-VL failures related to CUDA graph padding and transformers version compatibility. +- **Gemma:** Fixed out-of-bounds vector access for models with multiple layer types and resolved accuracy issues in Gemma 2. +- **Speculative Decoding:** + - Fixed race conditions in one-model speculative decoding. + - Resolved CUDA graph warmup issues that caused failures when using speculative decoding. + - Fixed KV cache recompute logic in `draft_target` speculative decoding. +- **MoE (Mixture of Experts):** + - Fixed OOM issues in fused MoE kernels by optimizing workspace pre-allocation. + - Corrected Cutlass MoE integration to fix accuracy issues on Blackwell hardware. + - Fixed W4A8 MoE kernel issues on Hopper architecture. +- **General:** + - Fixed a potential hang caused by Python multiprocessing when prefetching weights. + - Resolved an issue where `torch.onnx.export` would fail with newer PyTorch versions by correctly falling back to non-dynamo modes. + - Fixed numerical stability issues for XQA kernels when using speculative decoding. - Fixed a memory leak in the `cacheTransceiver` that could lead to hangs in disaggregated serving. ### Know Issues -- **Llama Pipeline Parallelism:** There are known stability issues when running Llama models with Pipeline Parallelism (PP) enabled in specific configurations. -- **GB300 Multi-Node:** Support for GB300 in multi-node configurations is currently in beta and not fully validated in this release. -- **Context Chunking:** In certain disaggregated serving configurations with specific chunk sizes, context chunking may exhibit performance degradation or instability. -- **Triton Backend:** There are known limitations when building Triton Docker images with specific combinations of dependencies; users should adhere strictly to the support matrix versions. +- **Llama Pipeline Parallelism:** There are known stability issues when running Llama models with Pipeline Parallelism (PP) enabled in specific configurations. +- **GB300 Multi-Node:** Support for GB300 in multi-node configurations is currently in beta and not fully validated in this release. +- **Context Chunking:** In certain disaggregated serving configurations with specific chunk sizes, context chunking may exhibit performance degradation or instability. ## TensorRT-LLM Release 1.0 @@ -117,7 +116,7 @@ TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now s - Add support for sm121 - Add LoRA support for Gemma3 - Support PyTorch LoRA adapter eviction - - Add LoRA support for PyTorch backend in trtllm-serve + - Add LoRA support for PyTorch backend in trtllm-serve - Add support of scheduling attention dp request - Remove padding of FusedMoE in attention DP - Support torch compile for attention dp @@ -156,7 +155,7 @@ TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now s - Add status tags to LLM API reference - Support JSON Schema in OpenAI-Compatible API - Support chunked prefill on spec decode 2 model - - Add KV cache reuse support for multimodal models + - Add KV cache reuse support for multimodal models - Support nanobind bindings - Add support for two-model engine KV cache reuse - Add Eagle-3 support for qwen3 dense model @@ -176,7 +175,7 @@ TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now s - Detokenize option in /v1/completions request - Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner - Remove support for llmapi + TRT backend in Triton - - Add request_perf_metrics to triton LLMAPI backend + - Add request_perf_metrics to triton LLMAPI backend - Add support for Triton request cancellation - Benchmark: @@ -185,7 +184,7 @@ TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now s - Add the ability to write a request timeline for trtllm-bench - Add no_kv_cache_reuse option and streaming support for trtllm-serve bench - Add latency support for trtllm-bench - - Add Acceptance Rate calculation to benchmark_serving + - Add Acceptance Rate calculation to benchmark_serving - Add wide-ep benchmarking scripts - Update trtllm-bench to support new Pytorch default - Add support for TRTLLM CustomDataset @@ -215,8 +214,8 @@ TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now s - Change default LoRA cache sizes and change peft_cache_config cache size fields to take effect when not explicitly set in lora_config - Remove deprecated LoRA LLM args, that are already specified in lora_config - Add request_perf_metrics to LLMAPI -- Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead -- Remove TrtGptModelOptionalParams +- Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead +- Remove TrtGptModelOptionalParams - Remove ptuning knobs from TorchLlmArgs @@ -264,17 +263,17 @@ TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now s - Fix error in post-merge-tests (#5949) - Fix missing arg to alltoall_prepare_maybe_dispatch (#5669) - Fix attention DP doesn't work with embedding TP (#5642) -- Fix broken cyclic reference detect (#5417) +- Fix broken cyclic reference detect (#5417) - Fix permission for local user issues in NGC docker container. (#5373) -- Fix mtp vanilla draft inputs (#5568) -- Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519) +- Fix mtp vanilla draft inputs (#5568) +- Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519) - Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514) -- Fix the issue MoE autotune fallback failed to query default heuristic (#5520) +- Fix the issue MoE autotune fallback failed to query default heuristic (#5520) - Fix the unexpected keyword argument 'streaming' (#5436) ### Known Issues - When using disaggregated serving with pipeline parallelism and KV cache reuse, a hang can occur. This will be fixed in a future release. In the meantime, disabling KV cache reuse will fix this issue. -- Running multi-node cases where each node has just a single GPU is known to fail. This will be addressed in a future release. +- Running multi-node cases where each node has just a single GPU is known to fail. This will be addressed in a future release. - For the Llama 3.x and Llama 4 models, there is an issue with pipeline parallelism when using FP8 and NVFP4 weights. As a workaround, you can set the environment variable `export TRTLLM_LLAMA_EAGER_FUSION_DISABLED=1`. ## TensorRT-LLM Release 0.21.0 From 3665a325caba81062caea6df00dc9bb99291d94e Mon Sep 17 00:00:00 2001 From: junq <22017000+QiJune@users.noreply.github.com> Date: Fri, 5 Dec 2025 18:21:06 +0800 Subject: [PATCH 3/8] update Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --- docs/source/release-notes.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md index 3d1207a0a92..e4c000c491b 100644 --- a/docs/source/release-notes.md +++ b/docs/source/release-notes.md @@ -95,9 +95,7 @@ All published functionality in the Release Notes has been fully tested and verif ### Know Issues -- **Llama Pipeline Parallelism:** There are known stability issues when running Llama models with Pipeline Parallelism (PP) enabled in specific configurations. - **GB300 Multi-Node:** Support for GB300 in multi-node configurations is currently in beta and not fully validated in this release. -- **Context Chunking:** In certain disaggregated serving configurations with specific chunk sizes, context chunking may exhibit performance degradation or instability. ## TensorRT-LLM Release 1.0 From 725e7007250375666bebda71f062c24142315252 Mon Sep 17 00:00:00 2001 From: junq <22017000+QiJune@users.noreply.github.com> Date: Mon, 8 Dec 2025 09:12:27 +0800 Subject: [PATCH 4/8] fix Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --- docs/source/release-notes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md index e4c000c491b..55eb0d6b163 100644 --- a/docs/source/release-notes.md +++ b/docs/source/release-notes.md @@ -93,7 +93,7 @@ All published functionality in the Release Notes has been fully tested and verif - Fixed numerical stability issues for XQA kernels when using speculative decoding. - Fixed a memory leak in the `cacheTransceiver` that could lead to hangs in disaggregated serving. -### Know Issues +### Known Issues - **GB300 Multi-Node:** Support for GB300 in multi-node configurations is currently in beta and not fully validated in this release. From 431dedb59fe87cc047ffe4ea51ca8bdaad224f15 Mon Sep 17 00:00:00 2001 From: junq <22017000+QiJune@users.noreply.github.com> Date: Mon, 8 Dec 2025 15:17:26 +0800 Subject: [PATCH 5/8] update NIXL Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --- docs/source/release-notes.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md index 55eb0d6b163..abc374f17b3 100644 --- a/docs/source/release-notes.md +++ b/docs/source/release-notes.md @@ -56,6 +56,7 @@ All published functionality in the Release Notes has been fully tested and verif - The dependent NVIDIA ModelOpt version is updated to 0.37. - The dependent xgrammar version is updated to 0.1.25. - The dependent transformers version is updated to 4.56.0. +- The dependent NIXL version is updated to 0.5.0. ### API Changes From c325b658a76cade7745b35e2ce89551f7fb74cd5 Mon Sep 17 00:00:00 2001 From: QI JUN <22017000+QiJune@users.noreply.github.com> Date: Tue, 9 Dec 2025 08:59:51 +0800 Subject: [PATCH 6/8] Update docs/source/release-notes.md Co-authored-by: Laikh Tewari Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com> --- docs/source/release-notes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md index abc374f17b3..24bfe6a937f 100644 --- a/docs/source/release-notes.md +++ b/docs/source/release-notes.md @@ -96,7 +96,7 @@ All published functionality in the Release Notes has been fully tested and verif ### Known Issues -- **GB300 Multi-Node:** Support for GB300 in multi-node configurations is currently in beta and not fully validated in this release. +- **GB300 Multi-Node:** Support for GB300 in multi-node configurations is currently in beta and not fully validated in this release. GB300 multi-node configurations have been validated in 1.2.0rc4+ ## TensorRT-LLM Release 1.0 From 4c910ebe567745472558341191923dfd5b1c9235 Mon Sep 17 00:00:00 2001 From: junq <22017000+QiJune@users.noreply.github.com> Date: Wed, 10 Dec 2025 08:50:51 +0800 Subject: [PATCH 7/8] update Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --- docs/source/release-notes.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md index 24bfe6a937f..8673b972fd1 100644 --- a/docs/source/release-notes.md +++ b/docs/source/release-notes.md @@ -29,6 +29,8 @@ All published functionality in the Release Notes has been fully tested and verif - **DeepEP:** Optimized low-precision (FP4) combined kernels and all-to-all communication. - **AutoTuner:** Refactored tuning config and generalized tactic selection for better kernel performance. - **CuteDSL:** Integrated CuteDSL NVFP4 grouped GEMM for Blackwell. + - **Hardware:** + - **B300/GB300:** Added support for B300/GB300. - **Benchmark** - **New Benchmarks:** - **Disaggregated Serving:** Added dedicated performance tests for disaggregated serving scenarios (`test_perf.py`). @@ -41,8 +43,7 @@ All published functionality in the Release Notes has been fully tested and verif - **Documentation** - **Deployment Guides:** Added comprehensive deployment guides for GPT-OSS, DeepSeek-R1, and VDR 1.0. - **Feature Documentation:** Created new documentation for KV Cache Connector, LoRA feature usage, and AutoDeploy. - - **Tech Blogs:** Published blogs on "Combining Guided Decoding and Speculative Decoding" and "ADP Balance Strategy". - - **Hardware:** Updated documentation to reflect B300/GB300 support. + - **Tech Blogs:** Published blogs on "[Combining Guided Decoding and Speculative Decoding](./blogs/tech_blog/blog12_Combining_Guided_Decoding_and_Speculative_Decoding.md)" and "[ADP Balance Strategy](./blogs/tech_blog/blog10_ADP_Balance_Strategy.md)". - **Quick Start:** Refined Quick Start guides with new links to ModelOpt checkpoints and updated installation steps (Linux/Windows). - **API Reference:** Enhanced LLM API documentation by explicitly labeling stable vs. unstable APIs. - **Performance:** Updated online benchmarking documentation and performance overview pages. @@ -96,7 +97,7 @@ All published functionality in the Release Notes has been fully tested and verif ### Known Issues -- **GB300 Multi-Node:** Support for GB300 in multi-node configurations is currently in beta and not fully validated in this release. GB300 multi-node configurations have been validated in 1.2.0rc4+ +- **GB300 Multi-Node:** Support for GB300 in multi-node configurations is currently in beta and not fully validated in this release. GB300 multi-node configurations have been validated in 1.2.0rc4+. ## TensorRT-LLM Release 1.0 From c81d5cc5f403bca3d7883c60bffa615dea7df01f Mon Sep 17 00:00:00 2001 From: junq <22017000+QiJune@users.noreply.github.com> Date: Wed, 10 Dec 2025 10:08:02 +0800 Subject: [PATCH 8/8] acknowledge contribution from community Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --- docs/source/release-notes.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/docs/source/release-notes.md b/docs/source/release-notes.md index 8673b972fd1..49ea1a1fbe0 100644 --- a/docs/source/release-notes.md +++ b/docs/source/release-notes.md @@ -9,11 +9,12 @@ All published functionality in the Release Notes has been fully tested and verif ### Key Features and Enhancements - **Model Support** - - Add GPT-OSS model support - - Add Hunyuan-Dense model support - - Add Hunyuan-MoE model support - - Add Nemotron Nano VL V2 model support - - Add Seed-OSS model support + - Add GPT-OSS model support. + - Add Hunyuan-Dense model support. Thanks to the contribution from @sorenwu. + - Add Hunyuan-MoE model support. Thanks to the contribution from @qianbiaoxiang. + - Add Nemotron Nano VL V2 model support. + - Add Seed-OSS model support. Thanks to the contribution from @Nekofish-L. + - **Features** - **KV Cache & Context:** - **Connector API:** Introduced a new KV Cache Connector API for state transfer in disaggregated serving.