Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
175 changes: 175 additions & 0 deletions docs/source/release-notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,181 @@

All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our [NVIDIA Developer Forum](https://forums.developer.nvidia.com/).

## TensorRT-LLM Release 1.0

TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now stable and the default experience, and the LLM API is now stable. For more details on new developments in 1.0, please see below.

### Key Features and Enhancements
- **Model Support**
- Add Mistral3.1 VLM model support
- Add TensorRT-Engine Qwen3 (dense) model support
- Add phi-4-multimodal model support
- Add EXAONE 4.0 model support
- Add Qwen3 MoE support to TensorRT backend

- **Features**
- Add support for sm121
- Add LoRA support for Gemma3
- Support PyTorch LoRA adapter eviction
- Add LoRA support for PyTorch backend in trtllm-serve
- Add support of scheduling attention dp request
- Remove padding of FusedMoE in attention DP
- Support torch compile for attention dp
- Add KV events support for sliding window attention
- Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE
- Add Piecewise CUDA Graph support for MLA
- Support mutliCtasKvMode for high-throughput MLA kernels
- Enable kvcache to be reused during request generation
- Add ADP schedule balance optimization
- Add chunked prefill support for MLA (Blackwell)
- Enable Multi-block mode for Hopper spec dec XQA kernel
- Add vLLM KV Pool support for XQA kernel
- Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5
- Add support for fused gate_up_proj scales for FP8 blockwise
- Support FP8 row-wise dense GEMM in torch flow
- Enable fp8 SwiGLU to minimize host overhead
- Add Deepseek R1 FP8 Support on Blackwell
- Add support for MXFP8xMXFP4 in pytorch
- Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell)
- Opensource MOE MXFP8-MXFP4 implementation
- Add support for Modelopt fp8_pb_wo quantization scheme
- Support deepEP fp4 post quant all2all dispatch
- Fuse w4a8 moe pre-quant scale on Hopper
- Support Weight-Only-Quantization in PyTorch Workflow
- Add support for per expert activation scaling factors
- Add ReDrafter support for Qwen
- Enable CUDA Graph for Nemotron-H
- Add support for YARN in NemotronNAS models
- Switch to internal version of MMProjector in Gemma3
- Disable add special tokens for Llama3.3 70B
- Auto-enable ngram with concurrency <= 32
- Support turning on/off spec decoding dynamically
- Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21
- Add support for external multimodal embeddings
- Add support for disaggregation with pp with pytorch backend
- Add status tags to LLM API reference
- Support JSON Schema in OpenAI-Compatible API
- Support chunked prefill on spec decode 2 model
- Add KV cache reuse support for multimodal models
- Support nanobind bindings
- Add support for two-model engine KV cache reuse
- Add Eagle-3 support for qwen3 dense model
- Migrate Eagle-3 and draft/target speculation to Drafter
- Enable guided decoding with overlap scheduler
- Support n-gram speculative decoding with disagg
- Add beam search support to the PyTorch Workflow
- Add LLGuidance Support for PyTorch Backend
- Add NGrams V2 support
- Add MTP support for Online EPLB
- Support disaggregated serving in TRTLLM Sampler
- Add core infrastructure to enable loading of custom checkpoint formats
- Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs
- Use huge page mapping for host accessible memory on GB200
- Add user-provided speculative decoding support
- Add streaming scaffolding_llm.generate_async support
- Detokenize option in /v1/completions request
- Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner
- Remove support for llmapi + TRT backend in Triton
- Add request_perf_metrics to triton LLMAPI backend
- Add support for Triton request cancellation

- Benchmark:
- Add support for benchmarking individual gemms in MOE benchmark (#6080)
- Add speculative metrics for trtllm-bench
- Add the ability to write a request timeline for trtllm-bench
- Add no_kv_cache_reuse option and streaming support for trtllm-serve bench
- Add latency support for trtllm-bench
- Add Acceptance Rate calculation to benchmark_serving
- Add wide-ep benchmarking scripts
- Update trtllm-bench to support new Pytorch default
- Add support for TRTLLM CustomDataset
- Make benchmark_serving part of the library

- Documentation:
- Refactored the doc structure to focus on the PyTorch workflow.
- Improved the LLMAPI and API reference documentation. Stable APIs are now protected and will remain consistent in subsequent versions following v1.0.
- Removed legacy documentation related to the TensorRT workflow.

### Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:25.06-py3`.
- The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.06-py3`.
- The dependent NVIDIA ModelOpt version is updated to 0.33.
- The dependent xgrammar version is updated to 0.1.21.
- The dependent transformers version is updated to 4.53.1.

### API Changes
- **BREAKING CHANGE** Promote PyTorch to be the default LLM backend
- **BREAKING CHANGE** Change default backend to PyTorch in trtllm-serve
- **BREAKING CHANGE** Unify KvCacheConfig in LLM class for pytorch backend
- **BREAKING CHANGE** Rename cuda_graph_config padding_enabled field
- **BREAKING CHANGE** Rename mixed_sampler to enable_mixed_sampler
- **BREAKING CHANGE** Rename LLM.autotuner_enabled to enable_autotuner
- Add back allreduce_strategy parameter into TorchLlmArgs
- Add LLmArgs option to force using dynamic quantization
- Change default LoRA cache sizes and change peft_cache_config cache size fields to take effect when not explicitly set in lora_config
- Remove deprecated LoRA LLM args, that are already specified in lora_config
- Add request_perf_metrics to LLMAPI
- Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead
- Remove TrtGptModelOptionalParams
- Remove ptuning knobs from TorchLlmArgs


### Fixed Issues
- Fix illegal memory access in MLA (#6437)
- Fix nemotronNAS loading for TP>1 (#6447)
- Fix wide EP when using DeepEP with online EPLB (#6429)
- Fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344)
- Fix PD + MTP + overlap scheduler accuracy issue (#6136)
- Fix bug of Qwen3 when using fp4 on sm120 (#6065)
- Fix TMA error with GEMM+AR on TP=2 (#6075)
- Fix scaffolding aime test in test_e2e (#6140)
- Fix KV Cache overrides in trtllm-bench (#6103)
- Fix MOE benchmark to rotate buffers to prevent L2 cache reuse (#4135)
- Fix eagle3 two model disaggregated serving test (#6014)
- Fix chunked prefill + overlap scheduling (#5761)
- Fix mgmn postprocess error (#5835)
- Fallback to cubins for fp8 fmha kernels on Ada (#5779)
- Fix disagg + speculative decoding (#5558)
- Fix test_generate_with_seed CI failure. (#5772)
- Fix prompt adapter TP2 case (#5782)
- Fix disaggregate serving with attention DP (#4993)
- Fix a quote error introduced in #5534 (#5816)
- Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. (#5801)
- Fix lost requests for disaggregated serving (#5815)
- Update unit tests: skip all_close assert for dropout in attention, increase tolerance for rope op test (#5855)
- Fix GEMM+AR fusion on blackwell (#5563)
- Fix llama4 multimodal support (#5809)
- Fix Llama4 Scout FP4 crash issue (#5925)
- Fix max batch size and max tokens in kv cache estimations for Nemotron-H (#5371)
- Fix moe regression for sm120 (#5823)
- Fix Qwen2.5VL FP8 support (#5029)
- Fix the illegal memory access issue in moe gemm on SM120 (#5636)
- Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
- Fix incremental detokenization (#5825)
- Fix MoE workspace info by storing Torch tensor itself instead of data_ptr (#5900)
- Fix mistral unit tests due to transformers upgrade (#5904)
- Fix the Llama3.1 405B hanging issue. (#5698) (#5925)
- Fix Gemma3 unit tests due to transformers upgrade (#5921)
- Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902)
- Remove SpecConfig and fix thread leak issues (#5931)
- Fast redux detection in trtllm gen routing kernel (#5941)
- Fix cancel request logic (#5800)
- Fix errors in wide-ep scripts (#5992)
- Fix error in post-merge-tests (#5949)
- Fix missing arg to alltoall_prepare_maybe_dispatch (#5669)
- Fix attention DP doesn't work with embedding TP (#5642)
- Fix broken cyclic reference detect (#5417)
- Fix permission for local user issues in NGC docker container. (#5373)
- Fix mtp vanilla draft inputs (#5568)
- Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519)
- Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514)
- Fix the issue MoE autotune fallback failed to query default heuristic (#5520)
- Fix the unexpected keyword argument 'streaming' (#5436)

### Known Issues
- When using disaggregated serving with pipeline parallelism and KV cache reuse, a hang can occur. This will be fixed in a future release. In the meantime, disabling KV cache reuse will fix this issue.
- Running multi-node cases where each node has just a single GPU is known to fail. This will be addressed in a future release.

## TensorRT-LLM Release 0.21.0

### Key Features and Enhancements
Expand Down