|
4 | 4 |
|
5 | 5 | All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our [NVIDIA Developer Forum](https://forums.developer.nvidia.com/). |
6 | 6 |
|
| 7 | +## TensorRT-LLM Release 1.0 |
| 8 | + |
| 9 | +TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now stable and the default experience, and the LLM API is now stable. For more details on new developments in 1.0, please see below. |
| 10 | + |
| 11 | +### Key Features and Enhancements |
| 12 | +- **Model Support** |
| 13 | + - Add Mistral3.1 VLM model support |
| 14 | + - Add TensorRT-Engine Qwen3 (dense) model support |
| 15 | + - Add phi-4-multimodal model support |
| 16 | + - Add EXAONE 4.0 model support |
| 17 | + - Add Qwen3 MoE support to TensorRT backend |
| 18 | + |
| 19 | +- **Features** |
| 20 | + - Add support for sm121 |
| 21 | + - Add LoRA support for Gemma3 |
| 22 | + - Support PyTorch LoRA adapter eviction |
| 23 | + - Add LoRA support for PyTorch backend in trtllm-serve |
| 24 | + - Add support of scheduling attention dp request |
| 25 | + - Remove padding of FusedMoE in attention DP |
| 26 | + - Support torch compile for attention dp |
| 27 | + - Add KV events support for sliding window attention |
| 28 | + - Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE |
| 29 | + - Add Piecewise CUDA Graph support for MLA |
| 30 | + - Support mutliCtasKvMode for high-throughput MLA kernels |
| 31 | + - Enable kvcache to be reused during request generation |
| 32 | + - Add ADP schedule balance optimization |
| 33 | + - Add chunked prefill support for MLA (Blackwell) |
| 34 | + - Enable Multi-block mode for Hopper spec dec XQA kernel |
| 35 | + - Add vLLM KV Pool support for XQA kernel |
| 36 | + - Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5 |
| 37 | + - Add support for fused gate_up_proj scales for FP8 blockwise |
| 38 | + - Support FP8 row-wise dense GEMM in torch flow |
| 39 | + - Enable fp8 SwiGLU to minimize host overhead |
| 40 | + - Add Deepseek R1 FP8 Support on Blackwell |
| 41 | + - Add support for MXFP8xMXFP4 in pytorch |
| 42 | + - Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) |
| 43 | + - Opensource MOE MXFP8-MXFP4 implementation |
| 44 | + - Add support for Modelopt fp8_pb_wo quantization scheme |
| 45 | + - Support deepEP fp4 post quant all2all dispatch |
| 46 | + - Fuse w4a8 moe pre-quant scale on Hopper |
| 47 | + - Support Weight-Only-Quantization in PyTorch Workflow |
| 48 | + - Add support for per expert activation scaling factors |
| 49 | + - Add ReDrafter support for Qwen |
| 50 | + - Enable CUDA Graph for Nemotron-H |
| 51 | + - Add support for YARN in NemotronNAS models |
| 52 | + - Switch to internal version of MMProjector in Gemma3 |
| 53 | + - Disable add special tokens for Llama3.3 70B |
| 54 | + - Auto-enable ngram with concurrency <= 32 |
| 55 | + - Support turning on/off spec decoding dynamically |
| 56 | + - Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 |
| 57 | + - Add support for external multimodal embeddings |
| 58 | + - Add support for disaggregation with pp with pytorch backend |
| 59 | + - Add status tags to LLM API reference |
| 60 | + - Support JSON Schema in OpenAI-Compatible API |
| 61 | + - Support chunked prefill on spec decode 2 model |
| 62 | + - Add KV cache reuse support for multimodal models |
| 63 | + - Support nanobind bindings |
| 64 | + - Add support for two-model engine KV cache reuse |
| 65 | + - Add Eagle-3 support for qwen3 dense model |
| 66 | + - Migrate Eagle-3 and draft/target speculation to Drafter |
| 67 | + - Enable guided decoding with overlap scheduler |
| 68 | + - Support n-gram speculative decoding with disagg |
| 69 | + - Add beam search support to the PyTorch Workflow |
| 70 | + - Add LLGuidance Support for PyTorch Backend |
| 71 | + - Add NGrams V2 support |
| 72 | + - Add MTP support for Online EPLB |
| 73 | + - Support disaggregated serving in TRTLLM Sampler |
| 74 | + - Add core infrastructure to enable loading of custom checkpoint formats |
| 75 | + - Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs |
| 76 | + - Use huge page mapping for host accessible memory on GB200 |
| 77 | + - Add user-provided speculative decoding support |
| 78 | + - Add streaming scaffolding_llm.generate_async support |
| 79 | + - Detokenize option in /v1/completions request |
| 80 | + - Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner |
| 81 | + - Remove support for llmapi + TRT backend in Triton |
| 82 | + - Add request_perf_metrics to triton LLMAPI backend |
| 83 | + - Add support for Triton request cancellation |
| 84 | + |
| 85 | +- Benchmark: |
| 86 | + - Add support for benchmarking individual gemms in MOE benchmark (#6080) |
| 87 | + - Add speculative metrics for trtllm-bench |
| 88 | + - Add the ability to write a request timeline for trtllm-bench |
| 89 | + - Add no_kv_cache_reuse option and streaming support for trtllm-serve bench |
| 90 | + - Add latency support for trtllm-bench |
| 91 | + - Add Acceptance Rate calculation to benchmark_serving |
| 92 | + - Add wide-ep benchmarking scripts |
| 93 | + - Update trtllm-bench to support new Pytorch default |
| 94 | + - Add support for TRTLLM CustomDataset |
| 95 | + - Make benchmark_serving part of the library |
| 96 | + |
| 97 | +- Documentation: |
| 98 | + - Refactored the doc structure to focus on the PyTorch workflow. |
| 99 | + - Improved the LLMAPI and API reference documentation. Stable APIs are now protected and will remain consistent in subsequent versions following v1.0. |
| 100 | + - Removed legacy documentation related to the TensorRT workflow. |
| 101 | + |
| 102 | +### Infrastructure Changes |
| 103 | +- The base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:25.06-py3`. |
| 104 | +- The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.06-py3`. |
| 105 | +- The dependent NVIDIA ModelOpt version is updated to 0.33. |
| 106 | +- The dependent xgrammar version is updated to 0.1.21. |
| 107 | +- The dependent transformers version is updated to 4.53.1. |
| 108 | + |
| 109 | +### API Changes |
| 110 | +- **BREAKING CHANGE** Promote PyTorch to be the default LLM backend |
| 111 | +- **BREAKING CHANGE** Change default backend to PyTorch in trtllm-serve |
| 112 | +- **BREAKING CHANGE** Unify KvCacheConfig in LLM class for pytorch backend |
| 113 | +- **BREAKING CHANGE** Rename cuda_graph_config padding_enabled field |
| 114 | +- **BREAKING CHANGE** Rename mixed_sampler to enable_mixed_sampler |
| 115 | +- **BREAKING CHANGE** Rename LLM.autotuner_enabled to enable_autotuner |
| 116 | +- Add back allreduce_strategy parameter into TorchLlmArgs |
| 117 | +- Add LLmArgs option to force using dynamic quantization |
| 118 | +- Change default LoRA cache sizes and change peft_cache_config cache size fields to take effect when not explicitly set in lora_config |
| 119 | +- Remove deprecated LoRA LLM args, that are already specified in lora_config |
| 120 | +- Add request_perf_metrics to LLMAPI |
| 121 | +- Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead |
| 122 | +- Remove TrtGptModelOptionalParams |
| 123 | +- Remove ptuning knobs from TorchLlmArgs |
| 124 | + |
| 125 | + |
| 126 | +### Fixed Issues |
| 127 | +- Fix illegal memory access in MLA (#6437) |
| 128 | +- Fix nemotronNAS loading for TP>1 (#6447) |
| 129 | +- Fix wide EP when using DeepEP with online EPLB (#6429) |
| 130 | +- Fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344) |
| 131 | +- Fix PD + MTP + overlap scheduler accuracy issue (#6136) |
| 132 | +- Fix bug of Qwen3 when using fp4 on sm120 (#6065) |
| 133 | +- Fix TMA error with GEMM+AR on TP=2 (#6075) |
| 134 | +- Fix scaffolding aime test in test_e2e (#6140) |
| 135 | +- Fix KV Cache overrides in trtllm-bench (#6103) |
| 136 | +- Fix MOE benchmark to rotate buffers to prevent L2 cache reuse (#4135) |
| 137 | +- Fix eagle3 two model disaggregated serving test (#6014) |
| 138 | +- Fix chunked prefill + overlap scheduling (#5761) |
| 139 | +- Fix mgmn postprocess error (#5835) |
| 140 | +- Fallback to cubins for fp8 fmha kernels on Ada (#5779) |
| 141 | +- Fix disagg + speculative decoding (#5558) |
| 142 | +- Fix test_generate_with_seed CI failure. (#5772) |
| 143 | +- Fix prompt adapter TP2 case (#5782) |
| 144 | +- Fix disaggregate serving with attention DP (#4993) |
| 145 | +- Fix a quote error introduced in #5534 (#5816) |
| 146 | +- Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. (#5801) |
| 147 | +- Fix lost requests for disaggregated serving (#5815) |
| 148 | +- Update unit tests: skip all_close assert for dropout in attention, increase tolerance for rope op test (#5855) |
| 149 | +- Fix GEMM+AR fusion on blackwell (#5563) |
| 150 | +- Fix llama4 multimodal support (#5809) |
| 151 | +- Fix Llama4 Scout FP4 crash issue (#5925) |
| 152 | +- Fix max batch size and max tokens in kv cache estimations for Nemotron-H (#5371) |
| 153 | +- Fix moe regression for sm120 (#5823) |
| 154 | +- Fix Qwen2.5VL FP8 support (#5029) |
| 155 | +- Fix the illegal memory access issue in moe gemm on SM120 (#5636) |
| 156 | +- Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531) |
| 157 | +- Fix incremental detokenization (#5825) |
| 158 | +- Fix MoE workspace info by storing Torch tensor itself instead of data_ptr (#5900) |
| 159 | +- Fix mistral unit tests due to transformers upgrade (#5904) |
| 160 | +- Fix the Llama3.1 405B hanging issue. (#5698) (#5925) |
| 161 | +- Fix Gemma3 unit tests due to transformers upgrade (#5921) |
| 162 | +- Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902) |
| 163 | +- Remove SpecConfig and fix thread leak issues (#5931) |
| 164 | +- Fast redux detection in trtllm gen routing kernel (#5941) |
| 165 | +- Fix cancel request logic (#5800) |
| 166 | +- Fix errors in wide-ep scripts (#5992) |
| 167 | +- Fix error in post-merge-tests (#5949) |
| 168 | +- Fix missing arg to alltoall_prepare_maybe_dispatch (#5669) |
| 169 | +- Fix attention DP doesn't work with embedding TP (#5642) |
| 170 | +- Fix broken cyclic reference detect (#5417) |
| 171 | +- Fix permission for local user issues in NGC docker container. (#5373) |
| 172 | +- Fix mtp vanilla draft inputs (#5568) |
| 173 | +- Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519) |
| 174 | +- Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514) |
| 175 | +- Fix the issue MoE autotune fallback failed to query default heuristic (#5520) |
| 176 | +- Fix the unexpected keyword argument 'streaming' (#5436) |
| 177 | + |
| 178 | +### Known Issues |
| 179 | +- When using disaggregated serving with pipeline parallelism and KV cache reuse, a hang can occur. This will be fixed in a future release. In the meantime, disabling KV cache reuse will fix this issue. |
| 180 | +- Running multi-node cases where each node has just a single GPU is known to fail. This will be addressed in a future release. |
| 181 | + |
7 | 182 | ## TensorRT-LLM Release 0.21.0 |
8 | 183 |
|
9 | 184 | ### Key Features and Enhancements |
|
0 commit comments