|  | 
| 4 | 4 | 
 | 
| 5 | 5 | All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our [NVIDIA Developer Forum](https://forums.developer.nvidia.com/). | 
| 6 | 6 | 
 | 
|  | 7 | +## TensorRT-LLM Release 1.0 | 
|  | 8 | + | 
|  | 9 | +TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now stable and the default experience, and the LLM API is now stable. For more details on new developments in 1.0, please see below. | 
|  | 10 | + | 
|  | 11 | +### Key Features and Enhancements | 
|  | 12 | +- **Model Support** | 
|  | 13 | +  - Add Mistral3.1 VLM model support | 
|  | 14 | +  - Add TensorRT-Engine Qwen3 (dense) model support | 
|  | 15 | +  - Add phi-4-multimodal model support | 
|  | 16 | +  - Add EXAONE 4.0 model support | 
|  | 17 | +  - Add Qwen3 MoE support to TensorRT backend | 
|  | 18 | + | 
|  | 19 | +- **Features** | 
|  | 20 | +  - Add support for sm121 | 
|  | 21 | +  - Add LoRA support for Gemma3 | 
|  | 22 | +  - Support PyTorch LoRA adapter eviction | 
|  | 23 | +  - Add LoRA support for PyTorch backend in trtllm-serve  | 
|  | 24 | +  - Add support of scheduling attention dp request | 
|  | 25 | +  - Remove padding of FusedMoE in attention DP | 
|  | 26 | +  - Support torch compile for attention dp | 
|  | 27 | +  - Add KV events support for sliding window attention | 
|  | 28 | +  - Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE | 
|  | 29 | +  - Add Piecewise CUDA Graph support for MLA | 
|  | 30 | +  - Support mutliCtasKvMode for high-throughput MLA kernels | 
|  | 31 | +  - Enable kvcache to be reused during request generation | 
|  | 32 | +  - Add ADP schedule balance optimization | 
|  | 33 | +  - Add chunked prefill support for MLA (Blackwell) | 
|  | 34 | +  - Enable Multi-block mode for Hopper spec dec XQA kernel | 
|  | 35 | +  - Add vLLM KV Pool support for XQA kernel | 
|  | 36 | +  - Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5 | 
|  | 37 | +  - Add support for fused gate_up_proj scales for FP8 blockwise | 
|  | 38 | +  - Support FP8 row-wise dense GEMM in torch flow | 
|  | 39 | +  - Enable fp8 SwiGLU to minimize host overhead | 
|  | 40 | +  - Add Deepseek R1 FP8 Support on Blackwell | 
|  | 41 | +  - Add support for MXFP8xMXFP4 in pytorch | 
|  | 42 | +  - Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) | 
|  | 43 | +  - Opensource MOE MXFP8-MXFP4 implementation | 
|  | 44 | +  - Add support for Modelopt fp8_pb_wo quantization scheme | 
|  | 45 | +  - Support deepEP fp4 post quant all2all dispatch | 
|  | 46 | +  - Fuse w4a8 moe pre-quant scale on Hopper | 
|  | 47 | +  - Support Weight-Only-Quantization in PyTorch Workflow | 
|  | 48 | +  - Add support for per expert activation scaling factors | 
|  | 49 | +  - Add ReDrafter support for Qwen | 
|  | 50 | +  - Enable CUDA Graph for Nemotron-H | 
|  | 51 | +  - Add support for YARN in NemotronNAS models | 
|  | 52 | +  - Switch to internal version of MMProjector in Gemma3 | 
|  | 53 | +  - Disable add special tokens for Llama3.3 70B | 
|  | 54 | +  - Auto-enable ngram with concurrency <= 32 | 
|  | 55 | +  - Support turning on/off spec decoding dynamically | 
|  | 56 | +  - Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 | 
|  | 57 | +  - Add support for external multimodal embeddings | 
|  | 58 | +  - Add support for disaggregation with pp with pytorch backend | 
|  | 59 | +  - Add status tags to LLM API reference | 
|  | 60 | +  - Support JSON Schema in OpenAI-Compatible API | 
|  | 61 | +  - Support chunked prefill on spec decode 2 model | 
|  | 62 | +  - Add KV cache reuse support for multimodal models  | 
|  | 63 | +  - Support nanobind bindings | 
|  | 64 | +  - Add support for two-model engine KV cache reuse | 
|  | 65 | +  - Add Eagle-3 support for qwen3 dense model | 
|  | 66 | +  - Migrate Eagle-3 and draft/target speculation to Drafter | 
|  | 67 | +  - Enable guided decoding with overlap scheduler | 
|  | 68 | +  - Support n-gram speculative decoding with disagg | 
|  | 69 | +  - Add beam search support to the PyTorch Workflow | 
|  | 70 | +  - Add LLGuidance Support for PyTorch Backend | 
|  | 71 | +  - Add NGrams V2 support | 
|  | 72 | +  - Add MTP support for Online EPLB | 
|  | 73 | +  - Support disaggregated serving in TRTLLM Sampler | 
|  | 74 | +  - Add core infrastructure to enable loading of custom checkpoint formats | 
|  | 75 | +  - Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs | 
|  | 76 | +  - Use huge page mapping for host accessible memory on GB200 | 
|  | 77 | +  - Add user-provided speculative decoding support | 
|  | 78 | +  - Add streaming scaffolding_llm.generate_async support | 
|  | 79 | +  - Detokenize option in /v1/completions request | 
|  | 80 | +  - Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner | 
|  | 81 | +  - Remove support for llmapi + TRT backend in Triton | 
|  | 82 | +  - Add request_perf_metrics to triton LLMAPI backend  | 
|  | 83 | +  - Add support for Triton request cancellation | 
|  | 84 | + | 
|  | 85 | +- Benchmark: | 
|  | 86 | +  - Add support for benchmarking individual gemms in MOE benchmark (#6080) | 
|  | 87 | +  - Add speculative metrics for trtllm-bench | 
|  | 88 | +  - Add the ability to write a request timeline for trtllm-bench | 
|  | 89 | +  - Add no_kv_cache_reuse option and streaming support for trtllm-serve bench | 
|  | 90 | +  - Add latency support for trtllm-bench | 
|  | 91 | +  - Add Acceptance Rate calculation to benchmark_serving  | 
|  | 92 | +  - Add wide-ep benchmarking scripts | 
|  | 93 | +  - Update trtllm-bench to support new Pytorch default | 
|  | 94 | +  - Add support for TRTLLM CustomDataset | 
|  | 95 | +  - Make benchmark_serving part of the library | 
|  | 96 | + | 
|  | 97 | +- Documentation: | 
|  | 98 | +  - Refactored the doc structure to focus on the PyTorch workflow. | 
|  | 99 | +  - Improved the LLMAPI and API reference documentation. Stable APIs are now protected and will remain consistent in subsequent versions following v1.0. | 
|  | 100 | +  - Removed legacy documentation related to the TensorRT workflow. | 
|  | 101 | + | 
|  | 102 | +### Infrastructure Changes | 
|  | 103 | +- The base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:25.06-py3`. | 
|  | 104 | +- The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.06-py3`. | 
|  | 105 | +- The dependent NVIDIA ModelOpt version is updated to 0.33. | 
|  | 106 | +- The dependent xgrammar version is updated to 0.1.21. | 
|  | 107 | +- The dependent transformers version is updated to 4.53.1. | 
|  | 108 | + | 
|  | 109 | +### API Changes | 
|  | 110 | +- **BREAKING CHANGE** Promote PyTorch to be the default LLM backend | 
|  | 111 | +- **BREAKING CHANGE** Change default backend to PyTorch in trtllm-serve | 
|  | 112 | +- **BREAKING CHANGE** Unify KvCacheConfig in LLM class for pytorch backend | 
|  | 113 | +- **BREAKING CHANGE** Rename cuda_graph_config padding_enabled field | 
|  | 114 | +- **BREAKING CHANGE** Rename mixed_sampler to enable_mixed_sampler | 
|  | 115 | +- **BREAKING CHANGE** Rename LLM.autotuner_enabled to enable_autotuner | 
|  | 116 | +- Add back allreduce_strategy parameter into TorchLlmArgs | 
|  | 117 | +- Add LLmArgs option to force using dynamic quantization | 
|  | 118 | +- Change default LoRA cache sizes and change peft_cache_config cache size fields to take effect when not explicitly set in lora_config | 
|  | 119 | +- Remove deprecated LoRA LLM args, that are already specified in lora_config | 
|  | 120 | +- Add request_perf_metrics to LLMAPI | 
|  | 121 | +- Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead  | 
|  | 122 | +- Remove TrtGptModelOptionalParams  | 
|  | 123 | +- Remove ptuning knobs from TorchLlmArgs | 
|  | 124 | + | 
|  | 125 | + | 
|  | 126 | +### Fixed Issues | 
|  | 127 | +- Fix illegal memory access in MLA (#6437) | 
|  | 128 | +- Fix nemotronNAS loading for TP>1 (#6447) | 
|  | 129 | +- Fix wide EP when using DeepEP with online EPLB (#6429) | 
|  | 130 | +- Fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344) | 
|  | 131 | +- Fix PD + MTP + overlap scheduler accuracy issue (#6136) | 
|  | 132 | +- Fix bug of Qwen3 when using fp4 on sm120 (#6065) | 
|  | 133 | +- Fix TMA error with GEMM+AR on TP=2 (#6075) | 
|  | 134 | +- Fix scaffolding aime test in test_e2e (#6140) | 
|  | 135 | +- Fix KV Cache overrides in trtllm-bench (#6103) | 
|  | 136 | +- Fix MOE benchmark to rotate buffers to prevent L2 cache reuse (#4135) | 
|  | 137 | +- Fix eagle3 two model disaggregated serving test (#6014) | 
|  | 138 | +- Fix chunked prefill + overlap scheduling (#5761) | 
|  | 139 | +- Fix mgmn postprocess error (#5835) | 
|  | 140 | +- Fallback to cubins for fp8 fmha kernels on Ada (#5779) | 
|  | 141 | +- Fix disagg + speculative decoding (#5558) | 
|  | 142 | +- Fix test_generate_with_seed CI failure. (#5772) | 
|  | 143 | +- Fix prompt adapter TP2 case (#5782) | 
|  | 144 | +- Fix disaggregate serving with attention DP (#4993) | 
|  | 145 | +- Fix a quote error introduced in #5534 (#5816) | 
|  | 146 | +- Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. (#5801) | 
|  | 147 | +- Fix lost requests for disaggregated serving (#5815) | 
|  | 148 | +- Update unit tests: skip all_close assert for dropout in attention, increase tolerance for rope op test (#5855) | 
|  | 149 | +- Fix GEMM+AR fusion on blackwell (#5563) | 
|  | 150 | +- Fix llama4 multimodal support (#5809) | 
|  | 151 | +- Fix Llama4 Scout FP4 crash issue (#5925) | 
|  | 152 | +- Fix max batch size and max tokens in kv cache estimations for Nemotron-H (#5371) | 
|  | 153 | +- Fix moe regression for sm120 (#5823) | 
|  | 154 | +- Fix Qwen2.5VL FP8 support (#5029) | 
|  | 155 | +- Fix the illegal memory access issue in moe gemm on SM120 (#5636) | 
|  | 156 | +- Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531) | 
|  | 157 | +- Fix incremental detokenization (#5825) | 
|  | 158 | +- Fix MoE workspace info by storing Torch tensor itself instead of data_ptr (#5900) | 
|  | 159 | +- Fix mistral unit tests due to transformers upgrade (#5904) | 
|  | 160 | +- Fix the Llama3.1 405B hanging issue. (#5698) (#5925) | 
|  | 161 | +- Fix Gemma3 unit tests due to transformers upgrade (#5921) | 
|  | 162 | +- Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902) | 
|  | 163 | +- Remove SpecConfig and fix thread leak issues (#5931) | 
|  | 164 | +- Fast redux detection in trtllm gen routing kernel (#5941) | 
|  | 165 | +- Fix cancel request logic (#5800) | 
|  | 166 | +- Fix errors in wide-ep scripts (#5992) | 
|  | 167 | +- Fix error in post-merge-tests (#5949) | 
|  | 168 | +- Fix missing arg to alltoall_prepare_maybe_dispatch (#5669) | 
|  | 169 | +- Fix attention DP doesn't work with embedding TP (#5642) | 
|  | 170 | +- Fix broken cyclic reference detect (#5417)  | 
|  | 171 | +- Fix permission for local user issues in NGC docker container. (#5373) | 
|  | 172 | +- Fix mtp vanilla draft inputs (#5568)  | 
|  | 173 | +- Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519)  | 
|  | 174 | +- Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514) | 
|  | 175 | +- Fix the issue MoE autotune fallback failed to query default heuristic (#5520)  | 
|  | 176 | +- Fix the unexpected keyword argument 'streaming' (#5436) | 
|  | 177 | + | 
|  | 178 | +### Known Issues | 
|  | 179 | +- When using disaggregated serving with pipeline parallelism and KV cache reuse, a hang can occur. This will be fixed in a future release. In the meantime, disabling KV cache reuse will fix this issue. | 
|  | 180 | +- Running multi-node cases where each node has just a single GPU is known to fail. This will be addressed in a future release.  | 
|  | 181 | + | 
| 7 | 182 | ## TensorRT-LLM Release 0.21.0 | 
| 8 | 183 | 
 | 
| 9 | 184 | ### Key Features and Enhancements | 
|  | 
0 commit comments