Skip to content

Commit c157c84

Browse files
nv-guomingzpcastonguayschetlur-nv
authored andcommitted
[TRTLLM-7958][doc] add 1.0 release notes (NVIDIA#7605)
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: pcastonguay <[email protected]> Signed-off-by: Sharan Chetlur <[email protected]> Co-authored-by: pcastonguay <[email protected]> Co-authored-by: Sharan Chetlur <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
1 parent a3f624f commit c157c84

File tree

1 file changed

+175
-0
lines changed

1 file changed

+175
-0
lines changed

docs/source/release-notes.md

Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,181 @@
44

55
All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our [NVIDIA Developer Forum](https://forums.developer.nvidia.com/).
66

7+
## TensorRT-LLM Release 1.0
8+
9+
TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now stable and the default experience, and the LLM API is now stable. For more details on new developments in 1.0, please see below.
10+
11+
### Key Features and Enhancements
12+
- **Model Support**
13+
- Add Mistral3.1 VLM model support
14+
- Add TensorRT-Engine Qwen3 (dense) model support
15+
- Add phi-4-multimodal model support
16+
- Add EXAONE 4.0 model support
17+
- Add Qwen3 MoE support to TensorRT backend
18+
19+
- **Features**
20+
- Add support for sm121
21+
- Add LoRA support for Gemma3
22+
- Support PyTorch LoRA adapter eviction
23+
- Add LoRA support for PyTorch backend in trtllm-serve
24+
- Add support of scheduling attention dp request
25+
- Remove padding of FusedMoE in attention DP
26+
- Support torch compile for attention dp
27+
- Add KV events support for sliding window attention
28+
- Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE
29+
- Add Piecewise CUDA Graph support for MLA
30+
- Support mutliCtasKvMode for high-throughput MLA kernels
31+
- Enable kvcache to be reused during request generation
32+
- Add ADP schedule balance optimization
33+
- Add chunked prefill support for MLA (Blackwell)
34+
- Enable Multi-block mode for Hopper spec dec XQA kernel
35+
- Add vLLM KV Pool support for XQA kernel
36+
- Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5
37+
- Add support for fused gate_up_proj scales for FP8 blockwise
38+
- Support FP8 row-wise dense GEMM in torch flow
39+
- Enable fp8 SwiGLU to minimize host overhead
40+
- Add Deepseek R1 FP8 Support on Blackwell
41+
- Add support for MXFP8xMXFP4 in pytorch
42+
- Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell)
43+
- Opensource MOE MXFP8-MXFP4 implementation
44+
- Add support for Modelopt fp8_pb_wo quantization scheme
45+
- Support deepEP fp4 post quant all2all dispatch
46+
- Fuse w4a8 moe pre-quant scale on Hopper
47+
- Support Weight-Only-Quantization in PyTorch Workflow
48+
- Add support for per expert activation scaling factors
49+
- Add ReDrafter support for Qwen
50+
- Enable CUDA Graph for Nemotron-H
51+
- Add support for YARN in NemotronNAS models
52+
- Switch to internal version of MMProjector in Gemma3
53+
- Disable add special tokens for Llama3.3 70B
54+
- Auto-enable ngram with concurrency <= 32
55+
- Support turning on/off spec decoding dynamically
56+
- Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21
57+
- Add support for external multimodal embeddings
58+
- Add support for disaggregation with pp with pytorch backend
59+
- Add status tags to LLM API reference
60+
- Support JSON Schema in OpenAI-Compatible API
61+
- Support chunked prefill on spec decode 2 model
62+
- Add KV cache reuse support for multimodal models
63+
- Support nanobind bindings
64+
- Add support for two-model engine KV cache reuse
65+
- Add Eagle-3 support for qwen3 dense model
66+
- Migrate Eagle-3 and draft/target speculation to Drafter
67+
- Enable guided decoding with overlap scheduler
68+
- Support n-gram speculative decoding with disagg
69+
- Add beam search support to the PyTorch Workflow
70+
- Add LLGuidance Support for PyTorch Backend
71+
- Add NGrams V2 support
72+
- Add MTP support for Online EPLB
73+
- Support disaggregated serving in TRTLLM Sampler
74+
- Add core infrastructure to enable loading of custom checkpoint formats
75+
- Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs
76+
- Use huge page mapping for host accessible memory on GB200
77+
- Add user-provided speculative decoding support
78+
- Add streaming scaffolding_llm.generate_async support
79+
- Detokenize option in /v1/completions request
80+
- Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner
81+
- Remove support for llmapi + TRT backend in Triton
82+
- Add request_perf_metrics to triton LLMAPI backend
83+
- Add support for Triton request cancellation
84+
85+
- Benchmark:
86+
- Add support for benchmarking individual gemms in MOE benchmark (#6080)
87+
- Add speculative metrics for trtllm-bench
88+
- Add the ability to write a request timeline for trtllm-bench
89+
- Add no_kv_cache_reuse option and streaming support for trtllm-serve bench
90+
- Add latency support for trtllm-bench
91+
- Add Acceptance Rate calculation to benchmark_serving
92+
- Add wide-ep benchmarking scripts
93+
- Update trtllm-bench to support new Pytorch default
94+
- Add support for TRTLLM CustomDataset
95+
- Make benchmark_serving part of the library
96+
97+
- Documentation:
98+
- Refactored the doc structure to focus on the PyTorch workflow.
99+
- Improved the LLMAPI and API reference documentation. Stable APIs are now protected and will remain consistent in subsequent versions following v1.0.
100+
- Removed legacy documentation related to the TensorRT workflow.
101+
102+
### Infrastructure Changes
103+
- The base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:25.06-py3`.
104+
- The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.06-py3`.
105+
- The dependent NVIDIA ModelOpt version is updated to 0.33.
106+
- The dependent xgrammar version is updated to 0.1.21.
107+
- The dependent transformers version is updated to 4.53.1.
108+
109+
### API Changes
110+
- **BREAKING CHANGE** Promote PyTorch to be the default LLM backend
111+
- **BREAKING CHANGE** Change default backend to PyTorch in trtllm-serve
112+
- **BREAKING CHANGE** Unify KvCacheConfig in LLM class for pytorch backend
113+
- **BREAKING CHANGE** Rename cuda_graph_config padding_enabled field
114+
- **BREAKING CHANGE** Rename mixed_sampler to enable_mixed_sampler
115+
- **BREAKING CHANGE** Rename LLM.autotuner_enabled to enable_autotuner
116+
- Add back allreduce_strategy parameter into TorchLlmArgs
117+
- Add LLmArgs option to force using dynamic quantization
118+
- Change default LoRA cache sizes and change peft_cache_config cache size fields to take effect when not explicitly set in lora_config
119+
- Remove deprecated LoRA LLM args, that are already specified in lora_config
120+
- Add request_perf_metrics to LLMAPI
121+
- Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead
122+
- Remove TrtGptModelOptionalParams
123+
- Remove ptuning knobs from TorchLlmArgs
124+
125+
126+
### Fixed Issues
127+
- Fix illegal memory access in MLA (#6437)
128+
- Fix nemotronNAS loading for TP>1 (#6447)
129+
- Fix wide EP when using DeepEP with online EPLB (#6429)
130+
- Fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344)
131+
- Fix PD + MTP + overlap scheduler accuracy issue (#6136)
132+
- Fix bug of Qwen3 when using fp4 on sm120 (#6065)
133+
- Fix TMA error with GEMM+AR on TP=2 (#6075)
134+
- Fix scaffolding aime test in test_e2e (#6140)
135+
- Fix KV Cache overrides in trtllm-bench (#6103)
136+
- Fix MOE benchmark to rotate buffers to prevent L2 cache reuse (#4135)
137+
- Fix eagle3 two model disaggregated serving test (#6014)
138+
- Fix chunked prefill + overlap scheduling (#5761)
139+
- Fix mgmn postprocess error (#5835)
140+
- Fallback to cubins for fp8 fmha kernels on Ada (#5779)
141+
- Fix disagg + speculative decoding (#5558)
142+
- Fix test_generate_with_seed CI failure. (#5772)
143+
- Fix prompt adapter TP2 case (#5782)
144+
- Fix disaggregate serving with attention DP (#4993)
145+
- Fix a quote error introduced in #5534 (#5816)
146+
- Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. (#5801)
147+
- Fix lost requests for disaggregated serving (#5815)
148+
- Update unit tests: skip all_close assert for dropout in attention, increase tolerance for rope op test (#5855)
149+
- Fix GEMM+AR fusion on blackwell (#5563)
150+
- Fix llama4 multimodal support (#5809)
151+
- Fix Llama4 Scout FP4 crash issue (#5925)
152+
- Fix max batch size and max tokens in kv cache estimations for Nemotron-H (#5371)
153+
- Fix moe regression for sm120 (#5823)
154+
- Fix Qwen2.5VL FP8 support (#5029)
155+
- Fix the illegal memory access issue in moe gemm on SM120 (#5636)
156+
- Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
157+
- Fix incremental detokenization (#5825)
158+
- Fix MoE workspace info by storing Torch tensor itself instead of data_ptr (#5900)
159+
- Fix mistral unit tests due to transformers upgrade (#5904)
160+
- Fix the Llama3.1 405B hanging issue. (#5698) (#5925)
161+
- Fix Gemma3 unit tests due to transformers upgrade (#5921)
162+
- Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902)
163+
- Remove SpecConfig and fix thread leak issues (#5931)
164+
- Fast redux detection in trtllm gen routing kernel (#5941)
165+
- Fix cancel request logic (#5800)
166+
- Fix errors in wide-ep scripts (#5992)
167+
- Fix error in post-merge-tests (#5949)
168+
- Fix missing arg to alltoall_prepare_maybe_dispatch (#5669)
169+
- Fix attention DP doesn't work with embedding TP (#5642)
170+
- Fix broken cyclic reference detect (#5417)
171+
- Fix permission for local user issues in NGC docker container. (#5373)
172+
- Fix mtp vanilla draft inputs (#5568)
173+
- Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519)
174+
- Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514)
175+
- Fix the issue MoE autotune fallback failed to query default heuristic (#5520)
176+
- Fix the unexpected keyword argument 'streaming' (#5436)
177+
178+
### Known Issues
179+
- When using disaggregated serving with pipeline parallelism and KV cache reuse, a hang can occur. This will be fixed in a future release. In the meantime, disabling KV cache reuse will fix this issue.
180+
- Running multi-node cases where each node has just a single GPU is known to fail. This will be addressed in a future release.
181+
7182
## TensorRT-LLM Release 0.21.0
8183

9184
### Key Features and Enhancements

0 commit comments

Comments
 (0)