Skip to content

Commit 6f7e39a

Browse files
committed
[TRTLLM-7958][doc] add 1.0 release notes
Signed-off-by: nv-guomingz <[email protected]>
1 parent 88d1bde commit 6f7e39a

File tree

1 file changed

+167
-0
lines changed

1 file changed

+167
-0
lines changed

docs/source/release-notes.md

Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,173 @@
44

55
All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our [NVIDIA Developer Forum](https://forums.developer.nvidia.com/).
66

7+
## TensorRT-LLM Release 1.0
8+
9+
### Key Features and Enhancements
10+
- **Model Support**
11+
- Add Mistral3.1 VLM model support
12+
- Add TensorRT-Engine Qwen3 (dense) model support
13+
- Add phi-4-multimodal model support
14+
- Add EXAONE 4.0 model support
15+
- Add Qwen3 MoE support to TensorRT backend
16+
17+
- **Features**
18+
- Add support for sm121
19+
- Add LoRA support for Gemma3
20+
- Support pytorch LoRA adapter eviction
21+
- Add LoRA support for pytorch backend in trtllm-serve
22+
- Add support of scheduling attention dp request
23+
- Remove padding of FusedMoE in attention DP
24+
- Support torch compile for attention dp
25+
- Add KV events support for sliding window attention
26+
- Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE
27+
- Add Piecewise cuda graph support for MLA
28+
- Support mutliCtasKvMode for high-throughput MLA kernels
29+
- Enable kvcache to be reused during request generation
30+
- Add ADP schedule balance optimization
31+
- Add chunked prefill support for MLA (Blackwell)
32+
- Enable Multi-block mode for Hopper spec dec XQA kernel
33+
- Add vLLM KV Pool support for XQA kernel
34+
- Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5
35+
- Add support for fused gate_up_proj scales for FP8 blockwise
36+
- Support FP8 row-wise dense GEMM in torch flow
37+
- Enable fp8 SwiGLU to minimize host overhead
38+
- Add Deepseek R1 FP8 Support on Blackwell
39+
- Add support for MXFP8xMXFP4 in pytorch
40+
- Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell)
41+
- Opensource MOE MXFP8-MXFP4 implementation
42+
- Add support for Modelopt fp8_pb_wo quantization scheme
43+
- Support deepEP fp4 post quant all2all dispatch
44+
- Fuse w4a8 moe pre-quant scale on Hopper
45+
- Support Weight-Only-Quantization in PyTorch Workflow
46+
- Add support for per expert activation scaling factors
47+
- Add ReDrafter support for Qwen
48+
- Enable CUDA graphs for Nemotron-H
49+
- Add support for YARN in NemotronNAS models
50+
- Switch to internal version of MMProjector in Gemma3
51+
- Disable add special tokens for Llama3.3 70B
52+
- Auto-enable ngram with concurrency <= 32
53+
- Support turning on/off spec decoding dynamically
54+
- Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21
55+
- Add support for external multimodal embeddings
56+
- Add support for disaggregation with pp with pytorch backend
57+
- Add status tags to LLM API reference
58+
- Support JSON Schema in OpenAI-Compatible API
59+
- Support chunked prefill on spec decode 2 model
60+
- Add KV cache reuse support for multimodal models
61+
- Support nanobind bindings
62+
- Add support for two-model engine KV cache reuse
63+
- Add Eagle-3 support for qwen3 dense model
64+
- Migrate Eagle-3 and draft/target speculation to Drafter
65+
- Enable guided decoding with overlap scheduler
66+
- Support n-gram speculative decoding with disagg
67+
- Add beam search support to the PyTorch Workflow
68+
- Add LLGuidance Support for PyTorch Backend
69+
- Add NGrams V2 support
70+
- Add MTP support for Online EPLB
71+
- Support disaggregated serving in TRTLLM Sampler
72+
- Add core infrastructure to enable loading of custom checkpoint formats
73+
- Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs
74+
- Use huge page mapping for host accessible memory on GB200
75+
- Add user-provided speculative decoding support
76+
- Add streaming scaffolding_llm.generate_async support
77+
- Detokenize option in /v1/completions request
78+
- Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner
79+
- Remove support for llmapi + TRT backend in Triton
80+
- Add request_perf_metrics to triton LLMAPI backend
81+
- Add support for Triton request cancellation
82+
83+
- Benchmark:
84+
- Add support for benchmarking individual gemms in MOE benchmark (#6080)
85+
- Add speculative metrics for trtllm-bench
86+
- Add the ability to write a request timeline for trtllm-bench
87+
- Add no_kv_cache_reuse option and streaming support for trtllm-serve bench
88+
- Add latency support for trtllm-bench
89+
- Add Acceptance Rate calculation to benchmark_serving
90+
- Add wide-ep benchmarking scripts
91+
- Update trtllm-bench to support new Pytorch default
92+
- Add support for TRTLLM CustomDataset
93+
- Make benchmark_serving part of the library
94+
95+
96+
### Infrastructure Changes
97+
- The base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:25.06-py3`.
98+
- The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.06-py3`.
99+
- The dependent public PyTorch version is updated to 2.8.0.
100+
- The dependent NVIDIA ModelOpt version is updated to 0.33.
101+
- The dependent xgrammar version is updated to 0.1.21.
102+
- The dependent transformers version is updated to 4.51.3.
103+
104+
### API Changes
105+
- **BREAKING CHANGE**Set pytorch LLM as the default
106+
- **BREAKING CHANGE**Change default backend to PyTorch in trtllm-serve
107+
- **BREAKING CHANGE**Unify KvCacheConfig in LLM class for pytorch backend
108+
- **BREAKING CHANGE**Rename cuda_graph_config padding_enabled field
109+
- **BREAKING CHANGE**Rename mixed_sampler to enable_mixed_sampler
110+
- **BREAKING CHANGE**Rename LLM.autotuner_enabled to enable_autotuner
111+
- Add back allreduce_strategy parameter into TorchLlmArgs
112+
- Add LLmArgs option to force using dynamic quantization
113+
- Add request_perf_metrics to LLMAPI
114+
- Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead
115+
- Remove TrtGptModelOptionalParams
116+
- Remove ptuning knobs from TorchLlmArgs
117+
118+
119+
### Fixed Issues
120+
- Fix illeagel memory access in MLA (#6437)
121+
- Fix nemotronNAS loading for TP>1 (#6447)
122+
- Fix wide EP when using DeepEP with online EPLB (#6429)
123+
- Fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344)
124+
- Fix PD + MTP + overlap scheduler accuracy issue (#6136)
125+
- Fix bug of Qwen3 when using fp4 on sm120 (#6065)
126+
- Fix TMA error with GEMM+AR on TP=2 (#6075)
127+
- Fix scaffolding aime test in test_e2e (#6140)
128+
- Fix KV Cache overrides in trtllm-bench (#6103)
129+
- Fix MOE benchmark to rotate buffers to prevent L2 cache reuse (#4135)
130+
- Fix eagle3 two model disaggregated serving test (#6014)
131+
- Fix chunked prefill + overlap scheduling (#5761)
132+
- Fix mgmn postprocess error (#5835)
133+
- Fallback to cubins for fp8 fmha kernels on Ada (#5779)
134+
- Fix disagg + speculative decoding (#5558)
135+
- Fix test_generate_with_seed CI failure. (#5772)
136+
- Fix prompt adapter TP2 case (#5782)
137+
- Fix disaggregate serving with attention DP (#4993)
138+
- Fix a quote error introduced in #5534 (#5816)
139+
- Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. (#5801)
140+
- Fix lost requests for disaggregated serving (#5815)
141+
- Update unit tests: skip all_close assert for dropout in attention, increase tolerance for rope op test (#5855)
142+
- Fix GEMM+AR fusion on blackwell (#5563)
143+
- Fix llama4 multimodal support (#5809)
144+
- Fix Llama4 Scout FP4 crash issue (#5925)
145+
- Fix max batch size and max tokens in kv cache estimations for Nemotron-H (#5371)
146+
- Fix moe regression for sm120 (#5823)
147+
- Fix Qwen2.5VL FP8 support (#5029)
148+
- Fix the illegal memory access issue in moe gemm on SM120 (#5636)
149+
- Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
150+
- Fix incremental detokenization (#5825)
151+
- Fix MoE workspace info by storing Torch tensor itself instead of data_ptr (#5900)
152+
- Fix mistral unit tests due to transformers upgrade (#5904)
153+
- Fix the Llama3.1 405B hanging issue. (#5698) (#5925)
154+
- Fix Gemma3 unit tests due to transformers upgrade (#5921)
155+
- Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902)
156+
- Remove SpecConfig and fix thread leak issues (#5931)
157+
- Fast redux detection in trtllm gen routing kernel (#5941)
158+
- Fix cancel request logic (#5800)
159+
- Fix errors in wide-ep scripts (#5992)
160+
- Fix error in post-merge-tests (#5949)
161+
- Fix missing arg to alltoall_prepare_maybe_dispatch (#5669)
162+
- Fix attention DP doesn't work with embedding TP (#5642)
163+
- Fix broken cyclic reference detect (#5417)
164+
- Fix permission for local user issues in NGC docker container. (#5373)
165+
- Fix mtp vanilla draft inputs (#5568)
166+
- Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519)
167+
- Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514)
168+
- Fix the issue MoE autotune fallback failed to query default heuristic (#5520)
169+
- Fix the unexpected keyword argument 'streaming' (#5436)
170+
171+
### Known Issues
172+
- On bare-metal Ubuntu 22.04 or 24.04, please install the `cuda-python==12.9.1` package after installing the TensorRT-LLM wheel. This resolves an incompatibility issue with the default cuda-python 13 of error `ImportError: cannot import name 'cuda' from 'cuda'`.
173+
7174
## TensorRT-LLM Release 0.21.0
8175

9176
### Key Features and Enhancements

0 commit comments

Comments
 (0)