TensorRT-LLM 0.12.0 Release #2167
Pinned
Shixiaowei02
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
We are very pleased to announce the 0.12.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
ModelWeightsLoader
is enabled for LLaMA family models (experimental), seedocs/source/architecture/model-weights-loader.md
.LLM
class.docs/source/speculative_decoding.md
.gelu_pytorch_tanh
activation function, thanks to the contribution from @ttim in Support gelu_pytorch_tanh activation function #1897.chunk_length
parameter to Whisper, thanks to the contribution from @MahmoudAshraf97 in addchunk_length
parameter to Whisper #1909.concurrency
argument forgptManagerBenchmark
.docs/source/executor.md#sending-requests-with-different-beam-widths
.--fast_build
totrtllm-build
command (experimental).API Changes
max_output_len
is removed fromtrtllm-build
command, if you want to limit sequence length on engine build stage, specifymax_seq_len
.use_custom_all_reduce
argument is removed fromtrtllm-build
.multi_block_mode
argument is moved from build stage (trtllm-build
and builder API) to the runtime.context_fmha_fp32_acc
is moved to runtime for decoder models.tp_size
,pp_size
andcp_size
is removed fromtrtllm-build
command.executor
API, and it will be removed in a future release of TensorRT-LLM.cpp/include/tensorrt_llm/executor/version.h
file is going to be generated.Model Updates
examples/exaone/README.md
.examples/chatglm/README.md
.examples/multimodal/README.md
.Fixed Issues
cluster_infos
defined intensorrt_llm/auto_parallel/cluster_info.py
, thanks to the contribution from @saeyoonoh in fix auto parallel cluster info typo #1987.docs/source/reference/troubleshooting.md
, thanks for the contribution from @hattizai in chore: remove duplicate flag #1937.exclude_modules
to weight-only quantization, thanks to the contribution from @fjosw in [Fix] Propagate QuantConfig.exclude_modules to weight only quantization #2056.max_seq_len
is not an integer. (llama 3.1 70B Instruct would not build engine "TypeError: set_shape(): incompatible function arguments." #2018)Infrastructure Changes
nvcr.io/nvidia/pytorch:24.07-py3
.nvcr.io/nvidia/tritonserver:24.07-py3
.Known Issues
OSError: exception: access violation reading 0x0000000000000000
when importing the library in Python. See Installing on Windows for workarounds.Currently, there are two key branches in the project:
We are updating the
main
branch regularly with new features, bug fixes and performance optimizations. Therel
branch will be updated less frequently, and the exact frequencies depend on your feedback.Thanks,
The TensorRT-LLM Engineering Team
This discussion was created from the release TensorRT-LLM 0.12.0 Release.
Beta Was this translation helpful? Give feedback.
All reactions