TensorRT-LLM v0.11 Update #1969
Merged
+2,084,834
−868,880
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TensorRT-LLM Release 0.11.0
Key Features and Enhancements
examples/llama/README.md
).examples/qwen/README.md
.examples/phi/README.md
.examples/gpt/README.md
.distil-whisper/distil-large-v3
, thanks to the contribution from @IbrahimAmin1 in [feat]: Add Option to convert and run distil-whisper large-v3 #1337.numQueuedRequests
to the iteration stats log of the executor API.iterLatencyMilliSec
to the iteration stats log of the executor API.API Changes
trtllm-build
commandtrtllm-build
command), see documents: examples/whisper/README.md.max_batch_size
intrtllm-build
command is switched to 256 by default.max_num_tokens
intrtllm-build
command is switched to 8192 by default.max_output_len
and addedmax_seq_len
.--weight_only_precision
argument fromtrtllm-build
command.attention_qk_half_accumulation
argument fromtrtllm-build
command.use_context_fmha_for_generation
argument fromtrtllm-build
command.strongly_typed
argument fromtrtllm-build
command.max_seq_len
reads from the HuggingFace mode config now.free_gpu_memory_fraction
inModelRunnerCpp
tokv_cache_free_gpu_memory_fraction
.GptManager
APImaxBeamWidth
intoTrtGptModelOptionalParams
.schedulerConfig
intoTrtGptModelOptionalParams
.ModelRunnerCpp
, includingmax_tokens_in_paged_kv_cache
,kv_cache_enable_block_reuse
andenable_chunked_context
.ModelConfig
class, and all the options are moved toLLM
class.LLM
class, please refer toexamples/high-level-api/README.md
model
to accept either HuggingFace model name or local HuggingFace model/TensorRT-LLM checkpoint/TensorRT-LLM engine.TLLM_HLAPI_BUILD_CACHE=1
or passingenable_build_cache=True
toLLM
class.BuildConfig
,SchedulerConfig
and so on in the kwargs, ideally you should be able to configure details about the build and runtime phase.LLM.generate()
andLLM.generate_async()
API.SamplingConfig
.SamplingParams
with more extensive parameters, seetensorrt_llm/hlapi/utils.py
.SamplingParams
contains and manages fields from Python bindings ofSamplingConfig
,OutputConfig
, and so on.LLM.generate()
output asRequestOutput
, seetensorrt_llm/hlapi/llm.py
.apps
examples, specially by rewriting bothchat.py
andfastapi_server.py
using theLLM
APIs, please refer to theexamples/apps/README.md
for details.chat.py
to support multi-turn conversation, allowing users to chat with a model in the terminal.fastapi_server.py
and eliminate the need formpirun
in multi-GPU scenarios.SpeculativeDecodingMode.h
to choose between different speculative decoding techniques.SpeculativeDecodingModule.h
base class for speculative decoding techniques.decodingMode.h
.gptManagerBenchmark
api
ingptManagerBenchmark
command isexecutor
by default now.max_batch_size
.max_num_tokens
.bias
argument to theLayerNorm
module, and supports non-bias layer normalization.GptSession
Python bindings.Model Updates
examples/jais/README.md
.examples/dit/README.md
.Video NeVA
section inexamples/multimodal/README.md
.examples/grok/README.md
.examples/phi/README.md
.Fixed Issues
top_k
type inexecutor.py
, thanks to the contribution from @vonjackustc in Fix top_k type (float => int32) executor.py #1329.qkv_bias
shape issue for Qwen1.5-32B (convert qwen 110b gptq checkpoint的时候,qkv_bias 的shape不能被3整除 #1589), thanks to the contribution from @Tlntin in fix up qkv.bias error when use qwen1.5-32b-gptq-int4 #1637.fpA_intB
, thanks to the contribution from @JamesTheZ in Fix the error of Ada traits for fpA_intB. #1583.examples/qwenvl/requirements.txt
, thanks to the contribution from @ngoanpv in Update requirements.txt #1248.lora_manager
, thanks to the contribution from @TheCodeWrangler in Fixed rslora scaling in lora_manager #1669.convert_hf_mpt_legacy
call failure when the function is called in other than global scope, thanks to the contribution from @bloodeagle40234 in Define hf_config explisitly for convert_hf_mpt_legacy #1534.use_fp8_context_fmha
broken outputs (use_fp8_context_fmha broken outputs #1539).quantize.py
is export data to config.json, thanks to the contribution from @janpetrov: quantize.py fails to export important data to config.json (eg rotary scaling) #1676shared_embedding_table
is not being set when loading Gemma [GEMMA]from_hugging_face
not settingshare_embedding_table
to True leading to incapacity to load Gemma #1799, thanks to the contribution from @mfuntowicz.ModelRunner
[ModelRunner] Fix stop and bad words list contiguous for offsets #1815, thanks to the contribution from @Marks101.FAST_BUILD
, thanks to the support from @lkm2835 in Add FAST_BUILD comment at #endif #1851.benchmarks/cpp/README.md
for gptManagerBenchmark seems to go into a dead loop with GPU usage 0% #1562 and Cannot process new request: [TensorRT-LLM][ERROR] Assertion failed: LoRA task 0 not found in cache. Please send LoRA weights with request (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp:182) #1552.Infrastructure Changes
nvcr.io/nvidia/pytorch:24.05-py3
.nvcr.io/nvidia/tritonserver:24.05-py3
.Known Issues
OSError: exception: access violation reading 0x0000000000000000
. This issue is under investigation.