Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TensorRT-LLM Release 0.10.0
Announcements
Key Features and Enhancements
executor
API.trtllm-refit
command. For more information, refer toexamples/sample_weight_stripping/README.md
.docs/source/advanced/weight-streaming.md
.--multiple_profiles
argument intrtllm-build
command builds more optimization profiles now for better performance.applyBiasRopeUpdateKVCache
kernel by avoiding re-computation.enqueue
calls of TensorRT engines.--visualize_network
and--dry_run
) to thetrtllm-build
command to visualize the TensorRT network before engine build.ModelRunnerCpp
so that it runs with theexecutor
API for IFB-compatible models.AllReduce
by adding a heuristic; fall back to use native NCCL kernel when hardware requirements are not satisfied to get the best performance.gptManagerBenchmark
.Time To the First Token (TTFT)
latency andInter-Token Latency (ITL)
metrics forgptManagerBenchmark
.--max_attention_window
option togptManagerBenchmark
.API Changes
tokens_per_block
argument of thetrtllm-build
command to 64 for better performance.GptModelConfig
toModelConfig
.SchedulerPolicy
with the same name inbatch_scheduler
andexecutor
, and renamed it toCapacitySchedulerPolicy
.SchedulerPolicy
toSchedulerConfig
to enhance extensibility. The latter also introduces a chunk-based configuration calledContextChunkingPolicy
.generate()
andgenerate_async()
APIs. For example, when given a prompt asA B
, the original generation result could be<s>A B C D E
where onlyC D E
is the actual output, and now the result isC D E
.add_special_token
in the TensorRT-LLM backend toTrue
.GptSession
andTrtGptModelV1
.Model Updates
Fixed Issues
gather_all_token_logits
. (Segmentation fault with pipeline parallelism andgather_all_token_logits
#1284)gpt_attention_plugin
for enc-dec models. (Flan t5 xxl result large difference #1343)Infrastructure changes
nvcr.io/nvidia/pytorch:24.03-py3
.nvcr.io/nvidia/tritonserver:24.03-py3
.