Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorRT-LLM v0.10 update #1734

Merged
merged 1 commit into from
Jun 5, 2024
Merged

TensorRT-LLM v0.10 update #1734

merged 1 commit into from
Jun 5, 2024

Conversation

kaiyux
Copy link
Member

@kaiyux kaiyux commented Jun 5, 2024

TensorRT-LLM Release 0.10.0

Announcements

  • TensorRT-LLM supports TensorRT 10.0.1 and NVIDIA NGC 24.03 containers.

Key Features and Enhancements

  • The Python high level API
    • Added embedding parallel, embedding sharing, and fused MLP support.
    • Enabled the usage of the executor API.
  • Added a weight-stripping feature with a new trtllm-refit command. For more information, refer to examples/sample_weight_stripping/README.md.
  • Added a weight-streaming feature. For more information, refer to docs/source/advanced/weight-streaming.md.
  • Enhanced the multiple profiles feature; --multiple_profiles argument in trtllm-build command builds more optimization profiles now for better performance.
  • Added FP8 quantization support for Mixtral.
  • Added support for pipeline parallelism for GPT.
  • Optimized applyBiasRopeUpdateKVCache kernel by avoiding re-computation.
  • Reduced overheads between enqueue calls of TensorRT engines.
  • Added support for paged KV cache for enc-dec models. The support is limited to beam width 1.
  • Added W4A(fp)8 CUTLASS kernels for the NVIDIA Ada Lovelace architecture.
  • Added debug options (--visualize_network and --dry_run) to the trtllm-build command to visualize the TensorRT network before engine build.
  • Integrated the new NVIDIA Hopper XQA kernels for LLaMA 2 70B model.
  • Improved the performance of pipeline parallelism when enabling in-flight batching.
  • Supported quantization for Nemotron models.
  • Added LoRA support for Mixtral and Qwen.
  • Added in-flight batching support for ChatGLM models.
  • Added support to ModelRunnerCpp so that it runs with the executor API for IFB-compatible models.
  • Enhanced the custom AllReduce by adding a heuristic; fall back to use native NCCL kernel when hardware requirements are not satisfied to get the best performance.
  • Optimized the performance of checkpoint conversion process for LLaMA.
  • Benchmark
    • [BREAKING CHANGE] Moved the request rate generation arguments and logic from prepare dataset script to gptManagerBenchmark.
    • Enabled streaming and support Time To the First Token (TTFT) latency and Inter-Token Latency (ITL) metrics for gptManagerBenchmark.
    • Added the --max_attention_window option to gptManagerBenchmark.

API Changes

  • [BREAKING CHANGE] Set the default tokens_per_block argument of the trtllm-build command to 64 for better performance.
  • [BREAKING CHANGE] Migrated enc-dec models to the unified workflow.
  • [BREAKING CHANGE] Renamed GptModelConfig to ModelConfig.
  • [BREAKING CHANGE] Added speculative decoding mode to the builder API.
  • [BREAKING CHANGE] Refactor scheduling configurations
    • Unified the SchedulerPolicy with the same name in batch_scheduler and executor, and renamed it to CapacitySchedulerPolicy.
    • Expanded the existing configuration scheduling strategy from SchedulerPolicy to SchedulerConfig to enhance extensibility. The latter also introduces a chunk-based configuration called ContextChunkingPolicy.
  • [BREAKING CHANGE] The input prompt was removed from the generation output in the generate() and generate_async() APIs. For example, when given a prompt as A B, the original generation result could be <s>A B C D E where only C D E is the actual output, and now the result is C D E.
  • [BREAKING CHANGE] Switched default add_special_token in the TensorRT-LLM backend to True.
  • Deprecated GptSession and TrtGptModelV1.

Model Updates

  • Support DBRX
  • Support Qwen2
  • Support CogVLM
  • Support ByT5
  • Support LLaMA 3
  • Support Arctic (w/ FP8)
  • Support Fuyu
  • Support Persimmon
  • Support Deplot
  • Support Phi-3-Mini with long Rope
  • Support Neva
  • Support Kosmos-2
  • Support RecurrentGemma

Fixed Issues

Infrastructure changes

  • Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.03-py3.
  • Base Docker image for TensorRT-LLM backend is updated to nvcr.io/nvidia/tritonserver:24.03-py3.
  • The dependent TensorRT version is updated to 10.0.1.
  • The dependent CUDA version is updated to 12.4.0.
  • The dependent PyTorch version is updated to 2.2.2.

@kaiyux kaiyux merged commit 9bd15f1 into rel Jun 5, 2024
@kaiyux kaiyux deleted the preview/rel branch June 5, 2024 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants