Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorRT-LLM v0.11 Update #1969

Merged
merged 1 commit into from
Jul 17, 2024
Merged

TensorRT-LLM v0.11 Update #1969

merged 1 commit into from
Jul 17, 2024

Conversation

kaiyux
Copy link
Member

@kaiyux kaiyux commented Jul 17, 2024

TensorRT-LLM Release 0.11.0

Key Features and Enhancements

  • Supported very long context for LLaMA (see “Long context evaluation” section in examples/llama/README.md).
  • Low latency optimization
    • Added a reduce-norm feature which aims to fuse the ResidualAdd and LayerNorm kernels after AllReduce into a single kernel, which is recommended to be enabled when the batch size is small and the generation phase time is dominant.
    • Added FP8 support to the GEMM plugin, which benefits the cases when batch size is smaller than 4.
    • Added a fused GEMM-SwiGLU plugin for FP8 on SM90.
  • LoRA enhancements
    • Supported running FP8 LLaMA with FP16 LoRA checkpoints.
    • Added support for quantized base model and FP16/BF16 LoRA.
      • SQ OOTB (- INT8 A/W) + FP16/BF16/FP32 LoRA​
      • INT8/ INT4 Weight-Only (INT8 /W) + FP16/BF16/FP32 LoRA​
      • Weight-Only Group-wise + FP16/BF16/FP32 LoRA
    • Added LoRA support to Qwen2, see “Run models with LoRA” section in examples/qwen/README.md.
    • Added support for Phi-3-mini/small FP8 base + FP16/BF16 LoRA, see “Run Phi-3 with LoRA” section in examples/phi/README.md.
    • Added support for starcoder-v2 FP8 base + FP16/BF16 LoRA, see “Run StarCoder2 with LoRA” section in examples/gpt/README.md.
  • Encoder-decoder models C++ runtime enhancements
  • Supported INT8 quantization with embedding layer excluded.
  • Updated default model for Whisper to distil-whisper/distil-large-v3, thanks to the contribution from @IbrahimAmin1 in [feat]: Add Option to convert and run distil-whisper large-v3 #1337.
  • Supported HuggingFace model automatically download for the Python high level API.
  • Supported explicit draft tokens for in-flight batching.
  • Supported local custom calibration datasets, thanks to the contribution from @DreamGenX in Support custom calibration datasets #1762.
  • Added batched logits post processor.
  • Added Hopper qgmma kernel to XQA JIT codepath.
  • Supported tensor parallelism and expert parallelism enabled together for MoE.
  • Supported the pipeline parallelism cases when the number of layers cannot be divided by PP size.
  • Added numQueuedRequests to the iteration stats log of the executor API.
  • Added iterLatencyMilliSec to the iteration stats log of the executor API.
  • Add HuggingFace model zoo from the community, thanks to the contribution from @matichon-vultureprime in Add Huggingface model zoo from community #1674.

API Changes

  • [BREAKING CHANGE] trtllm-build command
    • Migrated Whisper to unified workflow (trtllm-build command), see documents: examples/whisper/README.md.
    • max_batch_size in trtllm-build command is switched to 256 by default.
    • max_num_tokens in trtllm-build command is switched to 8192 by default.
    • Deprecated max_output_len and added max_seq_len.
    • Removed unnecessary --weight_only_precision argument from trtllm-build command.
    • Removed attention_qk_half_accumulation argument from trtllm-build command.
    • Removed use_context_fmha_for_generation argument from trtllm-build command.
    • Removed strongly_typed argument from trtllm-build command.
    • The default value of max_seq_len reads from the HuggingFace mode config now.
  • C++ runtime
    • [BREAKING CHANGE] Renamed free_gpu_memory_fraction in ModelRunnerCpp to kv_cache_free_gpu_memory_fraction.
    • [BREAKING CHANGE] Refactored GptManager API
      • Moved maxBeamWidth into TrtGptModelOptionalParams.
      • Moved schedulerConfig into TrtGptModelOptionalParams.
    • Added some more options to ModelRunnerCpp, including max_tokens_in_paged_kv_cache, kv_cache_enable_block_reuse and enable_chunked_context.
  • [BREAKING CHANGE] Python high-level API
    • Removed the ModelConfig class, and all the options are moved to LLM class.
    • Refactored the LLM class, please refer to examples/high-level-api/README.md
      • Moved the most commonly used options in the explicit arg-list, and hidden the expert options in the kwargs.
      • Exposed model to accept either HuggingFace model name or local HuggingFace model/TensorRT-LLM checkpoint/TensorRT-LLM engine.
      • Support downloading model from HuggingFace model hub, currently only Llama variants are supported.
      • Support build cache to reuse the built TensorRT-LLM engines by setting environment variable TLLM_HLAPI_BUILD_CACHE=1 or passing enable_build_cache=True to LLM class.
      • Exposed low-level options including BuildConfig, SchedulerConfig and so on in the kwargs, ideally you should be able to configure details about the build and runtime phase.
    • Refactored LLM.generate() and LLM.generate_async() API.
      • Removed SamplingConfig.
      • Added SamplingParams with more extensive parameters, see tensorrt_llm/hlapi/utils.py.
        • The new SamplingParams contains and manages fields from Python bindings of SamplingConfig, OutputConfig, and so on.
      • Refactored LLM.generate() output as RequestOutput, see tensorrt_llm/hlapi/llm.py.
    • Updated the apps examples, specially by rewriting both chat.py and fastapi_server.py using the LLM APIs, please refer to the examples/apps/README.md for details.
      • Updated the chat.py to support multi-turn conversation, allowing users to chat with a model in the terminal.
      • Fixed the fastapi_server.py and eliminate the need for mpirun in multi-GPU scenarios.
  • [BREAKING CHANGE] Speculative decoding configurations unification
    • Introduction of SpeculativeDecodingMode.h to choose between different speculative decoding techniques.
    • Introduction of SpeculativeDecodingModule.h base class for speculative decoding techniques.
    • Removed decodingMode.h.
  • gptManagerBenchmark
    • [BREAKING CHANGE] api in gptManagerBenchmark command is executor by default now.
    • Added a runtime max_batch_size.
    • Added a runtime max_num_tokens.
  • [BREAKING CHANGE] Added a bias argument to the LayerNorm module, and supports non-bias layer normalization.
  • [BREAKING CHANGE] Removed GptSession Python bindings.

Model Updates

  • Supported Jais, see examples/jais/README.md.
  • Supported DiT, see examples/dit/README.md.
  • Supported VILA 1.5.
  • Supported Video NeVA, see Video NeVAsection in examples/multimodal/README.md.
  • Supported Grok-1, see examples/grok/README.md.
  • Supported Qwen1.5-110B with FP8 PTQ.
  • Supported Phi-3 small model with block sparse attention.
  • Supported InternLM2 7B/20B, thanks to the contribution from @RunningLeon in Support internlm2 #1392.
  • Supported Phi-3-medium models, see examples/phi/README.md.
  • Supported Qwen1.5 MoE A2.7B.
  • Supported phi 3 vision multimodal.

Fixed Issues

Infrastructure Changes

  • Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.05-py3.
  • Base Docker image for TensorRT-LLM backend is updated to nvcr.io/nvidia/tritonserver:24.05-py3.
  • The dependent TensorRT version is updated to 10.1.0.
  • The dependent CUDA version is updated to 12.4.1.
  • The dependent PyTorch version is updated to 2.3.1.
  • The dependent ModelOpt version is updated to v0.13.0.

Known Issues

  • In a conda environment on Windows, installation of TensorRT-LLM may succeed. However, when importing the library in Python, you may receive an error message of OSError: exception: access violation reading 0x0000000000000000. This issue is under investigation.

@kaiyux kaiyux merged commit 05316d3 into rel Jul 17, 2024
@kaiyux kaiyux deleted the preview/rel branch July 17, 2024 12:45
vansangpfiev added a commit to janhq/cortex.tensorrt-llm that referenced this pull request Aug 6, 2024
* TensorRT-LLM v0.10 update

* TensorRT-LLM Release 0.10.0

---------

Co-authored-by: Loki <[email protected]>
Co-authored-by: meghagarwal <[email protected]>

* TensorRT-LLM v0.11 Update (NVIDIA#1969)

* fix: add formatter

* fix: use executor API

* fix: sync

* fix: remove requests thread

* fix: support unload endpoint for server example, handle release resources properly

* refactor: InferenceState

* fix: new line character for Mistral and Openhermes

* fix: add benchmark script

* Add Dockerfile for runner windows (#69)

* Add Dockerfile for runner windows

* Add Dockerfile for linux

* Change CI agent

* fix: build linux (#70)

Co-authored-by: vansangpfiev <[email protected]>

---------

Co-authored-by: Hien To <[email protected]>
Co-authored-by: vansangpfiev <[email protected]>
Co-authored-by: vansangpfiev <[email protected]>

* fix: default batch_size

* chore: only linux build

---------

Co-authored-by: Kaiyu Xie <[email protected]>
Co-authored-by: Loki <[email protected]>
Co-authored-by: meghagarwal <[email protected]>
Co-authored-by: sangjanai <[email protected]>
Co-authored-by: hiento09 <[email protected]>
Co-authored-by: Hien To <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants