Update on the development branch #1765

kaiyux · 2024-06-11T09:11:15Z

kaiyux
Jun 11, 2024
Maintainer

Hi,

The TensorRT-LLM team is pleased to announce that we are pushing an update to the development branch (and the Triton backend) this June 11, 2024.

This update includes:

Model Support
- Support Phi-3-medium models, see examples/phi/README.md
Features
- Added support for quantized base model and FP16/BF16 LoRA.
API
- [BREAKING CHANGE] max_batch_size in trtllm-build command is 256 by default now.
- [BREAKING CHANGE] max_num_tokens in trtllm-build command is 8192 by default now.
- [BREAKING CHANGE] api in gptManagerBenchmark command is executor by default now.
- [BREAKING CHANGE] Added a bias argument to the LayerNorm module, and supports non-bias layer normalization.
- [BREAKING CHANGE] Refactored LLM.generate() API.
  - Removed SamplingConfig
  - Added SamplingParams with some sampling parameters, see tensorrt_llm/hlapi/utils.py
  - Use SamplingParams instead of SamplingConfigin LLM.generate() API, see examples/high-level-api/README.md
- [BREAKING CHANGE]: Refactored GptManager API
  - Move maxBeamWidth into TrtGptModelOptionalParams
  - Move schedulerConfig into TrtGptModelOptionalParams
Bug fixes
- Fixed convert_hf_mpt_legacy call failure when the function is called in other than global scope, thanks to the contribution from @bloodeagle40234 in Define hf_config explisitly for convert_hf_mpt_legacy #1534.
- Fixed use_fp8_context_fmha broken outputs (use_fp8_context_fmha broken outputs #1539).
- Fixed pre-norm weight conversion for NMT models, thanks to the contribution from @Pzzzzz5142 in Fix pre-norm weight conversion for nmt #1723.
- Fixed random seed initialization issue, thanks to the contribution from @pathorn in Reference input randomSeeds by idx rather than batchSlot #1742.
- Fixed stop words and bad words in python bindings. (failed to use "stop_words_list" for tensorrt-llm==0.9.0 #1642)
Performance
- Low latency optimization
  - Added a reduce-norm feature which aims to fuse the ResidualAdd and LayerNorm kernels after AllReduce into a single kernel, which is recommended to be enabled when the batch size is small and the generation phase time is dominant.
  - Added FP8 support to the GEMM plugin, which benefits the cases when batch size is smaller than 4.
Documentation
- Added --ipc=host notes to installation guide to prevent bus error, see docs/source/installation/build-from-source-linux.md and docs/source/installation/linux.md (Bus error running t5 conversion script using the latest main #1538)

Thanks,
The TensorRT-LLM Engineering Team

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update on the development branch #1765

{{title}}

Replies: 0 comments

Select a reply

Update on the development branch #1765

kaiyux Jun 11, 2024 Maintainer

Replies: 0 comments

kaiyux
Jun 11, 2024
Maintainer