Update TensorRT-LLM #1763

kaiyux · 2024-06-11T07:00:12Z

Model Support
- Support Phi-3-medium models, see examples/phi/README.md
Features
- Added support for quantized base model and FP16/BF16 LoRA.
API
- [BREAKING CHANGE] max_batch_size in trtllm-build command is 256 by default now.
- [BREAKING CHANGE] max_num_tokens in trtllm-build command is 8192 by default now.
- [BREAKING CHANGE] api in gptManagerBenchmark command is executor by default now.
- [BREAKING CHANGE] Added a bias argument to the LayerNorm module, and supports non-bias layer normalization.
- [BREAKING CHANGE] Refactored LLM.generate() API.
  - Removed SamplingConfig
  - Added SamplingParams with some sampling parameters, see tensorrt_llm/hlapi/utils.py
  - Use SamplingParams instead of SamplingConfigin LLM.generate() API, see examples/high-level-api/README.md
- [BREAKING CHANGE]: Refactored GptManager API
  - Move maxBeamWidth into TrtGptModelOptionalParams
  - Move schedulerConfig into TrtGptModelOptionalParams
Bug fixes
- Fixed convert_hf_mpt_legacy call failure when the function is called in other than global scope, thanks to the contribution from @bloodeagle40234 in Define hf_config explisitly for convert_hf_mpt_legacy #1534.
- Fixed use_fp8_context_fmha broken outputs (use_fp8_context_fmha broken outputs #1539).
- Fixed pre-norm weight conversion for NMT models, thanks to the contribution from @Pzzzzz5142 in Fix pre-norm weight conversion for nmt #1723.
- Fixed random seed initialization issue, thanks to the contribution from @pathorn in Reference input randomSeeds by idx rather than batchSlot #1742.
- Fixed stop words and bad words in python bindings. (failed to use "stop_words_list" for tensorrt-llm==0.9.0 #1642)
Performance
- Low latency optimization
  - Added a reduce-norm feature which aims to fuse the ResidualAdd and LayerNorm kernels after AllReduce into a single kernel, which is recommended to be enabled when the batch size is small and the generation phase time is dominant.
  - Added FP8 support to the GEMM plugin, which benefits the cases when batch size is smaller than 4.
Documentation
- Added --ipc=host notes to installation guide to prevent bus error, see docs/source/installation/build-from-source-linux.md and docs/source/installation/linux.md (Bus error running t5 conversion script using the latest main #1538)

pfk-beta · 2024-06-24T14:33:41Z

Hi, thanks for your hard work, btw. I have spotted huge removal in examples/run.py: db4edea#diff-299cb0140ad8f9d286c86ecc32b793b048531e27570675b94e54b57b66b3d7d5. Is it intented?

pfk-beta · 2024-06-25T17:06:36Z

Sorry for false alarm, these arguments was moved to utils. I didn't spotted it

open source 8d4b145290d5984494a1fa6e380d01456534dc62

f498341

Shixiaowei02 approved these changes Jun 11, 2024

View reviewed changes

kaiyux merged commit db4edea into main Jun 11, 2024

kaiyux deleted the preview/main branch June 11, 2024 08:59

DreamGenX mentioned this pull request Jun 16, 2024

Repeated outputs for long input tasks on Llama 3 70B compared to vLLM and HF's transformers #1788

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update TensorRT-LLM #1763

Update TensorRT-LLM #1763

kaiyux commented Jun 11, 2024 •

edited

Loading

pfk-beta commented Jun 24, 2024

pfk-beta commented Jun 25, 2024

Update TensorRT-LLM #1763

Update TensorRT-LLM #1763

Conversation

kaiyux commented Jun 11, 2024 • edited Loading

pfk-beta commented Jun 24, 2024

pfk-beta commented Jun 25, 2024

kaiyux commented Jun 11, 2024 •

edited

Loading