Update TensorRT-LLM #1274

kaiyux · 2024-03-12T07:54:18Z

Model Support
- Support VILA (see “LLaVA and VILA” section in examples/multimodal/README.md)
Features
- Support loading Gemma from Hugging Face
- Add support to context chunking to work with KV cache reuse
- Support auto parallelism planner for high-level API and unified builder workflow
- Enable multi-LoRA for BART LoRA
API
- [BREAKING CHANGE] Remove model parameter from gptManagerBenchmark and gptSessionBenchmark
Bug fixes
- Fix ChatGLM2-6B building failure on INT8 chatglm2-6b int8+kv8 build failed on 0.8.0 branch #1239
- Fix wrong relative path in Baichuan documentation Incorrect documentation in examples /baichuan/ #1242
Performance
- Remove router tensor parallelism to improve performance for MoE models, thanks to the contribution from @megha95 in moe router tp removed #1091
Infra
- TensorRT dependency is updated to 9.3.
- Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.01-py3
- Base Docker image for TensorRT-LLM backend is updated to nvcr.io/nvidia/tritonserver:24.01-py3

yiakwy-xpu-ml-framework-team · 2024-07-23T09:06:32Z

tensorrt_llm/auto_parallel/auto_parallel.py

+    return new_network
+
+
+def find_solution(


Hi, previously, almost 1 year ago, we also wrote a python/cpp separately to find pipeline partitions upon ONNX IR and cpp IR (where we can have more accurate memory estimation).

Since Tensor-LLM now uses its own memory-pool to mange memory allocation. I guess the esitmation will be better now.

The method we used (I also wrote a solver with dyanmical programming) is multi-level graph based.

The problem of performaning solver upon ONNX IR graph is that you cannot always garantee the same topological order in frontend and backend. I guess you will have the same problem.

Obviously one could start with 2-device partition, the answer is naive. (that was initial reason why I used dynamic programming), then extend the solution to 4, 8 or many devices mesh.

However, iterating the nodes is time consuming, so we have to do it in cpp backend with heuristic search.

I wonder how did you managed to run into these problems ? And what is the compile time you have , for example 1000 nodes of graph ?

@Shixiaowei02 @kaiyux

Update TensorRT-LLM

ca2e9bd

Shixiaowei02 force-pushed the kaiyu/update branch 2 times, most recently from a3cbbf6 to df77e19 Compare March 12, 2024 09:04

update

e30cf5b

Shixiaowei02 force-pushed the kaiyu/update branch from df77e19 to e30cf5b Compare March 12, 2024 09:05

Shixiaowei02 approved these changes Mar 12, 2024

View reviewed changes

kaiyux merged commit 4bb65f2 into main Mar 12, 2024

kaiyux deleted the kaiyu/update branch March 12, 2024 10:15

hademircii mentioned this pull request Mar 12, 2024

Import Error: ModuleNotFoundError: No module named 'tensorrt_llm.lora_manager' #1289

Closed

4 tasks

This was referenced Mar 18, 2024

OOM when using quantize.py to quantize llama-like model #1285

Closed

Assertion failed: Failed to deserialize cuda engine #1324

Closed

yiakwy-xpu-ml-framework-team reviewed Jul 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update TensorRT-LLM #1274

Update TensorRT-LLM #1274

kaiyux commented Mar 12, 2024 •

edited

Loading

yiakwy-xpu-ml-framework-team Jul 23, 2024 •

edited

Loading

yiakwy-xpu-ml-framework-team Jul 23, 2024

Update TensorRT-LLM #1274

Update TensorRT-LLM #1274

Conversation

kaiyux commented Mar 12, 2024 • edited Loading

yiakwy-xpu-ml-framework-team Jul 23, 2024 • edited Loading

Choose a reason for hiding this comment

yiakwy-xpu-ml-framework-team Jul 23, 2024

Choose a reason for hiding this comment

kaiyux commented Mar 12, 2024 •

edited

Loading

yiakwy-xpu-ml-framework-team Jul 23, 2024 •

edited

Loading