This document describes how you can run multiple instances of LLaMa model on single and multiple GPUs running on the same machine. The guide focuses on the following scenarios:
-
Running multiple instances of LLaMa model on multiple GPUs:
a. Using Orchestrator mode.
b. Using Leader mode.
-
Setup the model repository as described in LLaMa Guide.
-
Increase the number of instances for the
instance_group
parameter for thetensorrt_llm
model. -
Start the triton server:
# Replace the <gpu> with the gpu you want to use for this model.
CUDA_VISIBLE_DEVICES=<gpu> tritonserver --model-repository `pwd`/llama_ifb &
This would create multiple instances of the tensorrt_llm
model, running on the
same GPU.
Note
Running multiple instances of a single model is generally not recommended. If you choose to do this, you need to ensure the GPU has enough resources for multiple copies of a single model. The performance implications of running multiple models on the same GPU are unpredictable.
Note
For production deployments please make sure to adjust the
max_tokens_in_paged_kv_cache
parameter, otherwise you may run out of GPU memory since TensorRT-LLM by default may use 90% of GPU for KV-Cache for each model instance. Additionally, if you rely onkv_cache_free_gpu_mem_fraction
the memory allocated to each instance will depend on the order in which instances are loaded.
- Run the test client to measure performance:
python3 tools/inflight_batcher_llm/end_to_end_test.py --dataset ci/L0_backend_trtllm/simple_data.json --max-input-len 500
If you plan to use the BLS version instead of the ensemble model, you might also
need to adjust the number of model instances for the tensorrt_llm_bls
model.
The default value only allows a single request for the whole pipeline which
might increase the latency and reduce the throughput.
- Kill the server:
pgrep tritonserver | xargs kill
Unlike other Triton backend models, the TensorRT-LLM backend does not support
using instance_group
setting for determining the placement of model instances
on different GPUs. In this section, we demonstrate how you can use
Leader Mode and Orchestrator Mode
for running multiple instances of a LLaMa model on different GPUs.
For this section, let's assume that we have four GPUs and the CUDA device ids are 0, 1, 2, and 3. We will be launching two instances of the LLaMa2-7b model with tensor parallelism equal to 2. The first instance will run on GPUs 0 and 1 and the second instance will run on GPUs 2 and 3.
- Create the engines:
# Update if the model is not available in huggingface cache
export HF_LLAMA_MODEL=`python3 -c "from pathlib import Path; from huggingface_hub import hf_hub_download; print(Path(hf_hub_download('meta-llama/Llama-2-7b-hf', filename='config.json')).parent)"`
export UNIFIED_CKPT_PATH=/tmp/ckpt/llama/7b-2tp-2gpu/
export ENGINE_PATH=/tmp/engines/llama/7b-2tp-2gpu/
# Create the checkpoint
python tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir ${HF_LLAMA_MODEL} \
--output_dir ${UNIFIED_CKPT_PATH} \
--dtype float16 \
--tp_size 2
# Build the engines
trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} \
--remove_input_padding enable \
--gpt_attention_plugin float16 \
--context_fmha enable \
--gemm_plugin float16 \
--output_dir ${ENGINE_PATH} \
--paged_kv_cache enable \
--max_batch_size 64
- Setup the model repository:
# Setup the model repository for the first instance.
cp all_models/inflight_batcher_llm/ llama_ifb -r
python3 tools/fill_template.py -i llama_ifb/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i llama_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i llama_ifb/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0
For leader mode, we will launch two separate mpirun
commands to launch two
separate Triton servers, one for each GPU (4 Triton Server instances in total).
We also need to use a reverse proxy in front of them to load balance the requests
between the servers.
3a. Launch the servers:
CUDA_VISIBLE_DEVICES=0,1 python3 scripts/launch_triton_server.py --world_size 2 --model_repo=llama_ifb/ --http_port 8000 --grpc_port 8001 --metrics_port 8004
CUDA_VISIBLE_DEVICES=2,3 python3 scripts/launch_triton_server.py --world_size 2 --model_repo=llama_ifb/ --http_port 8002 --grpc_port 8003 --metrics_port 8005
4a. Install NGINX:
apt update
apt install nginx -y
5a. Setup the NGINX configuration and store it in /etc/nginx/sites-available/tritonserver
:
upstream tritonserver {
server localhost:8000;
server localhost:8002;
}
server {
listen 8080;
location / {
proxy_pass http://tritonserver;
}
}
6a. Create a symlink and restart NGINX to enable the configuration:
ln -s /etc/nginx/sites-available/tritonserver /etc/nginx/sites-enabled/tritonserver
service nginx restart
7a. Run the test client to measure performance:
pip3 install tritonclient[all]
# Test the load on all the servers
python3 tools/inflight_batcher_llm/end_to_end_test.py --dataset ci/L0_backend_trtllm/simple_data.json --max-input-len 500 -u localhost:8080
# Test the load on one of the servers
python3 tools/inflight_batcher_llm/end_to_end_test.py --dataset ci/L0_backend_trtllm/simple_data.json --max-input-len 500 -u localhost:8000
8a. Kill the server:
pgrep mpirun | xargs kill
In this mode, we will create a copy of the TensorRT-LLM model and use the
gpu_device_ids
field to specify which GPU should be used by each model
instance. Then, we need to modify the client to distribute the requests between
different models.
3b. Create a copy of the tensorrt_llm
model:
cp llama_ifb/tensorrt_llm llama_ifb/tensorrt_llm_2 -r
4b. Modify the gpu_device_ids
field in the config file to specify which GPUs
should be used by each model:
sed -i 's/\${gpu_device_ids}/0,1/g' llama_ifb/tensorrt_llm/config.pbtxt
sed -i 's/\${gpu_device_ids}/2,3/g' llama_ifb/tensorrt_llm_2/config.pbtxt
sed -i 's/name: "tensorrt_llm"/name: "tensorrt_llm_2"/g' llama_ifb/tensorrt_llm_2/config.pbtxt
Note
If you want to use the ensemble or BLS models, you have to create a copy of the ensemble and BLS models as well and modify the "tensorrt_llm" model name to "tensorrt_llm_2" in the config file.
5b. Launch the server:
python3 scripts/launch_triton_server.py --multi-model --model_repo=llama_ifb/
6b. Run the test client to measure performance:
pip3 install tritonclient[all]
# We will only benchmark the core tensorrtllm models.
python3 tools/inflight_batcher_llm/benchmark_core_model.py --max-input-len 500 \
dataset --dataset ci/L0_backend_trtllm/simple_data.json \
--tokenizer-dir $HF_LLAMA_MODEL \
--tesnorrt-llm-model-name tensorrtllm \
--tensorrt-llm-model-name tensorrtllm_2
7b. Kill the server:
pgrep mpirun | xargs kill
The table below summarizes the differences between the orchestrator mode and leader mode:
Orchestrator Mode | Leader Mode | |
---|---|---|
Multi-node Support | ❌ | ✅ |
Requires Reverse Proxy | ❌ | ✅ |
Requires Client Changes | ✅ | ❌ |
Requires MPI_Comm_Spawn Support |
✅ | ❌ |