Drop in performance for Llama-2-13b-chat-hf in fp8 when increasing batch size #1380

bprus · 2024-03-29T16:31:04Z

System Info

CPU architecture: x86_64
GPU: NVIDIA H100 80GB
TensorRT-LLM: v0.8.0 (docker build via make -C docker release_build CUDA_ARCHS="90-real") and 0.9.0.dev2024032600
Triton Inference Server: r24.02
OS: Ubuntu 22.04

Who can help?

@kaiyux

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I've followed the official documentation to create Llama models and run them with Triton. I'm testing fp8 and int8 quantization.
https://github.com/NVIDIA/TensorRT-LLM/tree/v0.8.0/examples/llama

For fp8 model, I used the following commands:

python ../quantization/quantize.py --model_dir meta-llama/Llama-2-13b-chat-hf \
                                   --dtype float16 \
                                   --qformat fp8 \
                                   --output_dir /models/quant/Llama-2-13b-chat-hf_fp8 \
                                   --calib_size 512 \
                                   --tp_size 1

trtllm-build --checkpoint_dir /models/quant/Llama-2-13b-chat-hf_fp8 \
             --output_dir /models/engines/Llama-2-13b-chat-hf_1gpu_fp8 \
             --gemm_plugin float16 \
             --workers 1 \
             --use_custom_all_reduce disable \
             --remove_input_padding enable \
             --use_paged_context_fmha enable \
             --strongly_typed \
             --max_batch_size 256

For int8 model:

python3 convert_checkpoint.py --model_dir meta-llama/Llama-2-13b-chat-hf \
                              --output_dir /models/rt/Llama-2-13b-chat-hf_1gpu_fp16_wq8 \
                              --dtype float16 \
                              --use_weight_only \
                              --weight_only_precision int8 \
                              --tp_size 1 \
                              --workers 1

trtllm-build --checkpoint_dir /models/rt/Llama-2-13b-chat-hf_1gpu_fp16_wq8 \
             --output_dir /models/engines/Llama-2-13b-chat-hf_1gpu_fp16_wq8_pc \
             --gemm_plugin float16 \
             --workers 1 \
             --use_custom_all_reduce disable \
             --remove_input_padding enable \
             --use_paged_context_fmha enable \
             --max_batch_size 256

I run models with Triton docker:

docker run -d --rm --net host --shm-size=40g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all --name trt-llm \
	-v /root/dev/tensorrtllm_backend:/tensorrtllm_backend \
	-v /root/dev/models:/models \
	-v /root/models:/models-hub \
	trt-24-dev \
	mpirun --allow-run-as-root -n 1 /opt/tritonserver/bin/tritonserver  --log-info True --log-verbose 3 --model-repository=/models/triton/llama-fp8 --grpc-port=8001 --http-port=8000 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_ :

I'm testing performance for different setups, and I ran into the following issue:
When setting max_batch_size to high values (like 256) and running only 1 request at the same time, the performance of an fp8 model drops significantly compared to the int8 model and compared to a model built with max_batch_size=1.

I'm using mainly Locust for my tests, but to check if it's not the problem with my code, I also run the benchmark_core_model.py script. I had to make some code changes to it to simulate my approach: I'm forcing it to do only one request at a time. My changes to test_performance function:

    responses = [] <-------- HERE
    for i, ids in enumerate(input_start_ids):
        output0_len = np.ones_like([[1]]).astype(np.int32) * output_lens[i]
        end_id = np.full_like([[1]], 2).astype(np.int32)
        inputs = [
            utils.prepare_tensor("input_ids", ids, FLAGS.protocol),
            utils.prepare_tensor("input_lengths", input_lens[i],
                                 FLAGS.protocol),
            utils.prepare_tensor("request_output_len", output0_len,
                                 FLAGS.protocol),
            utils.prepare_tensor("end_id", end_id, FLAGS.protocol),
        ]

        # time.sleep(delays[i]) <-------- HERE

        if FLAGS.protocol == "http":
            async_requests.append(
                client.async_infer(model_name, inputs, request_id=str(i)))
        elif FLAGS.protocol == "grpc":
            async_requests.append(
                client.async_infer(model_name,
                                   inputs,
                                   callback=partial(callback, user_data,
                                                    datetime.now(), i),
                                   request_id=str(i)))
        responses.append(utils.get_grpc_results(user_data, 1)[0]) <-------- HERE

    try:
------- HERE ->
        # if FLAGS.protocol == "http":
        #     utils.get_http_results(async_requests)
        # elif FLAGS.protocol == "grpc":
        #     responses = utils.get_grpc_results(user_data, len(input_start_ids))
        # else:
        #     raise RuntimeError("Invalid protocol")
<------- HERE

Next, I run tests with the following command (with my own dataset):

python3 benchmark_core_model.py -i grpc --max-input-len 1024 --num-requests 100 --request-rate -1 --time-delay-dist constant dataset --dataset /data/questions_triton.json --tokenizer-dir meta-llama/Llama-2-13b-chat-hf --op-tokens-per-word 1.0

and with the synthetic dataset:

python3 benchmark_core_model.py -i grpc --max-input-len 1024 --num-requests 50 --request-rate -1 token-norm-dist --input-mean 128 --input-stdev 5 --output-mean 500 --output-stdev 20

Here are the results for fp8 and my dataset:

+----------------------------+-----------+
|            Stat            |   Value   |
+----------------------------+-----------+
|        Requests/Sec        |   0.25    |
|       OP tokens/sec        |   67.04   |
|     Avg. latency (ms)      |  4025.84  |
|      P99 latency (ms)      | 14156.30  |
|      P90 latency (ms)      |  8077.87  |
| Avg. IP tokens per request |   16.76   |
| Avg. OP tokens per request |  269.90   |
|   Avg. InFlight requests   |   0.00    |
|     Total latency (ms)     | 201303.43 |
|       Total requests       |   50.00   |
+----------------------------+-----------+

for int8:

+----------------------------+-----------+
|            Stat            |   Value   |
+----------------------------+-----------+
|        Requests/Sec        |   0.38    |
|       OP tokens/sec        |   96.53   |
|     Avg. latency (ms)      |  2640.49  |
|      P99 latency (ms)      |  9806.77  |
|      P90 latency (ms)      |  5151.49  |
| Avg. IP tokens per request |   16.76   |
| Avg. OP tokens per request |  254.92   |
|   Avg. InFlight requests   |   0.00    |
|     Total latency (ms)     | 132036.59 |
|       Total requests       |   50.00   |
+----------------------------+-----------+

And for synthetic fp8:

+----------------------------+-----------+
|            Stat            |   Value   |
+----------------------------+-----------+
|        Requests/Sec        |   0.14    |
|       OP tokens/sec        |   53.00   |
|     Avg. latency (ms)      |  7088.72  |
|      P99 latency (ms)      |  7553.55  |
|      P90 latency (ms)      |  7429.67  |
| Avg. IP tokens per request |  127.72   |
| Avg. OP tokens per request |  375.72   |
|   Avg. InFlight requests   |   0.00    |
|     Total latency (ms)     | 354447.98 |
|       Total requests       |   50.00   |
+----------------------------+-----------+

for int8:

+----------------------------+-----------+
|            Stat            |   Value   |
+----------------------------+-----------+
|        Requests/Sec        |   0.20    |
|       OP tokens/sec        |   75.67   |
|     Avg. latency (ms)      |  4945.65  |
|      P99 latency (ms)      |  5381.79  |
|      P90 latency (ms)      |  5182.60  |
| Avg. IP tokens per request |  128.62   |
| Avg. OP tokens per request |  374.24   |
|   Avg. InFlight requests   |   0.00    |
|     Total latency (ms)     | 247294.83 |
|       Total requests       |   50.00   |
+----------------------------+-----------+

As you can see, there is a huge difference in OP tokens/sec between fp8 and int8. It made me suspicious because I expected the performance to be similar, so I started to look for the cause. After some time, I found out that the problem disappears when I build models with max_batch_size=1.
Then the results for both fp8 and int8 are nearly the same: 103.079041 OP tokens/sec for int8 and 100.961634 for fp8.
I investigated further and found that the models perform similarly up to around max_batch_size=64. Increasing it further causes fp8 performance to drop gradually.

I wonder if this issue also impacts the performance if there is more than 1 simultaneous request, but I can't check it. For example, I tested with 20 simultaneous requests (in Locust), and the performance is similar for both models: 65.904632 tokens/s for fp8 and 63.529027 for int8. But I don't know if they should be the same or if fp8 should be faster at this point.

I tested it on v0.8.0 and 0.9.0.dev2024032600.

I can provide more details or results if needed.

Looking forward to solving this issue together.

Expected behavior

Performance for fp8 and int8 models is comparable.

actual behavior

Performance of fp8 model drops significantly when increasing max_batch_size and running 1 request at a time.

additional notes

The text was updated successfully, but these errors were encountered:

github-actions · 2024-05-19T01:51:29Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

kaiyux · 2024-05-23T07:44:09Z

Hi @bprus , can you try again on the latest main branch? We've integrated several optimizations including multiple profiles, which should minimize the impacts of max_batch_size to the kernel selection.

Besides, may I ask why did you set --gemm_plugin float16 to fp8 cases? Usually gemm plugin is not recommended to be enabled on fp8 cases. Can you also try and let us know if you can reproduce the perf issue by using gptManagerBenchmark? This is to eliminate the impacts of the Triton backend.

Thanks for your support and help.

bprus · 2024-05-23T08:52:12Z

Hi @kaiyux, thanks for the reply. I'll check it in 2 weeks, because I'm out of office right now. I'll get back with the results as soon as I can.

bprus · 2024-06-05T14:07:31Z

@kaiyux
I tested the new version: 0.11.0.dev2024060400
When using multiple profiles the issue disappears. Thanks for the help!

I used --gemm_plugin float16 because it was in the official example:

trtllm-build --checkpoint_dir ./tllm_checkpoint_2gpu_fp8 \
             --output_dir ./engine_outputs \
             --gemm_plugin float16 \
             --strongly_typed \
             --workers 2

(now it's changed to --gemm_plugin auto)
Thank you for letting me know that this is not a good practice.
However, when I turned off the gemm plugin, I got the following warnings:

[06/05/2024-10:13:15] [TRT-LLM] [I] Set dtype to float16.
[06/05/2024-10:13:15] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
...

I can't find any way to set dtype. Can you suggest something?

Also, when testing the current version, I stumbled on another issue: #1738

nv-guomingz · 2024-11-14T07:31:00Z

hi @bprus do u still have further issue or question now? If not, we'll close it soon.

bprus · 2024-11-14T08:14:08Z

We can close, thanks.

bprus added the bug Something isn't working label Mar 29, 2024

byshiue assigned byshiue and kaiyux Apr 1, 2024

github-actions bot added the stale label May 19, 2024

github-actions bot removed the stale label May 24, 2024

nv-guomingz added the stale label Nov 14, 2024

byshiue closed this as completed Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop in performance for Llama-2-13b-chat-hf in fp8 when increasing batch size #1380

Drop in performance for Llama-2-13b-chat-hf in fp8 when increasing batch size #1380

bprus commented Mar 29, 2024

github-actions bot commented May 19, 2024

kaiyux commented May 23, 2024

bprus commented May 23, 2024

bprus commented Jun 5, 2024

nv-guomingz commented Nov 14, 2024

bprus commented Nov 14, 2024

Drop in performance for Llama-2-13b-chat-hf in fp8 when increasing batch size #1380

Drop in performance for Llama-2-13b-chat-hf in fp8 when increasing batch size #1380

Comments

bprus commented Mar 29, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

github-actions bot commented May 19, 2024

kaiyux commented May 23, 2024

bprus commented May 23, 2024

bprus commented Jun 5, 2024

nv-guomingz commented Nov 14, 2024

bprus commented Nov 14, 2024