-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop in performance for Llama-2-13b-chat-hf in fp8 when increasing batch size #1380
Comments
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days." |
Hi @bprus , can you try again on the latest main branch? We've integrated several optimizations including multiple profiles, which should minimize the impacts of Besides, may I ask why did you set Thanks for your support and help. |
Hi @kaiyux, thanks for the reply. I'll check it in 2 weeks, because I'm out of office right now. I'll get back with the results as soon as I can. |
@kaiyux I used
(now it's changed to
I can't find any way to set dtype. Can you suggest something? Also, when testing the current version, I stumbled on another issue: #1738 |
hi @bprus do u still have further issue or question now? If not, we'll close it soon. |
We can close, thanks. |
System Info
x86_64
v0.8.0
(docker build viamake -C docker release_build CUDA_ARCHS="90-real"
) and0.9.0.dev2024032600
r24.02
Who can help?
@kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I've followed the official documentation to create Llama models and run them with Triton. I'm testing fp8 and int8 quantization.
https://github.com/NVIDIA/TensorRT-LLM/tree/v0.8.0/examples/llama
For fp8 model, I used the following commands:
For int8 model:
I run models with Triton docker:
I'm testing performance for different setups, and I ran into the following issue:
When setting
max_batch_size
to high values (like 256) and running only 1 request at the same time, the performance of an fp8 model drops significantly compared to the int8 model and compared to a model built withmax_batch_size=1
.I'm using mainly Locust for my tests, but to check if it's not the problem with my code, I also run the
benchmark_core_model.py
script. I had to make some code changes to it to simulate my approach: I'm forcing it to do only one request at a time. My changes totest_performance
function:Next, I run tests with the following command (with my own dataset):
and with the synthetic dataset:
Here are the results for fp8 and my dataset:
for int8:
And for synthetic fp8:
for int8:
As you can see, there is a huge difference in OP tokens/sec between fp8 and int8. It made me suspicious because I expected the performance to be similar, so I started to look for the cause. After some time, I found out that the problem disappears when I build models with
max_batch_size=1
.Then the results for both fp8 and int8 are nearly the same: 103.079041 OP tokens/sec for int8 and 100.961634 for fp8.
I investigated further and found that the models perform similarly up to around
max_batch_size=64
. Increasing it further causes fp8 performance to drop gradually.I wonder if this issue also impacts the performance if there is more than 1 simultaneous request, but I can't check it. For example, I tested with 20 simultaneous requests (in Locust), and the performance is similar for both models: 65.904632 tokens/s for fp8 and 63.529027 for int8. But I don't know if they should be the same or if fp8 should be faster at this point.
I tested it on
v0.8.0
and0.9.0.dev2024032600
.I can provide more details or results if needed.
Looking forward to solving this issue together.
Expected behavior
Performance for fp8 and int8 models is comparable.
actual behavior
Performance of fp8 model drops significantly when increasing
max_batch_size
and running 1 request at a time.additional notes
The text was updated successfully, but these errors were encountered: