-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inflight batching for fp8 Llama and Mixtral is broken #1738
Comments
Hi @bprus , we'll try to reproduce your issue at local side firstly. |
Much appreciated. |
I have also reproduced the problems above. I found that if the |
@wanzhenchn @bprus could you give a try with run.py ? this is what I have got with llama-2-13b-hf (not chat, but it should not make any difference to the issue here) with the same commands you have shared. I am going to try with triton-server, and update here if I find anything different.
|
I could reproduce it with triton-backend + IFB. let me see what is happening wrong with IFB. |
@wanzhenchn @bprus please give a try with the fix shown below (the fix will be pushed the main branch in next week's update): modify this line to:
|
I confirm that the fix works for Llama model. |
@PerkzZheng The issue with generation not stopping is solved. But now, the generated responses with multiple simultaneous requests are much shorter than those with only a single request. For a set of 15 questions, the average number of generated tokens for single requests is 450, and for multiple requests, it's 319. I looked into the generated answers and I found that this time around, single request setup tends to not stop. Prompt: SIngle-request answer:
Multiple-requests answer:
Any ideas? Note: I quantized Mixtral on CPU, and I'm not sure if this can impact results in such a way. I've described it here: #1777 |
@bprus could you give it try with run.py directly (not IFB + triton backend) ? #1738 (comment) and please share your full commands of engine building. |
@PerkzZheng sorry to keep you waiting. First, here are the commands I use:
I used
While the triton returns:
So the issue is somewhere in Triton or IFB I guess. Is there anything else I can help you with? |
@bprus have you enabled chunked_context or kv_cache_reuse ? what if we disabled them all (and probably paged_context_fmha), but I am still confused why the same commands work for llama (or llama is using different commands ?). |
@PerkzZheng
So the outputs are different, but none of them seems broken. Is it expected that generated answers are so much different when using Triton + IFB vs run.py? As to disabling other options, I used defaults for them. |
@bprus no, that is not expected. have you confirmed that they are using the same sampling configuration (greedy search I assume) ? I got consistent outputs when using IFB + triton backend vs run.py (even though I used llama-2-13b-hf).
can you share the full config.pbtxt so we know what are the configurations you are using ? |
@PerkzZheng After changing that, I get consistent outputs between Triton and Thanks for all the help, I'm closing the issue now. |
Hi Team! Seems #1793 fixed this issue. Could you help confirm that this issue exists in 0.10.0 and we have to avoid FP8 quantization with IFB in this version? |
yes, the same issue exists in 0.10.0. You can pick up the lastest main branch. |
System Info
x86_64
0.11.0.dev2024060400
(docker build viamake -C docker release_build CUDA_ARCHS="90-real"
)r24.04
(docker build viaDOCKER_BUILDKIT=1 docker build -t trt-llm -f dockerfile/Dockerfile.trt_llm_backend .
intensorrtllm_backend
)Who can help?
@Tracin @byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I've followed the official documentation to create Llama models and run them with Triton. I'm testing fp8 and int8 quantization. The issue is also present for Mixtral model, but I'm giving examples only for Llama for simplicity.
For fp8 model, I used the following commands:
For int8 model:
I serve models with Triton docker.
I'm testing performance for different setups using Locust, and I ran into the following issue.
When making a single request at the time to the model, everything works as expected for both setups.
But when I try to make simultaneous requests, the generated output for fp8 is broken. It nearly always tries to generate tokens until max_tokens is reached. The issue doesn't exist in int8 setup.
Here is an example (max_tokens is set to 1000):
fp8 single request:
fp8 multiple requests:
int8 single request
int8 multiple requests:
My guess is that something with inflight batching is broken for fp8. When the server tries to batch incoming requests it breaks the output in some way.
It looks a little bit similar to: #1539
I can run more tests and provide more results if you need.
Expected behavior
Responses generated for fp8 model when using inflight batching are the same as without it.
actual behavior
fp8 model when receiving multiple requests returns broken output.
additional notes
The text was updated successfully, but these errors were encountered: