You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using ghcr.io/huggingface/text-generation-inference:3.0.1 container image.
Issue Description
Hi everyone!
I'm using the Qwen/Qwen2.5-14B-Instruct-GPTQ-Int8 LLM model for benchmarking with multiple concurrent requests. However, when I send 10 concurrent requests, the responses start showing random characters, like the example below. It works fine with 3 or 5 concurrent requests, which give the best results.
When comparing with the meta-llama/Meta-Llama-3.1-8B-Instruct model using similar parameters, I get normal responses even with high concurrent requests. I expect that the Qwen/Qwen2.5-14B-Instruct-GPTQ-Int8 model should also handle high concurrency well without producing random characters.
I have also confirmed that the Qwen model performs well with high concurrency in vLLM.
Could anyone provide suggestions or experiments to improve the performance in high concurrency? Any help would be greatly appreciated! Thank you!
The text was updated successfully, but these errors were encountered:
System Info
I'm using
ghcr.io/huggingface/text-generation-inference:3.0.1
container image.Issue Description
Hi everyone!
I'm using the
Qwen/Qwen2.5-14B-Instruct-GPTQ-Int8
LLM model for benchmarking with multiple concurrent requests. However, when I send 10 concurrent requests, the responses start showing random characters, like the example below. It works fine with 3 or 5 concurrent requests, which give the best results.Information
Tasks
Reproduction
Send 10 concurrent requests to the inference server.
docker-compose.yml
Expected behavior
When comparing with the
meta-llama/Meta-Llama-3.1-8B-Instruct
model using similar parameters, I get normal responses even with high concurrent requests. I expect that theQwen/Qwen2.5-14B-Instruct-GPTQ-Int8
model should also handle high concurrency well without producing random characters.I have also confirmed that the Qwen model performs well with high concurrency in vLLM.
Could anyone provide suggestions or experiments to improve the performance in high concurrency? Any help would be greatly appreciated! Thank you!
The text was updated successfully, but these errors were encountered: