-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
benchmark stuck #20
Comments
@leiwen83 ”freed all gpu mem size 6000" stands for "all requests has been finished". Did you has others requests left to run ? |
@leiwen83 I guess that your arg "batch_max_tokens" is defult " 1 / 6 * 6000 = 1000", this will make requests with req_input_len > 1000 will cannot be handled by router. you can set batch_max_tokens larger. and 6000 for max_total_token_num is too small for test througput,please read https://github.com/ModelTC/lightllm/blob/main/docs/ApiServerArgs.md to set better value for max_total_token_num. |
@hiworldwzj you're right, after change batch_max_token, the benchmark itself works.
But on A100@40G, I get the result as 866.94s, which far from claimed 188.85 s reported. So is 866.94s a reasonable value for A100@40G? The same dataset run with vllm, I get benchmark result is as 449.97s. Test command is as:
|
@leiwen83 I try your setting in a A800 80G, but only use 40G gpu mem. the result is total tokens: 986423
Total time: 227.69 s
Throughput: 8.78 requests/s
Average latency: 97.93 s
Average latency per token: 0.36 s
Average latency per output token: 2.41 s I guess there could be two reasons. One possibility is that a slow tokenizer was loaded, and another possibility is that there may be issues related to the Triton version. I haven't tested the performance of the kernel I wrote on Triton version 2.1.0, so I'm not sure if the operator's performance would degrade on different GPUs. Sometimes, I feel very disappointed with the instability of Triton. |
@leiwen83 I will try to borrow a A100@40G to repeat this bug and repair it, Thanks to report this. Or could you help identify the cause and fix this performance issue? Thanks. |
I think maybe after you tune the result, one docker image may could be published together with testing hugginface url. So that people could better align with your result. For the time now, it is very hard to say which component to cause such big performance downgrade. |
@leiwen83 Yes, you are right. I will try it. |
@leiwen83 You can try to use triton==2.0.0.dev20221202. Here is my result on 40G A100: |
Hi @shihaobai , After use triton==2.0.0.dev20221202, I get 801.74 result for the benchmark. So it seems to me that after 796, the prompt get processed in serial way, not in batching? Do you mind pack your environ into one docker image, together with setup lightllm and the test case? So that I could reproduce your 249 latency result in my machine? Thx |
It sounds like you have loaded a slow tokenizer, very very much. "freed all gpu mem size" print stands for "all requests has been finished, but detokenize has not finished" |
How to change to fast tokenizer? |
Do you have tokenizer.json in your model folder? If there is this file, it will load the fast tokenizer. |
There is no such file... Thx |
You can use the pre-trained model in https://huggingface.co/huggyllama/llama-7b/tree/main which includes a fast tokenizer. |
After switch to this fast tokenizer one, seem result get very close to your side: Total time: 268.24 s Do you know how to convert "tokenizer.json" for those llama model without it? Thx |
@shihaobai Can you help this? |
You can export a tokenizer.json referring the https://huggingface.co/docs/transformers/fast_tokenizers. |
Got it. Thx. |
Hi,
I try benchmark_serving.py to check the througput of lightllm, But seems benchmark process stuck after server print the "freed all gpu mem", then http post print would no longer print except last one.
Any idea?
The text was updated successfully, but these errors were encountered: