-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lightllm和vllm性能对比 #116
Comments
@Cydia2018 是不是这个没有 Fast Tokenizer 呀, 你启动服务的时候,打印的warning信息有没有提醒呀。 |
@hiworldwzj 是的,我没有使用Fast Tokenizer,但是vllm和lightllm都未使用,所以我认为这不是性能差距的主要原因。 |
@Cydia2018 这个max_total_token_num在你这个模型配置下需要重新算个合理值 |
@hiworldwzj 请告诉我如何计算max_total_token_num的合理区间,方便的话,请直接告诉我你们在测试65b模型的参数设置(数据集相同),谢谢 |
@hiworldwzj 我在A1000-smx-80G,tp=8,max_total_token_num最大开到193696,吞吐量仍然只有2.98request/s |
@Cydia2018 ok,我试试你这个配置的性能,一时间确实看不出来问题到底在什么地方。 |
想請問各位大神~我在python setup.py install輸入進cmd後就沒反應了,請問這要怎麼解決呢? |
您好,请问就目前来说,针对llama2 70b的多卡推理,lightllm会比vllm的latency性能更高吗,有没有相关的benchmark呢,非常感谢 |
hi, 请问大家. 在跑 benckmark 时, 有啥好办法安装 vllm 使得 vllm 的相关依赖不影响到 lightllm 吗 |
@zzb610 用conda 创建两个虚拟环境呢 |
下面是我在A100-sxm-80G上的测试结果:
vllm
python -m vllm.entrypoints.api_server --model /code/llama-65b-hf --swap-space 16 --disable-log-requests --tensor-parallel-size 8
python benchmarks/benchmark_serving.py --tokenizer /code/llama-65b-hf --dataset /code/ShareGPT_V3_unfiltered_cleaned_split.json
Total time: 312.02 s
Throughput: 3.20 requests/s
Average latency: 125.45 s
Average latency per token: 0.40 s
Average latency per output token: 2.10 s
lightllm
python -m lightllm.server.api_server --model_dir /code/llama-65b-hf --tp 8 --max_total_token_num 121060 --tokenizer_mode auto
python benchmark_serving.py --tokenizer /code/llama-65b-hf --dataset /code/ShareGPT_V3_unfiltered_cleaned_split.json
total tokens: 494250
Total time: 333.10 s
Throughput: 3.00 requests/s
Average latency: 113.86 s
Average latency per token: 0.33 s
Average latency per output token: 1.54 s
看起来lightllm结果与报告的性能相差很大,可以告诉我是哪里设置错误了吗?谢谢
The text was updated successfully, but these errors were encountered: