We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lightllm commit id:718e6d6dfffc75e7bbfd7ea80ba4afb77aa27726 模型链接:https://huggingface.co/Linly-AI/Chinese-LLaMA-2-7B-hf 启动服务命令:python -m lightllm.server.api_server --model_dir Linly-AI/Chinese-LLaMA-2-7B-hf --host 0.0.0.0 --port 8100 --tp 1 --max_total_token_num 120000 --tokenizer_mode auto --trust_remote_code 测试发现首token延时很高,约3s左右,可以使用上面的模型和启动命令复现问题,辛苦看看是什么原因导致的呢?
The text was updated successfully, but these errors were encountered:
@yeliang2258 第一次启动会触发算子编译,肯定是比较慢的,只能提前warmup一下。
Sorry, something went wrong.
谢谢回复!只是我在测试之前都会有warmup,这个已经是带有warmup之后的结果了
@yeliang2258 你测试的并发是不是很高呀,或者句子特别长?,如果同时进行合并的请求很多,第一次推理的时候确实可能比较慢的
谢谢32并发去测试的,大概测试了2000条,有长有短,统计发现首token平均延时要3s,用相同的脚本测试BLOOM-7B发现首token延时是0.2s左右,所以我怀疑LLaMA2-7B有点问题
@yeliang2258 你这个现象确实比较诡异,莫不是llama2-7b的tokenizer有什么神奇的地方。计算规模上 llama2-7b 和 bloom-7b 应该差不太多。也可以用test目录下的脚本跑下不同batch和输入输出长度的数据看下推理性能在什么水平,再排查是推理慢还是其他有什么模块的处理太慢了导致的。
No branches or pull requests
lightllm commit id:718e6d6dfffc75e7bbfd7ea80ba4afb77aa27726
模型链接:https://huggingface.co/Linly-AI/Chinese-LLaMA-2-7B-hf
启动服务命令:python -m lightllm.server.api_server --model_dir Linly-AI/Chinese-LLaMA-2-7B-hf --host 0.0.0.0 --port 8100 --tp 1 --max_total_token_num 120000 --tokenizer_mode auto --trust_remote_code
测试发现首token延时很高,约3s左右,可以使用上面的模型和启动命令复现问题,辛苦看看是什么原因导致的呢?
The text was updated successfully, but these errors were encountered: