Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]LLaMA2-7B服务首token延时异常 #122

Open
yeliang2258 opened this issue Sep 6, 2023 · 5 comments
Open

[BUG]LLaMA2-7B服务首token延时异常 #122

yeliang2258 opened this issue Sep 6, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@yeliang2258
Copy link

yeliang2258 commented Sep 6, 2023

lightllm commit id:718e6d6dfffc75e7bbfd7ea80ba4afb77aa27726
模型链接:https://huggingface.co/Linly-AI/Chinese-LLaMA-2-7B-hf
启动服务命令:python -m lightllm.server.api_server --model_dir Linly-AI/Chinese-LLaMA-2-7B-hf --host 0.0.0.0 --port 8100 --tp 1 --max_total_token_num 120000 --tokenizer_mode auto --trust_remote_code
测试发现首token延时很高,约3s左右,可以使用上面的模型和启动命令复现问题,辛苦看看是什么原因导致的呢?

@yeliang2258 yeliang2258 added the bug Something isn't working label Sep 6, 2023
@hiworldwzj
Copy link
Collaborator

@yeliang2258 第一次启动会触发算子编译,肯定是比较慢的,只能提前warmup一下。

@yeliang2258
Copy link
Author

谢谢回复!只是我在测试之前都会有warmup,这个已经是带有warmup之后的结果了

@hiworldwzj
Copy link
Collaborator

@yeliang2258 你测试的并发是不是很高呀,或者句子特别长?,如果同时进行合并的请求很多,第一次推理的时候确实可能比较慢的

@yeliang2258
Copy link
Author

谢谢32并发去测试的,大概测试了2000条,有长有短,统计发现首token平均延时要3s,用相同的脚本测试BLOOM-7B发现首token延时是0.2s左右,所以我怀疑LLaMA2-7B有点问题

@hiworldwzj
Copy link
Collaborator

@yeliang2258 你这个现象确实比较诡异,莫不是llama2-7b的tokenizer有什么神奇的地方。计算规模上 llama2-7b 和 bloom-7b 应该差不太多。也可以用test目录下的脚本跑下不同batch和输入输出长度的数据看下推理性能在什么水平,再排查是推理慢还是其他有什么模块的处理太慢了导致的。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants