You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Checklist
Describe the bug
概率推理无法结束,并且报错
并且cache_max_entry_count的设置不是很懂。vllm是直接设置最大值的百分比。简单直接有效。
这里这个值不是很懂。即便设置成cache_max_entry_count=0.20, quant_policy=4, (模型会占用24GB*2/3 的显存。所以数值用1-2/3)
依然会得到下列信息
2025-02-12 21:25:08,496 - lmdeploy - WARNING - model_agent.py:70 - device<0> No enough memory. update max_prefill_token_num=2048
并且概率可以正常推理和无法结束的推理以及报错。
理论上 32B的4bit模型在24GB的机子上 如果是float16推理理论上最大token数量为12,288。但是由于电脑使用中会占用一些导致不足12,288。并且在vllm通过设置最大token数量和gpu_memory_utilization也可以得到验证。
但是这里已经使用quant_policy=4 了。最大token数量可以简单的乘以4吗?还是不行呢?
无论如何设置cache_max_entry_count都会得到
2025-02-12 21:25:08,496 - lmdeploy - WARNING - model_agent.py:70 - device<0> No enough memory. update max_prefill_token_num=2048
或者直接报错。那么24GB显存跑32B 4bit模型 session_len最大应该是多少呢?按照float16 算?
以后最大token数量会被修改吗?以及为什么显存占用大概还剩下3GB多一些?
报错信息放在后面
Reproduction
Environment
Error traceback
@lvhan028
The text was updated successfully, but these errors were encountered: