-
Notifications
You must be signed in to change notification settings - Fork 488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose cache_block_seq_len to API #1218
Conversation
@lvhan028 @lzhangzz @AllentDan Could you help review this PR? |
LGTM |
@ispobock I am going to postpone this PR's review, since we are prioritizing #1211, which is very important in the upcoming release. This PR proposed a fallback strategy when turbomind engine hasn't supported some model but pytorch engine does. Back to this PR, since both engines define the size of the k/v cache block, I think a common attribute name is better. |
@lvhan028 Sure. For Pytorch engine, there is a
|
|
@grimoire If user provide both |
|
@lvhan028 It seems that |
I agree with the default value of 64. |
e0f26be
to
99d7bcf
Compare
@lvhan028 I applied |
lmdeploy/cli/utils.py
Outdated
'--cache-block-seq-len', | ||
type=int, | ||
default=64, | ||
help='The length of the token sequence in a k/v block') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better add more help info, eg. it should be multipliers of 32/64? and range?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RunningLeon @lvhan028 For Turbomind engine, it should be multipliers of 32 or 64 for different GPU compute capability version. For Pytorch engine, it maybe have different requirement. Do we need to specify the requirement here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More help message are added. @lvhan028 could you help review this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
c87598a
to
b942632
Compare
Motivation
As mentioned in #1195,
cache_block_seq_len
is an important parameter for performance. We should expose it to users.Modification
Added argument
--cache-block-seq-len
for serve and chat api, and some benchmark scripts. The default value is 128, which is consistent with the previous internal setting.