Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose cache_block_seq_len to API #1218

Merged
merged 11 commits into from
Mar 19, 2024

Conversation

ispobock
Copy link
Contributor

Motivation

As mentioned in #1195, cache_block_seq_len is an important parameter for performance. We should expose it to users.

Modification

Added argument --cache-block-seq-len for serve and chat api, and some benchmark scripts. The default value is 128, which is consistent with the previous internal setting.

@ispobock
Copy link
Contributor Author

@lvhan028 @lzhangzz @AllentDan Could you help review this PR?

@zhyncs
Copy link
Collaborator

zhyncs commented Feb 29, 2024

LGTM

@lvhan028
Copy link
Collaborator

@ispobock I am going to postpone this PR's review, since we are prioritizing #1211, which is very important in the upcoming release. This PR proposed a fallback strategy when turbomind engine hasn't supported some model but pytorch engine does.

Back to this PR, since both engines define the size of the k/v cache block, I think a common attribute name is better.
However, each engine has a different request on the value of the k/v cache block.
So we should work carefully on some cases, like turbomind engine fallback to pytorch engine

@ispobock
Copy link
Contributor Author

ispobock commented Mar 1, 2024

@lvhan028 Sure. For Pytorch engine, there is a block_size argument with the default value 64. Currently it's not exposed to users. Before the change, here are some points to make sure:

  1. Shall we also expose block_size for Pytorch engine?
  2. Shall we use a common argument for both engines? If yes, which name we should take? block_size or cache_block_seq_len?
  3. If fallback strategy is trigged:
    • If user set the argument value, maybe we can just use it
    • If not, maybe we need to change the default value from 128 to 64 for Pytorch

cc: @grimoire @RunningLeon @lzhangzz

@grimoire
Copy link
Collaborator

grimoire commented Mar 1, 2024

block_size might not be the given value in pytorch engine if lora adapters have been loaded. It will be adjusted to match the dimensions of adapters.

@ispobock
Copy link
Contributor Author

ispobock commented Mar 1, 2024

block_size might not be the given value in pytorch engine if lora adapters have been loaded. It will be adjusted to match the dimensions of adapters.

@grimoire If user provide both adapter and block_size, maybe we need to ignore the block_size and log a warning to user.

@grimoire
Copy link
Collaborator

grimoire commented Mar 4, 2024

logger.warning(f'infered block size: {block_size}')
This is the warning when block size is changed by engine.

@ispobock
Copy link
Contributor Author

ispobock commented Mar 6, 2024

@lvhan028 It seems that block_size is a more common name for user, maybe we can use it for both engines.
If we use a common argument for both engines, it's not convenient to handle the different default values.
In #1195 (comment), we found the block_size = 64 is better than 128 for Turbomind engine. Shall we change the Turbomind default value to 64 to keep consistent with Pytorch engine?

@lvhan028
Copy link
Collaborator

lvhan028 commented Mar 6, 2024

I agree with the default value of 64.
But I prefer cache_block_seq_len since it goes with cache_max_entry_count and literally means the length of a sequence supported by a block.

@ispobock ispobock force-pushed the expose_cache_block_seq_len branch from e0f26be to 99d7bcf Compare March 6, 2024 09:35
@ispobock
Copy link
Contributor Author

ispobock commented Mar 6, 2024

@lvhan028 I applied cache_block_seq_len argument for both engines and changed the default value to 64. The engine config mapping for cache_block_seq_len attribute in auto-backend is also handled. Pls help review.

@lvhan028 lvhan028 requested review from lvhan028 and RunningLeon March 6, 2024 13:48
'--cache-block-seq-len',
type=int,
default=64,
help='The length of the token sequence in a k/v block')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better add more help info, eg. it should be multipliers of 32/64? and range?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RunningLeon @lvhan028 For Turbomind engine, it should be multipliers of 32 or 64 for different GPU compute capability version. For Pytorch engine, it maybe have different requirement. Do we need to specify the requirement here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More help message are added. @lvhan028 could you help review this PR?

Copy link
Collaborator

@RunningLeon RunningLeon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ispobock ispobock force-pushed the expose_cache_block_seq_len branch from c87598a to b942632 Compare March 12, 2024 09:20
@lvhan028 lvhan028 merged commit 45cc5c5 into InternLM:main Mar 19, 2024
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants