Expose cache_block_seq_len to API #1218

ispobock · 2024-02-29T08:33:26Z

Motivation

As mentioned in #1195, cache_block_seq_len is an important parameter for performance. We should expose it to users.

Modification

Added argument --cache-block-seq-len for serve and chat api, and some benchmark scripts. The default value is 128, which is consistent with the previous internal setting.

ispobock · 2024-02-29T08:35:40Z

@lvhan028 @lzhangzz @AllentDan Could you help review this PR?

zhyncs · 2024-02-29T08:43:45Z

LGTM

lmdeploy/turbomind/chat.py

lvhan028 · 2024-02-29T12:38:18Z

@ispobock I am going to postpone this PR's review, since we are prioritizing #1211, which is very important in the upcoming release. This PR proposed a fallback strategy when turbomind engine hasn't supported some model but pytorch engine does.

Back to this PR, since both engines define the size of the k/v cache block, I think a common attribute name is better.
However, each engine has a different request on the value of the k/v cache block.
So we should work carefully on some cases, like turbomind engine fallback to pytorch engine

ispobock · 2024-03-01T02:58:44Z

@lvhan028 Sure. For Pytorch engine, there is a block_size argument with the default value 64. Currently it's not exposed to users. Before the change, here are some points to make sure:

Shall we also expose block_size for Pytorch engine?
Shall we use a common argument for both engines? If yes, which name we should take? block_size or cache_block_seq_len?
If fallback strategy is trigged:
- If user set the argument value, maybe we can just use it
- If not, maybe we need to change the default value from 128 to 64 for Pytorch

cc: @grimoire @RunningLeon @lzhangzz

grimoire · 2024-03-01T03:24:46Z

block_size might not be the given value in pytorch engine if lora adapters have been loaded. It will be adjusted to match the dimensions of adapters.

ispobock · 2024-03-01T04:41:57Z

block_size might not be the given value in pytorch engine if lora adapters have been loaded. It will be adjusted to match the dimensions of adapters.

@grimoire If user provide both adapter and block_size, maybe we need to ignore the block_size and log a warning to user.

grimoire · 2024-03-04T07:21:55Z

lmdeploy/lmdeploy/pytorch/engine/model_agent.py

Line 476 in 7dd97fd

logger.warning(f'infered block size: {block_size}')

This is the warning when block size is changed by engine.

ispobock · 2024-03-06T02:51:37Z

@lvhan028 It seems that block_size is a more common name for user, maybe we can use it for both engines.
If we use a common argument for both engines, it's not convenient to handle the different default values.
In #1195 (comment), we found the block_size = 64 is better than 128 for Turbomind engine. Shall we change the Turbomind default value to 64 to keep consistent with Pytorch engine?

lvhan028 · 2024-03-06T07:27:11Z

I agree with the default value of 64.
But I prefer cache_block_seq_len since it goes with cache_max_entry_count and literally means the length of a sequence supported by a block.

ispobock · 2024-03-06T11:38:48Z

@lvhan028 I applied cache_block_seq_len argument for both engines and changed the default value to 64. The engine config mapping for cache_block_seq_len attribute in auto-backend is also handled. Pls help review.

lmdeploy/turbomind/chat.py

RunningLeon · 2024-03-11T03:52:43Z

lmdeploy/cli/utils.py

+            '--cache-block-seq-len',
+            type=int,
+            default=64,
+            help='The length of the token sequence in a k/v block')


Better add more help info， eg. it should be multipliers of 32/64? and range?

@RunningLeon @lvhan028 For Turbomind engine, it should be multipliers of 32 or 64 for different GPU compute capability version. For Pytorch engine, it maybe have different requirement. Do we need to specify the requirement here?

More help message are added. @lvhan028 could you help review this PR?

RunningLeon

LGTM

lmdeploy/cli/utils.py

lmdeploy/turbomind/chat.py

lmdeploy/cli/chat.py

AllentDan reviewed Feb 29, 2024

View reviewed changes

lmdeploy/turbomind/chat.py Outdated Show resolved Hide resolved

ispobock force-pushed the expose_cache_block_seq_len branch from e0f26be to 99d7bcf Compare March 6, 2024 09:35

lvhan028 requested review from lvhan028 and RunningLeon March 6, 2024 13:48

RunningLeon reviewed Mar 8, 2024

View reviewed changes

lmdeploy/turbomind/chat.py Outdated Show resolved Hide resolved

RunningLeon reviewed Mar 11, 2024

View reviewed changes

RunningLeon approved these changes Mar 11, 2024

View reviewed changes

ispobock added 9 commits March 12, 2024 17:08

expose cache_block_seq_len to api

9f4bf11

add engine config for chat api

9bc3e5e

fix chat change

c321102

change default value & apply for torch engine

6dbd16d

map values for auto backend

00e4108

fix typo

6e293a3

add ut

7186625

fix engine_cfg initialization

81b54ac

fix lint

b942632

ispobock force-pushed the expose_cache_block_seq_len branch from c87598a to b942632 Compare March 12, 2024 09:20

add help message

045d4b3

RunningLeon reviewed Mar 13, 2024

View reviewed changes

lmdeploy/cli/utils.py Show resolved Hide resolved

lvhan028 reviewed Mar 19, 2024

View reviewed changes

lmdeploy/turbomind/chat.py Show resolved Hide resolved

lvhan028 reviewed Mar 19, 2024

View reviewed changes

lmdeploy/cli/chat.py Show resolved Hide resolved

remove tp

831343a

lvhan028 approved these changes Mar 19, 2024

View reviewed changes

lvhan028 merged commit 45cc5c5 into InternLM:main Mar 19, 2024
3 of 5 checks passed

lvhan028 added the improvement label Mar 19, 2024

lvhan028 mentioned this pull request Mar 20, 2024

Performance analysis for different values of cache_block_seq_len #1195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose cache_block_seq_len to API #1218

Expose cache_block_seq_len to API #1218

ispobock commented Feb 29, 2024

ispobock commented Feb 29, 2024

zhyncs commented Feb 29, 2024

lvhan028 commented Feb 29, 2024

ispobock commented Mar 1, 2024

grimoire commented Mar 1, 2024

ispobock commented Mar 1, 2024 •

edited

Loading

grimoire commented Mar 4, 2024

ispobock commented Mar 6, 2024

lvhan028 commented Mar 6, 2024

ispobock commented Mar 6, 2024

RunningLeon Mar 11, 2024

ispobock Mar 11, 2024

ispobock Mar 13, 2024

RunningLeon left a comment

Expose cache_block_seq_len to API #1218

Expose cache_block_seq_len to API #1218

Conversation

ispobock commented Feb 29, 2024

Motivation

Modification

ispobock commented Feb 29, 2024

zhyncs commented Feb 29, 2024

lvhan028 commented Feb 29, 2024

ispobock commented Mar 1, 2024

grimoire commented Mar 1, 2024

ispobock commented Mar 1, 2024 • edited Loading

grimoire commented Mar 4, 2024

ispobock commented Mar 6, 2024

lvhan028 commented Mar 6, 2024

ispobock commented Mar 6, 2024

RunningLeon Mar 11, 2024

Choose a reason for hiding this comment

ispobock Mar 11, 2024

Choose a reason for hiding this comment

ispobock Mar 13, 2024

Choose a reason for hiding this comment

RunningLeon left a comment

Choose a reason for hiding this comment

ispobock commented Mar 1, 2024 •

edited

Loading