-
Notifications
You must be signed in to change notification settings - Fork 129
Fixes for the decode bucketing in non-contiguous pa scenario #1122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
5ac1801
8d1c242
85be44c
806769d
c15ec47
b1f9660
c3cc743
2c3b44b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -63,17 +63,18 @@ def get_decode_cfgs(self, max_num_seqs, block_size, max_num_batched_tokens, max_ | |
| if contiguous_pa: | ||
| max_decode_blocks = max_blocks | ||
| decode_block_bucket_cfg = read_bucket_settings('decode', 'block', min=1, step=block_size, max=max_decode_blocks) | ||
| if decode_block_bucket_cfg[2] > max_blocks: | ||
| if contiguous_pa and decode_block_bucket_cfg[2] > max_blocks: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the decode bucket count now?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a bug fix for the missing buckets which cause "not warmed-up" warnings for non-contiguous PA cases. The actual number of decode buckets for linear bucketing is sensitive to the |
||
| logger().info( | ||
| f'VLLM_DECODE_BLOCK_BUCKET_MAX={decode_block_bucket_cfg[2]} is higher than max_blocks={max_blocks}. Your configuration VLLM_DECODE_BLOCK_BUCKET_MAX={decode_block_bucket_cfg[2]} will be overwritten to VLLM_DECODE_BLOCK_BUCKET_MAX={max_blocks}' | ||
| ) | ||
| decode_block_bucket_cfg[2] = max_blocks | ||
| if decode_block_bucket_cfg[0] > max_blocks: | ||
| decode_block_bucket_min = max(1, max_blocks - decode_block_bucket_cfg[1]) | ||
| logger().info( | ||
| f'VLLM_DECODE_BLOCK_BUCKET_MIN={decode_block_bucket_cfg[0]} is higher than max_blocks={max_blocks}. Your configuration VLLM_DECODE_BLOCK_BUCKET_MIN={decode_block_bucket_cfg[0]} will be overwritten to VLLM_DECODE_BLOCK_BUCKET_MIN={decode_block_bucket_min}' | ||
| ) | ||
| decode_block_bucket_cfg[0] = decode_block_bucket_min | ||
|
|
||
| if decode_block_bucket_cfg[0] > decode_block_bucket_cfg[2]: | ||
| decode_block_bucket_min = max(1, decode_block_bucket_cfg[2] - decode_block_bucket_cfg[1]) | ||
| logger().info( | ||
| f"VLLM_DECODE_BLOCK_BUCKET_MIN={decode_block_bucket_cfg[0]} is higher than max_blocks={decode_block_bucket_cfg[2]}. Your configuration VLLM_DECODE_BLOCK_BUCKET_MIN={decode_block_bucket_cfg[0]} will be overwritten to VLLM_DECODE_BLOCK_BUCKET_MIN={decode_block_bucket_min}" | ||
| ) | ||
| decode_block_bucket_cfg[0] = decode_block_bucket_min | ||
|
|
||
| msg = ("Decode bucket config (min, step, max_warmup) " | ||
| f"bs:{decode_bs_bucket_cfg}, " | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the maximum for expected_max value here? This value is calculated, in fact calculation is copied from the algorithm, this test should check specific values and that filter in fact reduces buckets number
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
expected_maxis calculated base on the corner decoding case withcontext_lens = [max_model_len - 1] + [1] * (max_num_seqs - 1). The context lens will be padded to the max one and the total decoding blocks could be calculated withmath.ceil((max_model_len - 1) / block_size) * max_num_seqswhich could be simplified tomath.ceil(max_model_len / block_size) * max_num_seqsasmax_model_len >> block_sizefor common cases.For an example with
max_model_len=16384,max_num_seqs=128andblock_size=128. The corner decoding case withcontext_lens = [16383, 1, 1, 1, ... 1, 1, 1]needsceil(16383 / 128) * 128 = ceil(16384 / 128) * 128= 16384blocks after padding.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the UT for the newly added filter is added. Thanks for remind that.