Skip to content

Fix decode bucket filter issues from #1122#1447

Merged
kamil-kaczor merged 3 commits into
vllm-project:mainfrom
yangulei:decode_filter
May 18, 2026
Merged

Fix decode bucket filter issues from #1122#1447
kamil-kaczor merged 3 commits into
vllm-project:mainfrom
yangulei:decode_filter

Conversation

@yangulei
Copy link
Copy Markdown
Collaborator

@yangulei yangulei commented May 14, 2026

Fixes

Two bugs introduced by #1122 (commit f24f3f9):

1. IndexError when using file-based bucketing (GAUDISW-248587)

When VLLM_BUCKETING_FROM_FILE is used (e.g. GraniteMoeHybrid model), ctx_range is passed as an empty list to generate_buckets(). The num_ctx_tokens_less_or_equal_batched_max_model_len filter accessed ctx_range[0] unconditionally, causing IndexError: list index out of range.

Fix: Safe access with fallback to 0 when ctx_range is empty.

2. Contiguous PA decode buckets incorrectly filtered (GAUDISW-248598)

The ctx filter was applied to contiguous PA decode buckets, incorrectly dropping valid buckets. For example, with max_model_len=2048, block_size=256, max_num_seqs=256, bucket (256, 1, 2112) was filtered because 2112 > ceil(2048/256)*256 = 2048, but 2112 is a valid user-configured VLLM_DECODE_BLOCK_BUCKET_MAX.

Fix: Remove the ctx filter from contiguous PA decode buckets. For contiguous PA, the block range is already bounded by max_blocks in the bucketing strategies.

Tests

  • Added test_file_buckets_with_empty_ctx_range_no_crash — reproduces the server.log IndexError
  • Added test_contiguous_pa_decode_buckets_not_filtered_by_ctx — reproduces the std_out.txt issue
  • Narrowed test_decode_buckets_satisfy_ctx_filter to non-contiguous PA only
  • Updated docstrings

Signed-off-by: Youlei Yang youlei.yang@intel.com

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes two regressions introduced in #1122 affecting decode bucket filtering: (1) avoid IndexError when file-based bucketing passes an empty ctx_range, and (2) stop incorrectly filtering valid contiguous-PA decode buckets based on the ctx filter (since contiguous-PA block ranges are already bounded by max_blocks).

Changes:

  • Guard ctx_range[0] access in num_ctx_tokens_less_or_equal_batched_max_model_len by deriving a safe ctx_min when ctx_range is empty.
  • Remove the ctx filter from contiguous-PA decode bucket filtering.
  • Add/adjust unit tests to cover the empty-ctx_range file-bucketing path and contiguous-PA decode bucket retention.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
vllm_gaudi/extension/bucketing/common.py Makes ctx filtering safe for empty ctx_range and disables ctx filtering for contiguous-PA decode buckets.
tests/unit_tests/test_bucketing.py Adds regression tests for the two reported issues and scopes ctx-filter assertions to non-contiguous PA.

@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
54f548e9e58087f0155e4e164e416ad7efdfde6d

@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
54f548e9e58087f0155e4e164e416ad7efdfde6d

@yangulei
Copy link
Copy Markdown
Collaborator Author

@iboiko-habana @kamil-kaczor
Please help to review this PR. It has been validated by the reporters for GAUDISW-248598 and GAUDISW-248587.
Thanks!

@adobrzyn
Copy link
Copy Markdown
Collaborator

Finding 1 🔴 Critical · vllm_gaudi/extension/bucketing/common.py:L451

With ctx_min = ctx_range[0] if ctx_range else 0, the file-bucketing path (ctx_range == []) sets ctx_min = 0, so the condition ctx > ctx_min is true for every bucket with ctx > 0 and the filter is now applied to user-provided file buckets — silently rejecting any whose ctx > ceil(max_model_len/block_size) * bs.

Worked example using parameters from the new test_file_buckets_with_empty_ctx_range_no_crash (max_model_len=131072, block_size=528, non-contiguous PA, file_buckets=[(1,1,256), (1,1,512), (2,1,256), (2,1,512), (32,1,2424)]):

  • ceil(131072/528) = 249
  • (1,1,256): 1*249=249, 256 > 249dropped
  • (1,1,512): 512 > 249dropped
  • (2,1,256): 2*249=498, 256 ≤ 498 → kept
  • (2,1,512): 512 > 498dropped
  • (32,1,2424): 32*249=7968, 2424 ≤ 7968 → kept

3 of 5 explicitly-configured file buckets are silently filtered out. The new test only asserts len(buckets) > 0, which masks this. The num_ctx_tokens_less_or_equal_batched_max_model_len filter from #1122 was designed to limit auto-generated warmup explosion — it should not be silently rejecting buckets the user explicitly configured via VLLM_BUCKETING_FROM_FILE.

Suggestion: Skip filters entirely for file-provided buckets (file = explicit user intent), e.g. in generate_buckets:

filters = [] if file_buckets else get_filters(is_prompt, use_merged_prefill, use_contiguous_pa)

This preserves #1122's filter for auto-generated buckets (its actual target) and avoids any behavior change on the file path beyond "don't crash". Also strengthen the regression test to assert all five file buckets are preserved:

assert len(buckets) == len(file_buckets)
for fb in file_buckets:
    assert fb in buckets

[- Reviewed by Awesome ChlOpus]

yangulei added 3 commits May 18, 2026 02:42
…eting)

When VLLM_BUCKETING_FROM_FILE is used, ctx_range is passed as an empty
list to generate_buckets(). The num_ctx_tokens_less_or_equal_batched_max_model_len
filter accessed ctx_range[0] unconditionally, causing IndexError.

Fix: use safe access with fallback to 0 when ctx_range is empty.

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
For contiguous PA, the block range is already bounded by max_blocks in
the bucketing strategies, so the num_ctx_tokens_less_or_equal_batched_max_model_len
filter is unnecessary and incorrectly drops valid buckets.

Example: with max_model_len=2048, block_size=256, max_num_seqs=256,
bucket (256, 1, 2112) was filtered because 2112 > ceil(2048/256)*256=2048,
but 2112 is a valid user-configured VLLM_DECODE_BLOCK_BUCKET_MAX.

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
- Set filters to empty list when file_buckets is provided in
  generate_buckets(), since file-based bucketing should pass all
  user-specified buckets through without filtering.
- Rename test_file_buckets_with_empty_ctx_range_no_crash to
  test_file_buckets_bypass_filters to reflect the actual behavior.
- Add bucket (512,1,256) that would be rejected by
  batch_size_smaller_than_blocks filter to prove filters are skipped.
- Update test_decode_buckets_satisfy_ctx_filter to only run for
  non-contiguous PA since contiguous PA decode buckets are not filtered.
- Update docstring for test_exponential_decode_block_limit_uncapped.

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
@yangulei
Copy link
Copy Markdown
Collaborator Author

Finding 1 🔴 Critical · vllm_gaudi/extension/bucketing/common.py:L451

With ctx_min = ctx_range[0] if ctx_range else 0, the file-bucketing path (ctx_range == []) sets ctx_min = 0, so the condition ctx > ctx_min is true for every bucket with ctx > 0 and the filter is now applied to user-provided file buckets — silently rejecting any whose ctx > ceil(max_model_len/block_size) * bs.

Worked example using parameters from the new test_file_buckets_with_empty_ctx_range_no_crash (max_model_len=131072, block_size=528, non-contiguous PA, file_buckets=[(1,1,256), (1,1,512), (2,1,256), (2,1,512), (32,1,2424)]):

  • ceil(131072/528) = 249
  • (1,1,256): 1*249=249, 256 > 249dropped
  • (1,1,512): 512 > 249dropped
  • (2,1,256): 2*249=498, 256 ≤ 498 → kept
  • (2,1,512): 512 > 498dropped
  • (32,1,2424): 32*249=7968, 2424 ≤ 7968 → kept

3 of 5 explicitly-configured file buckets are silently filtered out. The new test only asserts len(buckets) > 0, which masks this. The num_ctx_tokens_less_or_equal_batched_max_model_len filter from #1122 was designed to limit auto-generated warmup explosion — it should not be silently rejecting buckets the user explicitly configured via VLLM_BUCKETING_FROM_FILE.

Suggestion: Skip filters entirely for file-provided buckets (file = explicit user intent), e.g. in generate_buckets:

filters = [] if file_buckets else get_filters(is_prompt, use_merged_prefill, use_contiguous_pa)

This preserves #1122's filter for auto-generated buckets (its actual target) and avoids any behavior change on the file path beyond "don't crash". Also strengthen the regression test to assert all five file buckets are preserved:

assert len(buckets) == len(file_buckets)
for fb in file_buckets:
    assert fb in buckets

[- Reviewed by Awesome ChlOpus]

Done by Skip filters for file-based bucketing and refactor tests.

@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
54f548e9e58087f0155e4e164e416ad7efdfde6d

Copy link
Copy Markdown
Collaborator

@kamil-kaczor kamil-kaczor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kamil-kaczor kamil-kaczor merged commit e9b8f08 into vllm-project:main May 18, 2026
2 checks passed
@yangulei yangulei deleted the decode_filter branch May 18, 2026 08:00
yangulei added a commit to yangulei/vllm-gaudi that referenced this pull request May 18, 2026
…project#1447)"

This reverts commit e9b8f08.

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
iboiko-habana pushed a commit that referenced this pull request May 18, 2026
## Fixes

Two bugs introduced by #1122 (commit f24f3f9):

### 1. IndexError when using file-based bucketing (GAUDISW-248587)
When `VLLM_BUCKETING_FROM_FILE` is used (e.g. GraniteMoeHybrid model),
`ctx_range` is passed as an empty list to `generate_buckets()`. The
`num_ctx_tokens_less_or_equal_batched_max_model_len` filter accessed
`ctx_range[0]` unconditionally, causing `IndexError: list index out of
range`.

**Fix**: Safe access with fallback to 0 when `ctx_range` is empty.

### 2. Contiguous PA decode buckets incorrectly filtered
(GAUDISW-248598)
The ctx filter was applied to contiguous PA decode buckets, incorrectly
dropping valid buckets. For example, with `max_model_len=2048`,
`block_size=256`, `max_num_seqs=256`, bucket `(256, 1, 2112)` was
filtered because `2112 > ceil(2048/256)*256 = 2048`, but 2112 is a valid
user-configured `VLLM_DECODE_BLOCK_BUCKET_MAX`.

**Fix**: Remove the ctx filter from contiguous PA decode buckets. For
contiguous PA, the block range is already bounded by `max_blocks` in the
bucketing strategies.

## Tests
- Added `test_file_buckets_with_empty_ctx_range_no_crash` — reproduces
the server.log IndexError
- Added `test_contiguous_pa_decode_buckets_not_filtered_by_ctx` —
reproduces the std_out.txt issue
- Narrowed `test_decode_buckets_satisfy_ctx_filter` to non-contiguous PA
only
- Updated docstrings

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

---------

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants