Skip to content

Combined changes from PRs #1122, #1155, #1346 for 256K context support on Gaudi2#1348

Draft
Copilot wants to merge 10 commits into
mainfrom
copilot/review-issue-1347
Draft

Combined changes from PRs #1122, #1155, #1346 for 256K context support on Gaudi2#1348
Copilot wants to merge 10 commits into
mainfrom
copilot/review-issue-1347

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 14, 2026

Summary

Add comprehensive bucketing test suite for the four PRs addressing 256K model length support on Gaudi2 (issue #1347). This PR contains only test files — no production code changes.

Test Files

tests/unit_tests/test_bucketing_issue_1347.py (~1217 lines, ~50 tests)

Tests covering bucketing contracts across all four PRs:

Test categories:

  • Exponential decode config formulas (contiguous vs non-contiguous PA)
  • Block limit cap removal verification
  • generate_buckets() filter behavior (bs ≤ ctx, ctx ≤ batched max)
  • Padding-aware strategy bucket generation and padding bounds
  • Linear strategy decode block overflow protection
  • Full HPUBucketingManager.initialize() integration for 256K scenarios
  • Fallback bucket correctness

tests/unit_tests/test_bucketing_warmup_time.py (~830 lines, ~25 tests)

Tests enforcing warmup-time budgets via bucket-count bounds:

  • Regression scenarios (small batch + 131K context, 256K non-contiguous PA)
  • Sub-linear scaling verification
  • Strategy comparison (exponential vs linear vs padding-aware)
  • Edge cases and torch._dynamo.config.cache_size_limit bounds

Dependencies

Some tests validate behavior from PRs #1122, #1155, and #1346 and will only pass after those PRs are merged. Tests for PR #762 (already merged) should pass against main.

Relates to: #1347

Copilot AI and others added 4 commits April 14, 2026 09:45
Signed-off-by: copilot <copilot@github.com>

Tests cover the four PRs addressing long-context bucketing:
- PR #762:  Padding-aware bucketing strategy (warmup ranges, configs, generation)
- PR #1122: Exponential decode block formula, limit cap, filter, linear fix
- PR #1155: FusedSDPA slicing contract (pad_max bounds, strategy selection)
- PR #1346: HPU graph capture skip (cudagraph size, warmup clamp scenarios)
- Cross-PR integration: end-to-end 256K scenario, fallback, regressions

49 test functions organized in 6 test classes.

Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
Signed-off-by: copilot <copilot@github.com>

Tests verify that bucket counts (proxy for warmup time) stay within
acceptable bounds across model configurations and bucketing strategies:

- Total/prompt/decode bucket count bounds across 5 model scenarios
- Regression tests for known bucket explosion (GAUDISW-247226, 256K decode)
- Sub-linear scaling verification: bucket count vs model length
- Strategy comparison: exponential vs linear vs padding-aware efficiency
- Dynamo cache_size_limit and accumulated_cache_size_limit bounds
- Individual range dimension sanity checks (bs, block, query, ctx)
- Warmup range edge cases: deduplication, padding parameters
- End-to-end manager tests: 256K, 8K fast warmup, speculative decoding

25 test functions organized in 8 test classes.

Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
Rename FP8 blockwise compressed tensors scales to match HPU ops, Fixes
regression in
https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512 due
to #1220 and
#1053

---------

Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>
Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
)

Signed-off-by: GitHub Copilot <copilot@github.com>

Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aggregates three previously separate fixes to unblock 256K max_model_len on Gaudi2 (TP=1) by tightening decode bucketing, reducing FusedSDPA memory via Q/K/V slicing, and avoiding expensive HPU graph capture for very long prefills.

Changes:

  • Refines decode bucketing (formula + filtering) and adds warmup-time regression tests to keep bucket counts (and warmup time) bounded for long contexts.
  • Introduces chunked/sliced FusedSDPA (BF16 + FP8) to avoid materializing full attention masks at long context lengths, plus feature flags and unit tests.
  • Updates HPU graph usage heuristics and warmup dummy-input generation to prevent OOMs during long-context warmup.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
vllm_gaudi/v1/worker/hpu_model_runner.py Defaults max_cudagraph_capture_size, refines _use_graphs() to account for context tokens, and clamps warmup dummy decode blocks to KV-cache capacity.
vllm_gaudi/ops/hpu_compressed_tensors.py Syncs FP8 sliced-module scale references after post-load weight processing.
vllm_gaudi/extension/utils.py Adds sliced FusedSDPA implementations and dispatch logic (BF16 + FP8) behind config/env gating.
vllm_gaudi/extension/ops.py Removes a causal+mask workaround to allow sliced path to trigger (wrapper still disables unsupported default path).
vllm_gaudi/extension/features.py Registers env flags and adds enable_fsdpa_slicing feature predicate.
vllm_gaudi/extension/bucketing/linear.py Fixes decode block max clamping behavior and min/max consistency handling.
vllm_gaudi/extension/bucketing/exponential.py Updates decode max-block formula and removes the prior limit cap in favor of downstream filtering.
vllm_gaudi/extension/bucketing/common.py Adds decode bucket filtering based on batched max-model-len and optional debug logging of omitted buckets.
tests/unit_tests/worker/test_hpu_model_runner.py Adds tests for cudagraph capture defaulting and _use_graphs() boundary behavior.
tests/unit_tests/test_fsdpa_slicing.py New unit + (HPU-skipped) accuracy/graph-break coverage for sliced FusedSDPA and feature/env wiring.
tests/unit_tests/test_bucketing_warmup_time.py New warmup-time budget tests using bucket-count bounds across scenarios.
tests/unit_tests/test_bucketing_issue_1347.py New comprehensive bucketing contract tests spanning the combined changes.
tests/unit_tests/test_bucketing.py Updates existing bucketing tests to match the new decode formula/limits.
tests/full_tests/model_cards/Qwen3-30B-A3B-FP8-Static-longbench.yaml Adds a LongBench model card for the new e2e discoverable tests.
tests/full_tests/ci_e2e_discoverable_tests.sh Adds LongBench discoverable test functions (baseline + slicing + fp8 KV variants).
docs/configuration/env_variables.md Documents the new FusedSDPA slicing tuning env vars.

Comment thread vllm_gaudi/extension/utils.py Outdated
Comment on lines +178 to +179
assert bucketing_manager is not None and bucketing_manager.initialized, 'Bucketing manager should be instantiated and initialized to enable FusedSDPA slicing.'

Comment thread vllm_gaudi/extension/features.py Outdated
env_var_type=boolean),
Value('use_hpu_aligned_scale', False, env_var='HPU_ALIGNED_SCALE', env_var_type=boolean),
Value('enable_fsdpa_slicing',
All(Eq('bucketing_strategy', 'pad'), Disabled('merged_prefill'), Kernel(fsdpa)),
if decode_block_bucket_cfg[0] > decode_block_bucket_cfg[2]:
decode_block_bucket_min = max(1, decode_block_bucket_cfg[2] - decode_block_bucket_cfg[1])
logger().info(
f"VLLM_DECODE_BLOCK_BUCKET_MIN={decode_block_bucket_cfg[0]} is higher than max_blocks={decode_block_bucket_cfg[2]}. Your configuration VLLM_DECODE_BLOCK_BUCKET_MIN={decode_block_bucket_cfg[0]} will be overwritten to VLLM_DECODE_BLOCK_BUCKET_MIN={decode_block_bucket_min}"
Copilot AI added a commit that referenced this pull request Apr 15, 2026
Agent-Logs-Url: https://github.com/vllm-project/vllm-gaudi/sessions/1d54e613-892a-46b1-b2af-4d9b9d321288

Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
Copilot AI and others added 2 commits April 15, 2026 20:02
Signed-off-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
Remove all production code changes from PRs #1122, #1155, #1346 and keep
only the two test files created for issue #1347:
- tests/unit_tests/test_bucketing_issue_1347.py
- tests/unit_tests/test_bucketing_warmup_time.py

Signed-off-by: GitHub Copilot <copilot@github.com>

Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@kamil-kaczor kamil-kaczor requested a review from jbyczkow as a code owner May 4, 2026 08:03
@michalkuligowski michalkuligowski marked this pull request as draft May 4, 2026 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants