Combined changes from PRs #1122, #1155, #1346 for 256K context support on Gaudi2 by Copilot · Pull Request #1348 · vllm-project/vllm-gaudi

Copilot · 2026-04-14T13:05:47Z

Summary

Add comprehensive bucketing test suite for the four PRs addressing 256K model length support on Gaudi2 (issue #1347). This PR contains only test files — no production code changes.

Test Files

`tests/unit_tests/test_bucketing_issue_1347.py` (~1217 lines, ~50 tests)

Tests covering bucketing contracts across all four PRs:

PR Add the padding-aware bucketing strategy #762 (merged): Padding-aware bucketing strategy gates and config validation
PR Fixes for the decode bucketing in non-contiguous pa scenario #1122: Decode block formula, generate_buckets() filter, linear strategy fix
PR Enable slicing for the FusedSDPA #1155: FusedSDPA slicing dependency on padding-aware bucketing
PR skip HPU graphs for long (query + context) prefills #1346: HPU graph capture skip for long prefills, _generate_seq_lengths clamp

Test categories:

Exponential decode config formulas (contiguous vs non-contiguous PA)
Block limit cap removal verification
generate_buckets() filter behavior (bs ≤ ctx, ctx ≤ batched max)
Padding-aware strategy bucket generation and padding bounds
Linear strategy decode block overflow protection
Full HPUBucketingManager.initialize() integration for 256K scenarios
Fallback bucket correctness

`tests/unit_tests/test_bucketing_warmup_time.py` (~830 lines, ~25 tests)

Tests enforcing warmup-time budgets via bucket-count bounds:

Regression scenarios (small batch + 131K context, 256K non-contiguous PA)
Sub-linear scaling verification
Strategy comparison (exponential vs linear vs padding-aware)
Edge cases and torch._dynamo.config.cache_size_limit bounds

Dependencies

Some tests validate behavior from PRs #1122, #1155, and #1346 and will only pass after those PRs are merged. Tests for PR #762 (already merged) should pass against main.

Relates to: #1347

Signed-off-by: copilot <copilot@github.com> Tests cover the four PRs addressing long-context bucketing: - PR #762: Padding-aware bucketing strategy (warmup ranges, configs, generation) - PR #1122: Exponential decode block formula, limit cap, filter, linear fix - PR #1155: FusedSDPA slicing contract (pad_max bounds, strategy selection) - PR #1346: HPU graph capture skip (cudagraph size, warmup clamp scenarios) - Cross-PR integration: end-to-end 256K scenario, fallback, regressions 49 test functions organized in 6 test classes. Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>

Signed-off-by: copilot <copilot@github.com> Tests verify that bucket counts (proxy for warmup time) stay within acceptable bounds across model configurations and bucketing strategies: - Total/prompt/decode bucket count bounds across 5 model scenarios - Regression tests for known bucket explosion (GAUDISW-247226, 256K decode) - Sub-linear scaling verification: bucket count vs model length - Strategy comparison: exponential vs linear vs padding-aware efficiency - Dynamo cache_size_limit and accumulated_cache_size_limit bounds - Individual range dimension sanity checks (bs, block, query, ctx) - Warmup range edge cases: deduplication, padding parameters - End-to-end manager tests: 256K, 8K fast warmup, speculative decoding 25 test functions organized in 8 test classes. Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>

Rename FP8 blockwise compressed tensors scales to match HPU ops, Fixes regression in https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512 due to #1220 and #1053 --------- Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com> Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>

) Signed-off-by: GitHub Copilot <copilot@github.com> Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>

github-actions · 2026-04-14T13:06:29Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

Copilot

Pull request overview

This PR aggregates three previously separate fixes to unblock 256K max_model_len on Gaudi2 (TP=1) by tightening decode bucketing, reducing FusedSDPA memory via Q/K/V slicing, and avoiding expensive HPU graph capture for very long prefills.

Changes:

Refines decode bucketing (formula + filtering) and adds warmup-time regression tests to keep bucket counts (and warmup time) bounded for long contexts.
Introduces chunked/sliced FusedSDPA (BF16 + FP8) to avoid materializing full attention masks at long context lengths, plus feature flags and unit tests.
Updates HPU graph usage heuristics and warmup dummy-input generation to prevent OOMs during long-context warmup.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
vllm_gaudi/v1/worker/hpu_model_runner.py	Defaults `max_cudagraph_capture_size`, refines `_use_graphs()` to account for context tokens, and clamps warmup dummy decode blocks to KV-cache capacity.
vllm_gaudi/ops/hpu_compressed_tensors.py	Syncs FP8 sliced-module scale references after post-load weight processing.
vllm_gaudi/extension/utils.py	Adds sliced FusedSDPA implementations and dispatch logic (BF16 + FP8) behind config/env gating.
vllm_gaudi/extension/ops.py	Removes a causal+mask workaround to allow sliced path to trigger (wrapper still disables unsupported default path).
vllm_gaudi/extension/features.py	Registers env flags and adds `enable_fsdpa_slicing` feature predicate.
vllm_gaudi/extension/bucketing/linear.py	Fixes decode block max clamping behavior and min/max consistency handling.
vllm_gaudi/extension/bucketing/exponential.py	Updates decode max-block formula and removes the prior limit cap in favor of downstream filtering.
vllm_gaudi/extension/bucketing/common.py	Adds decode bucket filtering based on batched max-model-len and optional debug logging of omitted buckets.
tests/unit_tests/worker/test_hpu_model_runner.py	Adds tests for cudagraph capture defaulting and `_use_graphs()` boundary behavior.
tests/unit_tests/test_fsdpa_slicing.py	New unit + (HPU-skipped) accuracy/graph-break coverage for sliced FusedSDPA and feature/env wiring.
tests/unit_tests/test_bucketing_warmup_time.py	New warmup-time budget tests using bucket-count bounds across scenarios.
tests/unit_tests/test_bucketing_issue_1347.py	New comprehensive bucketing contract tests spanning the combined changes.
tests/unit_tests/test_bucketing.py	Updates existing bucketing tests to match the new decode formula/limits.
tests/full_tests/model_cards/Qwen3-30B-A3B-FP8-Static-longbench.yaml	Adds a LongBench model card for the new e2e discoverable tests.
tests/full_tests/ci_e2e_discoverable_tests.sh	Adds LongBench discoverable test functions (baseline + slicing + fp8 KV variants).
docs/configuration/env_variables.md	Documents the new FusedSDPA slicing tuning env vars.

+        assert bucketing_manager is not None and bucketing_manager.initialized, 'Bucketing manager should be instantiated and initialized to enable FusedSDPA slicing.'
+


              env_var_type=boolean),
        Value('use_hpu_aligned_scale', False, env_var='HPU_ALIGNED_SCALE', env_var_type=boolean),
+        Value('enable_fsdpa_slicing',
+              All(Eq('bucketing_strategy', 'pad'), Disabled('merged_prefill'), Kernel(fsdpa)),


+        if decode_block_bucket_cfg[0] > decode_block_bucket_cfg[2]:
+            decode_block_bucket_min = max(1, decode_block_bucket_cfg[2] - decode_block_bucket_cfg[1])
+            logger().info(
+                f"VLLM_DECODE_BLOCK_BUCKET_MIN={decode_block_bucket_cfg[0]} is higher than max_blocks={decode_block_bucket_cfg[2]}. Your configuration VLLM_DECODE_BLOCK_BUCKET_MIN={decode_block_bucket_cfg[0]} will be overwritten to VLLM_DECODE_BLOCK_BUCKET_MIN={decode_block_bucket_min}"


Agent-Logs-Url: https://github.com/vllm-project/vllm-gaudi/sessions/1d54e613-892a-46b1-b2af-4d9b9d321288 Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>

Signed-off-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>

Remove all production code changes from PRs #1122, #1155, #1346 and keep only the two test files created for issue #1347: - tests/unit_tests/test_bucketing_issue_1347.py - tests/unit_tests/test_bucketing_warmup_time.py Signed-off-by: GitHub Copilot <copilot@github.com> Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>

github-actions · 2026-04-16T07:55:16Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Copilot AI and others added 4 commits April 14, 2026 09:45

Combined diff: PRs #1122, #1155, #1346 with bucketing tests (issue #1347

9e0b6c2

) Signed-off-by: GitHub Copilot <copilot@github.com> Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>

Copilot AI assigned Copilot and michalkuligowski Apr 14, 2026

Copilot AI requested review from Copilot and removed request for Copilot April 14, 2026 13:05

Copilot created this pull request from a session on behalf of michalkuligowski April 14, 2026 13:05 View session

Merge branch 'main' into copilot/review-issue-1347

a61bc8e

michalkuligowski marked this pull request as ready for review April 14, 2026 13:11

michalkuligowski requested review from adobrzyn, afierka-intel, iboiko-habana, kamil-kaczor, ksmusz, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners April 14, 2026 13:11

Copilot AI review requested due to automatic review settings April 14, 2026 13:11

michalkuligowski requested a review from PatrykWo as a code owner April 14, 2026 13:11

Copilot started reviewing on behalf of michalkuligowski April 14, 2026 13:12 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

github-actions Bot mentioned this pull request Apr 14, 2026

🚦 Team Review Dashboard #701

Open

Copilot AI mentioned this pull request Apr 15, 2026

Fix yapf formatting in test_bucketing_issue_1347.py from PR #1348 #1355

Draft

3 tasks

Copilot AI and others added 2 commits April 15, 2026 20:02

Fix yapf formatting in test_bucketing_issue_1347.py

5009297

Signed-off-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>

Merge branch 'main' into copilot/review-issue-1347

6f51a46

Copilot started work on behalf of michalkuligowski April 16, 2026 06:45 View session

Copilot finished work on behalf of michalkuligowski April 16, 2026 06:58

michalkuligowski and others added 2 commits April 16, 2026 09:55

Merge branch 'main' into copilot/review-issue-1347

a2f954f

Merge branch 'main' into copilot/review-issue-1347

70c45df

kamil-kaczor requested a review from jbyczkow as a code owner May 4, 2026 08:03

michalkuligowski marked this pull request as draft May 4, 2026 13:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combined changes from PRs #1122, #1155, #1346 for 256K context support on Gaudi2#1348

Combined changes from PRs #1122, #1155, #1346 for 256K context support on Gaudi2#1348
Copilot wants to merge 10 commits into
mainfrom
copilot/review-issue-1347

Copilot AI commented Apr 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		assert bucketing_manager is not None and bucketing_manager.initialized, 'Bucketing manager should be instantiated and initialized to enable FusedSDPA slicing.'

Conversation

Copilot AI commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Files

tests/unit_tests/test_bucketing_issue_1347.py (~1217 lines, ~50 tests)

tests/unit_tests/test_bucketing_warmup_time.py (~830 lines, ~25 tests)

Dependencies

Uh oh!

github-actions Bot commented Apr 14, 2026

🚧 CI Blocked

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

github-actions Bot commented Apr 16, 2026

🚧 CI Blocked

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Copilot AI commented Apr 14, 2026 •

edited

Loading

`tests/unit_tests/test_bucketing_issue_1347.py` (~1217 lines, ~50 tests)

`tests/unit_tests/test_bucketing_warmup_time.py` (~830 lines, ~25 tests)