Combined changes from PRs #1122, #1155, #1346 for 256K context support on Gaudi2#1348
Draft
Copilot wants to merge 10 commits into
Draft
Combined changes from PRs #1122, #1155, #1346 for 256K context support on Gaudi2#1348Copilot wants to merge 10 commits into
Copilot wants to merge 10 commits into
Conversation
Signed-off-by: copilot <copilot@github.com> Tests cover the four PRs addressing long-context bucketing: - PR #762: Padding-aware bucketing strategy (warmup ranges, configs, generation) - PR #1122: Exponential decode block formula, limit cap, filter, linear fix - PR #1155: FusedSDPA slicing contract (pad_max bounds, strategy selection) - PR #1346: HPU graph capture skip (cudagraph size, warmup clamp scenarios) - Cross-PR integration: end-to-end 256K scenario, fallback, regressions 49 test functions organized in 6 test classes. Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
Signed-off-by: copilot <copilot@github.com> Tests verify that bucket counts (proxy for warmup time) stay within acceptable bounds across model configurations and bucketing strategies: - Total/prompt/decode bucket count bounds across 5 model scenarios - Regression tests for known bucket explosion (GAUDISW-247226, 256K decode) - Sub-linear scaling verification: bucket count vs model length - Strategy comparison: exponential vs linear vs padding-aware efficiency - Dynamo cache_size_limit and accumulated_cache_size_limit bounds - Individual range dimension sanity checks (bs, block, query, ctx) - Warmup range edge cases: deduplication, padding parameters - End-to-end manager tests: 256K, 8K fast warmup, speculative decoding 25 test functions organized in 8 test classes. Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
Rename FP8 blockwise compressed tensors scales to match HPU ops, Fixes regression in https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512 due to #1220 and #1053 --------- Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com> Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
Copilot created this pull request from a session on behalf of
michalkuligowski
April 14, 2026 13:05
View session
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
Contributor
There was a problem hiding this comment.
Pull request overview
This PR aggregates three previously separate fixes to unblock 256K max_model_len on Gaudi2 (TP=1) by tightening decode bucketing, reducing FusedSDPA memory via Q/K/V slicing, and avoiding expensive HPU graph capture for very long prefills.
Changes:
- Refines decode bucketing (formula + filtering) and adds warmup-time regression tests to keep bucket counts (and warmup time) bounded for long contexts.
- Introduces chunked/sliced FusedSDPA (BF16 + FP8) to avoid materializing full attention masks at long context lengths, plus feature flags and unit tests.
- Updates HPU graph usage heuristics and warmup dummy-input generation to prevent OOMs during long-context warmup.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_gaudi/v1/worker/hpu_model_runner.py | Defaults max_cudagraph_capture_size, refines _use_graphs() to account for context tokens, and clamps warmup dummy decode blocks to KV-cache capacity. |
| vllm_gaudi/ops/hpu_compressed_tensors.py | Syncs FP8 sliced-module scale references after post-load weight processing. |
| vllm_gaudi/extension/utils.py | Adds sliced FusedSDPA implementations and dispatch logic (BF16 + FP8) behind config/env gating. |
| vllm_gaudi/extension/ops.py | Removes a causal+mask workaround to allow sliced path to trigger (wrapper still disables unsupported default path). |
| vllm_gaudi/extension/features.py | Registers env flags and adds enable_fsdpa_slicing feature predicate. |
| vllm_gaudi/extension/bucketing/linear.py | Fixes decode block max clamping behavior and min/max consistency handling. |
| vllm_gaudi/extension/bucketing/exponential.py | Updates decode max-block formula and removes the prior limit cap in favor of downstream filtering. |
| vllm_gaudi/extension/bucketing/common.py | Adds decode bucket filtering based on batched max-model-len and optional debug logging of omitted buckets. |
| tests/unit_tests/worker/test_hpu_model_runner.py | Adds tests for cudagraph capture defaulting and _use_graphs() boundary behavior. |
| tests/unit_tests/test_fsdpa_slicing.py | New unit + (HPU-skipped) accuracy/graph-break coverage for sliced FusedSDPA and feature/env wiring. |
| tests/unit_tests/test_bucketing_warmup_time.py | New warmup-time budget tests using bucket-count bounds across scenarios. |
| tests/unit_tests/test_bucketing_issue_1347.py | New comprehensive bucketing contract tests spanning the combined changes. |
| tests/unit_tests/test_bucketing.py | Updates existing bucketing tests to match the new decode formula/limits. |
| tests/full_tests/model_cards/Qwen3-30B-A3B-FP8-Static-longbench.yaml | Adds a LongBench model card for the new e2e discoverable tests. |
| tests/full_tests/ci_e2e_discoverable_tests.sh | Adds LongBench discoverable test functions (baseline + slicing + fp8 KV variants). |
| docs/configuration/env_variables.md | Documents the new FusedSDPA slicing tuning env vars. |
Comment on lines
+178
to
+179
| assert bucketing_manager is not None and bucketing_manager.initialized, 'Bucketing manager should be instantiated and initialized to enable FusedSDPA slicing.' | ||
|
|
| env_var_type=boolean), | ||
| Value('use_hpu_aligned_scale', False, env_var='HPU_ALIGNED_SCALE', env_var_type=boolean), | ||
| Value('enable_fsdpa_slicing', | ||
| All(Eq('bucketing_strategy', 'pad'), Disabled('merged_prefill'), Kernel(fsdpa)), |
| if decode_block_bucket_cfg[0] > decode_block_bucket_cfg[2]: | ||
| decode_block_bucket_min = max(1, decode_block_bucket_cfg[2] - decode_block_bucket_cfg[1]) | ||
| logger().info( | ||
| f"VLLM_DECODE_BLOCK_BUCKET_MIN={decode_block_bucket_cfg[0]} is higher than max_blocks={decode_block_bucket_cfg[2]}. Your configuration VLLM_DECODE_BLOCK_BUCKET_MIN={decode_block_bucket_cfg[0]} will be overwritten to VLLM_DECODE_BLOCK_BUCKET_MIN={decode_block_bucket_min}" |
Copilot AI
added a commit
that referenced
this pull request
Apr 15, 2026
Agent-Logs-Url: https://github.com/vllm-project/vllm-gaudi/sessions/1d54e613-892a-46b1-b2af-4d9b9d321288 Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
3 tasks
Signed-off-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
Remove all production code changes from PRs #1122, #1155, #1346 and keep only the two test files created for issue #1347: - tests/unit_tests/test_bucketing_issue_1347.py - tests/unit_tests/test_bucketing_warmup_time.py Signed-off-by: GitHub Copilot <copilot@github.com> Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add comprehensive bucketing test suite for the four PRs addressing 256K model length support on Gaudi2 (issue #1347). This PR contains only test files — no production code changes.
Test Files
tests/unit_tests/test_bucketing_issue_1347.py(~1217 lines, ~50 tests)Tests covering bucketing contracts across all four PRs:
generate_buckets()filter, linear strategy fix_generate_seq_lengthsclampTest categories:
generate_buckets()filter behavior (bs ≤ ctx, ctx ≤ batched max)HPUBucketingManager.initialize()integration for 256K scenariostests/unit_tests/test_bucketing_warmup_time.py(~830 lines, ~25 tests)Tests enforcing warmup-time budgets via bucket-count bounds:
torch._dynamo.config.cache_size_limitboundsDependencies
Some tests validate behavior from PRs #1122, #1155, and #1346 and will only pass after those PRs are merged. Tests for PR #762 (already merged) should pass against
main.Relates to: #1347