Skip to content

skip HPU graphs for long (query + context) prefills#1346

Merged
kamil-kaczor merged 4 commits into
vllm-project:mainfrom
yangulei:skip_long_prefill_graph
May 11, 2026
Merged

skip HPU graphs for long (query + context) prefills#1346
kamil-kaczor merged 4 commits into
vllm-project:mainfrom
yangulei:skip_long_prefill_graph

Conversation

@yangulei
Copy link
Copy Markdown
Collaborator

Motivation

Current implementation use the following logic to skip the HPU graphs for long prefills:

if self.max_cudagraph_capture_size is not None and batch_size * seq_len > self.max_cudagraph_capture_size:
use_graphs = False

While:

  • self.max_cudagraph_capture_size is not set by default,
  • the batched tokens is calculated by batch_size * seq_len which miss the context length which is comparable or even larger than the query length as the chunked-prefill and APC is enabled by default.

Those lead to unnecessary HPU graphs for the compute-bound long prefills and introduce much more memory footprint which may cause OOM crash.

Changes:

  • Set the self.max_cudagraph_capture_size to self.max_num_batched_tokens if it is not set,
  • Includes the context tokens while calculating the batched tokens.

This is an re-implementation of the reverted PR #780.

Copilot AI review requested due to automatic review settings April 14, 2026 01:06
@yangulei yangulei marked this pull request as ready for review April 14, 2026 01:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts the HPU graph-bypass heuristic in HPUModelRunner to avoid capturing graphs for compute-/memory-heavy long prefills, reducing unnecessary memory footprint and OOM risk during long-context runs.

Changes:

  • Default max_cudagraph_capture_size to max_num_batched_tokens when unset.
  • Update the “skip graphs” threshold logic to account for context in addition to query length.

Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py Outdated
Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py Outdated
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
f976e3b98ba45677a2213673a442c6cbff141e8e

Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py
Copilot AI added a commit that referenced this pull request Apr 14, 2026
Signed-off-by: copilot <copilot@github.com>

Tests cover the four PRs addressing long-context bucketing:
- PR #762:  Padding-aware bucketing strategy (warmup ranges, configs, generation)
- PR #1122: Exponential decode block formula, limit cap, filter, linear fix
- PR #1155: FusedSDPA slicing contract (pad_max bounds, strategy selection)
- PR #1346: HPU graph capture skip (cudagraph size, warmup clamp scenarios)
- Cross-PR integration: end-to-end 256K scenario, fallback, regressions

49 test functions organized in 6 test classes.

Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
Copilot AI added a commit that referenced this pull request Apr 14, 2026
)

Signed-off-by: GitHub Copilot <copilot@github.com>

Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
@yangulei yangulei force-pushed the skip_long_prefill_graph branch from ab8831d to ea94bc0 Compare April 15, 2026 06:45
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
0e39202ca911319c7747a2f9d5a0c162fdff4fd9

Copilot AI added a commit that referenced this pull request Apr 16, 2026
Remove all production code changes from PRs #1122, #1155, #1346 and keep
only the two test files created for issue #1347:
- tests/unit_tests/test_bucketing_issue_1347.py
- tests/unit_tests/test_bucketing_warmup_time.py

Signed-off-by: GitHub Copilot <copilot@github.com>

Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>
@yangulei yangulei force-pushed the skip_long_prefill_graph branch from ea94bc0 to c297bae Compare April 24, 2026 01:18
@yangulei yangulei requested a review from jbyczkow as a code owner April 24, 2026 01:18
@yangulei yangulei force-pushed the skip_long_prefill_graph branch from c297bae to 7689975 Compare April 27, 2026 00:20
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
d886c26d4d4fef7d079696beb4ece1cfb4b008a8


def _use_graphs(self):
return not self.model_config.enforce_eager
def _use_graphs(self, attn_metadata=None, batch_size=0):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why batch_size=0 as default? Someone can forget this and get weird values

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by removing the default values, thanks.

if attn_metadata is not None and attn_metadata.is_prompt:
seq_len = attn_metadata.seq_len()
num_blocks = attn_metadata.num_blocks()
total_tokens = (batch_size * seq_len + num_blocks * attn_metadata.block_size)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. Isn't total tokens num_blocks * block_size (with padding included)? Same for batch_size * seq_len

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, num_blocks * block_size is the total context tokens, and batch_size * seq_len is the total query tokens. Take the chunked prefill for a prompt with 10240 tokens, max_num_batch_tokens=8192 and block_size=128 as an example:

  • The prefill [bs, seq_len, num_blocks] for the first chunk will be [1, 8192, 0].
  • For the second chunk it will be [1, 2048, 8192/128=64], where 2048 is the query length (q_len in FSDPA) and8192 is the context length (kv_len - q_len in FSDPA).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, ty

@yangulei yangulei force-pushed the skip_long_prefill_graph branch from 5e19e06 to 99261f8 Compare May 7, 2026 01:55
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

✅ CI Passed

All checks passed successfully against the following vllm commit:
5b39b268f506150dbab38f6f6c04b7c843e37c07

Copy link
Copy Markdown
Collaborator

@kamil-kaczor kamil-kaczor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

if attn_metadata is not None and attn_metadata.is_prompt:
seq_len = attn_metadata.seq_len()
num_blocks = attn_metadata.num_blocks()
total_tokens = (batch_size * seq_len + num_blocks * attn_metadata.block_size)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, ty

yangulei added 4 commits May 11, 2026 07:42
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
@yangulei yangulei force-pushed the skip_long_prefill_graph branch from 99261f8 to 7d8fc7d Compare May 11, 2026 07:47
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
8eb401134e750781a202c0b6dc4059616cdb4954

@kamil-kaczor kamil-kaczor merged commit d93044f into vllm-project:main May 11, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants