skip HPU graphs for long (query + context) prefills by yangulei · Pull Request #1346 · vllm-project/vllm-gaudi

yangulei · 2026-04-14T01:06:30Z

Motivation

Current implementation use the following logic to skip the HPU graphs for long prefills:

vllm-gaudi/vllm_gaudi/v1/worker/hpu_model_runner.py

Lines 3105 to 3106 in bcff6c8

    
           if self.max_cudagraph_capture_size is not None and batch_size * seq_len > self.max_cudagraph_capture_size: 
        
               use_graphs = False

While:

self.max_cudagraph_capture_size is not set by default,
the batched tokens is calculated by batch_size * seq_len which miss the context length which is comparable or even larger than the query length as the chunked-prefill and APC is enabled by default.

Those lead to unnecessary HPU graphs for the compute-bound long prefills and introduce much more memory footprint which may cause OOM crash.

Changes:

Set the self.max_cudagraph_capture_size to self.max_num_batched_tokens if it is not set,
Includes the context tokens while calculating the batched tokens.

This is an re-implementation of the reverted PR #780.

Copilot

Pull request overview

Adjusts the HPU graph-bypass heuristic in HPUModelRunner to avoid capturing graphs for compute-/memory-heavy long prefills, reducing unnecessary memory footprint and OOM risk during long-context runs.

Changes:

Default max_cudagraph_capture_size to max_num_batched_tokens when unset.
Update the “skip graphs” threshold logic to account for context in addition to query length.

github-actions · 2026-04-14T05:40:59Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
f976e3b98ba45677a2213673a442c6cbff141e8e

Signed-off-by: copilot <copilot@github.com> Tests cover the four PRs addressing long-context bucketing: - PR #762: Padding-aware bucketing strategy (warmup ranges, configs, generation) - PR #1122: Exponential decode block formula, limit cap, filter, linear fix - PR #1155: FusedSDPA slicing contract (pad_max bounds, strategy selection) - PR #1346: HPU graph capture skip (cudagraph size, warmup clamp scenarios) - Cross-PR integration: end-to-end 256K scenario, fallback, regressions 49 test functions organized in 6 test classes. Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>

) Signed-off-by: GitHub Copilot <copilot@github.com> Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>

github-actions · 2026-04-15T10:06:18Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
0e39202ca911319c7747a2f9d5a0c162fdff4fd9

Remove all production code changes from PRs #1122, #1155, #1346 and keep only the two test files created for issue #1347: - tests/unit_tests/test_bucketing_issue_1347.py - tests/unit_tests/test_bucketing_warmup_time.py Signed-off-by: GitHub Copilot <copilot@github.com> Co-authored-by: michalkuligowski <23379006+michalkuligowski@users.noreply.github.com>

github-actions · 2026-04-27T04:15:14Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
d886c26d4d4fef7d079696beb4ece1cfb4b008a8

kamil-kaczor · 2026-05-06T10:58:30Z


-    def _use_graphs(self):
-        return not self.model_config.enforce_eager
+    def _use_graphs(self, attn_metadata=None, batch_size=0):


why batch_size=0 as default? Someone can forget this and get weird values

Fixed by removing the default values, thanks.

kamil-kaczor · 2026-05-06T11:00:49Z

+        if attn_metadata is not None and attn_metadata.is_prompt:
+            seq_len = attn_metadata.seq_len()
+            num_blocks = attn_metadata.num_blocks()
+            total_tokens = (batch_size * seq_len + num_blocks * attn_metadata.block_size)


I don't understand this. Isn't total tokens num_blocks * block_size (with padding included)? Same for batch_size * seq_len

No, num_blocks * block_size is the total context tokens, and batch_size * seq_len is the total query tokens. Take the chunked prefill for a prompt with 10240 tokens, max_num_batch_tokens=8192 and block_size=128 as an example:

The prefill [bs, seq_len, num_blocks] for the first chunk will be [1, 8192, 0].

For the second chunk it will be [1, 2048, 8192/128=64], where 2048 is the query length (q_len in FSDPA) and8192 is the context length (kv_len - q_len in FSDPA).

github-actions · 2026-05-07T05:49:08Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
5b39b268f506150dbab38f6f6c04b7c843e37c07

kamil-kaczor

lgtm

kamil-kaczor · 2026-05-07T07:47:13Z

+        if attn_metadata is not None and attn_metadata.is_prompt:
+            seq_len = attn_metadata.seq_len()
+            num_blocks = attn_metadata.num_blocks()
+            total_tokens = (batch_size * seq_len + num_blocks * attn_metadata.block_size)


Signed-off-by: Youlei Yang <youlei.yang@intel.com>

github-actions · 2026-05-11T11:06:47Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
8eb401134e750781a202c0b6dc4059616cdb4954

Copilot AI review requested due to automatic review settings April 14, 2026 01:06

yangulei marked this pull request as ready for review April 14, 2026 01:06

yangulei requested review from PatrykWo, adobrzyn, afierka-intel, iboiko-habana, kamil-kaczor, ksmusz, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners April 14, 2026 01:06

Copilot started reviewing on behalf of yangulei April 14, 2026 01:06 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py Outdated

Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py Outdated

github-actions Bot mentioned this pull request Apr 14, 2026

🚦 Team Review Dashboard #701

Open

yangulei force-pushed the skip_long_prefill_graph branch from dbfd71e to ab8831d Compare April 14, 2026 02:22

yangulei mentioned this pull request Apr 14, 2026

Supports 256k model length with TP=1 on Gaudi2 for Qwen3-30B-A3B-Thinking-2507 #1347

Open

4 tasks

michalkuligowski reviewed Apr 14, 2026

View reviewed changes

Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py

Copilot AI mentioned this pull request Apr 14, 2026

Combined changes from PRs #1122, #1155, #1346 for 256K context support on Gaudi2 #1348

Draft

yangulei force-pushed the skip_long_prefill_graph branch from ab8831d to ea94bc0 Compare April 15, 2026 06:45

adobrzyn assigned michalkuligowski Apr 22, 2026

yangulei force-pushed the skip_long_prefill_graph branch from ea94bc0 to c297bae Compare April 24, 2026 01:18

yangulei requested a review from jbyczkow as a code owner April 24, 2026 01:18

yangulei force-pushed the skip_long_prefill_graph branch from c297bae to 7689975 Compare April 27, 2026 00:20

michalkuligowski approved these changes May 4, 2026

View reviewed changes

iboiko-habana assigned kamil-kaczor May 6, 2026

kamil-kaczor requested changes May 6, 2026

View reviewed changes

yangulei force-pushed the skip_long_prefill_graph branch from 5e19e06 to 99261f8 Compare May 7, 2026 01:55

kamil-kaczor approved these changes May 7, 2026

View reviewed changes

yangulei added 4 commits May 11, 2026 07:42

skip HPU graphs for long prefills

f22a56a

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

add UT

016f4ac

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

add debug log

35dbde0

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

remove default values for _use_graphs

7d8fc7d

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

yangulei force-pushed the skip_long_prefill_graph branch from 99261f8 to 7d8fc7d Compare May 11, 2026 07:47

kamil-kaczor merged commit d93044f into vllm-project:main May 11, 2026
2 checks passed

	if self.max_cudagraph_capture_size is not None and batch_size * seq_len > self.max_cudagraph_capture_size:
	use_graphs = False

Conversation

yangulei commented Apr 14, 2026

Motivation

Changes:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 14, 2026

✅ CI Passed

Uh oh!

Uh oh!

github-actions Bot commented Apr 15, 2026

✅ CI Passed

Uh oh!

github-actions Bot commented Apr 27, 2026

✅ CI Passed

Uh oh!

kamil-kaczor May 6, 2026

Choose a reason for hiding this comment

Uh oh!

yangulei May 7, 2026

Choose a reason for hiding this comment

Uh oh!

kamil-kaczor May 6, 2026

Choose a reason for hiding this comment

Uh oh!

yangulei May 7, 2026

Choose a reason for hiding this comment

Uh oh!

kamil-kaczor May 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 7, 2026

✅ CI Passed

Uh oh!

kamil-kaczor left a comment

Choose a reason for hiding this comment

Uh oh!

kamil-kaczor May 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 11, 2026

✅ CI Passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants