skip HPU graphs for long prefills by yangulei · Pull Request #780 · vllm-project/vllm-gaudi

yangulei · 2026-01-07T07:03:50Z

Set the batched tokens threshold to skip HPUgraph to max_num_batched_tokens if max_cudagraph_capture_size is not set.
Include the context tokens while calculating the batched tokens.

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

Copilot

Pull request overview

This PR modifies HPU graph capture behavior to skip graph generation for long prefills by setting a default threshold of 8192 tokens and refining the calculation to include context tokens.

Introduces a default threshold of 8192 tokens for HPU graph capture when not explicitly configured
Updates batched token calculation to include context tokens (num_blocks * block_size) in addition to input sequence length
Adds clarifying comment about skipping HPU graphs for long prefills

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-07T07:04:41Z

@@ -783,7 +783,8 @@ def __init__(
        self.use_hpu_graph = not self.model_config.enforce_eager
        self.max_batch_size = self.scheduler_config.max_num_seqs
        self.max_num_seqs = self.scheduler_config.max_num_seqs


The variable is renamed from max_cudagraph_capture_size to max_graph_capture_tokens, but the source still references max_cudagraph_capture_size. This creates inconsistent terminology between HPU-specific naming and the CUDA-derived config field name. Consider either keeping the original name for clarity or adding a comment explaining the name change from size to tokens.

Suggested change

self.max_num_seqs = self.scheduler_config.max_num_seqs

self.max_num_seqs = self.scheduler_config.max_num_seqs

# NOTE: `max_graph_capture_tokens` is derived from the CUDA-derived

# config field `max_cudagraph_capture_size`. The HPU path uses

# "tokens" terminology, but the underlying config name is kept for

# compatibility.

Copilot · 2026-01-07T07:04:42Z

        if htorch.utils.internal.is_lazy():
            use_graphs = self._use_graphs()
-            if self.max_cudagraph_capture_size is not None and batch_size * seq_len > self.max_cudagraph_capture_size:
+            # skip HPU graphs for long prefills


The formula batch_size * (seq_len + num_blocks * self.block_size) calculates total tokens including context, but the logic assumes num_blocks * self.block_size represents context tokens. This assumption should be documented in a comment to clarify what num_blocks represents and why this calculation correctly accounts for context tokens.

Suggested change

# skip HPU graphs for long prefills

# skip HPU graphs for long prefills

# NOTE: num_blocks is the number of KV/cache blocks per sequence and

# self.block_size is tokens per block, so num_blocks * self.block_size

# represents the number of cached context tokens per sequence. Adding

# seq_len gives total tokens per sequence (context + current), which

# is then scaled by batch_size to compare against max_graph_capture_tokens.

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

github-actions · 2026-01-07T09:47:59Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
b3a2bdf1ac90748d58bf8c05f8d0095ede5c7eca

- Set the batched tokens threshold to skip HPUgraph to `max_num_batched_tokens` if `max_cudagraph_capture_size` is not set. - Include the context tokens while calculating the batched tokens. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>

- Set the batched tokens threshold to skip HPUgraph to `max_num_batched_tokens` if `max_cudagraph_capture_size` is not set. - Include the context tokens while calculating the batched tokens. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com> Signed-off-by: Jin, Youzhi <youzhi.jin@intel.com>

This reverts commit b208bbd.

Reverts #780 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>

Reverts vllm-project#780 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>

Reverts #780 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>

Reverts vllm-project#780 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: Wang, Zheng W <zheng.w.wang@intel.com>

…roject#888) Reverts vllm-project#780 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>

- Set the batched tokens threshold to skip HPUgraph to `max_num_batched_tokens` if `max_cudagraph_capture_size` is not set. - Include the context tokens while calculating the batched tokens. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>

Reverts vllm-project#780 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>

- Set the batched tokens threshold to skip HPUgraph to `max_num_batched_tokens` if `max_cudagraph_capture_size` is not set. - Include the context tokens while calculating the batched tokens. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>

Re-apply #780 to avoid OOM error caused by too many unnecessary HPU graphs captured for long prefills. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>

Re-apply vllm-project#780 to avoid OOM error caused by too many unnecessary HPU graphs captured for long prefills. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>

- Set the batched tokens threshold to skip HPUgraph to `max_num_batched_tokens` if `max_cudagraph_capture_size` is not set. - Include the context tokens while calculating the batched tokens. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>

Reverts #780 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>

- Set the batched tokens threshold to skip HPUgraph to `max_num_batched_tokens` if `max_cudagraph_capture_size` is not set. - Include the context tokens while calculating the batched tokens. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>

## Motivation Current implementation use the following logic to skip the HPU graphs for long prefills: https://github.com/vllm-project/vllm-gaudi/blob/bcff6c8a4e41dae81bbfd762961430f7607637f9/vllm_gaudi/v1/worker/hpu_model_runner.py#L3105-L3106 While: - `self.max_cudagraph_capture_size` is not set by default, - the batched tokens is calculated by `batch_size * seq_len` which miss the context length which is comparable or even larger than the query length as the chunked-prefill and APC is enabled by default. Those lead to unnecessary HPU graphs for the compute-bound long prefills and introduce much more memory footprint which may cause OOM crash. ## Changes: - Set the `self.max_cudagraph_capture_size` to `self.max_num_batched_tokens` if it is not set, - Includes the context tokens while calculating the batched tokens. > This is an re-implementation of the reverted PR #780. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>

skip HPU graphs for long prefills

30c9ac9

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

Copilot AI review requested due to automatic review settings January 7, 2026 07:03

yangulei requested review from adobrzyn, afierka-intel, iboiko-habana, kamil-kaczor, ksmusz, kzawora-intel, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners January 7, 2026 07:03

Copilot AI reviewed Jan 7, 2026

View reviewed changes

yangulei mentioned this pull request Jan 7, 2026

Add the padding-aware bucketing strategy #762

Merged

use max_num_batched_tokens instead of fixed 8192

ad91776

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

github-actions Bot mentioned this pull request Jan 7, 2026

🚦 Team Review Dashboard #701

Open

michalkuligowski approved these changes Jan 8, 2026

View reviewed changes

michalkuligowski merged commit b208bbd into vllm-project:main Jan 8, 2026
50 checks passed

yangulei deleted the skip_long_prefill_graph branch January 8, 2026 08:27

adobrzyn added a commit that referenced this pull request Jan 21, 2026

Revert "skip HPU graphs for long prefills (#780)"

a74918e

This reverts commit b208bbd.

adobrzyn mentioned this pull request Jan 21, 2026

Revert "skip HPU graphs for long prefills" #850

Merged

adobrzyn added a commit that referenced this pull request Jan 26, 2026

Revert "skip HPU graphs for long prefills" (#850)

4c0e6ff

Reverts #780 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>

adobrzyn added a commit that referenced this pull request Jan 27, 2026

Revert "skip HPU graphs for long prefills" (#850)

102e4a7

Reverts #780 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>

adobrzyn mentioned this pull request Jan 27, 2026

Revert "skip HPU graphs for long prefills" (#850) #888

Merged

mgawarkiewicz-intel pushed a commit that referenced this pull request Jan 28, 2026

Revert "skip HPU graphs for long prefills" (#850) (#888)

c66a038

Reverts #780 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>

yangulei mentioned this pull request Feb 25, 2026

skip HPU graphs for long prefills #1033

Merged

czhu15 pushed a commit that referenced this pull request Feb 27, 2026

skip HPU graphs for long prefills (#1033)

809daf8

Re-apply #780 to avoid OOM error caused by too many unnecessary HPU graphs captured for long prefills. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>

czhu15 pushed a commit that referenced this pull request Feb 27, 2026

skip HPU graphs for long prefills (#1033)

36bf5e5

Re-apply #780 to avoid OOM error caused by too many unnecessary HPU graphs captured for long prefills. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>

adobrzyn added a commit that referenced this pull request Mar 31, 2026

Revert "skip HPU graphs for long prefills" (#850)

31cf158

Reverts #780 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>

yangulei mentioned this pull request Apr 14, 2026

skip HPU graphs for long (query + context) prefills #1346

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

skip HPU graphs for long prefills#780

skip HPU graphs for long prefills#780
michalkuligowski merged 2 commits intovllm-project:mainfrom
yangulei:skip_long_prefill_graph

yangulei commented Jan 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

github-actions Bot commented Jan 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        self.max_num_seqs = self.scheduler_config.max_num_seqs
+        self.max_num_seqs = self.scheduler_config.max_num_seqs
+        # NOTE: `max_graph_capture_tokens` is derived from the CUDA-derived
+        # config field `max_cudagraph_capture_size`. The HPU path uses
+        # "tokens" terminology, but the underlying config name is kept for
+        # compatibility.

-            # skip HPU graphs for long prefills
+            # skip HPU graphs for long prefills
+            # NOTE: num_blocks is the number of KV/cache blocks per sequence and
+            # self.block_size is tokens per block, so num_blocks * self.block_size
+            # represents the number of cached context tokens per sequence. Adding
+            # seq_len gives total tokens per sequence (context + current), which
+            # is then scaled by batch_size to compare against max_graph_capture_tokens.

Conversation

yangulei commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jan 7, 2026

✅ CI Passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yangulei commented Jan 7, 2026 •

edited

Loading