Skip to content

skip HPU graphs for long prefills#780

Merged
michalkuligowski merged 2 commits intovllm-project:mainfrom
yangulei:skip_long_prefill_graph
Jan 8, 2026
Merged

skip HPU graphs for long prefills#780
michalkuligowski merged 2 commits intovllm-project:mainfrom
yangulei:skip_long_prefill_graph

Conversation

@yangulei
Copy link
Copy Markdown
Collaborator

@yangulei yangulei commented Jan 7, 2026

  • Set the batched tokens threshold to skip HPUgraph to max_num_batched_tokens if max_cudagraph_capture_size is not set.
  • Include the context tokens while calculating the batched tokens.

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR modifies HPU graph capture behavior to skip graph generation for long prefills by setting a default threshold of 8192 tokens and refining the calculation to include context tokens.

  • Introduces a default threshold of 8192 tokens for HPU graph capture when not explicitly configured
  • Updates batched token calculation to include context tokens (num_blocks * block_size) in addition to input sequence length
  • Adds clarifying comment about skipping HPU graphs for long prefills

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -783,7 +783,8 @@ def __init__(
self.use_hpu_graph = not self.model_config.enforce_eager
self.max_batch_size = self.scheduler_config.max_num_seqs
self.max_num_seqs = self.scheduler_config.max_num_seqs
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable is renamed from max_cudagraph_capture_size to max_graph_capture_tokens, but the source still references max_cudagraph_capture_size. This creates inconsistent terminology between HPU-specific naming and the CUDA-derived config field name. Consider either keeping the original name for clarity or adding a comment explaining the name change from size to tokens.

Suggested change
self.max_num_seqs = self.scheduler_config.max_num_seqs
self.max_num_seqs = self.scheduler_config.max_num_seqs
# NOTE: `max_graph_capture_tokens` is derived from the CUDA-derived
# config field `max_cudagraph_capture_size`. The HPU path uses
# "tokens" terminology, but the underlying config name is kept for
# compatibility.

Copilot uses AI. Check for mistakes.
if htorch.utils.internal.is_lazy():
use_graphs = self._use_graphs()
if self.max_cudagraph_capture_size is not None and batch_size * seq_len > self.max_cudagraph_capture_size:
# skip HPU graphs for long prefills
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formula batch_size * (seq_len + num_blocks * self.block_size) calculates total tokens including context, but the logic assumes num_blocks * self.block_size represents context tokens. This assumption should be documented in a comment to clarify what num_blocks represents and why this calculation correctly accounts for context tokens.

Suggested change
# skip HPU graphs for long prefills
# skip HPU graphs for long prefills
# NOTE: num_blocks is the number of KV/cache blocks per sequence and
# self.block_size is tokens per block, so num_blocks * self.block_size
# represents the number of cached context tokens per sequence. Adding
# seq_len gives total tokens per sequence (context + current), which
# is then scaled by batch_size to compare against max_graph_capture_tokens.

Copilot uses AI. Check for mistakes.
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jan 7, 2026

✅ CI Passed

All checks passed successfully against the following vllm commit:
b3a2bdf1ac90748d58bf8c05f8d0095ede5c7eca

@michalkuligowski michalkuligowski merged commit b208bbd into vllm-project:main Jan 8, 2026
50 checks passed
@yangulei yangulei deleted the skip_long_prefill_graph branch January 8, 2026 08:27
yangulei added a commit to yangulei/vllm-gaudi that referenced this pull request Jan 9, 2026
- Set the batched tokens threshold to skip HPUgraph to
`max_num_batched_tokens` if `max_cudagraph_capture_size` is not set.
- Include the context tokens while calculating the batched tokens.

---------

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
jinyouzhi pushed a commit to jinyouzhi/vllm-gaudi that referenced this pull request Jan 14, 2026
- Set the batched tokens threshold to skip HPUgraph to
`max_num_batched_tokens` if `max_cudagraph_capture_size` is not set.
- Include the context tokens while calculating the batched tokens.

---------

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Jin, Youzhi <youzhi.jin@intel.com>
adobrzyn added a commit that referenced this pull request Jan 21, 2026
adobrzyn added a commit that referenced this pull request Jan 26, 2026
Reverts #780

---------

Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
hlahkar pushed a commit to hlahkar/vllm-gaudi that referenced this pull request Jan 27, 2026
Reverts vllm-project#780

---------

Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
adobrzyn added a commit that referenced this pull request Jan 27, 2026
Reverts #780

---------

Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
mgawarkiewicz-intel pushed a commit that referenced this pull request Jan 28, 2026
Reverts #780

---------

Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
testdig pushed a commit to testdig/vllm-gaudi-fork that referenced this pull request Jan 29, 2026
Reverts vllm-project#780

---------

Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Signed-off-by: Wang, Zheng W <zheng.w.wang@intel.com>
slokesha pushed a commit to libinta/vllm-gaudi that referenced this pull request Jan 29, 2026
…roject#888)

Reverts vllm-project#780

---------

Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Signed-off-by: slokesha <slokeshappa@habana.ai>
slokesha pushed a commit to libinta/vllm-gaudi that referenced this pull request Feb 9, 2026
- Set the batched tokens threshold to skip HPUgraph to
`max_num_batched_tokens` if `max_cudagraph_capture_size` is not set.
- Include the context tokens while calculating the batched tokens.

---------

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: slokesha <slokeshappa@habana.ai>
slokesha pushed a commit to libinta/vllm-gaudi that referenced this pull request Feb 9, 2026
Reverts vllm-project#780

---------

Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Signed-off-by: slokesha <slokeshappa@habana.ai>
rajanintel24 pushed a commit to rajanintel24/vllm-gaudi that referenced this pull request Feb 11, 2026
- Set the batched tokens threshold to skip HPUgraph to
`max_num_batched_tokens` if `max_cudagraph_capture_size` is not set.
- Include the context tokens while calculating the batched tokens.

---------

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
czhu15 pushed a commit that referenced this pull request Feb 27, 2026
Re-apply #780 to avoid
OOM error caused by too many unnecessary HPU graphs captured for long
prefills.

---------

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
czhu15 pushed a commit that referenced this pull request Feb 27, 2026
Re-apply #780 to avoid
OOM error caused by too many unnecessary HPU graphs captured for long
prefills.

---------

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
tvoas pushed a commit to tvoas/vllm-gaudi that referenced this pull request Mar 11, 2026
Re-apply vllm-project#780 to avoid
OOM error caused by too many unnecessary HPU graphs captured for long
prefills.

---------

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
adobrzyn pushed a commit that referenced this pull request Mar 31, 2026
- Set the batched tokens threshold to skip HPUgraph to
`max_num_batched_tokens` if `max_cudagraph_capture_size` is not set.
- Include the context tokens while calculating the batched tokens.

---------

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
adobrzyn added a commit that referenced this pull request Mar 31, 2026
Reverts #780

---------

Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
yangulei added a commit to yangulei/vllm-gaudi that referenced this pull request Apr 14, 2026
- Set the batched tokens threshold to skip HPUgraph to
`max_num_batched_tokens` if `max_cudagraph_capture_size` is not set.
- Include the context tokens while calculating the batched tokens.

---------

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
kamil-kaczor pushed a commit that referenced this pull request May 11, 2026
## Motivation
Current implementation use the following logic to skip the HPU graphs
for long prefills:


https://github.com/vllm-project/vllm-gaudi/blob/bcff6c8a4e41dae81bbfd762961430f7607637f9/vllm_gaudi/v1/worker/hpu_model_runner.py#L3105-L3106

While:
- `self.max_cudagraph_capture_size` is not set by default,
- the batched tokens is calculated by `batch_size * seq_len` which miss
the context length which is comparable or even larger than the query
length as the chunked-prefill and APC is enabled by default.

Those lead to unnecessary HPU graphs for the compute-bound long prefills
and introduce much more memory footprint which may cause OOM crash.

## Changes:
- Set the `self.max_cudagraph_capture_size` to
`self.max_num_batched_tokens` if it is not set,
- Includes the context tokens while calculating the batched tokens.

> This is an re-implementation of the reverted PR
#780.

---------

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants