skip HPU graphs for long prefills#780
Conversation
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
There was a problem hiding this comment.
Pull request overview
This PR modifies HPU graph capture behavior to skip graph generation for long prefills by setting a default threshold of 8192 tokens and refining the calculation to include context tokens.
- Introduces a default threshold of 8192 tokens for HPU graph capture when not explicitly configured
- Updates batched token calculation to include context tokens (num_blocks * block_size) in addition to input sequence length
- Adds clarifying comment about skipping HPU graphs for long prefills
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -783,7 +783,8 @@ def __init__( | |||
| self.use_hpu_graph = not self.model_config.enforce_eager | |||
| self.max_batch_size = self.scheduler_config.max_num_seqs | |||
| self.max_num_seqs = self.scheduler_config.max_num_seqs | |||
There was a problem hiding this comment.
The variable is renamed from max_cudagraph_capture_size to max_graph_capture_tokens, but the source still references max_cudagraph_capture_size. This creates inconsistent terminology between HPU-specific naming and the CUDA-derived config field name. Consider either keeping the original name for clarity or adding a comment explaining the name change from size to tokens.
| self.max_num_seqs = self.scheduler_config.max_num_seqs | |
| self.max_num_seqs = self.scheduler_config.max_num_seqs | |
| # NOTE: `max_graph_capture_tokens` is derived from the CUDA-derived | |
| # config field `max_cudagraph_capture_size`. The HPU path uses | |
| # "tokens" terminology, but the underlying config name is kept for | |
| # compatibility. |
| if htorch.utils.internal.is_lazy(): | ||
| use_graphs = self._use_graphs() | ||
| if self.max_cudagraph_capture_size is not None and batch_size * seq_len > self.max_cudagraph_capture_size: | ||
| # skip HPU graphs for long prefills |
There was a problem hiding this comment.
The formula batch_size * (seq_len + num_blocks * self.block_size) calculates total tokens including context, but the logic assumes num_blocks * self.block_size represents context tokens. This assumption should be documented in a comment to clarify what num_blocks represents and why this calculation correctly accounts for context tokens.
| # skip HPU graphs for long prefills | |
| # skip HPU graphs for long prefills | |
| # NOTE: num_blocks is the number of KV/cache blocks per sequence and | |
| # self.block_size is tokens per block, so num_blocks * self.block_size | |
| # represents the number of cached context tokens per sequence. Adding | |
| # seq_len gives total tokens per sequence (context + current), which | |
| # is then scaled by batch_size to compare against max_graph_capture_tokens. |
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
✅ CI PassedAll checks passed successfully against the following vllm commit: |
- Set the batched tokens threshold to skip HPUgraph to `max_num_batched_tokens` if `max_cudagraph_capture_size` is not set. - Include the context tokens while calculating the batched tokens. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>
- Set the batched tokens threshold to skip HPUgraph to `max_num_batched_tokens` if `max_cudagraph_capture_size` is not set. - Include the context tokens while calculating the batched tokens. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com> Signed-off-by: Jin, Youzhi <youzhi.jin@intel.com>
Reverts #780 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Reverts vllm-project#780 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Reverts #780 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Reverts vllm-project#780 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: Wang, Zheng W <zheng.w.wang@intel.com>
…roject#888) Reverts vllm-project#780 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>
- Set the batched tokens threshold to skip HPUgraph to `max_num_batched_tokens` if `max_cudagraph_capture_size` is not set. - Include the context tokens while calculating the batched tokens. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>
Reverts vllm-project#780 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>
- Set the batched tokens threshold to skip HPUgraph to `max_num_batched_tokens` if `max_cudagraph_capture_size` is not set. - Include the context tokens while calculating the batched tokens. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Re-apply #780 to avoid OOM error caused by too many unnecessary HPU graphs captured for long prefills. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Re-apply #780 to avoid OOM error caused by too many unnecessary HPU graphs captured for long prefills. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Re-apply vllm-project#780 to avoid OOM error caused by too many unnecessary HPU graphs captured for long prefills. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>
- Set the batched tokens threshold to skip HPUgraph to `max_num_batched_tokens` if `max_cudagraph_capture_size` is not set. - Include the context tokens while calculating the batched tokens. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Reverts #780 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
- Set the batched tokens threshold to skip HPUgraph to `max_num_batched_tokens` if `max_cudagraph_capture_size` is not set. - Include the context tokens while calculating the batched tokens. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>
## Motivation Current implementation use the following logic to skip the HPU graphs for long prefills: https://github.com/vllm-project/vllm-gaudi/blob/bcff6c8a4e41dae81bbfd762961430f7607637f9/vllm_gaudi/v1/worker/hpu_model_runner.py#L3105-L3106 While: - `self.max_cudagraph_capture_size` is not set by default, - the batched tokens is calculated by `batch_size * seq_len` which miss the context length which is comparable or even larger than the query length as the chunked-prefill and APC is enabled by default. Those lead to unnecessary HPU graphs for the compute-bound long prefills and introduce much more memory footprint which may cause OOM crash. ## Changes: - Set the `self.max_cudagraph_capture_size` to `self.max_num_batched_tokens` if it is not set, - Includes the context tokens while calculating the batched tokens. > This is an re-implementation of the reverted PR #780. --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>
max_num_batched_tokensifmax_cudagraph_capture_sizeis not set.