Ported the reranking models: Bert-based, Roberta-based and Qwen3-based.#1001
Ported the reranking models: Bert-based, Roberta-based and Qwen3-based.#1001michalkuligowski merged 18 commits intovllm-project:mainfrom
Conversation
… Qwen3-based. Signed-off-by: gyou2021 <ganmei.you@intel.com>
Signed-off-by: gyou2021 <ganmei.you@intel.com>
There was a problem hiding this comment.
Pull request overview
This pull request adds support for three types of reranking models (BERT-based, RoBERTa-based, and Qwen3-based) to the vLLM HPU implementation. The changes primarily focus on enabling pooling models and encoder-only attention mechanisms.
Changes:
- Added monkey-patched forward methods for BERT and RoBERTa sequence classification models
- Enhanced HPU model runner to support pooling models with encoder-only attention
- Added infrastructure for handling token type IDs in reranking models
- Updated worker initialization to handle models with trust_remote_code
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_gaudi/models/bert.py | New file that patches BertForSequenceClassification forward method to support HPU |
| vllm_gaudi/models/roberta.py | New file that patches RobertaForSequenceClassification forward method with position handling |
| vllm_gaudi/init.py | Registers the new BERT and RoBERTa model patches |
| vllm_gaudi/v1/worker/hpu_worker.py | Added trust_remote_code initialization and KV cache check for pooling models |
| vllm_gaudi/v1/worker/hpu_model_runner.py | Extensive changes to support pooling models, encoder-only attention, token type IDs, and related infrastructure |
Comments suppressed due to low confidence (2)
vllm_gaudi/v1/worker/hpu_model_runner.py:842
- There are two assignments to
self.is_pooling_modelusing different methods. Line 799 sets it tomodel_config.pooler_config is not Nonewhile line 842 sets it tomodel_config.runner_type == 'pooling'. These two conditions may not always be equivalent, which could lead to inconsistent behavior. Consolidate these into a single assignment with the correct logic, or document why both are needed.
self.is_pooling_model = model_config.pooler_config is not None
self.sliding_window = model_config.get_sliding_window()
self.interleaved_sliding_window = (is_interleaved(vllm_config.model_config.hf_text_config)
and self.sliding_window)
self.block_size = cache_config.block_size
self.max_model_len = model_config.max_model_len
self.max_num_blocks_per_req = cdiv(self.max_model_len, self.block_size)
# Override settings when profiling a single prefill/decode
# We can do such barbaric changes because we close vllm after the profiling
prompt_profile_cfg, decode_profile_cfg = self._read_profiling_cfg()
if prompt_profile_cfg or decode_profile_cfg:
self.scheduler_config.max_num_seqs = self.max_model_len
if prompt_profile_cfg:
self.scheduler_config.max_num_batched_tokens = prompt_profile_cfg[0] * prompt_profile_cfg[1]
self.max_num_tokens = scheduler_config.max_num_batched_tokens
# Attention layers that are only in the KVCacheConfig of the runner
# (e.g., KV sharing, encoder-only attention), but not in the
# KVCacheConfig of the scheduler.
self.runner_only_attn_layers: set[str] = set()
# Cached outputs.
## universal buffer for input_ids and positions ##
## necessary being used by spec decode by following GPU impl ##
self._draft_token_ids: Optional[Union[list[list[int]], torch.Tensor]] = None
self.input_ids_cpu = torch.zeros(self.max_num_tokens,
dtype=torch.int32,
device="cpu",
pin_memory=self.pin_memory)
self.positions_cpu = torch.zeros(self.max_num_tokens,
dtype=torch.int64,
device="cpu",
pin_memory=self.pin_memory)
self.positions_np = self.positions_cpu.numpy()
self.prefill_use_fusedsdpa = get_config().prompt_attn_impl == 'fsdpa_impl'
###############################################################
# Model-related.
self.num_attn_layers = self.model_config.get_num_layers_by_block_type(self.parallel_config, "attention")
self.num_query_heads = self.model_config.get_num_attention_heads(self.parallel_config)
self.num_kv_heads = self.model_config.get_num_kv_heads(self.parallel_config)
self.head_size = self.model_config.get_head_size()
self.hidden_size = self.model_config.get_hidden_size()
self.is_pooling_model = (model_config.runner_type == 'pooling')
vllm_gaudi/v1/worker/hpu_model_runner.py:5918
- The assertion
assert layer_names == set(kv_caches.keys()), "Some layers are not correctly initialized"appears twice: once at line 5911 and again at line 5918. The second assertion at line 5918 seems to be checking the same condition even after KV cache sharing is set up. If the intention is to verify that shared layers don't change the keys, this should be documented. Otherwise, the duplicate assertion can be removed.
assert layer_names == set(kv_caches.keys()), "Some layers are not correctly initialized"
# Set up cross-layer KV cache sharing
if self.shared_kv_cache_layers:
logger.info("[KV sharing] Setting up tensor sharing for %s layers", len(self.shared_kv_cache_layers))
for layer_name, target_layer_name in self.shared_kv_cache_layers.items():
kv_caches[layer_name] = kv_caches[target_layer_name]
assert layer_names == set(kv_caches.keys()), "Some layers are not correctly initialized"
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: gyou2021 <ganmei.you@intel.com>
Signed-off-by: gyou2021 <ganmei.you@intel.com>
Signed-off-by: gyou2021 <ganmei.you@intel.com>
Signed-off-by: gyou2021 <ganmei.you@intel.com>
Signed-off-by: gyou2021 <ganmei.you@intel.com>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
1 similar comment
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
| self.kv_cache_config.kv_cache_groups.append(KVCacheGroupSpec(layer_names=layer_names, kv_cache_spec=spec)) | ||
| self.is_encoder_only_attn = True | ||
|
|
||
| def may_reinitialize_input_batch(self, kv_cache_config: KVCacheConfig, kernel_block_sizes: list[int]) -> None: |
There was a problem hiding this comment.
This method looks like a duplication of code in line 5718, please reuse in both places
There was a problem hiding this comment.
Since the if conditions are different, the method cannot be reused. Thank you for your comments.
There was a problem hiding this comment.
kernel_block_sizes condition can be used there. is block split across kernels in the models enabled in this PR?
…up mode. Signed-off-by: gyou2021 <ganmei.you@intel.com>
Signed-off-by: gyou2021 <ganmei.you@intel.com>
✅ CI PassedAll checks passed successfully against the following vllm commit: |
| self.kv_cache_config.kv_cache_groups.append(KVCacheGroupSpec(layer_names=layer_names, kv_cache_spec=spec)) | ||
| self.is_encoder_only_attn = True | ||
|
|
||
| def may_reinitialize_input_batch(self, kv_cache_config: KVCacheConfig, kernel_block_sizes: list[int]) -> None: |
There was a problem hiding this comment.
kernel_block_sizes condition can be used there. is block split across kernels in the models enabled in this PR?
…d. (vllm-project#1001) Three kinds of reranking models were added: Bert-based, Roberta-based, and Qwen3-based. --------- Signed-off-by: gyou2021 <ganmei.you@intel.com> Co-authored-by: Iryna Boiko <iryna.boiko@intel.com> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
…d. (vllm-project#1001) Three kinds of reranking models were added: Bert-based, Roberta-based, and Qwen3-based. --------- Signed-off-by: gyou2021 <ganmei.you@intel.com> Co-authored-by: Iryna Boiko <iryna.boiko@intel.com> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
…d. (#1001) Three kinds of reranking models were added: Bert-based, Roberta-based, and Qwen3-based. --------- Signed-off-by: gyou2021 <ganmei.you@intel.com> Co-authored-by: Iryna Boiko <iryna.boiko@intel.com> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
Three kinds of reranking models were added: Bert-based, Roberta-based, and Qwen3-based.