Skip to content

Ported the reranking models: Bert-based, Roberta-based and Qwen3-based.#1001

Merged
michalkuligowski merged 18 commits intovllm-project:mainfrom
gyou2021:reranking
Feb 26, 2026
Merged

Ported the reranking models: Bert-based, Roberta-based and Qwen3-based.#1001
michalkuligowski merged 18 commits intovllm-project:mainfrom
gyou2021:reranking

Conversation

@gyou2021
Copy link
Copy Markdown
Contributor

Three kinds of reranking models were added: Bert-based, Roberta-based, and Qwen3-based.

… Qwen3-based.

Signed-off-by: gyou2021 <ganmei.you@intel.com>
Signed-off-by: gyou2021 <ganmei.you@intel.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for three types of reranking models (BERT-based, RoBERTa-based, and Qwen3-based) to the vLLM HPU implementation. The changes primarily focus on enabling pooling models and encoder-only attention mechanisms.

Changes:

  • Added monkey-patched forward methods for BERT and RoBERTa sequence classification models
  • Enhanced HPU model runner to support pooling models with encoder-only attention
  • Added infrastructure for handling token type IDs in reranking models
  • Updated worker initialization to handle models with trust_remote_code

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
vllm_gaudi/models/bert.py New file that patches BertForSequenceClassification forward method to support HPU
vllm_gaudi/models/roberta.py New file that patches RobertaForSequenceClassification forward method with position handling
vllm_gaudi/init.py Registers the new BERT and RoBERTa model patches
vllm_gaudi/v1/worker/hpu_worker.py Added trust_remote_code initialization and KV cache check for pooling models
vllm_gaudi/v1/worker/hpu_model_runner.py Extensive changes to support pooling models, encoder-only attention, token type IDs, and related infrastructure
Comments suppressed due to low confidence (2)

vllm_gaudi/v1/worker/hpu_model_runner.py:842

  • There are two assignments to self.is_pooling_model using different methods. Line 799 sets it to model_config.pooler_config is not None while line 842 sets it to model_config.runner_type == 'pooling'. These two conditions may not always be equivalent, which could lead to inconsistent behavior. Consolidate these into a single assignment with the correct logic, or document why both are needed.
        self.is_pooling_model = model_config.pooler_config is not None

        self.sliding_window = model_config.get_sliding_window()
        self.interleaved_sliding_window = (is_interleaved(vllm_config.model_config.hf_text_config)
                                           and self.sliding_window)
        self.block_size = cache_config.block_size
        self.max_model_len = model_config.max_model_len
        self.max_num_blocks_per_req = cdiv(self.max_model_len, self.block_size)
        # Override settings when profiling a single prefill/decode
        # We can do such barbaric changes because we close vllm after the profiling
        prompt_profile_cfg, decode_profile_cfg = self._read_profiling_cfg()
        if prompt_profile_cfg or decode_profile_cfg:
            self.scheduler_config.max_num_seqs = self.max_model_len
            if prompt_profile_cfg:
                self.scheduler_config.max_num_batched_tokens = prompt_profile_cfg[0] * prompt_profile_cfg[1]
        self.max_num_tokens = scheduler_config.max_num_batched_tokens

        # Attention layers that are only in the KVCacheConfig of the runner
        # (e.g., KV sharing, encoder-only attention), but not in the
        # KVCacheConfig of the scheduler.
        self.runner_only_attn_layers: set[str] = set()
        # Cached outputs.
        ## universal buffer for input_ids and positions ##
        ## necessary being used by spec decode by following GPU impl ##
        self._draft_token_ids: Optional[Union[list[list[int]], torch.Tensor]] = None
        self.input_ids_cpu = torch.zeros(self.max_num_tokens,
                                         dtype=torch.int32,
                                         device="cpu",
                                         pin_memory=self.pin_memory)
        self.positions_cpu = torch.zeros(self.max_num_tokens,
                                         dtype=torch.int64,
                                         device="cpu",
                                         pin_memory=self.pin_memory)
        self.positions_np = self.positions_cpu.numpy()
        self.prefill_use_fusedsdpa = get_config().prompt_attn_impl == 'fsdpa_impl'
        ###############################################################

        # Model-related.
        self.num_attn_layers = self.model_config.get_num_layers_by_block_type(self.parallel_config, "attention")
        self.num_query_heads = self.model_config.get_num_attention_heads(self.parallel_config)
        self.num_kv_heads = self.model_config.get_num_kv_heads(self.parallel_config)
        self.head_size = self.model_config.get_head_size()
        self.hidden_size = self.model_config.get_hidden_size()
        self.is_pooling_model = (model_config.runner_type == 'pooling')

vllm_gaudi/v1/worker/hpu_model_runner.py:5918

  • The assertion assert layer_names == set(kv_caches.keys()), "Some layers are not correctly initialized" appears twice: once at line 5911 and again at line 5918. The second assertion at line 5918 seems to be checking the same condition even after KV cache sharing is set up. If the intention is to verify that shared layers don't change the keys, this should be documented. Otherwise, the duplicate assertion can be removed.
        assert layer_names == set(kv_caches.keys()), "Some layers are not correctly initialized"
        # Set up cross-layer KV cache sharing
        if self.shared_kv_cache_layers:
            logger.info("[KV sharing] Setting up tensor sharing for %s layers", len(self.shared_kv_cache_layers))
            for layer_name, target_layer_name in self.shared_kv_cache_layers.items():
                kv_caches[layer_name] = kv_caches[target_layer_name]

        assert layer_names == set(kv_caches.keys()), "Some layers are not correctly initialized"

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py Outdated
Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py Outdated
Comment thread vllm_gaudi/__init__.py Outdated
Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py Outdated
Comment thread vllm_gaudi/models/bert.py
Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py Outdated
Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py
Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py
Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py Outdated
Comment thread vllm_gaudi/v1/worker/hpu_model_runner.py
Signed-off-by: gyou2021 <ganmei.you@intel.com>
Signed-off-by: gyou2021 <ganmei.you@intel.com>
Signed-off-by: gyou2021 <ganmei.you@intel.com>
Signed-off-by: gyou2021 <ganmei.you@intel.com>
Signed-off-by: gyou2021 <ganmei.you@intel.com>
@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

1 similar comment
@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

self.kv_cache_config.kv_cache_groups.append(KVCacheGroupSpec(layer_names=layer_names, kv_cache_spec=spec))
self.is_encoder_only_attn = True

def may_reinitialize_input_batch(self, kv_cache_config: KVCacheConfig, kernel_block_sizes: list[int]) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method looks like a duplication of code in line 5718, please reuse in both places

Copy link
Copy Markdown
Contributor Author

@gyou2021 gyou2021 Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the if conditions are different, the method cannot be reused. Thank you for your comments.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kernel_block_sizes condition can be used there. is block split across kernels in the models enabled in this PR?

@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
dc6de33c3d5e9026cef7b27791dfe0f98e64bbde

self.kv_cache_config.kv_cache_groups.append(KVCacheGroupSpec(layer_names=layer_names, kv_cache_spec=spec))
self.is_encoder_only_attn = True

def may_reinitialize_input_batch(self, kv_cache_config: KVCacheConfig, kernel_block_sizes: list[int]) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kernel_block_sizes condition can be used there. is block split across kernels in the models enabled in this PR?

@michalkuligowski michalkuligowski merged commit 6728857 into vllm-project:main Feb 26, 2026
67 checks passed
12010486 pushed a commit to 12010486/vllm-gaudi that referenced this pull request Mar 5, 2026
…d. (vllm-project#1001)

Three kinds of reranking models were added: Bert-based, Roberta-based,
and Qwen3-based.

---------

Signed-off-by: gyou2021 <ganmei.you@intel.com>
Co-authored-by: Iryna Boiko <iryna.boiko@intel.com>
Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
SKRohit pushed a commit to SKRohit/vllm-gaudi that referenced this pull request Mar 12, 2026
…d. (vllm-project#1001)

Three kinds of reranking models were added: Bert-based, Roberta-based,
and Qwen3-based.

---------

Signed-off-by: gyou2021 <ganmei.you@intel.com>
Co-authored-by: Iryna Boiko <iryna.boiko@intel.com>
Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
adobrzyn pushed a commit that referenced this pull request Mar 31, 2026
…d. (#1001)

Three kinds of reranking models were added: Bert-based, Roberta-based,
and Qwen3-based.

---------

Signed-off-by: gyou2021 <ganmei.you@intel.com>
Co-authored-by: Iryna Boiko <iryna.boiko@intel.com>
Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants