Ported the reranking models: Bert-based, Roberta-based and Qwen3-based. by gyou2021 · Pull Request #1001 · vllm-project/vllm-gaudi

gyou2021 · 2026-02-21T17:13:40Z

Three kinds of reranking models were added: Bert-based, Roberta-based, and Qwen3-based.

… Qwen3-based. Signed-off-by: gyou2021 <ganmei.you@intel.com>

Signed-off-by: gyou2021 <ganmei.you@intel.com>

Copilot

Pull request overview

This pull request adds support for three types of reranking models (BERT-based, RoBERTa-based, and Qwen3-based) to the vLLM HPU implementation. The changes primarily focus on enabling pooling models and encoder-only attention mechanisms.

Changes:

Added monkey-patched forward methods for BERT and RoBERTa sequence classification models
Enhanced HPU model runner to support pooling models with encoder-only attention
Added infrastructure for handling token type IDs in reranking models
Updated worker initialization to handle models with trust_remote_code

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
vllm_gaudi/models/bert.py	New file that patches BertForSequenceClassification forward method to support HPU
vllm_gaudi/models/roberta.py	New file that patches RobertaForSequenceClassification forward method with position handling
vllm_gaudi/init.py	Registers the new BERT and RoBERTa model patches
vllm_gaudi/v1/worker/hpu_worker.py	Added trust_remote_code initialization and KV cache check for pooling models
vllm_gaudi/v1/worker/hpu_model_runner.py	Extensive changes to support pooling models, encoder-only attention, token type IDs, and related infrastructure

Comments suppressed due to low confidence (2)

vllm_gaudi/v1/worker/hpu_model_runner.py:842

There are two assignments to self.is_pooling_model using different methods. Line 799 sets it to model_config.pooler_config is not None while line 842 sets it to model_config.runner_type == 'pooling'. These two conditions may not always be equivalent, which could lead to inconsistent behavior. Consolidate these into a single assignment with the correct logic, or document why both are needed.

        self.is_pooling_model = model_config.pooler_config is not None

        self.sliding_window = model_config.get_sliding_window()
        self.interleaved_sliding_window = (is_interleaved(vllm_config.model_config.hf_text_config)
                                           and self.sliding_window)
        self.block_size = cache_config.block_size
        self.max_model_len = model_config.max_model_len
        self.max_num_blocks_per_req = cdiv(self.max_model_len, self.block_size)
        # Override settings when profiling a single prefill/decode
        # We can do such barbaric changes because we close vllm after the profiling
        prompt_profile_cfg, decode_profile_cfg = self._read_profiling_cfg()
        if prompt_profile_cfg or decode_profile_cfg:
            self.scheduler_config.max_num_seqs = self.max_model_len
            if prompt_profile_cfg:
                self.scheduler_config.max_num_batched_tokens = prompt_profile_cfg[0] * prompt_profile_cfg[1]
        self.max_num_tokens = scheduler_config.max_num_batched_tokens

        # Attention layers that are only in the KVCacheConfig of the runner
        # (e.g., KV sharing, encoder-only attention), but not in the
        # KVCacheConfig of the scheduler.
        self.runner_only_attn_layers: set[str] = set()
        # Cached outputs.
        ## universal buffer for input_ids and positions ##
        ## necessary being used by spec decode by following GPU impl ##
        self._draft_token_ids: Optional[Union[list[list[int]], torch.Tensor]] = None
        self.input_ids_cpu = torch.zeros(self.max_num_tokens,
                                         dtype=torch.int32,
                                         device="cpu",
                                         pin_memory=self.pin_memory)
        self.positions_cpu = torch.zeros(self.max_num_tokens,
                                         dtype=torch.int64,
                                         device="cpu",
                                         pin_memory=self.pin_memory)
        self.positions_np = self.positions_cpu.numpy()
        self.prefill_use_fusedsdpa = get_config().prompt_attn_impl == 'fsdpa_impl'
        ###############################################################

        # Model-related.
        self.num_attn_layers = self.model_config.get_num_layers_by_block_type(self.parallel_config, "attention")
        self.num_query_heads = self.model_config.get_num_attention_heads(self.parallel_config)
        self.num_kv_heads = self.model_config.get_num_kv_heads(self.parallel_config)
        self.head_size = self.model_config.get_head_size()
        self.hidden_size = self.model_config.get_hidden_size()
        self.is_pooling_model = (model_config.runner_type == 'pooling')

vllm_gaudi/v1/worker/hpu_model_runner.py:5918

The assertion assert layer_names == set(kv_caches.keys()), "Some layers are not correctly initialized" appears twice: once at line 5911 and again at line 5918. The second assertion at line 5918 seems to be checking the same condition even after KV cache sharing is set up. If the intention is to verify that shared layers don't change the keys, this should be documented. Otherwise, the duplicate assertion can be removed.

        assert layer_names == set(kv_caches.keys()), "Some layers are not correctly initialized"
        # Set up cross-layer KV cache sharing
        if self.shared_kv_cache_layers:
            logger.info("[KV sharing] Setting up tensor sharing for %s layers", len(self.shared_kv_cache_layers))
            for layer_name, target_layer_name in self.shared_kv_cache_layers.items():
                kv_caches[layer_name] = kv_caches[target_layer_name]

        assert layer_names == set(kv_caches.keys()), "Some layers are not correctly initialized"

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: gyou2021 <ganmei.you@intel.com>

github-actions · 2026-02-23T10:59:13Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

github-actions · 2026-02-23T10:59:41Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Signed-off-by: gyou2021 <ganmei.you@intel.com>

michalkuligowski · 2026-02-24T10:31:08Z

+            self.kv_cache_config.kv_cache_groups.append(KVCacheGroupSpec(layer_names=layer_names, kv_cache_spec=spec))
+            self.is_encoder_only_attn = True
+
+    def may_reinitialize_input_batch(self, kv_cache_config: KVCacheConfig, kernel_block_sizes: list[int]) -> None:


This method looks like a duplication of code in line 5718, please reuse in both places

Since the if conditions are different, the method cannot be reused. Thank you for your comments.

kernel_block_sizes condition can be used there. is block split across kernels in the models enabled in this PR?

…up mode. Signed-off-by: gyou2021 <ganmei.you@intel.com>

…to reranking

Signed-off-by: gyou2021 <ganmei.you@intel.com>

github-actions · 2026-02-24T18:15:10Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
dc6de33c3d5e9026cef7b27791dfe0f98e64bbde

michalkuligowski · 2026-02-25T10:15:00Z

+            self.kv_cache_config.kv_cache_groups.append(KVCacheGroupSpec(layer_names=layer_names, kv_cache_spec=spec))
+            self.is_encoder_only_attn = True
+
+    def may_reinitialize_input_batch(self, kv_cache_config: KVCacheConfig, kernel_block_sizes: list[int]) -> None:


kernel_block_sizes condition can be used there. is block split across kernels in the models enabled in this PR?

…d. (vllm-project#1001) Three kinds of reranking models were added: Bert-based, Roberta-based, and Qwen3-based. --------- Signed-off-by: gyou2021 <ganmei.you@intel.com> Co-authored-by: Iryna Boiko <iryna.boiko@intel.com> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>

…d. (#1001) Three kinds of reranking models were added: Bert-based, Roberta-based, and Qwen3-based. --------- Signed-off-by: gyou2021 <ganmei.you@intel.com> Co-authored-by: Iryna Boiko <iryna.boiko@intel.com> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>

Ported three types of reranking models: Bert-based, Roberta-based and…

885740e

… Qwen3-based. Signed-off-by: gyou2021 <ganmei.you@intel.com>

Copilot AI review requested due to automatic review settings February 21, 2026 17:13

gyou2021 requested review from adobrzyn, afierka-intel, iboiko-habana, kamil-kaczor, ksmusz, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners February 21, 2026 17:13

Copilot started reviewing on behalf of gyou2021 February 21, 2026 17:14 View session

Fixed imported but unused problem.

3d5b006

Signed-off-by: gyou2021 <ganmei.you@intel.com>

Copilot AI reviewed Feb 21, 2026

View reviewed changes

gyou2021 added 2 commits February 22, 2026 01:41

Fixed the code format.

36d23ae

Signed-off-by: gyou2021 <ganmei.you@intel.com>

Added conditions in case of no kv_cache.

07aea44

Signed-off-by: gyou2021 <ganmei.you@intel.com>

github-actions Bot mentioned this pull request Feb 21, 2026

🚦 Team Review Dashboard #701

Open

gyou2021 added 2 commits February 22, 2026 02:39

Added conditions in case of no kv_cache.

e8f9363

Signed-off-by: gyou2021 <ganmei.you@intel.com>

Fixed the bug in initialize_kv_cache.

de1cd02

Signed-off-by: gyou2021 <ganmei.you@intel.com>

gyou2021 force-pushed the reranking branch from 357aa2a to de1cd02 Compare February 22, 2026 06:46

Removed unused codes.

3175fa7

Signed-off-by: gyou2021 <ganmei.you@intel.com>

gyou2021 mentioned this pull request Feb 23, 2026

Ported the reranking models to vllm-gaudi #1000

Closed

gyou2021 and others added 5 commits February 23, 2026 19:26

Removed logically duplicated codes.

889c312

Signed-off-by: gyou2021 <ganmei.you@intel.com>

Merge remote-tracking branch 'origin/main' into reranking

c7d068f

Merge branch 'main' into reranking

8db4419

Merge branch 'main' into reranking

9efb0cb

Merge branch 'main' into reranking

e5a3874

michalkuligowski requested changes Feb 24, 2026

View reviewed changes

Merge branch 'main' into reranking

83c5ab7

gyou2021 and others added 5 commits February 24, 2026 21:03

Removed duplicated bucketing_manager init and fixed a bug in the warm…

ac072a8

…up mode. Signed-off-by: gyou2021 <ganmei.you@intel.com>

Merge remote-tracking branch 'origin/main' into reranking

6509fae

Merge branch 'reranking' of https://github.com/gyou2021/vllm-gaudi in…

c5c550a

…to reranking

Reused the method _encode_token_type_ids of bert.

fd73dbc

Signed-off-by: gyou2021 <ganmei.you@intel.com>

Merge branch 'main' into reranking

d5bc6b6

michalkuligowski approved these changes Feb 26, 2026

View reviewed changes

michalkuligowski merged commit 6728857 into vllm-project:main Feb 26, 2026
67 checks passed

iboiko-habana mentioned this pull request Mar 24, 2026

[FIX_FOR_VLLM_CUSTOM=14acf429ac08b6d538ca6feb3e06b6d13895804d] Fix CPUOffloadingSpec import path and remove obsolete roberta patch #1229

Merged

Conversation

gyou2021 commented Feb 21, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Feb 23, 2026

🚧 CI Blocked

Uh oh!

github-actions Bot commented Feb 23, 2026

🚧 CI Blocked

Uh oh!

michalkuligowski Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gyou2021 Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michalkuligowski Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Feb 24, 2026

✅ CI Passed

Uh oh!

michalkuligowski Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gyou2021 Feb 24, 2026 •

edited

Loading