[refactor] refactor excute_model and _dymmy_run method by kunpengW-code · Pull Request #6043 · vllm-project/vllm-ascend

kunpengW-code · 2026-01-20T08:39:10Z

What this PR does / why we need it?

The structure of the excute_model and _dymmy_run methods in NPUModelRunner differs greatly from that in GPUModelRunner.
Achieve alignment with GPUModelRunner:
Split the _prepare_inputs method into _prepare_inputs, _determine_batch_execution_and_padding, _build_attention_metadata, and _preprocess.
Modify _generate_process_reqs_hidden_states to _model_forward.
Align the implementation of the postprocess phase

Related-RFC: #5449

Co-authored-by: @zhenwenqi2024

Does this PR introduce any user-facing change?

no

How was this patch tested?

vLLM version: v0.13.0
vLLM main: vllm-project/vllm@d682094

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

…to modelrunner-refactor # Conflicts: # vllm_ascend/worker/model_runner_v1.py

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

github-actions · 2026-01-20T08:39:23Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request refactors the execute_model and _dummy_run methods in model_runner_v1.py, moving significant logic into new helper methods such as _preprocess, _build_attention_metadata, and _determine_batch_execution_and_padding. This refactoring aims to improve modularity and align with potentially updated upstream vLLM structures. Additionally, minor changes are introduced in pcp_utils.py to track PCP tokens. The changes are extensive and touch core execution paths, requiring thorough validation.

gemini-code-assist · 2026-01-20T08:42:30Z

+    def _build_attention_metadata(
+        self,
+        num_tokens: int,
+        num_reqs: int,
+        max_query_len: int,
+        num_tokens_padded: int | None = None,
+        num_reqs_padded: int | None = None,
+        ubatch_slices: UBatchSlices | None = None,
+        logits_indices: torch.Tensor | None = None,
+        use_spec_decode: bool = False,
+        for_cudagraph_capture: bool = False,
+        num_scheduled_tokens: dict[str, int] | None = None,
+        cascade_attn_prefix_lens: list[list[int]] | None = None,
+    ) -> tuple[PerLayerAttnMetadata, CommonAttentionMetadata | None]:
+        """
+        :return: tuple[attn_metadata, spec_decode_common_attn_metadata]
+        """
+        # Attention metadata is not needed for attention free models
+        if len(self.kv_cache_config.kv_cache_groups) == 0:
+            return {}, None
+        num_tokens_padded = num_tokens_padded or num_tokens
+        num_reqs_padded = num_reqs_padded or num_reqs
+        attn_metadata: PerLayerAttnMetadata = {}
+        if ubatch_slices is not None:
+            attn_metadata = [dict() for _ in range(len(ubatch_slices))]
+        if for_cudagraph_capture:
+            # For some attention backends (e.g. FA) with sliding window models we need
+            # to make sure the backend see a max_seq_len that is larger to the sliding
+            # window size when capturing to make sure the correct kernel is selected.
+            max_seq_len = self.max_model_len
        else:
-            hidden_states = hidden_states
-        return hidden_states
+            max_seq_len = self.seq_lens.np[:num_reqs].max().item()
+        if use_spec_decode and self.need_accepted_tokens:
+            self.num_accepted_tokens.np[:num_reqs] = (
+                self.input_batch.num_accepted_tokens_cpu[:num_reqs])
+            self.num_accepted_tokens.np[num_reqs:].fill(1)
+            self.num_accepted_tokens.copy_to_gpu()
+
+        kv_cache_groups = self.kv_cache_config.kv_cache_groups
+
+        def _get_pcp_metadata(num_tokens):
+            if not self.use_cp:
+                return None
+            return self.pcp_manager.generate_pcp_metadata(num_tokens, self.query_lens, self.input_batch, num_scheduled_tokens)
+
+        def _get_block_table_and_slot_mapping(kv_cache_gid: int):
+            assert num_reqs_padded is not None and num_tokens_padded is not None
+            kv_cache_spec = kv_cache_groups[kv_cache_gid].kv_cache_spec
+            maybe_pcp_full_tokens = (
+                    num_tokens if self.pcp_size == 1 else
+                    num_tokens * self.pcp_size -
+                    sum(self.pcp_manager.num_pcp_pads_cpu[:num_reqs]))
+            if isinstance(kv_cache_spec, EncoderOnlyAttentionSpec):
+                blk_table_tensor = torch.zeros(
+                    (num_reqs_padded, 1),
+                    dtype=torch.int32,
+                    device=self.device,
+                )
+                slot_mapping = torch.zeros(
+                    (num_tokens_padded,),
+                    dtype=torch.int64,
+                    device=self.device,
+                )
+            else:
+                blk_table = self.input_batch.block_table[kv_cache_gid]
+                slot_mapping = blk_table.slot_mapping.gpu[:maybe_pcp_full_tokens]
+                maybe_num_reqs_padded = num_reqs_padded * self.decode_token_per_req if self.use_cp else num_reqs_padded
+                blk_table_tensor = blk_table.get_device_tensor()[:maybe_num_reqs_padded]
+
+                # Fill unused with -1. Needed for reshape_and_cache in full cuda
+                # graph mode. `blk_table_tensor` -1 to match mamba PAD_SLOT_ID
+                if self.pcp_size == 1:
+                    slot_mapping[num_tokens:num_tokens_padded].fill_(-1)
+                    blk_table_tensor[num_reqs:num_reqs_padded].fill_(-1)
+            if self.pcp_size > 1:
+                slot_mapping = self.pcp_manager.get_padded_slot_mapping(
+                    num_tokens,
+                    slot_mapping,
+                )
+            return blk_table_tensor, slot_mapping
+
+        long_seq_metdadata = _get_pcp_metadata(num_tokens)
+        block_table_gid_0, slot_mapping_gid_0 = _get_block_table_and_slot_mapping(0)
+
+        if for_cudagraph_capture:
+            self.attn_state = AscendAttentionState.DecodeOnly
+            if self.speculative_config and \
+                    self.speculative_config.method == "mtp":
+                # `AscendAttentionState.SpecDecoding` is only designed for mla
+                if self.vllm_config.model_config.use_mla:
+                    self.attn_state = AscendAttentionState.SpecDecoding
+                else:
+                    self.attn_state = AscendAttentionState.ChunkedPrefill
+        cm_base = AscendCommonAttentionMetadata(
+            query_start_loc=self.query_start_loc.gpu[: num_reqs_padded + 1],
+            query_start_loc_cpu=self.query_start_loc.cpu[: num_reqs_padded + 1],
+            seq_lens=self.seq_lens.gpu[:num_reqs_padded],
+            # TODO
+            seq_lens_cpu=self.seq_lens.cpu[:num_reqs_padded],
+            # TODO
+            num_computed_tokens_cpu=self.input_batch.num_computed_tokens_cpu_tensor[
+                :num_reqs_padded
+            ],
+            num_reqs=num_reqs_padded,
+            num_actual_tokens=num_tokens,
+            max_query_len=max_query_len,
+            max_seq_len=max_seq_len,
+            block_table_tensor=block_table_gid_0,
+            slot_mapping=slot_mapping_gid_0,
+            causal=True,
+            num_input_tokens=num_tokens_padded,
+            actual_seq_lengths_q=self.actual_seq_lengths_q,
+            positions=self.positions.gpu,
+            attn_state=self.attn_state,
+            decode_token_per_req=self.decode_token_per_req,
+            prefill_context_parallel_metadata=long_seq_metdadata,
+        )
+
+        if logits_indices is not None and self.cache_config.kv_sharing_fast_prefill:
+            cm_base.num_logits_indices = logits_indices.size(0)
+            cm_base.logits_indices_padded = self._prepare_kv_sharing_fast_prefill(
+                logits_indices
+            )
+
+        def _build_attn_group_metadata(
+            kv_cache_gid: int,
+            attn_gid: int,
+            common_attn_metadata: CommonAttentionMetadata,
+            ubid: int | None = None,
+        ) -> None:
+            attn_group = self.attn_groups[kv_cache_gid][attn_gid]
+            builder = attn_group.get_metadata_builder(ubid or 0)
+            cascade_attn_prefix_len = (
+                cascade_attn_prefix_lens[kv_cache_gid][attn_gid]
+                if cascade_attn_prefix_lens
+                else 0
+            )
+
+            extra_attn_metadata_args = {}
+            if use_spec_decode and isinstance(builder, GDNAttentionMetadataBuilder):
+                assert ubid is None, "UBatching not supported with GDN yet"
+                patch_torch_npu_argsort()
+                extra_attn_metadata_args = dict(
+                    num_accepted_tokens=self.num_accepted_tokens.gpu[:num_reqs_padded],
+                    num_decode_draft_tokens_cpu=self.num_decode_draft_tokens.cpu[
+                        :num_reqs_padded
+                    ],
+                )
+
+            if for_cudagraph_capture:
+                attn_metadata_i = builder.build_for_cudagraph_capture(
+                    common_attn_metadata
+                )
+            else:
+                attn_metadata_i = builder.build(
+                    common_prefix_len=cascade_attn_prefix_len,
+                    common_attn_metadata=common_attn_metadata,
+                    **extra_attn_metadata_args,
+                )
+
+            if ubid is None:
+                assert isinstance(attn_metadata, dict)
+                attn_metadata_dict = attn_metadata
+            else:
+                assert isinstance(attn_metadata, list)
+                attn_metadata_dict = attn_metadata[ubid]
+
+            for layer_name in attn_group.layer_names:
+                attn_metadata_dict[layer_name] = attn_metadata_i
+
+        # Prepare the attention metadata for each KV cache group and make layers
+        # in the same group share the same metadata.
+        spec_decode_common_attn_metadata = None
+        for kv_cache_gid, kv_cache_group in enumerate(
+                self.kv_cache_config.kv_cache_groups):
+            cm = copy(cm_base)  # shallow copy
+            # Basically only the encoder seq_lens, block_table and slot_mapping change
+            # for each kv_cache_group.
+            cm.encoder_seq_lens, cm.encoder_seq_lens_cpu = self._get_encoder_seq_lens(
+                num_scheduled_tokens or {},
+                kv_cache_group.kv_cache_spec,
+                num_reqs_padded,
+            )
+            if kv_cache_gid > 0:
+                cm.block_table_tensor, cm.slot_mapping = (
+                    _get_block_table_and_slot_mapping(kv_cache_gid)
+                )
+            if self.speculative_config and spec_decode_common_attn_metadata is None:
+                if isinstance(self.drafter, EagleProposer):
+                    if self.drafter.attn_layer_names[0] in kv_cache_group.layer_names:
+                        spec_decode_common_attn_metadata = cm
+                else:
+                    spec_decode_common_attn_metadata = cm
+
+            for attn_gid in range(len(self.attn_groups[kv_cache_gid])):
+                if ubatch_slices is not None:
+                    for ubid, _cm in enumerate(split_attn_metadata(ubatch_slices, cm)):
+                        _build_attn_group_metadata(kv_cache_gid, attn_gid, _cm, ubid)
+
+                else:
+                    _build_attn_group_metadata(kv_cache_gid, attn_gid, cm)
+        if self.is_mm_prefix_lm:
+            req_doc_ranges = {}
+            for req_id in self.input_batch.req_ids:
+                image_doc_ranges = []
+                req_state = self.requests[req_id]
+                for mm_feature in req_state.mm_features:
+                    pos_info = mm_feature.mm_position
+                    img_doc_range = pos_info.extract_embeds_range()
+                    image_doc_ranges.extend(img_doc_range)
+                req_idx = self.input_batch.req_id_to_index[req_id]
+                req_doc_ranges[req_idx] = image_doc_ranges
+
+            if isinstance(attn_metadata, list):
+                for ub_metadata in attn_metadata:
+                    for _metadata in ub_metadata.values():
+                        _metadata.mm_prefix_range = req_doc_ranges  # type: ignore[attr-defined]
+            else:
+                for _metadata in attn_metadata.values():
+                    _metadata.mm_prefix_range = req_doc_ranges  # type: ignore[attr-defined]
+
+        if spec_decode_common_attn_metadata is not None and (
+            num_reqs != num_reqs_padded or num_tokens != num_tokens_padded
+        ):
+            # Currently the drafter still only uses piecewise cudagraphs (and modifies
+            # the attention metadata in directly), and therefore does not want to use
+            # padded attention metadata.
+            spec_decode_common_attn_metadata = (
+                spec_decode_common_attn_metadata.unpadded(num_tokens, num_reqs)
+            )
+        return attn_metadata, spec_decode_common_attn_metadata


The _build_attention_metadata method is a large and complex function responsible for creating all attention-related metadata. It handles various configurations like padding, speculative decoding, PCP, and different KV cache groups. Given its complexity and central role, any logical error in constructing this metadata could lead to incorrect attention calculations, which is a critical correctness issue. Thorough testing of all branches and configurations is essential.

gemini-code-assist · 2026-01-20T08:42:30Z

+                (attn_metadata, spec_decode_common_attn_metadata) = (
+                    self._build_attention_metadata(
+                        num_tokens=num_tokens_unpadded,
+                        num_tokens_padded=num_tokens_padded if pad_attn else None,
+                        num_reqs=num_reqs,
+                        num_reqs_padded=num_reqs_padded if pad_attn else None,
+                        max_query_len=max_num_scheduled_tokens,
+                        ubatch_slices=ubatch_slices_attn,
+                        logits_indices=logits_indices,
+                        use_spec_decode=use_spec_decode,
+                        num_scheduled_tokens=scheduler_output.num_scheduled_tokens,
+                        cascade_attn_prefix_lens=cascade_attn_prefix_lens,
+                    )
+                )


The _build_attention_metadata method is now responsible for constructing the attention metadata for all layers. This is a critical component, as incorrect attention metadata can lead to severe correctness issues in the attention mechanism. It's essential to ensure that all parameters (padding, number of requests, query length, speculative decoding flags, cascade attention, etc.) are correctly translated into the attn_metadata and spec_decode_common_attn_metadata objects.

gemini-code-assist · 2026-01-20T08:42:30Z

+def _post_process_cudagraph_mode(tensor: torch.Tensor) -> int:
+    """
+    Synchronize cudagraph_mode across DP ranks by taking the minimum.
+    If any rank has NONE (0), all ranks use NONE.
+    This ensures all ranks send consistent values (all padded or all unpadded).
+    """
+    return int(tensor[1, :].min().item())


The _post_process_cudagraph_mode function synchronizes the CUDA graph mode across DP ranks by taking the minimum. This is a critical step to ensure all ranks operate in a consistent mode (e.g., all use NONE if any rank uses NONE). Incorrect synchronization here could lead to deadlocks or inconsistent graph execution across distributed ranks.

gemini-code-assist · 2026-01-20T08:42:30Z

    def _prepare_inputs(
        self,
        scheduler_output: "SchedulerOutput",
-        intermediate_tensors: Optional[IntermediateTensors] = None,
-    ) -> tuple[dict[str, Any], torch.Tensor, np.ndarray, int, torch.Tensor,
-               int, torch.Tensor, SpecDecodeMetadata, Optional[torch.Tensor],
-               Optional[torch.Tensor], Optional[torch.Tensor], int, dict[str,
-                                                                         Any]]:
+        num_scheduled_tokens: np.ndarray,
+    ) -> tuple[
+        torch.Tensor,
+        SpecDecodeMetadata | None]:
+        """
+        :return: tuple[
+            logits_indices, spec_decode_metadata,
+        ]
+        """


The _prepare_inputs method's signature and return type have been significantly altered. It now returns only logits_indices and spec_decode_metadata, implying that all other input preparation logic has been moved elsewhere. This is a major refactoring, and it's critical to ensure that all necessary data for model execution is correctly prepared and passed through the new helper methods (_preprocess, _build_attention_metadata, etc.). Any missed data or incorrect transformations could lead to runtime errors or incorrect model behavior.

gemini-code-assist · 2026-01-20T08:42:30Z

+    def _determine_batch_execution_and_padding(
+        self,
+        num_tokens: int,
+        num_reqs: int,
+        num_scheduled_tokens_np: np.ndarray,
+        max_num_scheduled_tokens: int,
+        use_cascade_attn: bool,
+        allow_microbatching: bool = False,
+        force_eager: bool = False,
+        # For cudagraph capture TODO(lucas): Refactor how we capture cudagraphs (will
+        # be improved in model runner v2)
+        force_uniform_decode: bool | None = None,
+        force_has_lora: bool | None = None,
+        num_encoder_reqs: int = 0,
+    ) -> tuple[CUDAGraphMode, BatchDescriptor, bool,
+               torch.Tensor | None, CUDAGraphStat | None]:
+
+        num_tokens_padded = self._pad_for_sequence_parallelism(num_tokens)
+        uniform_decode = (
+            ((max_num_scheduled_tokens == self.uniform_decode_query_len) and
+             (num_tokens == max_num_scheduled_tokens * num_reqs))
+            if force_uniform_decode is None else force_uniform_decode)
+        # Encoder-decoder models only support CG for decoder_step > 0 (no enc_output
+        # is present). Also, chunked-prefill is disabled, so batch are uniform.
+        has_encoder_output = (self.model_config.is_encoder_decoder
+                              and num_encoder_reqs > 0)
+        has_lora = (len(self.input_batch.lora_id_to_lora_request) > 0
+                    if force_has_lora is None else force_has_lora)
+
+        # ruff: noqa: E731
+        dispatch_cudagraph = (
+            lambda num_tokens, disable_full: self.cudagraph_dispatcher.
+            dispatch(
+                num_tokens=num_tokens,
+                has_lora=has_lora,
+                uniform_decode=uniform_decode,
+                disable_full=disable_full,
+            ) if not force_eager else
+            (CUDAGraphMode.NONE, BatchDescriptor(num_tokens_padded)))
+        cudagraph_mode, batch_descriptor = dispatch_cudagraph(
+            num_tokens_padded, use_cascade_attn or has_encoder_output)
+        num_tokens_padded = batch_descriptor.num_tokens
+        if enable_sp(self.vllm_config):
+            assert (batch_descriptor.num_tokens %
+                    self.vllm_config.parallel_config.tensor_parallel_size == 0
+                    ), ("Sequence parallelism requires num_tokens to be "
+                        "a multiple of tensor parallel size")
+        # Extra coordination when running data-parallel since we need to coordinate
+        # across ranks
+        should_ubatch, num_tokens_across_dp = False, None
+        if self.vllm_config.parallel_config.data_parallel_size > 1:
+            _, num_tokens_across_dp, synced_cudagraph_mode = self._sync_batch_across_dp(num_tokens_padded=num_tokens_padded,
+                                                                                        cudagraph_mode=cudagraph_mode.value,
+                                                                                        )
+
+            # Extract DP padding if there is any
+            if num_tokens_across_dp is not None:
+                dp_rank = self.parallel_config.data_parallel_rank
+                num_tokens_padded = int(num_tokens_across_dp[dp_rank].item())
+                # Re-dispatch with DP padding
+                cudagraph_mode, batch_descriptor = dispatch_cudagraph(
+                    num_tokens_padded,
+                    disable_full=synced_cudagraph_mode <= CUDAGraphMode.PIECEWISE.value,)
+                # Assert to make sure the agreed upon token count is correct otherwise
+                # num_tokens_across_dp will no-longer be valid
+                assert batch_descriptor.num_tokens == num_tokens_padded
+        cudagraph_stats = None
+        if self.vllm_config.observability_config.cudagraph_metrics:
+            cudagraph_stats = CUDAGraphStat(
+                num_unpadded_tokens=num_tokens,
+                num_padded_tokens=batch_descriptor.num_tokens,
+                num_paddings=batch_descriptor.num_tokens - num_tokens,
+                runtime_mode=str(cudagraph_mode),
+            )
+
+        return (
+            cudagraph_mode,
+            batch_descriptor,
+            should_ubatch,
+            num_tokens_across_dp,
+            cudagraph_stats,
+        )


The _determine_batch_execution_and_padding method is a complex new addition that orchestrates CUDA graph mode, batch descriptors, microbatching, and DP synchronization. Its correctness is paramount for efficient and correct execution. The TODO(lucas): Refactor how we capture cudagraphs (will be improved in model runner v2) indicates ongoing work, suggesting potential for future improvements or current limitations. Any logical errors in this method could lead to severe performance issues or incorrect results.

gemini-code-assist · 2026-01-20T08:42:31Z

+            if create_mixed_batch:
+                raise NotImplementedError("create_mixed_batch is used for warmup deepgemm, vllm-ascend does not need it")


Similar to the previous comment, create_mixed_batch is explicitly not supported. This is a functional limitation that should be clearly documented for users if it's a standard parameter in upstream vLLM.

gemini-code-assist · 2026-01-20T08:42:31Z

        self.discard_request_indices.np[:self.num_discarded_requests] = (
            discard_request_indices)
        self.discard_request_indices.copy_to_gpu(self.num_discarded_requests)


The logic for discard_request_indices and num_discarded_requests is crucial for correctly handling partial requests and preventing sampling from invalid tokens. Given the extensive refactoring of input preparation, it's critical to ensure that discard_requests_mask is accurately computed under all scenarios (prefill, decode, speculative decoding, PCP enabled/disabled) to avoid incorrect token sampling or processing.

gemini-code-assist · 2026-01-20T08:42:31Z

+def _get_hidden_states_tensor(hidden_states):
+    # Sometimes, after the model is compiled through the AOT backend,
+    # the model output may become a list containing only one Tensor object.
+    if isinstance(hidden_states, list) and \
+            len(hidden_states) == 1 and \
+            isinstance(hidden_states[0], torch.Tensor):
+        hidden_states = hidden_states[0]
+    return hidden_states


The _get_hidden_states_tensor utility function handles cases where the model output might be a list containing a single tensor. This is a good defensive check, especially if the model's output format can vary based on compilation or backend. It ensures consistent handling of hidden states before further processing.

gemini-code-assist · 2026-01-20T08:42:31Z

+        self.pcp_tokens = np.zeros(self.max_num_reqs, dtype=np.int32)
+        self.total_num_sampled_tokens_pcp = 0


The new instance variables self.pcp_tokens and self.total_num_sampled_tokens_pcp are introduced to track PCP-related token information. It's crucial that these are correctly initialized and updated throughout the PCP workflow to maintain accurate state for distributed operations.

gemini-code-assist · 2026-01-20T08:42:31Z

+        self.pcp_tokens[:num_reqs] = pcp_tokens[:num_reqs]
+        self.total_num_sampled_tokens_pcp = pcp_tokens[:num_reqs].sum()


The pcp_tokens and total_num_sampled_tokens_pcp are updated here. This is a critical step to ensure that the PCP manager accurately reflects the number of tokens processed by each rank. Any discrepancy could lead to incorrect token counts or indexing in subsequent PCP operations.

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

gcanlin · 2026-01-20T09:48:09Z

-                                   positions=positions,
-                                   intermediate_tensors=intermediate_tensors,
-                                   inputs_embeds=inputs_embeds)
+        maybe_padded_num_tokens: int,


Why we still need maybe_padded_num_tokens?

No need. I forgot to change the name. Actually, num_tokens_padded is used.

gcanlin · 2026-01-20T15:32:12Z

@kunpengW-code Hey, could we make this PR ready tomorrow? I'd like to take it into v0.14.0rc1. We really need it in vLLM-Omni. If you need any help please let me know :)

zhenwenqi2024 · 2026-01-20T15:52:08Z

@kunpengW-code Hey, could we make this PR ready tomorrow? I'd like to take it into v0.14.0rc1. We really need it in vLLM-Omni. If you need any help please let me know :)

e2e test has some problems, we can solve the problems together

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

gcanlin · 2026-01-20T16:55:21Z

@zhenwenqi2024 @kunpengW-code I open one PR for your personal repository to fix the lint and complete something lost like ec_connector kunpengW-code#1. PTAL.

[Bugfix] Fix lint & add ec_connector_output for EPD

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

…to modelrunner-refactor

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

github-actions · 2026-01-22T02:37:11Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

…to modelrunner-refactor # Conflicts: # vllm_ascend/worker/model_runner_v1.py

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

github-actions · 2026-01-27T00:47:05Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>

…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (86 commits) [refactor] refactor excute_model and _dymmy_run method (vllm-project#6043) [Refactor] profiler config optimze (vllm-project#6141) [Graph][Fusion] Add MatmulAllReduceAddRMSNorm graph fusion for npugraph_ex. (vllm-project#6006) [UT]: refactoring 310p ops ut (vllm-project#6296) [Refact.]: refactoring 310p-kv cache allocator, align with main branch (vllm-project#6270) [Misc] Removes unnecessary graph size re-initialization (vllm-project#6280) [Main2Main] Upgrade vllm commit to 0123 (vllm-project#6169) [BugFix] Fix wheel package build workflow (vllm-project#6276) [CI][BugFix] Qwen3-Next nightly test fix. (vllm-project#6247) [Doc] quick fix for vllm-ascend version (vllm-project#6278) [Community] Nominate whx-sjtu as maintainer (vllm-project#6268) [Lint] Fix mypy issue to make CI happy (vllm-project#6272) BugFix: Fix moe_load accumulation error in ACL graph mode (vllm-project#6182) [Patch] Remove the patch of ECExampleConnector (vllm-project#5976) [Bugfix] Fix PP+PCP and PP+flashcomm1 bugs (vllm-project#5416) [Feat] proxy delay to remove instances (vllm-project#5934) [CI] Add workfolw_dispatch for nightly image build (vllm-project#6269) [bugfix][npugraph_ex]fix static kernel uninstall issue (vllm-project#6128) [Doc] 310P Documents update (vllm-project#6246) [Feature] Mooncake connector get remote ptp size (vllm-project#5822) ...

@zhenwenqi2024

…6043) ### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase **Related-RFC**: vllm-project#5449 **Co-authored-by**: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>

@zhenwenqi2024

…6043) ### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase **Related-RFC**: vllm-project#5449 **Co-authored-by**: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>

@zhenwenqi2024

…6043) ### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase **Related-RFC**: vllm-project#5449 **Co-authored-by**: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>

### What this PR does / why we need it? #6043 deleted the forward_before phase of the dynamic eplb. Currently, the end-to-end precision is monitored in the UT, and the log is not printed in the key place. As a result, the eplb does not take effect and is not intercepted. 1. The forward_before function is added back. 2. Delete unnecessary logs and add key logs. 3. Warm-up of algorithm 3 is added. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ![Snipaste_2026-02-10_15-57-31](https://github.com/user-attachments/assets/03813e5f-3d19-42d8-8118-76223afe8298) #### The conversation is normal. Okay, the user is asking, \"What is deep learning?\" I need to explain this in a clear and concise way. Let me start by recalling what I know about deep learning. It's a subset of machine learning, right? So first, I should mention that it's part of machine learning, which itself is a branch of AI. Then, the key aspect of deep learning is the use of neural networks with multiple layers. These are called deep neural networks.\n\nWait, I should define neural networks first. Maybe start with the basics. A neural network is inspired by the human brain, with layers of nodes (neurons) that process data. But deep learning specifically refers to networks with many layers—hence \"deep.\" So the term \"deep\" comes from the number of layers. \n\nI should explain how deep learning works. It involves training these networks on large datasets, allowing them to automatically learn features from the data. Unlike traditional machine learning, where you might have to manually extract features, deep learning models can do this automatically. That's a key point. For example, in image recognition, a deep learning model can learn to detect edges, shapes, and then more complex patterns without human intervention.\n\nApplications are important too. The user might want to know where deep learning is used. Common examples include image and speech recognition, natural language processing, autonomous vehicles, and recommendation systems. Maybe mention specific technologies like self-driving cars using computer vision or virtual assistants like Siri or Alexa - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 Signed-off-by: shenchuxiaofugui <1311027364@qq.com>

### What this PR does / why we need it? vllm-project#6043 deleted the forward_before phase of the dynamic eplb. Currently, the end-to-end precision is monitored in the UT, and the log is not printed in the key place. As a result, the eplb does not take effect and is not intercepted. 1. The forward_before function is added back. 2. Delete unnecessary logs and add key logs. 3. Warm-up of algorithm 3 is added. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ![Snipaste_2026-02-10_15-57-31](https://github.com/user-attachments/assets/03813e5f-3d19-42d8-8118-76223afe8298) #### The conversation is normal. Okay, the user is asking, \"What is deep learning?\" I need to explain this in a clear and concise way. Let me start by recalling what I know about deep learning. It's a subset of machine learning, right? So first, I should mention that it's part of machine learning, which itself is a branch of AI. Then, the key aspect of deep learning is the use of neural networks with multiple layers. These are called deep neural networks.\n\nWait, I should define neural networks first. Maybe start with the basics. A neural network is inspired by the human brain, with layers of nodes (neurons) that process data. But deep learning specifically refers to networks with many layers—hence \"deep.\" So the term \"deep\" comes from the number of layers. \n\nI should explain how deep learning works. It involves training these networks on large datasets, allowing them to automatically learn features from the data. Unlike traditional machine learning, where you might have to manually extract features, deep learning models can do this automatically. That's a key point. For example, in image recognition, a deep learning model can learn to detect edges, shapes, and then more complex patterns without human intervention.\n\nApplications are important too. The user might want to know where deep learning is used. Common examples include image and speech recognition, natural language processing, autonomous vehicles, and recommendation systems. Maybe mention specific technologies like self-driving cars using computer vision or virtual assistants like Siri or Alexa - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 Signed-off-by: shenchuxiaofugui <1311027364@qq.com>

@zhenwenqi2024

…6043) ### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase **Related-RFC**: vllm-project#5449 **Co-authored-by**: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? vllm-project#6043 deleted the forward_before phase of the dynamic eplb. Currently, the end-to-end precision is monitored in the UT, and the log is not printed in the key place. As a result, the eplb does not take effect and is not intercepted. 1. The forward_before function is added back. 2. Delete unnecessary logs and add key logs. 3. Warm-up of algorithm 3 is added. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ![Snipaste_2026-02-10_15-57-31](https://github.com/user-attachments/assets/03813e5f-3d19-42d8-8118-76223afe8298) #### The conversation is normal. Okay, the user is asking, \"What is deep learning?\" I need to explain this in a clear and concise way. Let me start by recalling what I know about deep learning. It's a subset of machine learning, right? So first, I should mention that it's part of machine learning, which itself is a branch of AI. Then, the key aspect of deep learning is the use of neural networks with multiple layers. These are called deep neural networks.\n\nWait, I should define neural networks first. Maybe start with the basics. A neural network is inspired by the human brain, with layers of nodes (neurons) that process data. But deep learning specifically refers to networks with many layers—hence \"deep.\" So the term \"deep\" comes from the number of layers. \n\nI should explain how deep learning works. It involves training these networks on large datasets, allowing them to automatically learn features from the data. Unlike traditional machine learning, where you might have to manually extract features, deep learning models can do this automatically. That's a key point. For example, in image recognition, a deep learning model can learn to detect edges, shapes, and then more complex patterns without human intervention.\n\nApplications are important too. The user might want to know where deep learning is used. Common examples include image and speech recognition, natural language processing, autonomous vehicles, and recommendation systems. Maybe mention specific technologies like self-driving cars using computer vision or virtual assistants like Siri or Alexa - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

@zhenwenqi2024

…6043) ### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase **Related-RFC**: vllm-project#5449 **Co-authored-by**: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>

### What this PR does / why we need it? vllm-project#6043 deleted the forward_before phase of the dynamic eplb. Currently, the end-to-end precision is monitored in the UT, and the log is not printed in the key place. As a result, the eplb does not take effect and is not intercepted. 1. The forward_before function is added back. 2. Delete unnecessary logs and add key logs. 3. Warm-up of algorithm 3 is added. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ![Snipaste_2026-02-10_15-57-31](https://github.com/user-attachments/assets/03813e5f-3d19-42d8-8118-76223afe8298) #### The conversation is normal. Okay, the user is asking, \"What is deep learning?\" I need to explain this in a clear and concise way. Let me start by recalling what I know about deep learning. It's a subset of machine learning, right? So first, I should mention that it's part of machine learning, which itself is a branch of AI. Then, the key aspect of deep learning is the use of neural networks with multiple layers. These are called deep neural networks.\n\nWait, I should define neural networks first. Maybe start with the basics. A neural network is inspired by the human brain, with layers of nodes (neurons) that process data. But deep learning specifically refers to networks with many layers—hence \"deep.\" So the term \"deep\" comes from the number of layers. \n\nI should explain how deep learning works. It involves training these networks on large datasets, allowing them to automatically learn features from the data. Unlike traditional machine learning, where you might have to manually extract features, deep learning models can do this automatically. That's a key point. For example, in image recognition, a deep learning model can learn to detect edges, shapes, and then more complex patterns without human intervention.\n\nApplications are important too. The user might want to know where deep learning is used. Common examples include image and speech recognition, natural language processing, autonomous vehicles, and recommendation systems. Maybe mention specific technologies like self-driving cars using computer vision or virtual assistants like Siri or Alexa - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 Signed-off-by: shenchuxiaofugui <1311027364@qq.com>

@zhenwenqi2024

…6043) ### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase **Related-RFC**: vllm-project#5449 **Co-authored-by**: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? vllm-project#6043 deleted the forward_before phase of the dynamic eplb. Currently, the end-to-end precision is monitored in the UT, and the log is not printed in the key place. As a result, the eplb does not take effect and is not intercepted. 1. The forward_before function is added back. 2. Delete unnecessary logs and add key logs. 3. Warm-up of algorithm 3 is added. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ![Snipaste_2026-02-10_15-57-31](https://github.com/user-attachments/assets/03813e5f-3d19-42d8-8118-76223afe8298) #### The conversation is normal. Okay, the user is asking, \"What is deep learning?\" I need to explain this in a clear and concise way. Let me start by recalling what I know about deep learning. It's a subset of machine learning, right? So first, I should mention that it's part of machine learning, which itself is a branch of AI. Then, the key aspect of deep learning is the use of neural networks with multiple layers. These are called deep neural networks.\n\nWait, I should define neural networks first. Maybe start with the basics. A neural network is inspired by the human brain, with layers of nodes (neurons) that process data. But deep learning specifically refers to networks with many layers—hence \"deep.\" So the term \"deep\" comes from the number of layers. \n\nI should explain how deep learning works. It involves training these networks on large datasets, allowing them to automatically learn features from the data. Unlike traditional machine learning, where you might have to manually extract features, deep learning models can do this automatically. That's a key point. For example, in image recognition, a deep learning model can learn to detect edges, shapes, and then more complex patterns without human intervention.\n\nApplications are important too. The user might want to know where deep learning is used. Common examples include image and speech recognition, natural language processing, autonomous vehicles, and recommendation systems. Maybe mention specific technologies like self-driving cars using computer vision or virtual assistants like Siri or Alexa - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

@zhenwenqi2024

…6043) ### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase **Related-RFC**: vllm-project#5449 **Co-authored-by**: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>

### What this PR does / why we need it? vllm-project#6043 deleted the forward_before phase of the dynamic eplb. Currently, the end-to-end precision is monitored in the UT, and the log is not printed in the key place. As a result, the eplb does not take effect and is not intercepted. 1. The forward_before function is added back. 2. Delete unnecessary logs and add key logs. 3. Warm-up of algorithm 3 is added. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ![Snipaste_2026-02-10_15-57-31](https://github.com/user-attachments/assets/03813e5f-3d19-42d8-8118-76223afe8298) #### The conversation is normal. Okay, the user is asking, \"What is deep learning?\" I need to explain this in a clear and concise way. Let me start by recalling what I know about deep learning. It's a subset of machine learning, right? So first, I should mention that it's part of machine learning, which itself is a branch of AI. Then, the key aspect of deep learning is the use of neural networks with multiple layers. These are called deep neural networks.\n\nWait, I should define neural networks first. Maybe start with the basics. A neural network is inspired by the human brain, with layers of nodes (neurons) that process data. But deep learning specifically refers to networks with many layers—hence \"deep.\" So the term \"deep\" comes from the number of layers. \n\nI should explain how deep learning works. It involves training these networks on large datasets, allowing them to automatically learn features from the data. Unlike traditional machine learning, where you might have to manually extract features, deep learning models can do this automatically. That's a key point. For example, in image recognition, a deep learning model can learn to detect edges, shapes, and then more complex patterns without human intervention.\n\nApplications are important too. The user might want to know where deep learning is used. Common examples include image and speech recognition, natural language processing, autonomous vehicles, and recommendation systems. Maybe mention specific technologies like self-driving cars using computer vision or virtual assistants like Siri or Alexa - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 Signed-off-by: shenchuxiaofugui <1311027364@qq.com>

@zhenwenqi2024

…6043) ### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase **Related-RFC**: vllm-project#5449 **Co-authored-by**: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>

kunpengW-code added 3 commits January 20, 2026 16:23

[refactor] refactor excute_model and _dymmy_run method

af94ce4

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-ascend in…

df95ddf

…to modelrunner-refactor # Conflicts: # vllm_ascend/worker/model_runner_v1.py

[refactor] refactor excute_model and _dymmy_run method

8ce11c1

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

kunpengW-code requested a review from MengqingCao as a code owner January 20, 2026 08:39

gemini-code-assist Bot reviewed Jan 20, 2026

View reviewed changes

[refactor] refactor excute_model and _dymmy_run method

39e16ee

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

wangxiyuan approved these changes Jan 20, 2026

View reviewed changes

wangxiyuan added ready read for review ready-for-test start test by label for PR labels Jan 20, 2026

gcanlin reviewed Jan 20, 2026

View reviewed changes

[Bugfix] Fix lint & add ec_connector_output for EPD

f068d67

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

kunpengW-code added 5 commits January 21, 2026 09:27

Merge pull request #1 from gcanlin/lint-ec

0d15960

[Bugfix] Fix lint & add ec_connector_output for EPD

fix ci

b4bac2f

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-ascend in…

a7499e9

…to modelrunner-refactor

fix ci

fbde812

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

fix ci

d466779

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

kunpengW-code requested a review from Yikun as a code owner January 21, 2026 12:17

fix ci

5551075

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

github-actions Bot added the merge-conflicts label Jan 22, 2026

gcanlin mentioned this pull request Jan 23, 2026

[Release]: Release checklist for v0.14.0rc1 #6149

Closed

38 tasks

kunpengW-code added 3 commits January 23, 2026 14:26

fix ci

e3112ba

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-ascend in…

16002b7

…to modelrunner-refactor # Conflicts: # vllm_ascend/worker/model_runner_v1.py

fix ci

2a5b2da

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

github-actions Bot removed the merge-conflicts label Jan 23, 2026

gcanlin mentioned this pull request Jan 23, 2026

[Refactor] Separate _prepare_inputs to _prepare_inputs and _preprocess #6191

Merged

zhenwenqi2024 mentioned this pull request Jan 27, 2026

[RFC]: Refactor npu_model_runner #5449

Closed

fixci

01e635b

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>

github-actions Bot removed the merge-conflicts label Jan 27, 2026

zhenwenqi2024 added 6 commits January 27, 2026 09:31

fixci

42f1a2f

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>

fixci

61cb587

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>

fixci

adfdf8d

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>

fixci

428c206

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>

fixci

b6e89b9

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>

fixci

240b814

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>

yiz-liu approved these changes Jan 27, 2026

View reviewed changes

wangxiyuan merged commit c498cea into vllm-project:main Jan 27, 2026
47 of 53 checks passed

kunpengW-code deleted the modelrunner-refactor branch February 6, 2026 07:49

shenchuxiaofugui mentioned this pull request Feb 10, 2026

[EPLB][Bugfix] Bugfix for ineffective dynamic eplb #6653

Merged

		if create_mixed_batch:
		raise NotImplementedError("create_mixed_batch is used for warmup deepgemm, vllm-ascend does not need it")

		self.pcp_tokens = np.zeros(self.max_num_reqs, dtype=np.int32)
		self.total_num_sampled_tokens_pcp = 0

		self.pcp_tokens[:num_reqs] = pcp_tokens[:num_reqs]
		self.total_num_sampled_tokens_pcp = pcp_tokens[:num_reqs].sum()

Conversation

kunpengW-code commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions Bot commented Jan 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

gcanlin Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

kunpengW-code Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

gcanlin commented Jan 20, 2026

Uh oh!

zhenwenqi2024 commented Jan 20, 2026

Uh oh!

gcanlin commented Jan 20, 2026

Uh oh!

github-actions Bot commented Jan 22, 2026

Uh oh!

github-actions Bot commented Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kunpengW-code commented Jan 20, 2026 •

edited

Loading