Skip to content

[refactor] refactor excute_model and _dymmy_run method #6043

Merged
wangxiyuan merged 35 commits intovllm-project:mainfrom
kunpengW-code:modelrunner-refactor
Jan 27, 2026
Merged

[refactor] refactor excute_model and _dymmy_run method #6043
wangxiyuan merged 35 commits intovllm-project:mainfrom
kunpengW-code:modelrunner-refactor

Conversation

@kunpengW-code
Copy link
Copy Markdown
Contributor

@kunpengW-code kunpengW-code commented Jan 20, 2026

What this PR does / why we need it?

The structure of the excute_model and _dymmy_run methods in NPUModelRunner differs greatly from that in GPUModelRunner.
Achieve alignment with GPUModelRunner:
Split the _prepare_inputs method into _prepare_inputs, _determine_batch_execution_and_padding, _build_attention_metadata, and _preprocess.
Modify _generate_process_reqs_hidden_states to _model_forward.
Align the implementation of the postprocess phase

Related-RFC: #5449

Co-authored-by: @zhenwenqi2024

Does this PR introduce any user-facing change?

no

How was this patch tested?

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
…to modelrunner-refactor

# Conflicts:
#	vllm_ascend/worker/model_runner_v1.py
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the execute_model and _dummy_run methods in model_runner_v1.py, moving significant logic into new helper methods such as _preprocess, _build_attention_metadata, and _determine_batch_execution_and_padding. This refactoring aims to improve modularity and align with potentially updated upstream vLLM structures. Additionally, minor changes are introduced in pcp_utils.py to track PCP tokens. The changes are extensive and touch core execution paths, requiring thorough validation.

Comment on lines +1906 to +2136
def _build_attention_metadata(
self,
num_tokens: int,
num_reqs: int,
max_query_len: int,
num_tokens_padded: int | None = None,
num_reqs_padded: int | None = None,
ubatch_slices: UBatchSlices | None = None,
logits_indices: torch.Tensor | None = None,
use_spec_decode: bool = False,
for_cudagraph_capture: bool = False,
num_scheduled_tokens: dict[str, int] | None = None,
cascade_attn_prefix_lens: list[list[int]] | None = None,
) -> tuple[PerLayerAttnMetadata, CommonAttentionMetadata | None]:
"""
:return: tuple[attn_metadata, spec_decode_common_attn_metadata]
"""
# Attention metadata is not needed for attention free models
if len(self.kv_cache_config.kv_cache_groups) == 0:
return {}, None
num_tokens_padded = num_tokens_padded or num_tokens
num_reqs_padded = num_reqs_padded or num_reqs
attn_metadata: PerLayerAttnMetadata = {}
if ubatch_slices is not None:
attn_metadata = [dict() for _ in range(len(ubatch_slices))]
if for_cudagraph_capture:
# For some attention backends (e.g. FA) with sliding window models we need
# to make sure the backend see a max_seq_len that is larger to the sliding
# window size when capturing to make sure the correct kernel is selected.
max_seq_len = self.max_model_len
else:
hidden_states = hidden_states
return hidden_states
max_seq_len = self.seq_lens.np[:num_reqs].max().item()
if use_spec_decode and self.need_accepted_tokens:
self.num_accepted_tokens.np[:num_reqs] = (
self.input_batch.num_accepted_tokens_cpu[:num_reqs])
self.num_accepted_tokens.np[num_reqs:].fill(1)
self.num_accepted_tokens.copy_to_gpu()

kv_cache_groups = self.kv_cache_config.kv_cache_groups

def _get_pcp_metadata(num_tokens):
if not self.use_cp:
return None
return self.pcp_manager.generate_pcp_metadata(num_tokens, self.query_lens, self.input_batch, num_scheduled_tokens)

def _get_block_table_and_slot_mapping(kv_cache_gid: int):
assert num_reqs_padded is not None and num_tokens_padded is not None
kv_cache_spec = kv_cache_groups[kv_cache_gid].kv_cache_spec
maybe_pcp_full_tokens = (
num_tokens if self.pcp_size == 1 else
num_tokens * self.pcp_size -
sum(self.pcp_manager.num_pcp_pads_cpu[:num_reqs]))
if isinstance(kv_cache_spec, EncoderOnlyAttentionSpec):
blk_table_tensor = torch.zeros(
(num_reqs_padded, 1),
dtype=torch.int32,
device=self.device,
)
slot_mapping = torch.zeros(
(num_tokens_padded,),
dtype=torch.int64,
device=self.device,
)
else:
blk_table = self.input_batch.block_table[kv_cache_gid]
slot_mapping = blk_table.slot_mapping.gpu[:maybe_pcp_full_tokens]
maybe_num_reqs_padded = num_reqs_padded * self.decode_token_per_req if self.use_cp else num_reqs_padded
blk_table_tensor = blk_table.get_device_tensor()[:maybe_num_reqs_padded]

# Fill unused with -1. Needed for reshape_and_cache in full cuda
# graph mode. `blk_table_tensor` -1 to match mamba PAD_SLOT_ID
if self.pcp_size == 1:
slot_mapping[num_tokens:num_tokens_padded].fill_(-1)
blk_table_tensor[num_reqs:num_reqs_padded].fill_(-1)
if self.pcp_size > 1:
slot_mapping = self.pcp_manager.get_padded_slot_mapping(
num_tokens,
slot_mapping,
)
return blk_table_tensor, slot_mapping

long_seq_metdadata = _get_pcp_metadata(num_tokens)
block_table_gid_0, slot_mapping_gid_0 = _get_block_table_and_slot_mapping(0)

if for_cudagraph_capture:
self.attn_state = AscendAttentionState.DecodeOnly
if self.speculative_config and \
self.speculative_config.method == "mtp":
# `AscendAttentionState.SpecDecoding` is only designed for mla
if self.vllm_config.model_config.use_mla:
self.attn_state = AscendAttentionState.SpecDecoding
else:
self.attn_state = AscendAttentionState.ChunkedPrefill
cm_base = AscendCommonAttentionMetadata(
query_start_loc=self.query_start_loc.gpu[: num_reqs_padded + 1],
query_start_loc_cpu=self.query_start_loc.cpu[: num_reqs_padded + 1],
seq_lens=self.seq_lens.gpu[:num_reqs_padded],
# TODO
seq_lens_cpu=self.seq_lens.cpu[:num_reqs_padded],
# TODO
num_computed_tokens_cpu=self.input_batch.num_computed_tokens_cpu_tensor[
:num_reqs_padded
],
num_reqs=num_reqs_padded,
num_actual_tokens=num_tokens,
max_query_len=max_query_len,
max_seq_len=max_seq_len,
block_table_tensor=block_table_gid_0,
slot_mapping=slot_mapping_gid_0,
causal=True,
num_input_tokens=num_tokens_padded,
actual_seq_lengths_q=self.actual_seq_lengths_q,
positions=self.positions.gpu,
attn_state=self.attn_state,
decode_token_per_req=self.decode_token_per_req,
prefill_context_parallel_metadata=long_seq_metdadata,
)

if logits_indices is not None and self.cache_config.kv_sharing_fast_prefill:
cm_base.num_logits_indices = logits_indices.size(0)
cm_base.logits_indices_padded = self._prepare_kv_sharing_fast_prefill(
logits_indices
)

def _build_attn_group_metadata(
kv_cache_gid: int,
attn_gid: int,
common_attn_metadata: CommonAttentionMetadata,
ubid: int | None = None,
) -> None:
attn_group = self.attn_groups[kv_cache_gid][attn_gid]
builder = attn_group.get_metadata_builder(ubid or 0)
cascade_attn_prefix_len = (
cascade_attn_prefix_lens[kv_cache_gid][attn_gid]
if cascade_attn_prefix_lens
else 0
)

extra_attn_metadata_args = {}
if use_spec_decode and isinstance(builder, GDNAttentionMetadataBuilder):
assert ubid is None, "UBatching not supported with GDN yet"
patch_torch_npu_argsort()
extra_attn_metadata_args = dict(
num_accepted_tokens=self.num_accepted_tokens.gpu[:num_reqs_padded],
num_decode_draft_tokens_cpu=self.num_decode_draft_tokens.cpu[
:num_reqs_padded
],
)

if for_cudagraph_capture:
attn_metadata_i = builder.build_for_cudagraph_capture(
common_attn_metadata
)
else:
attn_metadata_i = builder.build(
common_prefix_len=cascade_attn_prefix_len,
common_attn_metadata=common_attn_metadata,
**extra_attn_metadata_args,
)

if ubid is None:
assert isinstance(attn_metadata, dict)
attn_metadata_dict = attn_metadata
else:
assert isinstance(attn_metadata, list)
attn_metadata_dict = attn_metadata[ubid]

for layer_name in attn_group.layer_names:
attn_metadata_dict[layer_name] = attn_metadata_i

# Prepare the attention metadata for each KV cache group and make layers
# in the same group share the same metadata.
spec_decode_common_attn_metadata = None
for kv_cache_gid, kv_cache_group in enumerate(
self.kv_cache_config.kv_cache_groups):
cm = copy(cm_base) # shallow copy
# Basically only the encoder seq_lens, block_table and slot_mapping change
# for each kv_cache_group.
cm.encoder_seq_lens, cm.encoder_seq_lens_cpu = self._get_encoder_seq_lens(
num_scheduled_tokens or {},
kv_cache_group.kv_cache_spec,
num_reqs_padded,
)
if kv_cache_gid > 0:
cm.block_table_tensor, cm.slot_mapping = (
_get_block_table_and_slot_mapping(kv_cache_gid)
)
if self.speculative_config and spec_decode_common_attn_metadata is None:
if isinstance(self.drafter, EagleProposer):
if self.drafter.attn_layer_names[0] in kv_cache_group.layer_names:
spec_decode_common_attn_metadata = cm
else:
spec_decode_common_attn_metadata = cm

for attn_gid in range(len(self.attn_groups[kv_cache_gid])):
if ubatch_slices is not None:
for ubid, _cm in enumerate(split_attn_metadata(ubatch_slices, cm)):
_build_attn_group_metadata(kv_cache_gid, attn_gid, _cm, ubid)

else:
_build_attn_group_metadata(kv_cache_gid, attn_gid, cm)
if self.is_mm_prefix_lm:
req_doc_ranges = {}
for req_id in self.input_batch.req_ids:
image_doc_ranges = []
req_state = self.requests[req_id]
for mm_feature in req_state.mm_features:
pos_info = mm_feature.mm_position
img_doc_range = pos_info.extract_embeds_range()
image_doc_ranges.extend(img_doc_range)
req_idx = self.input_batch.req_id_to_index[req_id]
req_doc_ranges[req_idx] = image_doc_ranges

if isinstance(attn_metadata, list):
for ub_metadata in attn_metadata:
for _metadata in ub_metadata.values():
_metadata.mm_prefix_range = req_doc_ranges # type: ignore[attr-defined]
else:
for _metadata in attn_metadata.values():
_metadata.mm_prefix_range = req_doc_ranges # type: ignore[attr-defined]

if spec_decode_common_attn_metadata is not None and (
num_reqs != num_reqs_padded or num_tokens != num_tokens_padded
):
# Currently the drafter still only uses piecewise cudagraphs (and modifies
# the attention metadata in directly), and therefore does not want to use
# padded attention metadata.
spec_decode_common_attn_metadata = (
spec_decode_common_attn_metadata.unpadded(num_tokens, num_reqs)
)
return attn_metadata, spec_decode_common_attn_metadata
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The _build_attention_metadata method is a large and complex function responsible for creating all attention-related metadata. It handles various configurations like padding, speculative decoding, PCP, and different KV cache groups. Given its complexity and central role, any logical error in constructing this metadata could lead to incorrect attention calculations, which is a critical correctness issue. Thorough testing of all branches and configurations is essential.

Comment on lines +1258 to +1271
(attn_metadata, spec_decode_common_attn_metadata) = (
self._build_attention_metadata(
num_tokens=num_tokens_unpadded,
num_tokens_padded=num_tokens_padded if pad_attn else None,
num_reqs=num_reqs,
num_reqs_padded=num_reqs_padded if pad_attn else None,
max_query_len=max_num_scheduled_tokens,
ubatch_slices=ubatch_slices_attn,
logits_indices=logits_indices,
use_spec_decode=use_spec_decode,
num_scheduled_tokens=scheduler_output.num_scheduled_tokens,
cascade_attn_prefix_lens=cascade_attn_prefix_lens,
)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The _build_attention_metadata method is now responsible for constructing the attention metadata for all layers. This is a critical component, as incorrect attention metadata can lead to severe correctness issues in the attention mechanism. It's essential to ensure that all parameters (padding, number of requests, query length, speculative decoding flags, cascade attention, etc.) are correctly translated into the attn_metadata and spec_decode_common_attn_metadata objects.

Comment on lines +3074 to +3080
def _post_process_cudagraph_mode(tensor: torch.Tensor) -> int:
"""
Synchronize cudagraph_mode across DP ranks by taking the minimum.
If any rank has NONE (0), all ranks use NONE.
This ensures all ranks send consistent values (all padded or all unpadded).
"""
return int(tensor[1, :].min().item())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The _post_process_cudagraph_mode function synchronizes the CUDA graph mode across DP ranks by taking the minimum. This is a critical step to ensure all ranks operate in a consistent mode (e.g., all use NONE if any rank uses NONE). Incorrect synchronization here could lead to deadlocks or inconsistent graph execution across distributed ranks.

Comment on lines 502 to +513
def _prepare_inputs(
self,
scheduler_output: "SchedulerOutput",
intermediate_tensors: Optional[IntermediateTensors] = None,
) -> tuple[dict[str, Any], torch.Tensor, np.ndarray, int, torch.Tensor,
int, torch.Tensor, SpecDecodeMetadata, Optional[torch.Tensor],
Optional[torch.Tensor], Optional[torch.Tensor], int, dict[str,
Any]]:
num_scheduled_tokens: np.ndarray,
) -> tuple[
torch.Tensor,
SpecDecodeMetadata | None]:
"""
:return: tuple[
logits_indices, spec_decode_metadata,
]
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The _prepare_inputs method's signature and return type have been significantly altered. It now returns only logits_indices and spec_decode_metadata, implying that all other input preparation logic has been moved elsewhere. This is a major refactoring, and it's critical to ensure that all necessary data for model execution is correctly prepared and passed through the new helper methods (_preprocess, _build_attention_metadata, etc.). Any missed data or incorrect transformations could lead to runtime errors or incorrect model behavior.

Comment on lines +1823 to +1904
def _determine_batch_execution_and_padding(
self,
num_tokens: int,
num_reqs: int,
num_scheduled_tokens_np: np.ndarray,
max_num_scheduled_tokens: int,
use_cascade_attn: bool,
allow_microbatching: bool = False,
force_eager: bool = False,
# For cudagraph capture TODO(lucas): Refactor how we capture cudagraphs (will
# be improved in model runner v2)
force_uniform_decode: bool | None = None,
force_has_lora: bool | None = None,
num_encoder_reqs: int = 0,
) -> tuple[CUDAGraphMode, BatchDescriptor, bool,
torch.Tensor | None, CUDAGraphStat | None]:

num_tokens_padded = self._pad_for_sequence_parallelism(num_tokens)
uniform_decode = (
((max_num_scheduled_tokens == self.uniform_decode_query_len) and
(num_tokens == max_num_scheduled_tokens * num_reqs))
if force_uniform_decode is None else force_uniform_decode)
# Encoder-decoder models only support CG for decoder_step > 0 (no enc_output
# is present). Also, chunked-prefill is disabled, so batch are uniform.
has_encoder_output = (self.model_config.is_encoder_decoder
and num_encoder_reqs > 0)
has_lora = (len(self.input_batch.lora_id_to_lora_request) > 0
if force_has_lora is None else force_has_lora)

# ruff: noqa: E731
dispatch_cudagraph = (
lambda num_tokens, disable_full: self.cudagraph_dispatcher.
dispatch(
num_tokens=num_tokens,
has_lora=has_lora,
uniform_decode=uniform_decode,
disable_full=disable_full,
) if not force_eager else
(CUDAGraphMode.NONE, BatchDescriptor(num_tokens_padded)))
cudagraph_mode, batch_descriptor = dispatch_cudagraph(
num_tokens_padded, use_cascade_attn or has_encoder_output)
num_tokens_padded = batch_descriptor.num_tokens
if enable_sp(self.vllm_config):
assert (batch_descriptor.num_tokens %
self.vllm_config.parallel_config.tensor_parallel_size == 0
), ("Sequence parallelism requires num_tokens to be "
"a multiple of tensor parallel size")
# Extra coordination when running data-parallel since we need to coordinate
# across ranks
should_ubatch, num_tokens_across_dp = False, None
if self.vllm_config.parallel_config.data_parallel_size > 1:
_, num_tokens_across_dp, synced_cudagraph_mode = self._sync_batch_across_dp(num_tokens_padded=num_tokens_padded,
cudagraph_mode=cudagraph_mode.value,
)

# Extract DP padding if there is any
if num_tokens_across_dp is not None:
dp_rank = self.parallel_config.data_parallel_rank
num_tokens_padded = int(num_tokens_across_dp[dp_rank].item())
# Re-dispatch with DP padding
cudagraph_mode, batch_descriptor = dispatch_cudagraph(
num_tokens_padded,
disable_full=synced_cudagraph_mode <= CUDAGraphMode.PIECEWISE.value,)
# Assert to make sure the agreed upon token count is correct otherwise
# num_tokens_across_dp will no-longer be valid
assert batch_descriptor.num_tokens == num_tokens_padded
cudagraph_stats = None
if self.vllm_config.observability_config.cudagraph_metrics:
cudagraph_stats = CUDAGraphStat(
num_unpadded_tokens=num_tokens,
num_padded_tokens=batch_descriptor.num_tokens,
num_paddings=batch_descriptor.num_tokens - num_tokens,
runtime_mode=str(cudagraph_mode),
)

return (
cudagraph_mode,
batch_descriptor,
should_ubatch,
num_tokens_across_dp,
cudagraph_stats,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The _determine_batch_execution_and_padding method is a complex new addition that orchestrates CUDA graph mode, batch descriptors, microbatching, and DP synchronization. Its correctness is paramount for efficient and correct execution. The TODO(lucas): Refactor how we capture cudagraphs (will be improved in model runner v2) indicates ongoing work, suggesting potential for future improvements or current limitations. Any logical errors in this method could lead to severe performance issues or incorrect results.

Comment on lines +2233 to +2234
if create_mixed_batch:
raise NotImplementedError("create_mixed_batch is used for warmup deepgemm, vllm-ascend does not need it")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the previous comment, create_mixed_batch is explicitly not supported. This is a functional limitation that should be clearly documented for users if it's a standard parameter in upstream vLLM.

Comment on lines 718 to 720
self.discard_request_indices.np[:self.num_discarded_requests] = (
discard_request_indices)
self.discard_request_indices.copy_to_gpu(self.num_discarded_requests)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for discard_request_indices and num_discarded_requests is crucial for correctly handling partial requests and preventing sampling from invalid tokens. Given the extensive refactoring of input preparation, it's critical to ensure that discard_requests_mask is accurately computed under all scenarios (prefill, decode, speculative decoding, PCP enabled/disabled) to avoid incorrect token sampling or processing.

Comment thread vllm_ascend/worker/model_runner_v1.py Outdated
Comment on lines +3083 to +3090
def _get_hidden_states_tensor(hidden_states):
# Sometimes, after the model is compiled through the AOT backend,
# the model output may become a list containing only one Tensor object.
if isinstance(hidden_states, list) and \
len(hidden_states) == 1 and \
isinstance(hidden_states[0], torch.Tensor):
hidden_states = hidden_states[0]
return hidden_states
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _get_hidden_states_tensor utility function handles cases where the model output might be a list containing a single tensor. This is a good defensive check, especially if the model's output format can vary based on compilation or backend. It ensures consistent handling of hidden states before further processing.

Comment on lines +78 to +79
self.pcp_tokens = np.zeros(self.max_num_reqs, dtype=np.int32)
self.total_num_sampled_tokens_pcp = 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The new instance variables self.pcp_tokens and self.total_num_sampled_tokens_pcp are introduced to track PCP-related token information. It's crucial that these are correctly initialized and updated throughout the PCP workflow to maintain accurate state for distributed operations.

Comment on lines +300 to +301
self.pcp_tokens[:num_reqs] = pcp_tokens[:num_reqs]
self.total_num_sampled_tokens_pcp = pcp_tokens[:num_reqs].sum()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The pcp_tokens and total_num_sampled_tokens_pcp are updated here. This is a critical step to ensure that the PCP manager accurately reflects the number of tokens processed by each rank. Any discrepancy could lead to incorrect token counts or indexing in subsequent PCP operations.

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Jan 20, 2026
Comment thread vllm_ascend/worker/model_runner_v1.py Outdated
positions=positions,
intermediate_tensors=intermediate_tensors,
inputs_embeds=inputs_embeds)
maybe_padded_num_tokens: int,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we still need maybe_padded_num_tokens?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need. I forgot to change the name. Actually, num_tokens_padded is used.

@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented Jan 20, 2026

@kunpengW-code Hey, could we make this PR ready tomorrow? I'd like to take it into v0.14.0rc1. We really need it in vLLM-Omni. If you need any help please let me know :)

@zhenwenqi2024
Copy link
Copy Markdown
Collaborator

@kunpengW-code Hey, could we make this PR ready tomorrow? I'd like to take it into v0.14.0rc1. We really need it in vLLM-Omni. If you need any help please let me know :)

e2e test has some problems, we can solve the problems together

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented Jan 20, 2026

@zhenwenqi2024 @kunpengW-code I open one PR for your personal repository to fix the lint and complete something lost like ec_connector kunpengW-code#1. PTAL.

[Bugfix] Fix lint & add ec_connector_output for EPD
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
@kunpengW-code kunpengW-code requested a review from Yikun as a code owner January 21, 2026 12:17
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
…to modelrunner-refactor

# Conflicts:
#	vllm_ascend/worker/model_runner_v1.py
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
@wangxiyuan wangxiyuan merged commit c498cea into vllm-project:main Jan 27, 2026
47 of 53 checks passed
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Jan 28, 2026
…to qwen3next_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend: (86 commits)
  [refactor] refactor excute_model and _dymmy_run method  (vllm-project#6043)
  [Refactor] profiler config optimze (vllm-project#6141)
  [Graph][Fusion] Add MatmulAllReduceAddRMSNorm graph fusion for npugraph_ex. (vllm-project#6006)
  [UT]: refactoring 310p ops ut (vllm-project#6296)
  [Refact.]: refactoring 310p-kv cache allocator, align with main branch (vllm-project#6270)
  [Misc] Removes unnecessary graph size re-initialization (vllm-project#6280)
  [Main2Main] Upgrade vllm commit to 0123 (vllm-project#6169)
  [BugFix] Fix wheel package build workflow (vllm-project#6276)
  [CI][BugFix] Qwen3-Next nightly test fix. (vllm-project#6247)
  [Doc] quick fix for vllm-ascend version (vllm-project#6278)
  [Community] Nominate whx-sjtu as maintainer (vllm-project#6268)
  [Lint] Fix mypy issue to make CI happy (vllm-project#6272)
  BugFix:  Fix moe_load accumulation error in ACL graph mode (vllm-project#6182)
  [Patch] Remove the patch of ECExampleConnector (vllm-project#5976)
  [Bugfix] Fix PP+PCP and PP+flashcomm1 bugs (vllm-project#5416)
  [Feat] proxy delay to remove instances (vllm-project#5934)
  [CI] Add workfolw_dispatch for nightly image build (vllm-project#6269)
  [bugfix][npugraph_ex]fix static kernel uninstall issue (vllm-project#6128)
  [Doc] 310P Documents update (vllm-project#6246)
  [Feature] Mooncake connector get remote ptp size (vllm-project#5822)
  ...
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
…6043)

### What this PR does / why we need it?
The structure of the `excute_model` and `_dymmy_run` methods in
NPUModelRunner differs greatly from that in GPUModelRunner.
Achieve alignment with GPUModelRunner:
Split the `_prepare_inputs` method into `_prepare_inputs`,
`_determine_batch_execution_and_padding`, `_build_attention_metadata`,
and `_preprocess`.
Modify `_generate_process_reqs_hidden_states` to `_model_forward`.
Align the implementation of the `postprocess` phase

**Related-RFC**: vllm-project#5449

**Co-authored-by**: @zhenwenqi2024 
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@d682094

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
…6043)

### What this PR does / why we need it?
The structure of the `excute_model` and `_dymmy_run` methods in
NPUModelRunner differs greatly from that in GPUModelRunner.
Achieve alignment with GPUModelRunner:
Split the `_prepare_inputs` method into `_prepare_inputs`,
`_determine_batch_execution_and_padding`, `_build_attention_metadata`,
and `_preprocess`.
Modify `_generate_process_reqs_hidden_states` to `_model_forward`.
Align the implementation of the `postprocess` phase

**Related-RFC**: vllm-project#5449

**Co-authored-by**: @zhenwenqi2024 
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@d682094

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
@kunpengW-code kunpengW-code deleted the modelrunner-refactor branch February 6, 2026 07:49
chenchuw886 pushed a commit to chenchuw886/vllm-ascend that referenced this pull request Feb 12, 2026
…6043)

### What this PR does / why we need it?
The structure of the `excute_model` and `_dymmy_run` methods in
NPUModelRunner differs greatly from that in GPUModelRunner.
Achieve alignment with GPUModelRunner:
Split the `_prepare_inputs` method into `_prepare_inputs`,
`_determine_batch_execution_and_padding`, `_build_attention_metadata`,
and `_preprocess`.
Modify `_generate_process_reqs_hidden_states` to `_model_forward`.
Align the implementation of the `postprocess` phase

**Related-RFC**: vllm-project#5449

**Co-authored-by**: @zhenwenqi2024
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@d682094

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Signed-off-by: momochenchuw <chenchuw@huawei.com>
wangxiyuan pushed a commit that referenced this pull request Feb 24, 2026
### What this PR does / why we need it?
#6043 deleted the forward_before phase of the dynamic eplb. Currently,
the end-to-end precision is monitored in the UT, and the log is not
printed in the key place. As a result, the eplb does not take effect and
is not intercepted.
1. The forward_before function is added back.
2. Delete unnecessary logs and add key logs.
3. Warm-up of algorithm 3 is added.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


![Snipaste_2026-02-10_15-57-31](https://github.com/user-attachments/assets/03813e5f-3d19-42d8-8118-76223afe8298)

#### The conversation is normal.
Okay, the user is asking, \"What is deep learning?\" I need to explain
this in a clear and concise way. Let me start by recalling what I know
about deep learning. It's a subset of machine learning, right? So first,
I should mention that it's part of machine learning, which itself is a
branch of AI. Then, the key aspect of deep learning is the use of neural
networks with multiple layers. These are called deep neural
networks.\n\nWait, I should define neural networks first. Maybe start
with the basics. A neural network is inspired by the human brain, with
layers of nodes (neurons) that process data. But deep learning
specifically refers to networks with many layers—hence \"deep.\" So the
term \"deep\" comes from the number of layers. \n\nI should explain how
deep learning works. It involves training these networks on large
datasets, allowing them to automatically learn features from the data.
Unlike traditional machine learning, where you might have to manually
extract features, deep learning models can do this automatically. That's
a key point. For example, in image recognition, a deep learning model
can learn to detect edges, shapes, and then more complex patterns
without human intervention.\n\nApplications are important too. The user
might want to know where deep learning is used. Common examples include
image and speech recognition, natural language processing, autonomous
vehicles, and recommendation systems. Maybe mention specific
technologies like self-driving cars using computer vision or virtual
assistants like Siri or Alexa

- vLLM version: v0.15.0
- vLLM main:
vllm-project/vllm@1339784

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
HF-001 pushed a commit to HF-001/vllm-ascend that referenced this pull request Feb 25, 2026
### What this PR does / why we need it?
vllm-project#6043 deleted the forward_before phase of the dynamic eplb. Currently,
the end-to-end precision is monitored in the UT, and the log is not
printed in the key place. As a result, the eplb does not take effect and
is not intercepted.
1. The forward_before function is added back.
2. Delete unnecessary logs and add key logs.
3. Warm-up of algorithm 3 is added.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


![Snipaste_2026-02-10_15-57-31](https://github.com/user-attachments/assets/03813e5f-3d19-42d8-8118-76223afe8298)

#### The conversation is normal.
Okay, the user is asking, \"What is deep learning?\" I need to explain
this in a clear and concise way. Let me start by recalling what I know
about deep learning. It's a subset of machine learning, right? So first,
I should mention that it's part of machine learning, which itself is a
branch of AI. Then, the key aspect of deep learning is the use of neural
networks with multiple layers. These are called deep neural
networks.\n\nWait, I should define neural networks first. Maybe start
with the basics. A neural network is inspired by the human brain, with
layers of nodes (neurons) that process data. But deep learning
specifically refers to networks with many layers—hence \"deep.\" So the
term \"deep\" comes from the number of layers. \n\nI should explain how
deep learning works. It involves training these networks on large
datasets, allowing them to automatically learn features from the data.
Unlike traditional machine learning, where you might have to manually
extract features, deep learning models can do this automatically. That's
a key point. For example, in image recognition, a deep learning model
can learn to detect edges, shapes, and then more complex patterns
without human intervention.\n\nApplications are important too. The user
might want to know where deep learning is used. Common examples include
image and speech recognition, natural language processing, autonomous
vehicles, and recommendation systems. Maybe mention specific
technologies like self-driving cars using computer vision or virtual
assistants like Siri or Alexa

- vLLM version: v0.15.0
- vLLM main:
vllm-project/vllm@1339784

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
banxiaduhuo pushed a commit to banxiaduhuo/vllm-ascend that referenced this pull request Feb 26, 2026
### What this PR does / why we need it?
vllm-project#6043 deleted the forward_before phase of the dynamic eplb. Currently,
the end-to-end precision is monitored in the UT, and the log is not
printed in the key place. As a result, the eplb does not take effect and
is not intercepted.
1. The forward_before function is added back.
2. Delete unnecessary logs and add key logs.
3. Warm-up of algorithm 3 is added.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


![Snipaste_2026-02-10_15-57-31](https://github.com/user-attachments/assets/03813e5f-3d19-42d8-8118-76223afe8298)

#### The conversation is normal.
Okay, the user is asking, \"What is deep learning?\" I need to explain
this in a clear and concise way. Let me start by recalling what I know
about deep learning. It's a subset of machine learning, right? So first,
I should mention that it's part of machine learning, which itself is a
branch of AI. Then, the key aspect of deep learning is the use of neural
networks with multiple layers. These are called deep neural
networks.\n\nWait, I should define neural networks first. Maybe start
with the basics. A neural network is inspired by the human brain, with
layers of nodes (neurons) that process data. But deep learning
specifically refers to networks with many layers—hence \"deep.\" So the
term \"deep\" comes from the number of layers. \n\nI should explain how
deep learning works. It involves training these networks on large
datasets, allowing them to automatically learn features from the data.
Unlike traditional machine learning, where you might have to manually
extract features, deep learning models can do this automatically. That's
a key point. For example, in image recognition, a deep learning model
can learn to detect edges, shapes, and then more complex patterns
without human intervention.\n\nApplications are important too. The user
might want to know where deep learning is used. Common examples include
image and speech recognition, natural language processing, autonomous
vehicles, and recommendation systems. Maybe mention specific
technologies like self-driving cars using computer vision or virtual
assistants like Siri or Alexa

- vLLM version: v0.15.0
- vLLM main:
vllm-project/vllm@1339784

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
…6043)

### What this PR does / why we need it?
The structure of the `excute_model` and `_dymmy_run` methods in
NPUModelRunner differs greatly from that in GPUModelRunner.
Achieve alignment with GPUModelRunner:
Split the `_prepare_inputs` method into `_prepare_inputs`,
`_determine_batch_execution_and_padding`, `_build_attention_metadata`,
and `_preprocess`.
Modify `_generate_process_reqs_hidden_states` to `_model_forward`.
Align the implementation of the `postprocess` phase

**Related-RFC**: vllm-project#5449

**Co-authored-by**: @zhenwenqi2024
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@d682094

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
### What this PR does / why we need it?
vllm-project#6043 deleted the forward_before phase of the dynamic eplb. Currently,
the end-to-end precision is monitored in the UT, and the log is not
printed in the key place. As a result, the eplb does not take effect and
is not intercepted.
1. The forward_before function is added back.
2. Delete unnecessary logs and add key logs.
3. Warm-up of algorithm 3 is added.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

![Snipaste_2026-02-10_15-57-31](https://github.com/user-attachments/assets/03813e5f-3d19-42d8-8118-76223afe8298)

#### The conversation is normal.
Okay, the user is asking, \"What is deep learning?\" I need to explain
this in a clear and concise way. Let me start by recalling what I know
about deep learning. It's a subset of machine learning, right? So first,
I should mention that it's part of machine learning, which itself is a
branch of AI. Then, the key aspect of deep learning is the use of neural
networks with multiple layers. These are called deep neural
networks.\n\nWait, I should define neural networks first. Maybe start
with the basics. A neural network is inspired by the human brain, with
layers of nodes (neurons) that process data. But deep learning
specifically refers to networks with many layers—hence \"deep.\" So the
term \"deep\" comes from the number of layers. \n\nI should explain how
deep learning works. It involves training these networks on large
datasets, allowing them to automatically learn features from the data.
Unlike traditional machine learning, where you might have to manually
extract features, deep learning models can do this automatically. That's
a key point. For example, in image recognition, a deep learning model
can learn to detect edges, shapes, and then more complex patterns
without human intervention.\n\nApplications are important too. The user
might want to know where deep learning is used. Common examples include
image and speech recognition, natural language processing, autonomous
vehicles, and recommendation systems. Maybe mention specific
technologies like self-driving cars using computer vision or virtual
assistants like Siri or Alexa

- vLLM version: v0.15.0
- vLLM main:
vllm-project/vllm@1339784

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
…6043)

### What this PR does / why we need it?
The structure of the `excute_model` and `_dymmy_run` methods in
NPUModelRunner differs greatly from that in GPUModelRunner.
Achieve alignment with GPUModelRunner:
Split the `_prepare_inputs` method into `_prepare_inputs`,
`_determine_batch_execution_and_padding`, `_build_attention_metadata`,
and `_preprocess`.
Modify `_generate_process_reqs_hidden_states` to `_model_forward`.
Align the implementation of the `postprocess` phase

**Related-RFC**: vllm-project#5449

**Co-authored-by**: @zhenwenqi2024 
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@d682094

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
### What this PR does / why we need it?
vllm-project#6043 deleted the forward_before phase of the dynamic eplb. Currently,
the end-to-end precision is monitored in the UT, and the log is not
printed in the key place. As a result, the eplb does not take effect and
is not intercepted.
1. The forward_before function is added back.
2. Delete unnecessary logs and add key logs.
3. Warm-up of algorithm 3 is added.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


![Snipaste_2026-02-10_15-57-31](https://github.com/user-attachments/assets/03813e5f-3d19-42d8-8118-76223afe8298)

#### The conversation is normal.
Okay, the user is asking, \"What is deep learning?\" I need to explain
this in a clear and concise way. Let me start by recalling what I know
about deep learning. It's a subset of machine learning, right? So first,
I should mention that it's part of machine learning, which itself is a
branch of AI. Then, the key aspect of deep learning is the use of neural
networks with multiple layers. These are called deep neural
networks.\n\nWait, I should define neural networks first. Maybe start
with the basics. A neural network is inspired by the human brain, with
layers of nodes (neurons) that process data. But deep learning
specifically refers to networks with many layers—hence \"deep.\" So the
term \"deep\" comes from the number of layers. \n\nI should explain how
deep learning works. It involves training these networks on large
datasets, allowing them to automatically learn features from the data.
Unlike traditional machine learning, where you might have to manually
extract features, deep learning models can do this automatically. That's
a key point. For example, in image recognition, a deep learning model
can learn to detect edges, shapes, and then more complex patterns
without human intervention.\n\nApplications are important too. The user
might want to know where deep learning is used. Common examples include
image and speech recognition, natural language processing, autonomous
vehicles, and recommendation systems. Maybe mention specific
technologies like self-driving cars using computer vision or virtual
assistants like Siri or Alexa

- vLLM version: v0.15.0
- vLLM main:
vllm-project/vllm@1339784

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
…6043)

### What this PR does / why we need it?
The structure of the `excute_model` and `_dymmy_run` methods in
NPUModelRunner differs greatly from that in GPUModelRunner.
Achieve alignment with GPUModelRunner:
Split the `_prepare_inputs` method into `_prepare_inputs`,
`_determine_batch_execution_and_padding`, `_build_attention_metadata`,
and `_preprocess`.
Modify `_generate_process_reqs_hidden_states` to `_model_forward`.
Align the implementation of the `postprocess` phase

**Related-RFC**: vllm-project#5449

**Co-authored-by**: @zhenwenqi2024
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@d682094

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
### What this PR does / why we need it?
vllm-project#6043 deleted the forward_before phase of the dynamic eplb. Currently,
the end-to-end precision is monitored in the UT, and the log is not
printed in the key place. As a result, the eplb does not take effect and
is not intercepted.
1. The forward_before function is added back.
2. Delete unnecessary logs and add key logs.
3. Warm-up of algorithm 3 is added.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

![Snipaste_2026-02-10_15-57-31](https://github.com/user-attachments/assets/03813e5f-3d19-42d8-8118-76223afe8298)

#### The conversation is normal.
Okay, the user is asking, \"What is deep learning?\" I need to explain
this in a clear and concise way. Let me start by recalling what I know
about deep learning. It's a subset of machine learning, right? So first,
I should mention that it's part of machine learning, which itself is a
branch of AI. Then, the key aspect of deep learning is the use of neural
networks with multiple layers. These are called deep neural
networks.\n\nWait, I should define neural networks first. Maybe start
with the basics. A neural network is inspired by the human brain, with
layers of nodes (neurons) that process data. But deep learning
specifically refers to networks with many layers—hence \"deep.\" So the
term \"deep\" comes from the number of layers. \n\nI should explain how
deep learning works. It involves training these networks on large
datasets, allowing them to automatically learn features from the data.
Unlike traditional machine learning, where you might have to manually
extract features, deep learning models can do this automatically. That's
a key point. For example, in image recognition, a deep learning model
can learn to detect edges, shapes, and then more complex patterns
without human intervention.\n\nApplications are important too. The user
might want to know where deep learning is used. Common examples include
image and speech recognition, natural language processing, autonomous
vehicles, and recommendation systems. Maybe mention specific
technologies like self-driving cars using computer vision or virtual
assistants like Siri or Alexa

- vLLM version: v0.15.0
- vLLM main:
vllm-project/vllm@1339784

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
…6043)

### What this PR does / why we need it?
The structure of the `excute_model` and `_dymmy_run` methods in
NPUModelRunner differs greatly from that in GPUModelRunner.
Achieve alignment with GPUModelRunner:
Split the `_prepare_inputs` method into `_prepare_inputs`,
`_determine_batch_execution_and_padding`, `_build_attention_metadata`,
and `_preprocess`.
Modify `_generate_process_reqs_hidden_states` to `_model_forward`.
Align the implementation of the `postprocess` phase

**Related-RFC**: vllm-project#5449

**Co-authored-by**: @zhenwenqi2024 
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@d682094

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
### What this PR does / why we need it?
vllm-project#6043 deleted the forward_before phase of the dynamic eplb. Currently,
the end-to-end precision is monitored in the UT, and the log is not
printed in the key place. As a result, the eplb does not take effect and
is not intercepted.
1. The forward_before function is added back.
2. Delete unnecessary logs and add key logs.
3. Warm-up of algorithm 3 is added.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


![Snipaste_2026-02-10_15-57-31](https://github.com/user-attachments/assets/03813e5f-3d19-42d8-8118-76223afe8298)

#### The conversation is normal.
Okay, the user is asking, \"What is deep learning?\" I need to explain
this in a clear and concise way. Let me start by recalling what I know
about deep learning. It's a subset of machine learning, right? So first,
I should mention that it's part of machine learning, which itself is a
branch of AI. Then, the key aspect of deep learning is the use of neural
networks with multiple layers. These are called deep neural
networks.\n\nWait, I should define neural networks first. Maybe start
with the basics. A neural network is inspired by the human brain, with
layers of nodes (neurons) that process data. But deep learning
specifically refers to networks with many layers—hence \"deep.\" So the
term \"deep\" comes from the number of layers. \n\nI should explain how
deep learning works. It involves training these networks on large
datasets, allowing them to automatically learn features from the data.
Unlike traditional machine learning, where you might have to manually
extract features, deep learning models can do this automatically. That's
a key point. For example, in image recognition, a deep learning model
can learn to detect edges, shapes, and then more complex patterns
without human intervention.\n\nApplications are important too. The user
might want to know where deep learning is used. Common examples include
image and speech recognition, natural language processing, autonomous
vehicles, and recommendation systems. Maybe mention specific
technologies like self-driving cars using computer vision or virtual
assistants like Siri or Alexa

- vLLM version: v0.15.0
- vLLM main:
vllm-project/vllm@1339784

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
jiangyunfan1 pushed a commit to jiangyunfan1/vllm-ascend that referenced this pull request Apr 9, 2026
…6043)

### What this PR does / why we need it?
The structure of the `excute_model` and `_dymmy_run` methods in
NPUModelRunner differs greatly from that in GPUModelRunner.
Achieve alignment with GPUModelRunner:
Split the `_prepare_inputs` method into `_prepare_inputs`,
`_determine_batch_execution_and_padding`, `_build_attention_metadata`,
and `_preprocess`.
Modify `_generate_process_reqs_hidden_states` to `_model_forward`.
Align the implementation of the `postprocess` phase

**Related-RFC**: vllm-project#5449

**Co-authored-by**: @zhenwenqi2024 
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@d682094

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants