[refactor] refactor excute_model and _dymmy_run method #6043
[refactor] refactor excute_model and _dymmy_run method #6043wangxiyuan merged 35 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
…to modelrunner-refactor # Conflicts: # vllm_ascend/worker/model_runner_v1.py
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request refactors the execute_model and _dummy_run methods in model_runner_v1.py, moving significant logic into new helper methods such as _preprocess, _build_attention_metadata, and _determine_batch_execution_and_padding. This refactoring aims to improve modularity and align with potentially updated upstream vLLM structures. Additionally, minor changes are introduced in pcp_utils.py to track PCP tokens. The changes are extensive and touch core execution paths, requiring thorough validation.
| def _build_attention_metadata( | ||
| self, | ||
| num_tokens: int, | ||
| num_reqs: int, | ||
| max_query_len: int, | ||
| num_tokens_padded: int | None = None, | ||
| num_reqs_padded: int | None = None, | ||
| ubatch_slices: UBatchSlices | None = None, | ||
| logits_indices: torch.Tensor | None = None, | ||
| use_spec_decode: bool = False, | ||
| for_cudagraph_capture: bool = False, | ||
| num_scheduled_tokens: dict[str, int] | None = None, | ||
| cascade_attn_prefix_lens: list[list[int]] | None = None, | ||
| ) -> tuple[PerLayerAttnMetadata, CommonAttentionMetadata | None]: | ||
| """ | ||
| :return: tuple[attn_metadata, spec_decode_common_attn_metadata] | ||
| """ | ||
| # Attention metadata is not needed for attention free models | ||
| if len(self.kv_cache_config.kv_cache_groups) == 0: | ||
| return {}, None | ||
| num_tokens_padded = num_tokens_padded or num_tokens | ||
| num_reqs_padded = num_reqs_padded or num_reqs | ||
| attn_metadata: PerLayerAttnMetadata = {} | ||
| if ubatch_slices is not None: | ||
| attn_metadata = [dict() for _ in range(len(ubatch_slices))] | ||
| if for_cudagraph_capture: | ||
| # For some attention backends (e.g. FA) with sliding window models we need | ||
| # to make sure the backend see a max_seq_len that is larger to the sliding | ||
| # window size when capturing to make sure the correct kernel is selected. | ||
| max_seq_len = self.max_model_len | ||
| else: | ||
| hidden_states = hidden_states | ||
| return hidden_states | ||
| max_seq_len = self.seq_lens.np[:num_reqs].max().item() | ||
| if use_spec_decode and self.need_accepted_tokens: | ||
| self.num_accepted_tokens.np[:num_reqs] = ( | ||
| self.input_batch.num_accepted_tokens_cpu[:num_reqs]) | ||
| self.num_accepted_tokens.np[num_reqs:].fill(1) | ||
| self.num_accepted_tokens.copy_to_gpu() | ||
|
|
||
| kv_cache_groups = self.kv_cache_config.kv_cache_groups | ||
|
|
||
| def _get_pcp_metadata(num_tokens): | ||
| if not self.use_cp: | ||
| return None | ||
| return self.pcp_manager.generate_pcp_metadata(num_tokens, self.query_lens, self.input_batch, num_scheduled_tokens) | ||
|
|
||
| def _get_block_table_and_slot_mapping(kv_cache_gid: int): | ||
| assert num_reqs_padded is not None and num_tokens_padded is not None | ||
| kv_cache_spec = kv_cache_groups[kv_cache_gid].kv_cache_spec | ||
| maybe_pcp_full_tokens = ( | ||
| num_tokens if self.pcp_size == 1 else | ||
| num_tokens * self.pcp_size - | ||
| sum(self.pcp_manager.num_pcp_pads_cpu[:num_reqs])) | ||
| if isinstance(kv_cache_spec, EncoderOnlyAttentionSpec): | ||
| blk_table_tensor = torch.zeros( | ||
| (num_reqs_padded, 1), | ||
| dtype=torch.int32, | ||
| device=self.device, | ||
| ) | ||
| slot_mapping = torch.zeros( | ||
| (num_tokens_padded,), | ||
| dtype=torch.int64, | ||
| device=self.device, | ||
| ) | ||
| else: | ||
| blk_table = self.input_batch.block_table[kv_cache_gid] | ||
| slot_mapping = blk_table.slot_mapping.gpu[:maybe_pcp_full_tokens] | ||
| maybe_num_reqs_padded = num_reqs_padded * self.decode_token_per_req if self.use_cp else num_reqs_padded | ||
| blk_table_tensor = blk_table.get_device_tensor()[:maybe_num_reqs_padded] | ||
|
|
||
| # Fill unused with -1. Needed for reshape_and_cache in full cuda | ||
| # graph mode. `blk_table_tensor` -1 to match mamba PAD_SLOT_ID | ||
| if self.pcp_size == 1: | ||
| slot_mapping[num_tokens:num_tokens_padded].fill_(-1) | ||
| blk_table_tensor[num_reqs:num_reqs_padded].fill_(-1) | ||
| if self.pcp_size > 1: | ||
| slot_mapping = self.pcp_manager.get_padded_slot_mapping( | ||
| num_tokens, | ||
| slot_mapping, | ||
| ) | ||
| return blk_table_tensor, slot_mapping | ||
|
|
||
| long_seq_metdadata = _get_pcp_metadata(num_tokens) | ||
| block_table_gid_0, slot_mapping_gid_0 = _get_block_table_and_slot_mapping(0) | ||
|
|
||
| if for_cudagraph_capture: | ||
| self.attn_state = AscendAttentionState.DecodeOnly | ||
| if self.speculative_config and \ | ||
| self.speculative_config.method == "mtp": | ||
| # `AscendAttentionState.SpecDecoding` is only designed for mla | ||
| if self.vllm_config.model_config.use_mla: | ||
| self.attn_state = AscendAttentionState.SpecDecoding | ||
| else: | ||
| self.attn_state = AscendAttentionState.ChunkedPrefill | ||
| cm_base = AscendCommonAttentionMetadata( | ||
| query_start_loc=self.query_start_loc.gpu[: num_reqs_padded + 1], | ||
| query_start_loc_cpu=self.query_start_loc.cpu[: num_reqs_padded + 1], | ||
| seq_lens=self.seq_lens.gpu[:num_reqs_padded], | ||
| # TODO | ||
| seq_lens_cpu=self.seq_lens.cpu[:num_reqs_padded], | ||
| # TODO | ||
| num_computed_tokens_cpu=self.input_batch.num_computed_tokens_cpu_tensor[ | ||
| :num_reqs_padded | ||
| ], | ||
| num_reqs=num_reqs_padded, | ||
| num_actual_tokens=num_tokens, | ||
| max_query_len=max_query_len, | ||
| max_seq_len=max_seq_len, | ||
| block_table_tensor=block_table_gid_0, | ||
| slot_mapping=slot_mapping_gid_0, | ||
| causal=True, | ||
| num_input_tokens=num_tokens_padded, | ||
| actual_seq_lengths_q=self.actual_seq_lengths_q, | ||
| positions=self.positions.gpu, | ||
| attn_state=self.attn_state, | ||
| decode_token_per_req=self.decode_token_per_req, | ||
| prefill_context_parallel_metadata=long_seq_metdadata, | ||
| ) | ||
|
|
||
| if logits_indices is not None and self.cache_config.kv_sharing_fast_prefill: | ||
| cm_base.num_logits_indices = logits_indices.size(0) | ||
| cm_base.logits_indices_padded = self._prepare_kv_sharing_fast_prefill( | ||
| logits_indices | ||
| ) | ||
|
|
||
| def _build_attn_group_metadata( | ||
| kv_cache_gid: int, | ||
| attn_gid: int, | ||
| common_attn_metadata: CommonAttentionMetadata, | ||
| ubid: int | None = None, | ||
| ) -> None: | ||
| attn_group = self.attn_groups[kv_cache_gid][attn_gid] | ||
| builder = attn_group.get_metadata_builder(ubid or 0) | ||
| cascade_attn_prefix_len = ( | ||
| cascade_attn_prefix_lens[kv_cache_gid][attn_gid] | ||
| if cascade_attn_prefix_lens | ||
| else 0 | ||
| ) | ||
|
|
||
| extra_attn_metadata_args = {} | ||
| if use_spec_decode and isinstance(builder, GDNAttentionMetadataBuilder): | ||
| assert ubid is None, "UBatching not supported with GDN yet" | ||
| patch_torch_npu_argsort() | ||
| extra_attn_metadata_args = dict( | ||
| num_accepted_tokens=self.num_accepted_tokens.gpu[:num_reqs_padded], | ||
| num_decode_draft_tokens_cpu=self.num_decode_draft_tokens.cpu[ | ||
| :num_reqs_padded | ||
| ], | ||
| ) | ||
|
|
||
| if for_cudagraph_capture: | ||
| attn_metadata_i = builder.build_for_cudagraph_capture( | ||
| common_attn_metadata | ||
| ) | ||
| else: | ||
| attn_metadata_i = builder.build( | ||
| common_prefix_len=cascade_attn_prefix_len, | ||
| common_attn_metadata=common_attn_metadata, | ||
| **extra_attn_metadata_args, | ||
| ) | ||
|
|
||
| if ubid is None: | ||
| assert isinstance(attn_metadata, dict) | ||
| attn_metadata_dict = attn_metadata | ||
| else: | ||
| assert isinstance(attn_metadata, list) | ||
| attn_metadata_dict = attn_metadata[ubid] | ||
|
|
||
| for layer_name in attn_group.layer_names: | ||
| attn_metadata_dict[layer_name] = attn_metadata_i | ||
|
|
||
| # Prepare the attention metadata for each KV cache group and make layers | ||
| # in the same group share the same metadata. | ||
| spec_decode_common_attn_metadata = None | ||
| for kv_cache_gid, kv_cache_group in enumerate( | ||
| self.kv_cache_config.kv_cache_groups): | ||
| cm = copy(cm_base) # shallow copy | ||
| # Basically only the encoder seq_lens, block_table and slot_mapping change | ||
| # for each kv_cache_group. | ||
| cm.encoder_seq_lens, cm.encoder_seq_lens_cpu = self._get_encoder_seq_lens( | ||
| num_scheduled_tokens or {}, | ||
| kv_cache_group.kv_cache_spec, | ||
| num_reqs_padded, | ||
| ) | ||
| if kv_cache_gid > 0: | ||
| cm.block_table_tensor, cm.slot_mapping = ( | ||
| _get_block_table_and_slot_mapping(kv_cache_gid) | ||
| ) | ||
| if self.speculative_config and spec_decode_common_attn_metadata is None: | ||
| if isinstance(self.drafter, EagleProposer): | ||
| if self.drafter.attn_layer_names[0] in kv_cache_group.layer_names: | ||
| spec_decode_common_attn_metadata = cm | ||
| else: | ||
| spec_decode_common_attn_metadata = cm | ||
|
|
||
| for attn_gid in range(len(self.attn_groups[kv_cache_gid])): | ||
| if ubatch_slices is not None: | ||
| for ubid, _cm in enumerate(split_attn_metadata(ubatch_slices, cm)): | ||
| _build_attn_group_metadata(kv_cache_gid, attn_gid, _cm, ubid) | ||
|
|
||
| else: | ||
| _build_attn_group_metadata(kv_cache_gid, attn_gid, cm) | ||
| if self.is_mm_prefix_lm: | ||
| req_doc_ranges = {} | ||
| for req_id in self.input_batch.req_ids: | ||
| image_doc_ranges = [] | ||
| req_state = self.requests[req_id] | ||
| for mm_feature in req_state.mm_features: | ||
| pos_info = mm_feature.mm_position | ||
| img_doc_range = pos_info.extract_embeds_range() | ||
| image_doc_ranges.extend(img_doc_range) | ||
| req_idx = self.input_batch.req_id_to_index[req_id] | ||
| req_doc_ranges[req_idx] = image_doc_ranges | ||
|
|
||
| if isinstance(attn_metadata, list): | ||
| for ub_metadata in attn_metadata: | ||
| for _metadata in ub_metadata.values(): | ||
| _metadata.mm_prefix_range = req_doc_ranges # type: ignore[attr-defined] | ||
| else: | ||
| for _metadata in attn_metadata.values(): | ||
| _metadata.mm_prefix_range = req_doc_ranges # type: ignore[attr-defined] | ||
|
|
||
| if spec_decode_common_attn_metadata is not None and ( | ||
| num_reqs != num_reqs_padded or num_tokens != num_tokens_padded | ||
| ): | ||
| # Currently the drafter still only uses piecewise cudagraphs (and modifies | ||
| # the attention metadata in directly), and therefore does not want to use | ||
| # padded attention metadata. | ||
| spec_decode_common_attn_metadata = ( | ||
| spec_decode_common_attn_metadata.unpadded(num_tokens, num_reqs) | ||
| ) | ||
| return attn_metadata, spec_decode_common_attn_metadata |
There was a problem hiding this comment.
The _build_attention_metadata method is a large and complex function responsible for creating all attention-related metadata. It handles various configurations like padding, speculative decoding, PCP, and different KV cache groups. Given its complexity and central role, any logical error in constructing this metadata could lead to incorrect attention calculations, which is a critical correctness issue. Thorough testing of all branches and configurations is essential.
| (attn_metadata, spec_decode_common_attn_metadata) = ( | ||
| self._build_attention_metadata( | ||
| num_tokens=num_tokens_unpadded, | ||
| num_tokens_padded=num_tokens_padded if pad_attn else None, | ||
| num_reqs=num_reqs, | ||
| num_reqs_padded=num_reqs_padded if pad_attn else None, | ||
| max_query_len=max_num_scheduled_tokens, | ||
| ubatch_slices=ubatch_slices_attn, | ||
| logits_indices=logits_indices, | ||
| use_spec_decode=use_spec_decode, | ||
| num_scheduled_tokens=scheduler_output.num_scheduled_tokens, | ||
| cascade_attn_prefix_lens=cascade_attn_prefix_lens, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
The _build_attention_metadata method is now responsible for constructing the attention metadata for all layers. This is a critical component, as incorrect attention metadata can lead to severe correctness issues in the attention mechanism. It's essential to ensure that all parameters (padding, number of requests, query length, speculative decoding flags, cascade attention, etc.) are correctly translated into the attn_metadata and spec_decode_common_attn_metadata objects.
| def _post_process_cudagraph_mode(tensor: torch.Tensor) -> int: | ||
| """ | ||
| Synchronize cudagraph_mode across DP ranks by taking the minimum. | ||
| If any rank has NONE (0), all ranks use NONE. | ||
| This ensures all ranks send consistent values (all padded or all unpadded). | ||
| """ | ||
| return int(tensor[1, :].min().item()) |
There was a problem hiding this comment.
The _post_process_cudagraph_mode function synchronizes the CUDA graph mode across DP ranks by taking the minimum. This is a critical step to ensure all ranks operate in a consistent mode (e.g., all use NONE if any rank uses NONE). Incorrect synchronization here could lead to deadlocks or inconsistent graph execution across distributed ranks.
| def _prepare_inputs( | ||
| self, | ||
| scheduler_output: "SchedulerOutput", | ||
| intermediate_tensors: Optional[IntermediateTensors] = None, | ||
| ) -> tuple[dict[str, Any], torch.Tensor, np.ndarray, int, torch.Tensor, | ||
| int, torch.Tensor, SpecDecodeMetadata, Optional[torch.Tensor], | ||
| Optional[torch.Tensor], Optional[torch.Tensor], int, dict[str, | ||
| Any]]: | ||
| num_scheduled_tokens: np.ndarray, | ||
| ) -> tuple[ | ||
| torch.Tensor, | ||
| SpecDecodeMetadata | None]: | ||
| """ | ||
| :return: tuple[ | ||
| logits_indices, spec_decode_metadata, | ||
| ] | ||
| """ |
There was a problem hiding this comment.
The _prepare_inputs method's signature and return type have been significantly altered. It now returns only logits_indices and spec_decode_metadata, implying that all other input preparation logic has been moved elsewhere. This is a major refactoring, and it's critical to ensure that all necessary data for model execution is correctly prepared and passed through the new helper methods (_preprocess, _build_attention_metadata, etc.). Any missed data or incorrect transformations could lead to runtime errors or incorrect model behavior.
| def _determine_batch_execution_and_padding( | ||
| self, | ||
| num_tokens: int, | ||
| num_reqs: int, | ||
| num_scheduled_tokens_np: np.ndarray, | ||
| max_num_scheduled_tokens: int, | ||
| use_cascade_attn: bool, | ||
| allow_microbatching: bool = False, | ||
| force_eager: bool = False, | ||
| # For cudagraph capture TODO(lucas): Refactor how we capture cudagraphs (will | ||
| # be improved in model runner v2) | ||
| force_uniform_decode: bool | None = None, | ||
| force_has_lora: bool | None = None, | ||
| num_encoder_reqs: int = 0, | ||
| ) -> tuple[CUDAGraphMode, BatchDescriptor, bool, | ||
| torch.Tensor | None, CUDAGraphStat | None]: | ||
|
|
||
| num_tokens_padded = self._pad_for_sequence_parallelism(num_tokens) | ||
| uniform_decode = ( | ||
| ((max_num_scheduled_tokens == self.uniform_decode_query_len) and | ||
| (num_tokens == max_num_scheduled_tokens * num_reqs)) | ||
| if force_uniform_decode is None else force_uniform_decode) | ||
| # Encoder-decoder models only support CG for decoder_step > 0 (no enc_output | ||
| # is present). Also, chunked-prefill is disabled, so batch are uniform. | ||
| has_encoder_output = (self.model_config.is_encoder_decoder | ||
| and num_encoder_reqs > 0) | ||
| has_lora = (len(self.input_batch.lora_id_to_lora_request) > 0 | ||
| if force_has_lora is None else force_has_lora) | ||
|
|
||
| # ruff: noqa: E731 | ||
| dispatch_cudagraph = ( | ||
| lambda num_tokens, disable_full: self.cudagraph_dispatcher. | ||
| dispatch( | ||
| num_tokens=num_tokens, | ||
| has_lora=has_lora, | ||
| uniform_decode=uniform_decode, | ||
| disable_full=disable_full, | ||
| ) if not force_eager else | ||
| (CUDAGraphMode.NONE, BatchDescriptor(num_tokens_padded))) | ||
| cudagraph_mode, batch_descriptor = dispatch_cudagraph( | ||
| num_tokens_padded, use_cascade_attn or has_encoder_output) | ||
| num_tokens_padded = batch_descriptor.num_tokens | ||
| if enable_sp(self.vllm_config): | ||
| assert (batch_descriptor.num_tokens % | ||
| self.vllm_config.parallel_config.tensor_parallel_size == 0 | ||
| ), ("Sequence parallelism requires num_tokens to be " | ||
| "a multiple of tensor parallel size") | ||
| # Extra coordination when running data-parallel since we need to coordinate | ||
| # across ranks | ||
| should_ubatch, num_tokens_across_dp = False, None | ||
| if self.vllm_config.parallel_config.data_parallel_size > 1: | ||
| _, num_tokens_across_dp, synced_cudagraph_mode = self._sync_batch_across_dp(num_tokens_padded=num_tokens_padded, | ||
| cudagraph_mode=cudagraph_mode.value, | ||
| ) | ||
|
|
||
| # Extract DP padding if there is any | ||
| if num_tokens_across_dp is not None: | ||
| dp_rank = self.parallel_config.data_parallel_rank | ||
| num_tokens_padded = int(num_tokens_across_dp[dp_rank].item()) | ||
| # Re-dispatch with DP padding | ||
| cudagraph_mode, batch_descriptor = dispatch_cudagraph( | ||
| num_tokens_padded, | ||
| disable_full=synced_cudagraph_mode <= CUDAGraphMode.PIECEWISE.value,) | ||
| # Assert to make sure the agreed upon token count is correct otherwise | ||
| # num_tokens_across_dp will no-longer be valid | ||
| assert batch_descriptor.num_tokens == num_tokens_padded | ||
| cudagraph_stats = None | ||
| if self.vllm_config.observability_config.cudagraph_metrics: | ||
| cudagraph_stats = CUDAGraphStat( | ||
| num_unpadded_tokens=num_tokens, | ||
| num_padded_tokens=batch_descriptor.num_tokens, | ||
| num_paddings=batch_descriptor.num_tokens - num_tokens, | ||
| runtime_mode=str(cudagraph_mode), | ||
| ) | ||
|
|
||
| return ( | ||
| cudagraph_mode, | ||
| batch_descriptor, | ||
| should_ubatch, | ||
| num_tokens_across_dp, | ||
| cudagraph_stats, | ||
| ) |
There was a problem hiding this comment.
The _determine_batch_execution_and_padding method is a complex new addition that orchestrates CUDA graph mode, batch descriptors, microbatching, and DP synchronization. Its correctness is paramount for efficient and correct execution. The TODO(lucas): Refactor how we capture cudagraphs (will be improved in model runner v2) indicates ongoing work, suggesting potential for future improvements or current limitations. Any logical errors in this method could lead to severe performance issues or incorrect results.
| if create_mixed_batch: | ||
| raise NotImplementedError("create_mixed_batch is used for warmup deepgemm, vllm-ascend does not need it") |
| self.discard_request_indices.np[:self.num_discarded_requests] = ( | ||
| discard_request_indices) | ||
| self.discard_request_indices.copy_to_gpu(self.num_discarded_requests) |
There was a problem hiding this comment.
The logic for discard_request_indices and num_discarded_requests is crucial for correctly handling partial requests and preventing sampling from invalid tokens. Given the extensive refactoring of input preparation, it's critical to ensure that discard_requests_mask is accurately computed under all scenarios (prefill, decode, speculative decoding, PCP enabled/disabled) to avoid incorrect token sampling or processing.
| def _get_hidden_states_tensor(hidden_states): | ||
| # Sometimes, after the model is compiled through the AOT backend, | ||
| # the model output may become a list containing only one Tensor object. | ||
| if isinstance(hidden_states, list) and \ | ||
| len(hidden_states) == 1 and \ | ||
| isinstance(hidden_states[0], torch.Tensor): | ||
| hidden_states = hidden_states[0] | ||
| return hidden_states |
There was a problem hiding this comment.
The _get_hidden_states_tensor utility function handles cases where the model output might be a list containing a single tensor. This is a good defensive check, especially if the model's output format can vary based on compilation or backend. It ensures consistent handling of hidden states before further processing.
| self.pcp_tokens = np.zeros(self.max_num_reqs, dtype=np.int32) | ||
| self.total_num_sampled_tokens_pcp = 0 |
There was a problem hiding this comment.
| self.pcp_tokens[:num_reqs] = pcp_tokens[:num_reqs] | ||
| self.total_num_sampled_tokens_pcp = pcp_tokens[:num_reqs].sum() |
There was a problem hiding this comment.
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
| positions=positions, | ||
| intermediate_tensors=intermediate_tensors, | ||
| inputs_embeds=inputs_embeds) | ||
| maybe_padded_num_tokens: int, |
There was a problem hiding this comment.
Why we still need maybe_padded_num_tokens?
There was a problem hiding this comment.
No need. I forgot to change the name. Actually, num_tokens_padded is used.
|
@kunpengW-code Hey, could we make this PR ready tomorrow? I'd like to take it into v0.14.0rc1. We really need it in vLLM-Omni. If you need any help please let me know :) |
e2e test has some problems, we can solve the problems together |
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
|
@zhenwenqi2024 @kunpengW-code I open one PR for your personal repository to fix the lint and complete something lost like ec_connector kunpengW-code#1. PTAL. |
[Bugfix] Fix lint & add ec_connector_output for EPD
…to modelrunner-refactor
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
…to modelrunner-refactor # Conflicts: # vllm_ascend/worker/model_runner_v1.py
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (86 commits) [refactor] refactor excute_model and _dymmy_run method (vllm-project#6043) [Refactor] profiler config optimze (vllm-project#6141) [Graph][Fusion] Add MatmulAllReduceAddRMSNorm graph fusion for npugraph_ex. (vllm-project#6006) [UT]: refactoring 310p ops ut (vllm-project#6296) [Refact.]: refactoring 310p-kv cache allocator, align with main branch (vllm-project#6270) [Misc] Removes unnecessary graph size re-initialization (vllm-project#6280) [Main2Main] Upgrade vllm commit to 0123 (vllm-project#6169) [BugFix] Fix wheel package build workflow (vllm-project#6276) [CI][BugFix] Qwen3-Next nightly test fix. (vllm-project#6247) [Doc] quick fix for vllm-ascend version (vllm-project#6278) [Community] Nominate whx-sjtu as maintainer (vllm-project#6268) [Lint] Fix mypy issue to make CI happy (vllm-project#6272) BugFix: Fix moe_load accumulation error in ACL graph mode (vllm-project#6182) [Patch] Remove the patch of ECExampleConnector (vllm-project#5976) [Bugfix] Fix PP+PCP and PP+flashcomm1 bugs (vllm-project#5416) [Feat] proxy delay to remove instances (vllm-project#5934) [CI] Add workfolw_dispatch for nightly image build (vllm-project#6269) [bugfix][npugraph_ex]fix static kernel uninstall issue (vllm-project#6128) [Doc] 310P Documents update (vllm-project#6246) [Feature] Mooncake connector get remote ptp size (vllm-project#5822) ...
…6043) ### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase **Related-RFC**: vllm-project#5449 **Co-authored-by**: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
…6043) ### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase **Related-RFC**: vllm-project#5449 **Co-authored-by**: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
…6043) ### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase **Related-RFC**: vllm-project#5449 **Co-authored-by**: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>
### What this PR does / why we need it? #6043 deleted the forward_before phase of the dynamic eplb. Currently, the end-to-end precision is monitored in the UT, and the log is not printed in the key place. As a result, the eplb does not take effect and is not intercepted. 1. The forward_before function is added back. 2. Delete unnecessary logs and add key logs. 3. Warm-up of algorithm 3 is added. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested?  #### The conversation is normal. Okay, the user is asking, \"What is deep learning?\" I need to explain this in a clear and concise way. Let me start by recalling what I know about deep learning. It's a subset of machine learning, right? So first, I should mention that it's part of machine learning, which itself is a branch of AI. Then, the key aspect of deep learning is the use of neural networks with multiple layers. These are called deep neural networks.\n\nWait, I should define neural networks first. Maybe start with the basics. A neural network is inspired by the human brain, with layers of nodes (neurons) that process data. But deep learning specifically refers to networks with many layers—hence \"deep.\" So the term \"deep\" comes from the number of layers. \n\nI should explain how deep learning works. It involves training these networks on large datasets, allowing them to automatically learn features from the data. Unlike traditional machine learning, where you might have to manually extract features, deep learning models can do this automatically. That's a key point. For example, in image recognition, a deep learning model can learn to detect edges, shapes, and then more complex patterns without human intervention.\n\nApplications are important too. The user might want to know where deep learning is used. Common examples include image and speech recognition, natural language processing, autonomous vehicles, and recommendation systems. Maybe mention specific technologies like self-driving cars using computer vision or virtual assistants like Siri or Alexa - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it? vllm-project#6043 deleted the forward_before phase of the dynamic eplb. Currently, the end-to-end precision is monitored in the UT, and the log is not printed in the key place. As a result, the eplb does not take effect and is not intercepted. 1. The forward_before function is added back. 2. Delete unnecessary logs and add key logs. 3. Warm-up of algorithm 3 is added. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested?  #### The conversation is normal. Okay, the user is asking, \"What is deep learning?\" I need to explain this in a clear and concise way. Let me start by recalling what I know about deep learning. It's a subset of machine learning, right? So first, I should mention that it's part of machine learning, which itself is a branch of AI. Then, the key aspect of deep learning is the use of neural networks with multiple layers. These are called deep neural networks.\n\nWait, I should define neural networks first. Maybe start with the basics. A neural network is inspired by the human brain, with layers of nodes (neurons) that process data. But deep learning specifically refers to networks with many layers—hence \"deep.\" So the term \"deep\" comes from the number of layers. \n\nI should explain how deep learning works. It involves training these networks on large datasets, allowing them to automatically learn features from the data. Unlike traditional machine learning, where you might have to manually extract features, deep learning models can do this automatically. That's a key point. For example, in image recognition, a deep learning model can learn to detect edges, shapes, and then more complex patterns without human intervention.\n\nApplications are important too. The user might want to know where deep learning is used. Common examples include image and speech recognition, natural language processing, autonomous vehicles, and recommendation systems. Maybe mention specific technologies like self-driving cars using computer vision or virtual assistants like Siri or Alexa - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it? vllm-project#6043 deleted the forward_before phase of the dynamic eplb. Currently, the end-to-end precision is monitored in the UT, and the log is not printed in the key place. As a result, the eplb does not take effect and is not intercepted. 1. The forward_before function is added back. 2. Delete unnecessary logs and add key logs. 3. Warm-up of algorithm 3 is added. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested?  #### The conversation is normal. Okay, the user is asking, \"What is deep learning?\" I need to explain this in a clear and concise way. Let me start by recalling what I know about deep learning. It's a subset of machine learning, right? So first, I should mention that it's part of machine learning, which itself is a branch of AI. Then, the key aspect of deep learning is the use of neural networks with multiple layers. These are called deep neural networks.\n\nWait, I should define neural networks first. Maybe start with the basics. A neural network is inspired by the human brain, with layers of nodes (neurons) that process data. But deep learning specifically refers to networks with many layers—hence \"deep.\" So the term \"deep\" comes from the number of layers. \n\nI should explain how deep learning works. It involves training these networks on large datasets, allowing them to automatically learn features from the data. Unlike traditional machine learning, where you might have to manually extract features, deep learning models can do this automatically. That's a key point. For example, in image recognition, a deep learning model can learn to detect edges, shapes, and then more complex patterns without human intervention.\n\nApplications are important too. The user might want to know where deep learning is used. Common examples include image and speech recognition, natural language processing, autonomous vehicles, and recommendation systems. Maybe mention specific technologies like self-driving cars using computer vision or virtual assistants like Siri or Alexa - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
…6043) ### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase **Related-RFC**: vllm-project#5449 **Co-authored-by**: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? vllm-project#6043 deleted the forward_before phase of the dynamic eplb. Currently, the end-to-end precision is monitored in the UT, and the log is not printed in the key place. As a result, the eplb does not take effect and is not intercepted. 1. The forward_before function is added back. 2. Delete unnecessary logs and add key logs. 3. Warm-up of algorithm 3 is added. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested?  #### The conversation is normal. Okay, the user is asking, \"What is deep learning?\" I need to explain this in a clear and concise way. Let me start by recalling what I know about deep learning. It's a subset of machine learning, right? So first, I should mention that it's part of machine learning, which itself is a branch of AI. Then, the key aspect of deep learning is the use of neural networks with multiple layers. These are called deep neural networks.\n\nWait, I should define neural networks first. Maybe start with the basics. A neural network is inspired by the human brain, with layers of nodes (neurons) that process data. But deep learning specifically refers to networks with many layers—hence \"deep.\" So the term \"deep\" comes from the number of layers. \n\nI should explain how deep learning works. It involves training these networks on large datasets, allowing them to automatically learn features from the data. Unlike traditional machine learning, where you might have to manually extract features, deep learning models can do this automatically. That's a key point. For example, in image recognition, a deep learning model can learn to detect edges, shapes, and then more complex patterns without human intervention.\n\nApplications are important too. The user might want to know where deep learning is used. Common examples include image and speech recognition, natural language processing, autonomous vehicles, and recommendation systems. Maybe mention specific technologies like self-driving cars using computer vision or virtual assistants like Siri or Alexa - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…6043) ### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase **Related-RFC**: vllm-project#5449 **Co-authored-by**: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
### What this PR does / why we need it? vllm-project#6043 deleted the forward_before phase of the dynamic eplb. Currently, the end-to-end precision is monitored in the UT, and the log is not printed in the key place. As a result, the eplb does not take effect and is not intercepted. 1. The forward_before function is added back. 2. Delete unnecessary logs and add key logs. 3. Warm-up of algorithm 3 is added. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested?  #### The conversation is normal. Okay, the user is asking, \"What is deep learning?\" I need to explain this in a clear and concise way. Let me start by recalling what I know about deep learning. It's a subset of machine learning, right? So first, I should mention that it's part of machine learning, which itself is a branch of AI. Then, the key aspect of deep learning is the use of neural networks with multiple layers. These are called deep neural networks.\n\nWait, I should define neural networks first. Maybe start with the basics. A neural network is inspired by the human brain, with layers of nodes (neurons) that process data. But deep learning specifically refers to networks with many layers—hence \"deep.\" So the term \"deep\" comes from the number of layers. \n\nI should explain how deep learning works. It involves training these networks on large datasets, allowing them to automatically learn features from the data. Unlike traditional machine learning, where you might have to manually extract features, deep learning models can do this automatically. That's a key point. For example, in image recognition, a deep learning model can learn to detect edges, shapes, and then more complex patterns without human intervention.\n\nApplications are important too. The user might want to know where deep learning is used. Common examples include image and speech recognition, natural language processing, autonomous vehicles, and recommendation systems. Maybe mention specific technologies like self-driving cars using computer vision or virtual assistants like Siri or Alexa - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
…6043) ### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase **Related-RFC**: vllm-project#5449 **Co-authored-by**: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? vllm-project#6043 deleted the forward_before phase of the dynamic eplb. Currently, the end-to-end precision is monitored in the UT, and the log is not printed in the key place. As a result, the eplb does not take effect and is not intercepted. 1. The forward_before function is added back. 2. Delete unnecessary logs and add key logs. 3. Warm-up of algorithm 3 is added. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested?  #### The conversation is normal. Okay, the user is asking, \"What is deep learning?\" I need to explain this in a clear and concise way. Let me start by recalling what I know about deep learning. It's a subset of machine learning, right? So first, I should mention that it's part of machine learning, which itself is a branch of AI. Then, the key aspect of deep learning is the use of neural networks with multiple layers. These are called deep neural networks.\n\nWait, I should define neural networks first. Maybe start with the basics. A neural network is inspired by the human brain, with layers of nodes (neurons) that process data. But deep learning specifically refers to networks with many layers—hence \"deep.\" So the term \"deep\" comes from the number of layers. \n\nI should explain how deep learning works. It involves training these networks on large datasets, allowing them to automatically learn features from the data. Unlike traditional machine learning, where you might have to manually extract features, deep learning models can do this automatically. That's a key point. For example, in image recognition, a deep learning model can learn to detect edges, shapes, and then more complex patterns without human intervention.\n\nApplications are important too. The user might want to know where deep learning is used. Common examples include image and speech recognition, natural language processing, autonomous vehicles, and recommendation systems. Maybe mention specific technologies like self-driving cars using computer vision or virtual assistants like Siri or Alexa - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…6043) ### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase **Related-RFC**: vllm-project#5449 **Co-authored-by**: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
### What this PR does / why we need it? vllm-project#6043 deleted the forward_before phase of the dynamic eplb. Currently, the end-to-end precision is monitored in the UT, and the log is not printed in the key place. As a result, the eplb does not take effect and is not intercepted. 1. The forward_before function is added back. 2. Delete unnecessary logs and add key logs. 3. Warm-up of algorithm 3 is added. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested?  #### The conversation is normal. Okay, the user is asking, \"What is deep learning?\" I need to explain this in a clear and concise way. Let me start by recalling what I know about deep learning. It's a subset of machine learning, right? So first, I should mention that it's part of machine learning, which itself is a branch of AI. Then, the key aspect of deep learning is the use of neural networks with multiple layers. These are called deep neural networks.\n\nWait, I should define neural networks first. Maybe start with the basics. A neural network is inspired by the human brain, with layers of nodes (neurons) that process data. But deep learning specifically refers to networks with many layers—hence \"deep.\" So the term \"deep\" comes from the number of layers. \n\nI should explain how deep learning works. It involves training these networks on large datasets, allowing them to automatically learn features from the data. Unlike traditional machine learning, where you might have to manually extract features, deep learning models can do this automatically. That's a key point. For example, in image recognition, a deep learning model can learn to detect edges, shapes, and then more complex patterns without human intervention.\n\nApplications are important too. The user might want to know where deep learning is used. Common examples include image and speech recognition, natural language processing, autonomous vehicles, and recommendation systems. Maybe mention specific technologies like self-driving cars using computer vision or virtual assistants like Siri or Alexa - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
…6043) ### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase **Related-RFC**: vllm-project#5449 **Co-authored-by**: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
What this PR does / why we need it?
The structure of the
excute_modeland_dymmy_runmethods in NPUModelRunner differs greatly from that in GPUModelRunner.Achieve alignment with GPUModelRunner:
Split the
_prepare_inputsmethod into_prepare_inputs,_determine_batch_execution_and_padding,_build_attention_metadata, and_preprocess.Modify
_generate_process_reqs_hidden_statesto_model_forward.Align the implementation of the
postprocessphaseRelated-RFC: #5449
Co-authored-by: @zhenwenqi2024
Does this PR introduce any user-facing change?
no
How was this patch tested?