diff --git a/.claude/skills/vllm-omni-npu-upgrade/SKILL.md b/.claude/skills/vllm-omni-npu-upgrade/SKILL.md new file mode 100644 index 00000000000..1ef7ab39301 --- /dev/null +++ b/.claude/skills/vllm-omni-npu-upgrade/SKILL.md @@ -0,0 +1,300 @@ +--- +name: vllm-omni-npu-model-runner-upgrade +description: "Upgrade vllm-omni NPU model runners (OmniNPUModelRunner, NPUARModelRunner, NPUGenerationModelRunner) to align with the latest vllm-ascend NPUModelRunner while preserving omni-specific logic." +--- + +# vLLM-Omni NPU Model Runner Upgrade Skill + +## Overview + +This skill guides the process of upgrading vllm-omni's NPU model runners to align with the latest vllm-ascend codebase while preserving omni-specific enhancements. The NPU runners are designed to run omni multimodal models (like Qwen3-Omni, Bagel, MiMoAudio) on Ascend NPUs. + +## File Structure + +### NPU Model Runner Files +``` +vllm-omni/vllm_omni/platforms/npu/worker/ +├── __init__.py +├── npu_model_runner.py # OmniNPUModelRunner (base class) +├── npu_ar_model_runner.py # NPUARModelRunner (autoregressive) +├── npu_ar_worker.py # AR worker +├── npu_generation_model_runner.py # NPUGenerationModelRunner (diffusion/non-AR) +└── npu_generation_worker.py # Generation worker +``` + +### GPU Reference Files (for omni-specific logic sync) +``` +vllm-omni/vllm_omni/worker/ +├── __init__.py +├── gpu_model_runner.py # OmniGPUModelRunner +├── gpu_ar_model_runner.py # GPUARModelRunner +├── gpu_ar_worker.py +├── gpu_generation_model_runner.py +├── gpu_generation_worker.py +├── mixins.py +├── base.py +└── gpu_memory_utils.py +``` + +### vllm-ascend Reference Files +``` +vllm-ascend/vllm_ascend/worker/ +├── model_runner_v1.py # NPUModelRunner (base class to copy from) +├── npu_input_batch.py +├── block_table.py +├── pcp_utils.py +└── worker.py +``` + +## Inheritance Hierarchy + +``` + GPUModelRunner (vllm) + | + +----------------+----------------+ + | | + OmniGPUModelRunner NPUModelRunner (vllm-ascend) + (vllm_omni/worker) (vllm_ascend/worker) + | | + +----------- OmniNPUModelRunner --+ + (multiple inheritance) + | + +---------------+---------------+ + | | + NPUARModelRunner NPUGenerationModelRunner + (autoregressive) (non-autoregressive/diffusion) +``` + +## Omni-Specific Comment Markers + +Omni-specific logic is marked with comment blocks: +```python +# -------------------------------------- Omni-new ------------------------------------------------- +# ... omni-specific code ... +# -------------------------------------- Omni-new ------------------------------------------------- +``` + +Or simpler variations: +```python +# -------------------------------------- Omni-new ------------------------------------------------- +# ------------------------------------------------------------------------------------------------ +``` + +**Important**: +- Always preserve and add these markers when modifying code. +- **The reference documents (`references/omni-specific-blocks.md`) may not be up-to-date.** Always grep for `Omni-new` in the GPU implementations to find the authoritative list of omni-specific blocks. +- When you discover new omni-specific code that is not documented in the references, please update the reference files. + +## Key Methods Requiring Attention + +### OmniNPUModelRunner (npu_model_runner.py) + +| Method | Description | Omni-Specific Logic | +|--------|-------------|---------------------| +| `load_model` | Load model and initialize talker_mtp | Uses `ACLGraphWrapper` instead of `CUDAGraphWrapper`, initializes talker buffers | +| `_dummy_run` | Warmup/profiling run | talker_mtp dummy forward, `extract_multimodal_outputs` | +| `_model_forward` | Forward pass wrapper | Injects `model_kwargs_extra`, wraps with `OmniOutput`, NPU-specific graph updates | +| `_talker_mtp_forward` | Talker MTP forward for Qwen3-Omni | Uses `set_ascend_forward_context` | + +### NPUARModelRunner (npu_ar_model_runner.py) + +| Method | Description | Omni-Specific Logic | +|--------|-------------|---------------------| +| `__init__` | Initialize with KV transfer manager | `OmniKVTransferManager` setup | +| `execute_model` | Main inference entry | KV transfer handling, `_update_states` override, `extract_multimodal_outputs` | +| `sample_tokens` | Token sampling | Hidden states extraction, multimodal outputs processing, `OmniModelRunnerOutput` | +| `_resolve_global_request_id` | Request ID resolution | For disaggregated inference | + +### NPUGenerationModelRunner (npu_generation_model_runner.py) + +| Method | Description | Omni-Specific Logic | +|--------|-------------|---------------------| +| `_update_request_states` | Update request states for async chunk | async_chunk handling | +| `execute_model` | Generation forward | async_chunk, `seq_token_counts`, `_run_generation_model` | +| `sample_tokens` | Output processing | multimodal output packaging to `OmniModelRunnerOutput` | +| `_dummy_run` | Dummy run override | model_kwargs initialization, multimodal extraction | +| `_run_generation_model` | Run generation model | Calls `_model_forward` with sampler | + +## Upgrade Workflow + +### Step 1: Preparation + +1. **Identify target versions**(Use gh cli to check): + - We're using vllm-omni main branch + - Check the last release of vllm-omni + - Target vllm-ascend version(Just directly use the local latest vllm-ascend code) + +2. **Check GPU-side changes** (since last release): + ```bash + cd /root/vllm-workspace/vllm-omni + git log --oneline --since="" -- vllm_omni/worker/ + ``` + +3. **Read latest vllm-ascend code**: + - We don't track vllm-ascend changes - just directly use the latest code from `/root/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py` + - Copy the relevant methods and re-insert omni-specific blocks + +### Step 2: Analyze Omni-Specific Logic + +For each NPU model runner file: + +1. **Extract existing omni-specific blocks**: + ```bash + grep -n "Omni-new" vllm_omni/platforms/npu/worker/npu_model_runner.py + ``` + +2. **Document each omni block**: + - Which method it belongs to + - What functionality it provides + - Dependencies on other omni code + +### Step 3: Update Base Class (OmniNPUModelRunner) + +**Note**: Always check the GPU implementation `gpu_model_runner.py` for any new omni logic not yet documented in references. + +1. **Read the latest vllm-ascend `NPUModelRunner.load_model`** +2. **Copy the method, keeping the structure** +3. **Re-insert omni-specific logic** (check GPU `gpu_model_runner.py` for authoritative list): + - Replace `CUDAGraphWrapper` with `ACLGraphWrapper` + - Keep talker_mtp initialization + - Preserve buffer allocations for talker + - Check for any new omni blocks added since last sync + +4. **Update `_dummy_run`**: + - Copy from vllm-ascend + - Compare with GPU `_dummy_run` for omni-specific blocks + - Re-insert all `Omni-new` marked code from GPU version + +5. **Update `_model_forward`**: + - Keep the omni wrapper logic + - Update NPU-specific parts (graph params, SP all-gather) + - Check GPU version for any new omni logic + +### Step 4: Update AR Model Runner + +1. **Compare with GPU `gpu_ar_model_runner.py`** for any new omni features +2. **Copy `execute_model` from vllm-ascend** +3. **Re-insert omni blocks** (reference `references/omni-specific-blocks.md`, but note it may be incomplete): + - **IMPORTANT**: Always check the GPU implementation `gpu_ar_model_runner.py` for all `Omni-new` marked code blocks + - The reference doc may not include newly added omni logic - treat it as a starting point, not exhaustive + - When discovering new omni code blocks, please update `references/omni-specific-blocks.md` + - Common omni blocks include but are not limited to: KV transfer, multimodal outputs, sampling_metadata handling, etc. + +4. **Update `sample_tokens`** (also compare with GPU implementation): + - Compare with `gpu_ar_model_runner.py`'s `sample_tokens` method + - Identify all `Omni-new` marked code blocks + - Ensure NPU version includes all omni-specific logic + +### Step 5: Update Generation Model Runner + +**Note**: Generation model runner may have unique omni logic for diffusion/non-AR models. + +1. **Compare with GPU `gpu_generation_model_runner.py`** - grep for all `Omni-new` blocks +2. **Update `execute_model`**: + - Check GPU version for all omni-specific blocks + - Keep async_chunk handling + - Keep `seq_token_counts` injection + - Update forward/context setup from vllm-ascend + - Look for any new omni logic not documented in references + +3. **Update `_dummy_run`**: + - Copy from vllm-ascend base + - Compare with GPU `_dummy_run` if exists + - Re-insert all omni-specific logic + +### Step 6: Update Imports + +Check and update imports at the top of each file: + +```python +# Common vllm-ascend imports +from vllm_ascend.ascend_forward_context import get_forward_context, set_ascend_forward_context +from vllm_ascend.attention.attention_v1 import AscendAttentionState +from vllm_ascend.attention.utils import using_paged_attention +from vllm_ascend.compilation.acl_graph import ACLGraphWrapper, update_full_graph_params +from vllm_ascend.ops.rotary_embedding import update_cos_sin +from vllm_ascend.utils import enable_sp, lmhead_tp_enable +from vllm_ascend.worker.model_runner_v1 import SEQ_LEN_WITH_MAX_PA_WORKSPACE, NPUModelRunner + +# Omni-specific imports +from vllm_omni.model_executor.models.output_templates import OmniOutput +from vllm_omni.worker.gpu_model_runner import OmniGPUModelRunner +from vllm_omni.outputs import OmniModelRunnerOutput +from vllm_omni.distributed.omni_connectors.kv_transfer_manager import OmniKVTransferManager +``` + +### Step 7: Sync GPU-Side Omni Changes + +1. **Check recent GPU worker changes**: + ```bash + git diff .. -- vllm_omni/worker/gpu_model_runner.py + git diff .. -- vllm_omni/worker/gpu_ar_model_runner.py + ``` + +2. **Identify new omni features** that need to be ported to NPU + +3. **Apply corresponding changes** to NPU runners + +### Step 8: Validation + +1. **Run type checking**: + ```bash + cd /root/vllm-workspace/vllm-omni + python -m py_compile vllm_omni/platforms/npu/worker/npu_model_runner.py + python -m py_compile vllm_omni/platforms/npu/worker/npu_ar_model_runner.py + python -m py_compile vllm_omni/platforms/npu/worker/npu_generation_model_runner.py + ``` + +2. **Run import test**: + ```bash + python -c "from vllm_omni.platforms.npu.worker import *" + ``` + +3. **Run model serving test** (if hardware available): + ```bash + vllm serve --trust-remote-code + ``` + +## Common Pitfalls + +### 1. Forward Context Differences +- GPU uses `set_forward_context` +- NPU uses `set_ascend_forward_context` +- Parameters may differ slightly + +### 2. Graph Wrapper Differences +- GPU: `CUDAGraphWrapper` +- NPU: `ACLGraphWrapper` +- Constructor parameters may differ + +### 3. Buffer Creation +- GPU: `_make_buffer` returns different structure +- NPU: May need numpy=True/False parameter + +### 4. Attention Metadata +- GPU: Uses vllm attention metadata builders +- NPU: Uses `AscendCommonAttentionMetadata` + +### 5. Sampling +- GPU: Uses vllm sampler +- NPU: Uses `AscendSampler` + +## Checklist Before Commit + +- [ ] All omni-specific comment markers preserved +- [ ] New omni logic from GPU side synced +- [ ] Imports updated to latest vllm-ascend +- [ ] No `CUDAGraphWrapper` references in NPU code +- [ ] `set_ascend_forward_context` used instead of `set_forward_context` +- [ ] `ACLGraphWrapper` used for talker_mtp wrapping +- [ ] Type hints match vllm-ascend signatures +- [ ] No duplicate code blocks +- [ ] Python syntax valid (py_compile passes) + +## Reference Files for Comparison + +When upgrading, keep these files open for reference: + +1. **vllm-ascend NPUModelRunner**: `/root/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py` +2. **vllm GPUModelRunner**: `/root/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py` +3. **vllm-omni OmniGPUModelRunner**: `/root/vllm-workspace/vllm-omni/vllm_omni/worker/gpu_model_runner.py` diff --git a/.claude/skills/vllm-omni-npu-upgrade/references/gpu-to-npu-translation.md b/.claude/skills/vllm-omni-npu-upgrade/references/gpu-to-npu-translation.md new file mode 100644 index 00000000000..89067d37b2d --- /dev/null +++ b/.claude/skills/vllm-omni-npu-upgrade/references/gpu-to-npu-translation.md @@ -0,0 +1,335 @@ +# GPU to NPU Translation Patterns + +This document provides a quick reference for translating GPU code patterns to NPU equivalents when porting omni-specific logic. + +## Import Translations + +### Forward Context +```python +# GPU +from vllm.forward_context import set_forward_context + +# NPU +from vllm_ascend.ascend_forward_context import set_ascend_forward_context +``` + +### Graph Wrapper +```python +# GPU +from vllm.compilation.cuda_graph import CUDAGraphWrapper + +# NPU +from vllm_ascend.compilation.acl_graph import ACLGraphWrapper +``` + +### Attention State +```python +# GPU (no equivalent - uses FlashAttention states directly) + +# NPU +from vllm_ascend.attention.attention_v1 import AscendAttentionState +``` + +### Utilities +```python +# GPU +# (directly use torch.cuda functions) + +# NPU +from vllm_ascend.utils import enable_sp, lmhead_tp_enable +from vllm_ascend.ops.rotary_embedding import update_cos_sin +``` + +## Context Manager Translations + +### Forward Context Setup +```python +# GPU +with set_forward_context( + attn_metadata, + self.vllm_config, + num_tokens=num_tokens_padded, + num_tokens_across_dp=num_tokens_across_dp, + cudagraph_runtime_mode=cudagraph_mode, + batch_descriptor=batch_desc, +): + # forward pass + +# NPU +with set_ascend_forward_context( + attn_metadata, + self.vllm_config, + num_tokens=num_tokens_padded, + num_tokens_across_dp=num_tokens_across_dp, + aclgraph_runtime_mode=cudagraph_mode, # Note: 'aclgraph' not 'cudagraph' + batch_descriptor=batch_desc, + num_actual_tokens=scheduler_output.total_num_scheduled_tokens, + model_instance=self.model, +): + # forward pass +``` + +### Graph Capture Context +```python +# GPU +from vllm.compilation.cuda_graph import graph_capture as cuda_graph_capture +with cuda_graph_capture(self.device): + # capture + +# NPU +from vllm_ascend.worker.model_runner_v1 import graph_capture +with graph_capture(self.device): + # capture +``` + +## Graph Wrapper Usage + +### Creating Graph Wrapper +```python +# GPU +if cudagraph_mode.has_full_cudagraphs() and has_separate_talker: + self.talker_mtp = CUDAGraphWrapper( + talker_mtp, + self.vllm_config, + runtime_mode=CUDAGraphMode.FULL + ) + +# NPU +if cudagraph_mode.has_full_cudagraphs() and has_separate_talker: + self.talker_mtp = ACLGraphWrapper( + talker_mtp, + self.vllm_config, + runtime_mode=CUDAGraphMode.FULL + ) +``` + +### Checking Graph Wrapper Type +```python +# GPU +if not isinstance(self.talker_mtp, CUDAGraphWrapper): + _cudagraph_mode = CUDAGraphMode.NONE + +# NPU +if not isinstance(self.talker_mtp, ACLGraphWrapper): + _cudagraph_mode = CUDAGraphMode.NONE +``` + +## Device Operations + +### Synchronization +```python +# GPU +torch.cuda.synchronize() + +# NPU +torch.npu.synchronize() +``` + +### Stream Operations +```python +# GPU +stream = torch.cuda.Stream(device=device) +torch.cuda.current_stream() + +# NPU +stream = torch.npu.Stream(device=device) +torch.npu.current_stream() +``` + +## Attention Metadata + +### State Setting (NPU-specific) +```python +# GPU - handled internally by attention backends + +# NPU - explicit state setting required +self.attn_state = AscendAttentionState.DecodeOnly +if self.speculative_config and self.speculative_config.method == "mtp": + if self.vllm_config.model_config.use_mla: + self.attn_state = AscendAttentionState.SpecDecoding + else: + self.attn_state = AscendAttentionState.ChunkedPrefill +``` + +### Building Attention Metadata +```python +# GPU - uses vllm attention builders + +# NPU - may need additional parameters +(attn_metadata, spec_decode_common_attn_metadata) = self._build_attention_metadata( + num_tokens=num_tokens_unpadded, + num_tokens_padded=num_tokens_padded, + num_reqs=num_reqs, + num_reqs_padded=num_reqs_padded, + max_query_len=max_num_scheduled_tokens, + ubatch_slices=ubatch_slices_attn, + logits_indices=logits_indices, + use_spec_decode=use_spec_decode, + num_scheduled_tokens=scheduler_output.num_scheduled_tokens, + num_scheduled_tokens_np=num_scheduled_tokens_np, + cascade_attn_prefix_lens=cascade_attn_prefix_lens, +) +``` + +## Rotary Embedding + +### Update Cos/Sin Cache +```python +# GPU - typically handled inside attention + +# NPU - explicit update required before forward +from vllm_ascend.ops.rotary_embedding import update_cos_sin +update_cos_sin(positions) +``` + +## Sequence Parallelism + +### Enable SP Check +```python +# GPU - use vllm distributed utilities + +# NPU - use vllm-ascend wrapper +from vllm_ascend.utils import enable_sp + +if enable_sp(): + # sequence parallelism enabled +``` + +## Sampler + +### Sampler Type +```python +# GPU - uses vllm sampler +self.sampler = Sampler() + +# NPU - uses AscendSampler +from vllm_ascend.sample.sampler import AscendSampler +self.sampler = AscendSampler() +``` + +## Input Batch + +### Batch Class +```python +# GPU +from vllm.v1.worker.gpu_input_batch import InputBatch + +# NPU +from vllm_ascend.worker.npu_input_batch import NPUInputBatch +``` + +## Graph Parameter Updates + +### Full Graph Params Update (NPU-specific) +```python +# GPU - not needed + +# NPU - required for FULL graph mode +from vllm_ascend.compilation.acl_graph import update_full_graph_params + +forward_context = get_forward_context() +if ( + forward_context.cudagraph_runtime_mode == CUDAGraphMode.FULL + and not forward_context.capturing + and not self.use_sparse +): + update_full_graph_params( + self.attn_backend, + self.update_stream, + forward_context, + num_tokens_padded, + self.vllm_config, + self.speculative_config, + positions.shape[0], + ) +``` + +## Paged Attention Check + +```python +# GPU - not typically needed + +# NPU +from vllm_ascend.attention.utils import using_paged_attention + +if is_graph_capturing and using_paged_attention(num_tokens, self.vllm_config): + seq_lens = SEQ_LEN_WITH_MAX_PA_WORKSPACE +``` + +## Common Method Signature Differences + +### _dummy_run Parameters +```python +# GPU (v0.17.0) +def _dummy_run( + self, + num_tokens: int, + cudagraph_runtime_mode: CUDAGraphMode | None = None, + force_attention: bool = False, + uniform_decode: bool = False, + allow_microbatching: bool = True, + skip_eplb: bool = False, + is_profile: bool = False, + create_mixed_batch: bool = False, + remove_lora: bool = True, + is_graph_capturing: bool = False, + num_active_loras: int = 0, +) -> tuple[torch.Tensor, torch.Tensor]: + +# NPU (v0.17.0) - adds with_prefill, activate_lora +def _dummy_run( + self, + num_tokens: int, + with_prefill: bool = False, + cudagraph_runtime_mode: CUDAGraphMode | None = None, + force_attention: bool = False, + uniform_decode: bool = False, + is_profile: bool = False, + create_mixed_batch: bool = False, + allow_microbatching: bool = True, + skip_eplb: bool = False, + remove_lora: bool = True, + activate_lora: bool = False, + is_graph_capturing: bool = False, + num_active_loras: int = 0, +) -> tuple[torch.Tensor, torch.Tensor]: +``` + +### _model_forward Parameters +```python +# GPU - no num_tokens_padded +def _model_forward( + self, + input_ids: torch.Tensor | None = None, + positions: torch.Tensor | None = None, + intermediate_tensors: IntermediateTensors | None = None, + inputs_embeds: torch.Tensor | None = None, + **model_kwargs: dict[str, Any], +): + +# NPU - has num_tokens_padded as first parameter +def _model_forward( + self, + num_tokens_padded: int, + input_ids: torch.Tensor | None = None, + positions: torch.Tensor | None = None, + intermediate_tensors: IntermediateTensors | None = None, + inputs_embeds: torch.Tensor | None = None, + **model_kwargs: dict[str, Any], +): +``` + +## Quick Reference Table + +| Feature | GPU | NPU | +|---------|-----|-----| +| Graph wrapper | `CUDAGraphWrapper` | `ACLGraphWrapper` | +| Forward context | `set_forward_context` | `set_ascend_forward_context` | +| Runtime mode param | `cudagraph_runtime_mode` | `aclgraph_runtime_mode` | +| Device sync | `torch.cuda.synchronize()` | `torch.npu.synchronize()` | +| Stream | `torch.cuda.Stream` | `torch.npu.Stream` | +| Current stream | `torch.cuda.current_stream()` | `torch.npu.current_stream()` | +| Input batch | `InputBatch` | `NPUInputBatch` | +| Sampler | `Sampler` | `AscendSampler` | +| Attention state | N/A | `AscendAttentionState` | +| RoPE update | N/A | `update_cos_sin()` | diff --git a/.claude/skills/vllm-omni-npu-upgrade/references/omni-specific-blocks.md b/.claude/skills/vllm-omni-npu-upgrade/references/omni-specific-blocks.md new file mode 100644 index 00000000000..8c5d32ab4c1 --- /dev/null +++ b/.claude/skills/vllm-omni-npu-upgrade/references/omni-specific-blocks.md @@ -0,0 +1,374 @@ +# Omni-Specific Code Blocks Reference + +This document catalogs omni-specific code blocks in the NPU model runners, making it easier to identify what needs to be preserved during upgrades. + +> **IMPORTANT**: This document may not be complete or up-to-date! +> +> - Always grep for `Omni-new` in the GPU implementations (`vllm_omni/worker/`) to find the authoritative list +> - New omni features may be added that are not yet documented here +> - When you discover new omni-specific blocks during an upgrade, please update this document +> - Last verified: Check git history for this file + +## OmniNPUModelRunner (npu_model_runner.py) + +### load_model - Talker MTP Initialization + +```python +def load_model(self, *args, **kwargs) -> None: + NPUModelRunner.load_model(self, *args, **kwargs) + # Initialize enable_sp cache to avoid get_current_vllm_config() error + # in _pad_for_sequence_parallelism during execute_model. + # This is a workaround for vllm-ascend not passing vllm_config to enable_sp(). + enable_sp(self.vllm_config) + # TODO move this model specific logic to a separate class + # TTS model IS the talker (no .talker sub-attr); use getattr to support both Omni and TTS. + talker_mtp = getattr(self.model, "talker_mtp", None) + if talker_mtp is not None: + self.talker_mtp = talker_mtp # type: ignore[assignment] + cudagraph_mode = self.compilation_config.cudagraph_mode + assert cudagraph_mode is not None + # Only wrap talker_mtp in CUDAGraphWrapper for Omni models that + # have a separate .talker sub-module. TTS models' code predictor + # has internal AR loops / torch.multinomial — not graph-safe. + has_separate_talker = getattr(self.model, "talker", None) is not None + if cudagraph_mode.has_full_cudagraphs() and has_separate_talker: + # NOTE: Use ACLGraphWrapper on NPU, not CUDAGraphWrapper + self.talker_mtp = ACLGraphWrapper(talker_mtp, self.vllm_config, runtime_mode=CUDAGraphMode.FULL) + # TTS exposes mtp_hidden_size; Omni uses hf_text_config.hidden_size. + hidden_size = int( + getattr(self.model, "mtp_hidden_size", 0) or getattr(self.model_config.hf_text_config, "hidden_size") + ) + max_batch_size = max(self.max_num_reqs, self.compilation_config.max_cudagraph_capture_size) + self.talker_mtp_input_ids = self._make_buffer(max_batch_size, dtype=torch.int32) + self.talker_mtp_inputs_embeds = self._make_buffer( + max_batch_size, hidden_size, dtype=self.dtype, numpy=False + ) + self.last_talker_hidden = self._make_buffer(max_batch_size, hidden_size, dtype=self.dtype, numpy=False) + self.text_step = self._make_buffer(max_batch_size, hidden_size, dtype=self.dtype, numpy=False) +``` + +### _dummy_run - Talker MTP Dummy Forward + +Location: Inside `set_ascend_forward_context` block, before main model forward + +```python +# ---------------------------------------Omni-new---------------------------------------------- +if getattr(self.model, "talker", None) is not None and hasattr(self.model, "talker_mtp"): + num_tokens_padded_talker_mtp = num_tokens_padded + if num_tokens_padded_talker_mtp == self.max_num_tokens: + num_tokens_padded_talker_mtp = self.talker_mtp_input_ids.gpu.shape[0] + outputs = self.talker_mtp( + self.talker_mtp_input_ids.gpu[:num_tokens_padded_talker_mtp], + self.talker_mtp_inputs_embeds.gpu[:num_tokens_padded_talker_mtp], + self.last_talker_hidden.gpu[:num_tokens_padded_talker_mtp], + self.text_step.gpu[:num_tokens_padded_talker_mtp], + ) + self.compilation_config.cache_dir = None +# ---------------------------------------Omni-new---------------------------------------------- +``` + +### _dummy_run - Extract Multimodal Outputs + +Location: After model forward, before dummy_compute_logits + +```python +# ---------------------------------------Omni-new---------------------------------------------- +hidden_states, multimodal_outputs = self.extract_multimodal_outputs(hidden_states) +# ---------------------------------------Omni-new---------------------------------------------- +``` + +### _model_forward - Omni Output Wrapping + +```python +def _model_forward( + self, + num_tokens_padded: int, + input_ids: torch.Tensor | None = None, + positions: torch.Tensor | None = None, + intermediate_tensors: IntermediateTensors | None = None, + inputs_embeds: torch.Tensor | None = None, + **model_kwargs: dict[str, Any], +): + """Override to combine NPUModelRunner's signature with OmniGPUModelRunner's logic.""" + # Omni-specific: build and inject extra model kwargs + model_kwargs_extra = self._build_model_kwargs_extra() + + # Call the model forward (same as NPUModelRunner) + assert self.model is not None + model_output = self.model( + input_ids=input_ids, + positions=positions, + intermediate_tensors=intermediate_tensors, + inputs_embeds=inputs_embeds, + **model_kwargs, + **model_kwargs_extra, + ) + + # Omni-specific: wrap output if needed + if not isinstance(model_output, OmniOutput) and hasattr(self.model, "make_omni_output"): + model_output = self.model.make_omni_output(model_output, **model_kwargs_extra) + + # Omni-specific: cache model output for later sample_tokens + self._omni_last_model_output = model_output + + # NPU-specific: update full graph params (keep from vllm-ascend) + forward_context = get_forward_context() + # ... NPU graph update logic ... + + # NPU-specific: all-gather for sequence parallelism (keep from vllm-ascend) + if get_forward_context().sp_enabled and not isinstance(model_output, IntermediateTensors): + model_output = self._all_gather_hidden_states_and_aux(model_output) + + return model_output +``` + +--- + +## NPUARModelRunner (npu_ar_model_runner.py) + +### __init__ - KV Transfer Manager + +```python +def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + self.input_ids = self._make_buffer(self.max_num_tokens, dtype=torch.int32) + # each model stage has their own hidden size + self.hidden_size = self.model_config.hf_text_config.hidden_size + self.inputs_embeds = self._make_buffer(self.max_num_tokens, self.hidden_size, dtype=self.dtype, numpy=False) + # Initialize KV cache manager (preserve vllm_config fallback behavior) + self.kv_transfer_manager = OmniKVTransferManager.from_vllm_config(self.vllm_config, self.model_config) +``` + +### execute_model - KV Transfer Before Update States + +Location: At the very beginning of execute_model + +```python +# -------------------------------------- Omni-new ------------------------------------------------- +# [Omni] Handle KV transfer BEFORE updating states (which removes finished requests) +self.kv_extracted_req_ids = self.kv_transfer_manager.handle_finished_requests_kv_transfer( + finished_reqs=getattr(scheduler_output, "finished_requests_needing_kv_transfer", {}), + kv_caches=self.kv_caches, + block_size=self.cache_config.block_size, + cache_dtype=str(self.cache_config.cache_dtype), + request_id_resolver=self._resolve_global_request_id, +) +# -------------------------------------- Omni-new ------------------------------------------------- +``` + +### execute_model - Custom _update_states Call + +Location: Inside synchronize_input_prep context + +```python +# -------------------------------------- Omni-new ------------------------------------------------- +self._update_states(scheduler_output) +# ------------------------------------------------------------------------------------------------ +``` + +### execute_model - Extract Multimodal Outputs + +Location: In post process section, after hidden_states assignment + +```python +# -------------------------------------- Omni-new ------------------------------------------------- +hidden_states, multimodal_outputs = self.extract_multimodal_outputs(hidden_states) + +if multimodal_outputs is not None: + keys_or_type = ( + list(multimodal_outputs.keys()) + if isinstance(multimodal_outputs, dict) + else type(multimodal_outputs) + ) + logger.debug(f"[AR] execute_model: multimodal_outputs keys = {keys_or_type}") +else: + logger.debug("[AR] execute_model: multimodal_outputs is None") +# -------------------------------------- Omni-new ------------------------------------------------- +``` + +### execute_model - Compute Logits with sampling_metadata + +Location: In both broadcast_pp_output True and False branches + +```python +# -------------------------------------- Omni-new ------------------------------------------------- +# Try with sampling_metadata first; fall back to without for models that don't support it +try: + logits = self.model.compute_logits( + sample_hidden_states, sampling_metadata=self.input_batch.sampling_metadata + ) +except TypeError: + logits = self.model.compute_logits(sample_hidden_states) +# -------------------------------------- Omni-new ------------------------------------------------- +``` + +### sample_tokens - KV Extracted Req IDs + +Location: At the beginning of sample_tokens + +```python +# -------------------------------------- Omni-new ------------------------------------------------- +kv_extracted_req_ids = getattr(self, "kv_extracted_req_ids", None) +self.kv_extracted_req_ids = None +# -------------------------------------- Omni-new ------------------------------------------------- +``` + +### sample_tokens - Process Additional Information and Build Output + +Location: After bookkeeping sync, replacing the original output construction + +```python +# -------------------------------------- Omni-new ------------------------------------------------- +hidden_states_cpu = hidden_states.detach().to("cpu").contiguous() +num_scheduled_tokens_np = getattr(self, "_omni_num_scheduled_tokens_np", None) +if num_scheduled_tokens_np is None: + req_ids = self.input_batch.req_ids + num_scheduled_tokens_np = np.array( + [scheduler_output.num_scheduled_tokens[rid] for rid in req_ids], + dtype=np.int32, + ) + +self._process_additional_information_updates( + hidden_states, multimodal_outputs, num_scheduled_tokens_np, scheduler_output +) + +pooler_output: list[dict[str, object]] = [] +for rid in req_ids_output_copy: + idx = req_id_to_index_output_copy[rid] + start = int(self.query_start_loc.cpu[idx]) + sched = int(num_scheduled_tokens_np[idx]) + end = start + sched + hidden_slice = hidden_states_cpu[start:end] + payload: dict[str, object] = {"hidden": hidden_slice} + if isinstance(multimodal_outputs, dict) and multimodal_outputs: + # ... multimodal output slicing logic ... + pooler_output.append(payload) + +model_runner_output = OmniModelRunnerOutput( + req_ids=req_ids_output_copy, + req_id_to_index=req_id_to_index_output_copy, + sampled_token_ids=valid_sampled_token_ids, + logprobs=logprobs_lists, + prompt_logprobs_dict=prompt_logprobs_dict, + pooler_output=(pooler_output if self.vllm_config.model_config.engine_output_type != "text" else None), + kv_connector_output=kv_connector_output, +) +model_runner_output.kv_extracted_req_ids = kv_extracted_req_ids +# -------------------------------------- Omni-new ------------------------------------------------- +``` + +--- + +## NPUGenerationModelRunner (npu_generation_model_runner.py) + +### execute_model - Async Chunk Update + +Location: Inside prepare input section, before synchronize_input_prep + +```python +# -------------------------------------- Omni-new ------------------------------------------------- +if self.model_config.async_chunk and num_scheduled_tokens: + self._update_request_states(scheduler_output) +# -------------------------------------- Omni-new ------------------------------------------------- +``` + +### execute_model - Seq Token Counts + +Location: After _preprocess call + +```python +# [Omni] Pass token counts per request for code2wav output slicing +model_kwargs["seq_token_counts"] = tokens +``` + +### execute_model - Run Generation Model + +Location: Inside forward context + +```python +# -------------------------------------- Omni-new ------------------------------------------------- +outputs = self._run_generation_model( + num_tokens_padded=num_tokens_padded, + input_ids=input_ids, + positions=positions, + intermediate_tensors=intermediate_tensors, + inputs_embeds=inputs_embeds, + model_kwargs=model_kwargs, + logits_indices=logits_indices, +) +_, multimodal_outputs = self.extract_multimodal_outputs(outputs) +# -------------------------------------- Omni-new ------------------------------------------------- +``` + +### sample_tokens - Multimodal Output Processing + +The entire sample_tokens method body is omni-specific for generation models: + +```python +# -------------------------------------- Omni-new ------------------------------------------------- +pooler_output: list[object] = [] +if isinstance(multimodal_outputs, torch.Tensor): + # ... tensor handling ... +elif isinstance(multimodal_outputs, list): + # ... list handling ... +elif isinstance(multimodal_outputs, dict): + # ... dict handling per request ... +else: + raise RuntimeError("Unsupported diffusion output type") +# [Omni] Copy req_id mappings to avoid async scheduling mutation. +req_ids_output_copy = self.input_batch.req_ids.copy() +req_id_to_index_output_copy = self.input_batch.req_id_to_index.copy() +output = OmniModelRunnerOutput( + req_ids=req_ids_output_copy, + req_id_to_index=req_id_to_index_output_copy, + sampled_token_ids=[], + logprobs=None, + prompt_logprobs_dict={}, + pooler_output=pooler_output, + kv_connector_output=kv_connector_output, + num_nans_in_logits={}, + ec_connector_output=ec_connector_output if self.supports_mm_inputs else None, +) +# -------------------------------------- Omni-new ------------------------------------------------- +``` + +### _dummy_run - Model Kwargs Init and Multimodal Extract + +Location: Before model forward and after + +```python +model_kwargs = self._init_model_kwargs() # Before forward + +# ... forward ... + +# -------------------------------------- Omni-new ------------------------------------------------- +hidden_states, _ = self.extract_multimodal_outputs(hidden_states) +# ------------------------------------------------------------------------------------------------- +``` + +--- + +## ExecuteModelState Extension + +The `ExecuteModelState` NamedTuple is extended for omni: + +```python +class ExecuteModelState(NamedTuple): + """Ephemeral cached state transferred between execute_model() and + sample_tokens(), after execute_model() returns None.""" + + scheduler_output: SchedulerOutput + logits: torch.Tensor + spec_decode_metadata: SpecDecodeMetadata | None + spec_decode_common_attn_metadata: AscendCommonAttentionMetadata | None + hidden_states: torch.Tensor + sample_hidden_states: torch.Tensor + aux_hidden_states: list[torch.Tensor] | None + attn_metadata: PerLayerAttnMetadata + positions: torch.Tensor + ec_connector_output: ECConnectorOutput | None + cudagraph_stats: CUDAGraphStat | None + multimodal_outputs: Any # <-- Omni extension +``` + +This extended state must be imported from `npu_ar_model_runner` in `npu_generation_model_runner`. diff --git a/.claude/skills/vllm-omni-npu-upgrade/references/workflow-checklist.md b/.claude/skills/vllm-omni-npu-upgrade/references/workflow-checklist.md new file mode 100644 index 00000000000..4f184df0ecb --- /dev/null +++ b/.claude/skills/vllm-omni-npu-upgrade/references/workflow-checklist.md @@ -0,0 +1,222 @@ +# NPU Model Runner Upgrade Workflow Checklist + +> **Note**: Reference documents (`omni-specific-blocks.md`) may not be complete. Always grep for `Omni-new` in GPU implementations to find all omni-specific code blocks. Update the reference docs when discovering new blocks. + +## Pre-Upgrade Preparation + +### 1. Version Information +- [ ] Identify current vllm-omni version: `_________` +- [ ] Identify target vllm-ascend version: `_________` +- [ ] Identify target vllm version: `_________` +- [ ] Last release date for GPU worker changes: `_________` + +### 2. Gather Git History +```bash +# GPU-side omni changes since last release +cd /root/vllm-workspace/vllm-omni +git log --oneline --since="YYYY-MM-DD" -- vllm_omni/worker/ + +# vllm-ascend NPUModelRunner changes +cd /root/vllm-workspace/vllm-ascend +git log --oneline .. -- vllm_ascend/worker/model_runner_v1.py +``` + +### 3. Backup Current Files +- [ ] Create backup of current NPU runners: + ```bash + cp -r vllm_omni/platforms/npu/worker vllm_omni/platforms/npu/worker.backup + ``` + +--- + +## OmniNPUModelRunner (npu_model_runner.py) + +### Read and Understand +- [ ] Read current `npu_model_runner.py` +- [ ] Read latest `vllm_ascend/worker/model_runner_v1.py` +- [ ] Read latest `vllm_omni/worker/gpu_model_runner.py` + +### Method: load_model +- [ ] Document existing omni-specific logic +- [ ] Copy latest NPUModelRunner.load_model structure +- [ ] Re-insert: `enable_sp(self.vllm_config)` call +- [ ] Re-insert: talker_mtp detection and setup +- [ ] Replace: `CUDAGraphWrapper` → `ACLGraphWrapper` +- [ ] Re-insert: Buffer allocations (talker_mtp_input_ids, etc.) + +### Method: _dummy_run +- [ ] Document existing omni-specific logic locations +- [ ] Copy latest NPUModelRunner._dummy_run +- [ ] Re-insert: talker_mtp dummy forward block (inside context) +- [ ] Re-insert: `extract_multimodal_outputs` call +- [ ] Verify: Comment markers are present + +### Method: _model_forward +- [ ] Copy latest NPUModelRunner._model_forward structure +- [ ] Re-insert: `_build_model_kwargs_extra()` call +- [ ] Re-insert: OmniOutput wrapping logic +- [ ] Re-insert: `_omni_last_model_output` caching +- [ ] Keep: NPU graph params update +- [ ] Keep: SP all-gather logic + +### Method: _talker_mtp_forward +- [ ] Verify: Uses `set_ascend_forward_context` +- [ ] Verify: Uses `ACLGraphWrapper` check +- [ ] Sync any changes from GPU `_talker_mtp_forward` + +### Imports +- [ ] Update vllm-ascend imports to latest paths +- [ ] Verify all omni imports are present +- [ ] Remove any deprecated imports + +--- + +## NPUARModelRunner (npu_ar_model_runner.py) + +### Read and Understand +- [ ] Read current `npu_ar_model_runner.py` +- [ ] Read latest `vllm_ascend/worker/model_runner_v1.py` execute_model +- [ ] Read latest `vllm_omni/worker/gpu_ar_model_runner.py` + +### Method: __init__ +- [ ] Sync any new initialization from GPU side +- [ ] Keep: `OmniKVTransferManager` setup +- [ ] Keep: Custom buffer allocations + +### Method: execute_model +- [ ] Document all omni blocks with line numbers +- [ ] Copy latest NPUModelRunner.execute_model structure +- [ ] Re-insert: KV transfer handling (beginning) +- [ ] Re-insert: Custom `_update_states` call +- [ ] Re-insert: `extract_multimodal_outputs` +- [ ] Re-insert: `compute_logits` with sampling_metadata try/except +- [ ] Update: ExecuteModelState to include multimodal_outputs + +### Method: sample_tokens +- [ ] Document all omni blocks +- [ ] Copy latest NPUModelRunner.sample_tokens structure +- [ ] Re-insert: `kv_extracted_req_ids` handling +- [ ] Re-insert: Hidden states CPU copy +- [ ] Re-insert: `_process_additional_information_updates` +- [ ] Re-insert: `OmniModelRunnerOutput` construction + +### ExecuteModelState +- [ ] Verify: `multimodal_outputs` field is present +- [ ] Verify: Imported/used correctly in execute_model + +### Imports +- [ ] Update all vllm-ascend imports +- [ ] Keep omni-specific imports + +--- + +## NPUGenerationModelRunner (npu_generation_model_runner.py) + +### Read and Understand +- [ ] Read current `npu_generation_model_runner.py` +- [ ] Read latest GPU `gpu_generation_model_runner.py` + +### Method: _update_request_states +- [ ] Verify: async_chunk handling is correct +- [ ] Sync any changes from GPU side + +### Method: execute_model +- [ ] Document all omni blocks +- [ ] Copy latest NPUModelRunner.execute_model base structure +- [ ] Re-insert: async_chunk update logic +- [ ] Re-insert: `seq_token_counts` injection +- [ ] Re-insert: `_run_generation_model` call +- [ ] Re-insert: `extract_multimodal_outputs` +- [ ] Use: ExecuteModelState from npu_ar_model_runner + +### Method: sample_tokens +- [ ] Keep: Entire omni multimodal output processing +- [ ] Update: Any new output fields needed +- [ ] Keep: `OmniModelRunnerOutput` construction + +### Method: _run_generation_model +- [ ] Sync any changes from GPU side +- [ ] Keep: `_model_forward` call with sampler + +### Method: _dummy_run +- [ ] Copy latest NPUModelRunner._dummy_run +- [ ] Re-insert: `model_kwargs = self._init_model_kwargs()` +- [ ] Re-insert: `extract_multimodal_outputs` at end + +### Imports +- [ ] Import ExecuteModelState from npu_ar_model_runner +- [ ] Update vllm-ascend imports + +--- + +## Post-Upgrade Validation + +### Syntax Validation +- [ ] `python -m py_compile vllm_omni/platforms/npu/worker/npu_model_runner.py` +- [ ] `python -m py_compile vllm_omni/platforms/npu/worker/npu_ar_model_runner.py` +- [ ] `python -m py_compile vllm_omni/platforms/npu/worker/npu_generation_model_runner.py` + +### Import Validation +- [ ] `python -c "from vllm_omni.platforms.npu.worker.npu_model_runner import OmniNPUModelRunner"` +- [ ] `python -c "from vllm_omni.platforms.npu.worker.npu_ar_model_runner import NPUARModelRunner"` +- [ ] `python -c "from vllm_omni.platforms.npu.worker.npu_generation_model_runner import NPUGenerationModelRunner"` + +### Comment Markers +- [ ] Grep for "Omni-new" in all three files +- [ ] Verify all omni blocks have closing markers + +### Code Review +- [ ] No `CUDAGraphWrapper` references +- [ ] All `set_forward_context` replaced with `set_ascend_forward_context` +- [ ] Parameter names correct (`aclgraph_runtime_mode` not `cudagraph_runtime_mode`) +- [ ] No duplicate code blocks +- [ ] No missing imports + +--- + +## Git Commit + +### Commit Message Template +``` +[NPU] Upgrade model runners to align with vllm-ascend vX.Y.Z + +- Update OmniNPUModelRunner with latest NPUModelRunner base +- Update NPUARModelRunner execute_model and sample_tokens +- Update NPUGenerationModelRunner for async_chunk changes +- Sync GPU-side omni changes from vX.Y.Z release +- Preserve all omni-specific logic (marked with Omni-new comments) + +Changes from vllm-ascend: +- + +Changes synced from GPU: +- +``` + +### Files to Stage +- [ ] `vllm_omni/platforms/npu/worker/npu_model_runner.py` +- [ ] `vllm_omni/platforms/npu/worker/npu_ar_model_runner.py` +- [ ] `vllm_omni/platforms/npu/worker/npu_generation_model_runner.py` +- [ ] Any other modified files + +--- + +## Troubleshooting + +### Import Errors +- Check if vllm-ascend module paths have changed +- Verify PYTHONPATH includes both vllm-ascend and vllm-omni + +### Type Errors +- Check method signatures match between GPU and NPU +- Verify NamedTuple fields match expected structure + +### Runtime Errors +- Enable debug logging: `export VLLM_LOGGING_LEVEL=DEBUG` +- Check graph capture issues: try `--enforce-eager` +- Check attention issues: verify AscendAttentionState usage + +### Performance Regression +- Compare with previous version on same model +- Check if graph capture is working: look for ACLGraph logs +- Verify SP/EP configurations are correct