diff --git a/.claude/skills/vllm-omni-npu-upgrade/SKILL.md b/.claude/skills/vllm-omni-npu-upgrade/SKILL.md
new file mode 100644
index 00000000000..1ef7ab39301
--- /dev/null
+++ b/.claude/skills/vllm-omni-npu-upgrade/SKILL.md
@@ -0,0 +1,300 @@
+---
+name: vllm-omni-npu-model-runner-upgrade
+description: "Upgrade vllm-omni NPU model runners (OmniNPUModelRunner, NPUARModelRunner, NPUGenerationModelRunner) to align with the latest vllm-ascend NPUModelRunner while preserving omni-specific logic."
+---
+
+# vLLM-Omni NPU Model Runner Upgrade Skill
+
+## Overview
+
+This skill guides the process of upgrading vllm-omni's NPU model runners to align with the latest vllm-ascend codebase while preserving omni-specific enhancements. The NPU runners are designed to run omni multimodal models (like Qwen3-Omni, Bagel, MiMoAudio) on Ascend NPUs.
+
+## File Structure
+
+### NPU Model Runner Files
+```
+vllm-omni/vllm_omni/platforms/npu/worker/
+├── __init__.py
+├── npu_model_runner.py           # OmniNPUModelRunner (base class)
+├── npu_ar_model_runner.py        # NPUARModelRunner (autoregressive)
+├── npu_ar_worker.py              # AR worker
+├── npu_generation_model_runner.py # NPUGenerationModelRunner (diffusion/non-AR)
+└── npu_generation_worker.py      # Generation worker
+```
+
+### GPU Reference Files (for omni-specific logic sync)
+```
+vllm-omni/vllm_omni/worker/
+├── __init__.py
+├── gpu_model_runner.py           # OmniGPUModelRunner
+├── gpu_ar_model_runner.py        # GPUARModelRunner
+├── gpu_ar_worker.py
+├── gpu_generation_model_runner.py
+├── gpu_generation_worker.py
+├── mixins.py
+├── base.py
+└── gpu_memory_utils.py
+```
+
+### vllm-ascend Reference Files
+```
+vllm-ascend/vllm_ascend/worker/
+├── model_runner_v1.py            # NPUModelRunner (base class to copy from)
+├── npu_input_batch.py
+├── block_table.py
+├── pcp_utils.py
+└── worker.py
+```
+
+## Inheritance Hierarchy
+
+```
+                    GPUModelRunner (vllm)
+                         |
+        +----------------+----------------+
+        |                                 |
+  OmniGPUModelRunner              NPUModelRunner (vllm-ascend)
+  (vllm_omni/worker)              (vllm_ascend/worker)
+        |                                 |
+        +----------- OmniNPUModelRunner --+
+                     (multiple inheritance)
+                            |
+            +---------------+---------------+
+            |                               |
+    NPUARModelRunner            NPUGenerationModelRunner
+    (autoregressive)            (non-autoregressive/diffusion)
+```
+
+## Omni-Specific Comment Markers
+
+Omni-specific logic is marked with comment blocks:
+```python
+# -------------------------------------- Omni-new -------------------------------------------------
+# ... omni-specific code ...
+# -------------------------------------- Omni-new -------------------------------------------------
+```
+
+Or simpler variations:
+```python
+#  -------------------------------------- Omni-new -------------------------------------------------
+#  ------------------------------------------------------------------------------------------------
+```
+
+**Important**:
+- Always preserve and add these markers when modifying code.
+- **The reference documents (`references/omni-specific-blocks.md`) may not be up-to-date.** Always grep for `Omni-new` in the GPU implementations to find the authoritative list of omni-specific blocks.
+- When you discover new omni-specific code that is not documented in the references, please update the reference files.
+
+## Key Methods Requiring Attention
+
+### OmniNPUModelRunner (npu_model_runner.py)
+
+| Method | Description | Omni-Specific Logic |
+|--------|-------------|---------------------|
+| `load_model` | Load model and initialize talker_mtp | Uses `ACLGraphWrapper` instead of `CUDAGraphWrapper`, initializes talker buffers |
+| `_dummy_run` | Warmup/profiling run | talker_mtp dummy forward, `extract_multimodal_outputs` |
+| `_model_forward` | Forward pass wrapper | Injects `model_kwargs_extra`, wraps with `OmniOutput`, NPU-specific graph updates |
+| `_talker_mtp_forward` | Talker MTP forward for Qwen3-Omni | Uses `set_ascend_forward_context` |
+
+### NPUARModelRunner (npu_ar_model_runner.py)
+
+| Method | Description | Omni-Specific Logic |
+|--------|-------------|---------------------|
+| `__init__` | Initialize with KV transfer manager | `OmniKVTransferManager` setup |
+| `execute_model` | Main inference entry | KV transfer handling, `_update_states` override, `extract_multimodal_outputs` |
+| `sample_tokens` | Token sampling | Hidden states extraction, multimodal outputs processing, `OmniModelRunnerOutput` |
+| `_resolve_global_request_id` | Request ID resolution | For disaggregated inference |
+
+### NPUGenerationModelRunner (npu_generation_model_runner.py)
+
+| Method | Description | Omni-Specific Logic |
+|--------|-------------|---------------------|
+| `_update_request_states` | Update request states for async chunk | async_chunk handling |
+| `execute_model` | Generation forward | async_chunk, `seq_token_counts`, `_run_generation_model` |
+| `sample_tokens` | Output processing | multimodal output packaging to `OmniModelRunnerOutput` |
+| `_dummy_run` | Dummy run override | model_kwargs initialization, multimodal extraction |
+| `_run_generation_model` | Run generation model | Calls `_model_forward` with sampler |
+
+## Upgrade Workflow
+
+### Step 1: Preparation
+
+1. **Identify target versions**(Use gh cli to check):
+   - We're using vllm-omni main branch
+   - Check the last release of vllm-omni
+   - Target vllm-ascend version(Just directly use the local latest vllm-ascend code)
+
+2. **Check GPU-side changes** (since last release):
+   ```bash
+   cd /root/vllm-workspace/vllm-omni
+   git log --oneline --since="<last-release-date>" -- vllm_omni/worker/
+   ```
+
+3. **Read latest vllm-ascend code**:
+   - We don't track vllm-ascend changes - just directly use the latest code from `/root/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py`
+   - Copy the relevant methods and re-insert omni-specific blocks
+
+### Step 2: Analyze Omni-Specific Logic
+
+For each NPU model runner file:
+
+1. **Extract existing omni-specific blocks**:
+   ```bash
+   grep -n "Omni-new" vllm_omni/platforms/npu/worker/npu_model_runner.py
+   ```
+
+2. **Document each omni block**:
+   - Which method it belongs to
+   - What functionality it provides
+   - Dependencies on other omni code
+
+### Step 3: Update Base Class (OmniNPUModelRunner)
+
+**Note**: Always check the GPU implementation `gpu_model_runner.py` for any new omni logic not yet documented in references.
+
+1. **Read the latest vllm-ascend `NPUModelRunner.load_model`**
+2. **Copy the method, keeping the structure**
+3. **Re-insert omni-specific logic** (check GPU `gpu_model_runner.py` for authoritative list):
+   - Replace `CUDAGraphWrapper` with `ACLGraphWrapper`
+   - Keep talker_mtp initialization
+   - Preserve buffer allocations for talker
+   - Check for any new omni blocks added since last sync
+
+4. **Update `_dummy_run`**:
+   - Copy from vllm-ascend
+   - Compare with GPU `_dummy_run` for omni-specific blocks
+   - Re-insert all `Omni-new` marked code from GPU version
+
+5. **Update `_model_forward`**:
+   - Keep the omni wrapper logic
+   - Update NPU-specific parts (graph params, SP all-gather)
+   - Check GPU version for any new omni logic
+
+### Step 4: Update AR Model Runner
+
+1. **Compare with GPU `gpu_ar_model_runner.py`** for any new omni features
+2. **Copy `execute_model` from vllm-ascend**
+3. **Re-insert omni blocks** (reference `references/omni-specific-blocks.md`, but note it may be incomplete):
+   - **IMPORTANT**: Always check the GPU implementation `gpu_ar_model_runner.py` for all `Omni-new` marked code blocks
+   - The reference doc may not include newly added omni logic - treat it as a starting point, not exhaustive
+   - When discovering new omni code blocks, please update `references/omni-specific-blocks.md`
+   - Common omni blocks include but are not limited to: KV transfer, multimodal outputs, sampling_metadata handling, etc.
+
+4. **Update `sample_tokens`** (also compare with GPU implementation):
+   - Compare with `gpu_ar_model_runner.py`'s `sample_tokens` method
+   - Identify all `Omni-new` marked code blocks
+   - Ensure NPU version includes all omni-specific logic
+
+### Step 5: Update Generation Model Runner
+
+**Note**: Generation model runner may have unique omni logic for diffusion/non-AR models.
+
+1. **Compare with GPU `gpu_generation_model_runner.py`** - grep for all `Omni-new` blocks
+2. **Update `execute_model`**:
+   - Check GPU version for all omni-specific blocks
+   - Keep async_chunk handling
+   - Keep `seq_token_counts` injection
+   - Update forward/context setup from vllm-ascend
+   - Look for any new omni logic not documented in references
+
+3. **Update `_dummy_run`**:
+   - Copy from vllm-ascend base
+   - Compare with GPU `_dummy_run` if exists
+   - Re-insert all omni-specific logic
+
+### Step 6: Update Imports
+
+Check and update imports at the top of each file:
+
+```python
+# Common vllm-ascend imports
+from vllm_ascend.ascend_forward_context import get_forward_context, set_ascend_forward_context
+from vllm_ascend.attention.attention_v1 import AscendAttentionState
+from vllm_ascend.attention.utils import using_paged_attention
+from vllm_ascend.compilation.acl_graph import ACLGraphWrapper, update_full_graph_params
+from vllm_ascend.ops.rotary_embedding import update_cos_sin
+from vllm_ascend.utils import enable_sp, lmhead_tp_enable
+from vllm_ascend.worker.model_runner_v1 import SEQ_LEN_WITH_MAX_PA_WORKSPACE, NPUModelRunner
+
+# Omni-specific imports
+from vllm_omni.model_executor.models.output_templates import OmniOutput
+from vllm_omni.worker.gpu_model_runner import OmniGPUModelRunner
+from vllm_omni.outputs import OmniModelRunnerOutput
+from vllm_omni.distributed.omni_connectors.kv_transfer_manager import OmniKVTransferManager
+```
+
+### Step 7: Sync GPU-Side Omni Changes
+
+1. **Check recent GPU worker changes**:
+   ```bash
+   git diff <from-tag>..<to-tag> -- vllm_omni/worker/gpu_model_runner.py
+   git diff <from-tag>..<to-tag> -- vllm_omni/worker/gpu_ar_model_runner.py
+   ```
+
+2. **Identify new omni features** that need to be ported to NPU
+
+3. **Apply corresponding changes** to NPU runners
+
+### Step 8: Validation
+
+1. **Run type checking**:
+   ```bash
+   cd /root/vllm-workspace/vllm-omni
+   python -m py_compile vllm_omni/platforms/npu/worker/npu_model_runner.py
+   python -m py_compile vllm_omni/platforms/npu/worker/npu_ar_model_runner.py
+   python -m py_compile vllm_omni/platforms/npu/worker/npu_generation_model_runner.py
+   ```
+
+2. **Run import test**:
+   ```bash
+   python -c "from vllm_omni.platforms.npu.worker import *"
+   ```
+
+3. **Run model serving test** (if hardware available):
+   ```bash
+   vllm serve <model-path> --trust-remote-code
+   ```
+
+## Common Pitfalls
+
+### 1. Forward Context Differences
+- GPU uses `set_forward_context`
+- NPU uses `set_ascend_forward_context`
+- Parameters may differ slightly
+
+### 2. Graph Wrapper Differences
+- GPU: `CUDAGraphWrapper`
+- NPU: `ACLGraphWrapper`
+- Constructor parameters may differ
+
+### 3. Buffer Creation
+- GPU: `_make_buffer` returns different structure
+- NPU: May need numpy=True/False parameter
+
+### 4. Attention Metadata
+- GPU: Uses vllm attention metadata builders
+- NPU: Uses `AscendCommonAttentionMetadata`
+
+### 5. Sampling
+- GPU: Uses vllm sampler
+- NPU: Uses `AscendSampler`
+
+## Checklist Before Commit
+
+- [ ] All omni-specific comment markers preserved
+- [ ] New omni logic from GPU side synced
+- [ ] Imports updated to latest vllm-ascend
+- [ ] No `CUDAGraphWrapper` references in NPU code
+- [ ] `set_ascend_forward_context` used instead of `set_forward_context`
+- [ ] `ACLGraphWrapper` used for talker_mtp wrapping
+- [ ] Type hints match vllm-ascend signatures
+- [ ] No duplicate code blocks
+- [ ] Python syntax valid (py_compile passes)
+
+## Reference Files for Comparison
+
+When upgrading, keep these files open for reference:
+
+1. **vllm-ascend NPUModelRunner**: `/root/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py`
+2. **vllm GPUModelRunner**: `/root/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py`
+3. **vllm-omni OmniGPUModelRunner**: `/root/vllm-workspace/vllm-omni/vllm_omni/worker/gpu_model_runner.py`
diff --git a/.claude/skills/vllm-omni-npu-upgrade/references/gpu-to-npu-translation.md b/.claude/skills/vllm-omni-npu-upgrade/references/gpu-to-npu-translation.md
new file mode 100644
index 00000000000..89067d37b2d
--- /dev/null
+++ b/.claude/skills/vllm-omni-npu-upgrade/references/gpu-to-npu-translation.md
@@ -0,0 +1,335 @@
+# GPU to NPU Translation Patterns
+
+This document provides a quick reference for translating GPU code patterns to NPU equivalents when porting omni-specific logic.
+
+## Import Translations
+
+### Forward Context
+```python
+# GPU
+from vllm.forward_context import set_forward_context
+
+# NPU
+from vllm_ascend.ascend_forward_context import set_ascend_forward_context
+```
+
+### Graph Wrapper
+```python
+# GPU
+from vllm.compilation.cuda_graph import CUDAGraphWrapper
+
+# NPU
+from vllm_ascend.compilation.acl_graph import ACLGraphWrapper
+```
+
+### Attention State
+```python
+# GPU (no equivalent - uses FlashAttention states directly)
+
+# NPU
+from vllm_ascend.attention.attention_v1 import AscendAttentionState
+```
+
+### Utilities
+```python
+# GPU
+# (directly use torch.cuda functions)
+
+# NPU
+from vllm_ascend.utils import enable_sp, lmhead_tp_enable
+from vllm_ascend.ops.rotary_embedding import update_cos_sin
+```
+
+## Context Manager Translations
+
+### Forward Context Setup
+```python
+# GPU
+with set_forward_context(
+    attn_metadata,
+    self.vllm_config,
+    num_tokens=num_tokens_padded,
+    num_tokens_across_dp=num_tokens_across_dp,
+    cudagraph_runtime_mode=cudagraph_mode,
+    batch_descriptor=batch_desc,
+):
+    # forward pass
+
+# NPU
+with set_ascend_forward_context(
+    attn_metadata,
+    self.vllm_config,
+    num_tokens=num_tokens_padded,
+    num_tokens_across_dp=num_tokens_across_dp,
+    aclgraph_runtime_mode=cudagraph_mode,  # Note: 'aclgraph' not 'cudagraph'
+    batch_descriptor=batch_desc,
+    num_actual_tokens=scheduler_output.total_num_scheduled_tokens,
+    model_instance=self.model,
+):
+    # forward pass
+```
+
+### Graph Capture Context
+```python
+# GPU
+from vllm.compilation.cuda_graph import graph_capture as cuda_graph_capture
+with cuda_graph_capture(self.device):
+    # capture
+
+# NPU
+from vllm_ascend.worker.model_runner_v1 import graph_capture
+with graph_capture(self.device):
+    # capture
+```
+
+## Graph Wrapper Usage
+
+### Creating Graph Wrapper
+```python
+# GPU
+if cudagraph_mode.has_full_cudagraphs() and has_separate_talker:
+    self.talker_mtp = CUDAGraphWrapper(
+        talker_mtp,
+        self.vllm_config,
+        runtime_mode=CUDAGraphMode.FULL
+    )
+
+# NPU
+if cudagraph_mode.has_full_cudagraphs() and has_separate_talker:
+    self.talker_mtp = ACLGraphWrapper(
+        talker_mtp,
+        self.vllm_config,
+        runtime_mode=CUDAGraphMode.FULL
+    )
+```
+
+### Checking Graph Wrapper Type
+```python
+# GPU
+if not isinstance(self.talker_mtp, CUDAGraphWrapper):
+    _cudagraph_mode = CUDAGraphMode.NONE
+
+# NPU
+if not isinstance(self.talker_mtp, ACLGraphWrapper):
+    _cudagraph_mode = CUDAGraphMode.NONE
+```
+
+## Device Operations
+
+### Synchronization
+```python
+# GPU
+torch.cuda.synchronize()
+
+# NPU
+torch.npu.synchronize()
+```
+
+### Stream Operations
+```python
+# GPU
+stream = torch.cuda.Stream(device=device)
+torch.cuda.current_stream()
+
+# NPU
+stream = torch.npu.Stream(device=device)
+torch.npu.current_stream()
+```
+
+## Attention Metadata
+
+### State Setting (NPU-specific)
+```python
+# GPU - handled internally by attention backends
+
+# NPU - explicit state setting required
+self.attn_state = AscendAttentionState.DecodeOnly
+if self.speculative_config and self.speculative_config.method == "mtp":
+    if self.vllm_config.model_config.use_mla:
+        self.attn_state = AscendAttentionState.SpecDecoding
+    else:
+        self.attn_state = AscendAttentionState.ChunkedPrefill
+```
+
+### Building Attention Metadata
+```python
+# GPU - uses vllm attention builders
+
+# NPU - may need additional parameters
+(attn_metadata, spec_decode_common_attn_metadata) = self._build_attention_metadata(
+    num_tokens=num_tokens_unpadded,
+    num_tokens_padded=num_tokens_padded,
+    num_reqs=num_reqs,
+    num_reqs_padded=num_reqs_padded,
+    max_query_len=max_num_scheduled_tokens,
+    ubatch_slices=ubatch_slices_attn,
+    logits_indices=logits_indices,
+    use_spec_decode=use_spec_decode,
+    num_scheduled_tokens=scheduler_output.num_scheduled_tokens,
+    num_scheduled_tokens_np=num_scheduled_tokens_np,
+    cascade_attn_prefix_lens=cascade_attn_prefix_lens,
+)
+```
+
+## Rotary Embedding
+
+### Update Cos/Sin Cache
+```python
+# GPU - typically handled inside attention
+
+# NPU - explicit update required before forward
+from vllm_ascend.ops.rotary_embedding import update_cos_sin
+update_cos_sin(positions)
+```
+
+## Sequence Parallelism
+
+### Enable SP Check
+```python
+# GPU - use vllm distributed utilities
+
+# NPU - use vllm-ascend wrapper
+from vllm_ascend.utils import enable_sp
+
+if enable_sp():
+    # sequence parallelism enabled
+```
+
+## Sampler
+
+### Sampler Type
+```python
+# GPU - uses vllm sampler
+self.sampler = Sampler()
+
+# NPU - uses AscendSampler
+from vllm_ascend.sample.sampler import AscendSampler
+self.sampler = AscendSampler()
+```
+
+## Input Batch
+
+### Batch Class
+```python
+# GPU
+from vllm.v1.worker.gpu_input_batch import InputBatch
+
+# NPU
+from vllm_ascend.worker.npu_input_batch import NPUInputBatch
+```
+
+## Graph Parameter Updates
+
+### Full Graph Params Update (NPU-specific)
+```python
+# GPU - not needed
+
+# NPU - required for FULL graph mode
+from vllm_ascend.compilation.acl_graph import update_full_graph_params
+
+forward_context = get_forward_context()
+if (
+    forward_context.cudagraph_runtime_mode == CUDAGraphMode.FULL
+    and not forward_context.capturing
+    and not self.use_sparse
+):
+    update_full_graph_params(
+        self.attn_backend,
+        self.update_stream,
+        forward_context,
+        num_tokens_padded,
+        self.vllm_config,
+        self.speculative_config,
+        positions.shape[0],
+    )
+```
+
+## Paged Attention Check
+
+```python
+# GPU - not typically needed
+
+# NPU
+from vllm_ascend.attention.utils import using_paged_attention
+
+if is_graph_capturing and using_paged_attention(num_tokens, self.vllm_config):
+    seq_lens = SEQ_LEN_WITH_MAX_PA_WORKSPACE
+```
+
+## Common Method Signature Differences
+
+### _dummy_run Parameters
+```python
+# GPU (v0.17.0)
+def _dummy_run(
+    self,
+    num_tokens: int,
+    cudagraph_runtime_mode: CUDAGraphMode | None = None,
+    force_attention: bool = False,
+    uniform_decode: bool = False,
+    allow_microbatching: bool = True,
+    skip_eplb: bool = False,
+    is_profile: bool = False,
+    create_mixed_batch: bool = False,
+    remove_lora: bool = True,
+    is_graph_capturing: bool = False,
+    num_active_loras: int = 0,
+) -> tuple[torch.Tensor, torch.Tensor]:
+
+# NPU (v0.17.0) - adds with_prefill, activate_lora
+def _dummy_run(
+    self,
+    num_tokens: int,
+    with_prefill: bool = False,
+    cudagraph_runtime_mode: CUDAGraphMode | None = None,
+    force_attention: bool = False,
+    uniform_decode: bool = False,
+    is_profile: bool = False,
+    create_mixed_batch: bool = False,
+    allow_microbatching: bool = True,
+    skip_eplb: bool = False,
+    remove_lora: bool = True,
+    activate_lora: bool = False,
+    is_graph_capturing: bool = False,
+    num_active_loras: int = 0,
+) -> tuple[torch.Tensor, torch.Tensor]:
+```
+
+### _model_forward Parameters
+```python
+# GPU - no num_tokens_padded
+def _model_forward(
+    self,
+    input_ids: torch.Tensor | None = None,
+    positions: torch.Tensor | None = None,
+    intermediate_tensors: IntermediateTensors | None = None,
+    inputs_embeds: torch.Tensor | None = None,
+    **model_kwargs: dict[str, Any],
+):
+
+# NPU - has num_tokens_padded as first parameter
+def _model_forward(
+    self,
+    num_tokens_padded: int,
+    input_ids: torch.Tensor | None = None,
+    positions: torch.Tensor | None = None,
+    intermediate_tensors: IntermediateTensors | None = None,
+    inputs_embeds: torch.Tensor | None = None,
+    **model_kwargs: dict[str, Any],
+):
+```
+
+## Quick Reference Table
+
+| Feature | GPU | NPU |
+|---------|-----|-----|
+| Graph wrapper | `CUDAGraphWrapper` | `ACLGraphWrapper` |
+| Forward context | `set_forward_context` | `set_ascend_forward_context` |
+| Runtime mode param | `cudagraph_runtime_mode` | `aclgraph_runtime_mode` |
+| Device sync | `torch.cuda.synchronize()` | `torch.npu.synchronize()` |
+| Stream | `torch.cuda.Stream` | `torch.npu.Stream` |
+| Current stream | `torch.cuda.current_stream()` | `torch.npu.current_stream()` |
+| Input batch | `InputBatch` | `NPUInputBatch` |
+| Sampler | `Sampler` | `AscendSampler` |
+| Attention state | N/A | `AscendAttentionState` |
+| RoPE update | N/A | `update_cos_sin()` |
diff --git a/.claude/skills/vllm-omni-npu-upgrade/references/omni-specific-blocks.md b/.claude/skills/vllm-omni-npu-upgrade/references/omni-specific-blocks.md
new file mode 100644
index 00000000000..8c5d32ab4c1
--- /dev/null
+++ b/.claude/skills/vllm-omni-npu-upgrade/references/omni-specific-blocks.md
@@ -0,0 +1,374 @@
+# Omni-Specific Code Blocks Reference
+
+This document catalogs omni-specific code blocks in the NPU model runners, making it easier to identify what needs to be preserved during upgrades.
+
+> **IMPORTANT**: This document may not be complete or up-to-date!
+>
+> - Always grep for `Omni-new` in the GPU implementations (`vllm_omni/worker/`) to find the authoritative list
+> - New omni features may be added that are not yet documented here
+> - When you discover new omni-specific blocks during an upgrade, please update this document
+> - Last verified: Check git history for this file
+
+## OmniNPUModelRunner (npu_model_runner.py)
+
+### load_model - Talker MTP Initialization
+
+```python
+def load_model(self, *args, **kwargs) -> None:
+    NPUModelRunner.load_model(self, *args, **kwargs)
+    # Initialize enable_sp cache to avoid get_current_vllm_config() error
+    # in _pad_for_sequence_parallelism during execute_model.
+    # This is a workaround for vllm-ascend not passing vllm_config to enable_sp().
+    enable_sp(self.vllm_config)
+    # TODO move this model specific logic to a separate class
+    # TTS model IS the talker (no .talker sub-attr); use getattr to support both Omni and TTS.
+    talker_mtp = getattr(self.model, "talker_mtp", None)
+    if talker_mtp is not None:
+        self.talker_mtp = talker_mtp  # type: ignore[assignment]
+        cudagraph_mode = self.compilation_config.cudagraph_mode
+        assert cudagraph_mode is not None
+        # Only wrap talker_mtp in CUDAGraphWrapper for Omni models that
+        # have a separate .talker sub-module.  TTS models' code predictor
+        # has internal AR loops / torch.multinomial — not graph-safe.
+        has_separate_talker = getattr(self.model, "talker", None) is not None
+        if cudagraph_mode.has_full_cudagraphs() and has_separate_talker:
+            # NOTE: Use ACLGraphWrapper on NPU, not CUDAGraphWrapper
+            self.talker_mtp = ACLGraphWrapper(talker_mtp, self.vllm_config, runtime_mode=CUDAGraphMode.FULL)
+        # TTS exposes mtp_hidden_size; Omni uses hf_text_config.hidden_size.
+        hidden_size = int(
+            getattr(self.model, "mtp_hidden_size", 0) or getattr(self.model_config.hf_text_config, "hidden_size")
+        )
+        max_batch_size = max(self.max_num_reqs, self.compilation_config.max_cudagraph_capture_size)
+        self.talker_mtp_input_ids = self._make_buffer(max_batch_size, dtype=torch.int32)
+        self.talker_mtp_inputs_embeds = self._make_buffer(
+            max_batch_size, hidden_size, dtype=self.dtype, numpy=False
+        )
+        self.last_talker_hidden = self._make_buffer(max_batch_size, hidden_size, dtype=self.dtype, numpy=False)
+        self.text_step = self._make_buffer(max_batch_size, hidden_size, dtype=self.dtype, numpy=False)
+```
+
+### _dummy_run - Talker MTP Dummy Forward
+
+Location: Inside `set_ascend_forward_context` block, before main model forward
+
+```python
+# ---------------------------------------Omni-new----------------------------------------------
+if getattr(self.model, "talker", None) is not None and hasattr(self.model, "talker_mtp"):
+    num_tokens_padded_talker_mtp = num_tokens_padded
+    if num_tokens_padded_talker_mtp == self.max_num_tokens:
+        num_tokens_padded_talker_mtp = self.talker_mtp_input_ids.gpu.shape[0]
+    outputs = self.talker_mtp(
+        self.talker_mtp_input_ids.gpu[:num_tokens_padded_talker_mtp],
+        self.talker_mtp_inputs_embeds.gpu[:num_tokens_padded_talker_mtp],
+        self.last_talker_hidden.gpu[:num_tokens_padded_talker_mtp],
+        self.text_step.gpu[:num_tokens_padded_talker_mtp],
+    )
+    self.compilation_config.cache_dir = None
+# ---------------------------------------Omni-new----------------------------------------------
+```
+
+### _dummy_run - Extract Multimodal Outputs
+
+Location: After model forward, before dummy_compute_logits
+
+```python
+# ---------------------------------------Omni-new----------------------------------------------
+hidden_states, multimodal_outputs = self.extract_multimodal_outputs(hidden_states)
+# ---------------------------------------Omni-new----------------------------------------------
+```
+
+### _model_forward - Omni Output Wrapping
+
+```python
+def _model_forward(
+    self,
+    num_tokens_padded: int,
+    input_ids: torch.Tensor | None = None,
+    positions: torch.Tensor | None = None,
+    intermediate_tensors: IntermediateTensors | None = None,
+    inputs_embeds: torch.Tensor | None = None,
+    **model_kwargs: dict[str, Any],
+):
+    """Override to combine NPUModelRunner's signature with OmniGPUModelRunner's logic."""
+    # Omni-specific: build and inject extra model kwargs
+    model_kwargs_extra = self._build_model_kwargs_extra()
+
+    # Call the model forward (same as NPUModelRunner)
+    assert self.model is not None
+    model_output = self.model(
+        input_ids=input_ids,
+        positions=positions,
+        intermediate_tensors=intermediate_tensors,
+        inputs_embeds=inputs_embeds,
+        **model_kwargs,
+        **model_kwargs_extra,
+    )
+
+    # Omni-specific: wrap output if needed
+    if not isinstance(model_output, OmniOutput) and hasattr(self.model, "make_omni_output"):
+        model_output = self.model.make_omni_output(model_output, **model_kwargs_extra)
+
+    # Omni-specific: cache model output for later sample_tokens
+    self._omni_last_model_output = model_output
+
+    # NPU-specific: update full graph params (keep from vllm-ascend)
+    forward_context = get_forward_context()
+    # ... NPU graph update logic ...
+
+    # NPU-specific: all-gather for sequence parallelism (keep from vllm-ascend)
+    if get_forward_context().sp_enabled and not isinstance(model_output, IntermediateTensors):
+        model_output = self._all_gather_hidden_states_and_aux(model_output)
+
+    return model_output
+```
+
+---
+
+## NPUARModelRunner (npu_ar_model_runner.py)
+
+### __init__ - KV Transfer Manager
+
+```python
+def __init__(self, *args, **kwargs):
+    super().__init__(*args, **kwargs)
+    self.input_ids = self._make_buffer(self.max_num_tokens, dtype=torch.int32)
+    # each model stage has their own hidden size
+    self.hidden_size = self.model_config.hf_text_config.hidden_size
+    self.inputs_embeds = self._make_buffer(self.max_num_tokens, self.hidden_size, dtype=self.dtype, numpy=False)
+    # Initialize KV cache manager (preserve vllm_config fallback behavior)
+    self.kv_transfer_manager = OmniKVTransferManager.from_vllm_config(self.vllm_config, self.model_config)
+```
+
+### execute_model - KV Transfer Before Update States
+
+Location: At the very beginning of execute_model
+
+```python
+#  -------------------------------------- Omni-new -------------------------------------------------
+# [Omni] Handle KV transfer BEFORE updating states (which removes finished requests)
+self.kv_extracted_req_ids = self.kv_transfer_manager.handle_finished_requests_kv_transfer(
+    finished_reqs=getattr(scheduler_output, "finished_requests_needing_kv_transfer", {}),
+    kv_caches=self.kv_caches,
+    block_size=self.cache_config.block_size,
+    cache_dtype=str(self.cache_config.cache_dtype),
+    request_id_resolver=self._resolve_global_request_id,
+)
+#  -------------------------------------- Omni-new -------------------------------------------------
+```
+
+### execute_model - Custom _update_states Call
+
+Location: Inside synchronize_input_prep context
+
+```python
+#  -------------------------------------- Omni-new -------------------------------------------------
+self._update_states(scheduler_output)
+#  ------------------------------------------------------------------------------------------------
+```
+
+### execute_model - Extract Multimodal Outputs
+
+Location: In post process section, after hidden_states assignment
+
+```python
+#  -------------------------------------- Omni-new -------------------------------------------------
+hidden_states, multimodal_outputs = self.extract_multimodal_outputs(hidden_states)
+
+if multimodal_outputs is not None:
+    keys_or_type = (
+        list(multimodal_outputs.keys())
+        if isinstance(multimodal_outputs, dict)
+        else type(multimodal_outputs)
+    )
+    logger.debug(f"[AR] execute_model: multimodal_outputs keys = {keys_or_type}")
+else:
+    logger.debug("[AR] execute_model: multimodal_outputs is None")
+#  -------------------------------------- Omni-new -------------------------------------------------
+```
+
+### execute_model - Compute Logits with sampling_metadata
+
+Location: In both broadcast_pp_output True and False branches
+
+```python
+#  -------------------------------------- Omni-new -------------------------------------------------
+# Try with sampling_metadata first; fall back to without for models that don't support it
+try:
+    logits = self.model.compute_logits(
+        sample_hidden_states, sampling_metadata=self.input_batch.sampling_metadata
+    )
+except TypeError:
+    logits = self.model.compute_logits(sample_hidden_states)
+#  -------------------------------------- Omni-new -------------------------------------------------
+```
+
+### sample_tokens - KV Extracted Req IDs
+
+Location: At the beginning of sample_tokens
+
+```python
+#  -------------------------------------- Omni-new -------------------------------------------------
+kv_extracted_req_ids = getattr(self, "kv_extracted_req_ids", None)
+self.kv_extracted_req_ids = None
+#  -------------------------------------- Omni-new -------------------------------------------------
+```
+
+### sample_tokens - Process Additional Information and Build Output
+
+Location: After bookkeeping sync, replacing the original output construction
+
+```python
+#  -------------------------------------- Omni-new -------------------------------------------------
+hidden_states_cpu = hidden_states.detach().to("cpu").contiguous()
+num_scheduled_tokens_np = getattr(self, "_omni_num_scheduled_tokens_np", None)
+if num_scheduled_tokens_np is None:
+    req_ids = self.input_batch.req_ids
+    num_scheduled_tokens_np = np.array(
+        [scheduler_output.num_scheduled_tokens[rid] for rid in req_ids],
+        dtype=np.int32,
+    )
+
+self._process_additional_information_updates(
+    hidden_states, multimodal_outputs, num_scheduled_tokens_np, scheduler_output
+)
+
+pooler_output: list[dict[str, object]] = []
+for rid in req_ids_output_copy:
+    idx = req_id_to_index_output_copy[rid]
+    start = int(self.query_start_loc.cpu[idx])
+    sched = int(num_scheduled_tokens_np[idx])
+    end = start + sched
+    hidden_slice = hidden_states_cpu[start:end]
+    payload: dict[str, object] = {"hidden": hidden_slice}
+    if isinstance(multimodal_outputs, dict) and multimodal_outputs:
+        # ... multimodal output slicing logic ...
+    pooler_output.append(payload)
+
+model_runner_output = OmniModelRunnerOutput(
+    req_ids=req_ids_output_copy,
+    req_id_to_index=req_id_to_index_output_copy,
+    sampled_token_ids=valid_sampled_token_ids,
+    logprobs=logprobs_lists,
+    prompt_logprobs_dict=prompt_logprobs_dict,
+    pooler_output=(pooler_output if self.vllm_config.model_config.engine_output_type != "text" else None),
+    kv_connector_output=kv_connector_output,
+)
+model_runner_output.kv_extracted_req_ids = kv_extracted_req_ids
+#  -------------------------------------- Omni-new -------------------------------------------------
+```
+
+---
+
+## NPUGenerationModelRunner (npu_generation_model_runner.py)
+
+### execute_model - Async Chunk Update
+
+Location: Inside prepare input section, before synchronize_input_prep
+
+```python
+#  -------------------------------------- Omni-new -------------------------------------------------
+if self.model_config.async_chunk and num_scheduled_tokens:
+    self._update_request_states(scheduler_output)
+#  -------------------------------------- Omni-new -------------------------------------------------
+```
+
+### execute_model - Seq Token Counts
+
+Location: After _preprocess call
+
+```python
+# [Omni] Pass token counts per request for code2wav output slicing
+model_kwargs["seq_token_counts"] = tokens
+```
+
+### execute_model - Run Generation Model
+
+Location: Inside forward context
+
+```python
+#  -------------------------------------- Omni-new -------------------------------------------------
+outputs = self._run_generation_model(
+    num_tokens_padded=num_tokens_padded,
+    input_ids=input_ids,
+    positions=positions,
+    intermediate_tensors=intermediate_tensors,
+    inputs_embeds=inputs_embeds,
+    model_kwargs=model_kwargs,
+    logits_indices=logits_indices,
+)
+_, multimodal_outputs = self.extract_multimodal_outputs(outputs)
+# -------------------------------------- Omni-new -------------------------------------------------
+```
+
+### sample_tokens - Multimodal Output Processing
+
+The entire sample_tokens method body is omni-specific for generation models:
+
+```python
+#  -------------------------------------- Omni-new -------------------------------------------------
+pooler_output: list[object] = []
+if isinstance(multimodal_outputs, torch.Tensor):
+    # ... tensor handling ...
+elif isinstance(multimodal_outputs, list):
+    # ... list handling ...
+elif isinstance(multimodal_outputs, dict):
+    # ... dict handling per request ...
+else:
+    raise RuntimeError("Unsupported diffusion output type")
+# [Omni] Copy req_id mappings to avoid async scheduling mutation.
+req_ids_output_copy = self.input_batch.req_ids.copy()
+req_id_to_index_output_copy = self.input_batch.req_id_to_index.copy()
+output = OmniModelRunnerOutput(
+    req_ids=req_ids_output_copy,
+    req_id_to_index=req_id_to_index_output_copy,
+    sampled_token_ids=[],
+    logprobs=None,
+    prompt_logprobs_dict={},
+    pooler_output=pooler_output,
+    kv_connector_output=kv_connector_output,
+    num_nans_in_logits={},
+    ec_connector_output=ec_connector_output if self.supports_mm_inputs else None,
+)
+#  -------------------------------------- Omni-new -------------------------------------------------
+```
+
+### _dummy_run - Model Kwargs Init and Multimodal Extract
+
+Location: Before model forward and after
+
+```python
+model_kwargs = self._init_model_kwargs()  # Before forward
+
+# ... forward ...
+
+# -------------------------------------- Omni-new -------------------------------------------------
+hidden_states, _ = self.extract_multimodal_outputs(hidden_states)
+# -------------------------------------------------------------------------------------------------
+```
+
+---
+
+## ExecuteModelState Extension
+
+The `ExecuteModelState` NamedTuple is extended for omni:
+
+```python
+class ExecuteModelState(NamedTuple):
+    """Ephemeral cached state transferred between execute_model() and
+    sample_tokens(), after execute_model() returns None."""
+
+    scheduler_output: SchedulerOutput
+    logits: torch.Tensor
+    spec_decode_metadata: SpecDecodeMetadata | None
+    spec_decode_common_attn_metadata: AscendCommonAttentionMetadata | None
+    hidden_states: torch.Tensor
+    sample_hidden_states: torch.Tensor
+    aux_hidden_states: list[torch.Tensor] | None
+    attn_metadata: PerLayerAttnMetadata
+    positions: torch.Tensor
+    ec_connector_output: ECConnectorOutput | None
+    cudagraph_stats: CUDAGraphStat | None
+    multimodal_outputs: Any  # <-- Omni extension
+```
+
+This extended state must be imported from `npu_ar_model_runner` in `npu_generation_model_runner`.
diff --git a/.claude/skills/vllm-omni-npu-upgrade/references/workflow-checklist.md b/.claude/skills/vllm-omni-npu-upgrade/references/workflow-checklist.md
new file mode 100644
index 00000000000..4f184df0ecb
--- /dev/null
+++ b/.claude/skills/vllm-omni-npu-upgrade/references/workflow-checklist.md
@@ -0,0 +1,222 @@
+# NPU Model Runner Upgrade Workflow Checklist
+
+> **Note**: Reference documents (`omni-specific-blocks.md`) may not be complete. Always grep for `Omni-new` in GPU implementations to find all omni-specific code blocks. Update the reference docs when discovering new blocks.
+
+## Pre-Upgrade Preparation
+
+### 1. Version Information
+- [ ] Identify current vllm-omni version: `_________`
+- [ ] Identify target vllm-ascend version: `_________`
+- [ ] Identify target vllm version: `_________`
+- [ ] Last release date for GPU worker changes: `_________`
+
+### 2. Gather Git History
+```bash
+# GPU-side omni changes since last release
+cd /root/vllm-workspace/vllm-omni
+git log --oneline --since="YYYY-MM-DD" -- vllm_omni/worker/
+
+# vllm-ascend NPUModelRunner changes
+cd /root/vllm-workspace/vllm-ascend
+git log --oneline <from-tag>..<to-tag> -- vllm_ascend/worker/model_runner_v1.py
+```
+
+### 3. Backup Current Files
+- [ ] Create backup of current NPU runners:
+  ```bash
+  cp -r vllm_omni/platforms/npu/worker vllm_omni/platforms/npu/worker.backup
+  ```
+
+---
+
+## OmniNPUModelRunner (npu_model_runner.py)
+
+### Read and Understand
+- [ ] Read current `npu_model_runner.py`
+- [ ] Read latest `vllm_ascend/worker/model_runner_v1.py`
+- [ ] Read latest `vllm_omni/worker/gpu_model_runner.py`
+
+### Method: load_model
+- [ ] Document existing omni-specific logic
+- [ ] Copy latest NPUModelRunner.load_model structure
+- [ ] Re-insert: `enable_sp(self.vllm_config)` call
+- [ ] Re-insert: talker_mtp detection and setup
+- [ ] Replace: `CUDAGraphWrapper` → `ACLGraphWrapper`
+- [ ] Re-insert: Buffer allocations (talker_mtp_input_ids, etc.)
+
+### Method: _dummy_run
+- [ ] Document existing omni-specific logic locations
+- [ ] Copy latest NPUModelRunner._dummy_run
+- [ ] Re-insert: talker_mtp dummy forward block (inside context)
+- [ ] Re-insert: `extract_multimodal_outputs` call
+- [ ] Verify: Comment markers are present
+
+### Method: _model_forward
+- [ ] Copy latest NPUModelRunner._model_forward structure
+- [ ] Re-insert: `_build_model_kwargs_extra()` call
+- [ ] Re-insert: OmniOutput wrapping logic
+- [ ] Re-insert: `_omni_last_model_output` caching
+- [ ] Keep: NPU graph params update
+- [ ] Keep: SP all-gather logic
+
+### Method: _talker_mtp_forward
+- [ ] Verify: Uses `set_ascend_forward_context`
+- [ ] Verify: Uses `ACLGraphWrapper` check
+- [ ] Sync any changes from GPU `_talker_mtp_forward`
+
+### Imports
+- [ ] Update vllm-ascend imports to latest paths
+- [ ] Verify all omni imports are present
+- [ ] Remove any deprecated imports
+
+---
+
+## NPUARModelRunner (npu_ar_model_runner.py)
+
+### Read and Understand
+- [ ] Read current `npu_ar_model_runner.py`
+- [ ] Read latest `vllm_ascend/worker/model_runner_v1.py` execute_model
+- [ ] Read latest `vllm_omni/worker/gpu_ar_model_runner.py`
+
+### Method: __init__
+- [ ] Sync any new initialization from GPU side
+- [ ] Keep: `OmniKVTransferManager` setup
+- [ ] Keep: Custom buffer allocations
+
+### Method: execute_model
+- [ ] Document all omni blocks with line numbers
+- [ ] Copy latest NPUModelRunner.execute_model structure
+- [ ] Re-insert: KV transfer handling (beginning)
+- [ ] Re-insert: Custom `_update_states` call
+- [ ] Re-insert: `extract_multimodal_outputs`
+- [ ] Re-insert: `compute_logits` with sampling_metadata try/except
+- [ ] Update: ExecuteModelState to include multimodal_outputs
+
+### Method: sample_tokens
+- [ ] Document all omni blocks
+- [ ] Copy latest NPUModelRunner.sample_tokens structure
+- [ ] Re-insert: `kv_extracted_req_ids` handling
+- [ ] Re-insert: Hidden states CPU copy
+- [ ] Re-insert: `_process_additional_information_updates`
+- [ ] Re-insert: `OmniModelRunnerOutput` construction
+
+### ExecuteModelState
+- [ ] Verify: `multimodal_outputs` field is present
+- [ ] Verify: Imported/used correctly in execute_model
+
+### Imports
+- [ ] Update all vllm-ascend imports
+- [ ] Keep omni-specific imports
+
+---
+
+## NPUGenerationModelRunner (npu_generation_model_runner.py)
+
+### Read and Understand
+- [ ] Read current `npu_generation_model_runner.py`
+- [ ] Read latest GPU `gpu_generation_model_runner.py`
+
+### Method: _update_request_states
+- [ ] Verify: async_chunk handling is correct
+- [ ] Sync any changes from GPU side
+
+### Method: execute_model
+- [ ] Document all omni blocks
+- [ ] Copy latest NPUModelRunner.execute_model base structure
+- [ ] Re-insert: async_chunk update logic
+- [ ] Re-insert: `seq_token_counts` injection
+- [ ] Re-insert: `_run_generation_model` call
+- [ ] Re-insert: `extract_multimodal_outputs`
+- [ ] Use: ExecuteModelState from npu_ar_model_runner
+
+### Method: sample_tokens
+- [ ] Keep: Entire omni multimodal output processing
+- [ ] Update: Any new output fields needed
+- [ ] Keep: `OmniModelRunnerOutput` construction
+
+### Method: _run_generation_model
+- [ ] Sync any changes from GPU side
+- [ ] Keep: `_model_forward` call with sampler
+
+### Method: _dummy_run
+- [ ] Copy latest NPUModelRunner._dummy_run
+- [ ] Re-insert: `model_kwargs = self._init_model_kwargs()`
+- [ ] Re-insert: `extract_multimodal_outputs` at end
+
+### Imports
+- [ ] Import ExecuteModelState from npu_ar_model_runner
+- [ ] Update vllm-ascend imports
+
+---
+
+## Post-Upgrade Validation
+
+### Syntax Validation
+- [ ] `python -m py_compile vllm_omni/platforms/npu/worker/npu_model_runner.py`
+- [ ] `python -m py_compile vllm_omni/platforms/npu/worker/npu_ar_model_runner.py`
+- [ ] `python -m py_compile vllm_omni/platforms/npu/worker/npu_generation_model_runner.py`
+
+### Import Validation
+- [ ] `python -c "from vllm_omni.platforms.npu.worker.npu_model_runner import OmniNPUModelRunner"`
+- [ ] `python -c "from vllm_omni.platforms.npu.worker.npu_ar_model_runner import NPUARModelRunner"`
+- [ ] `python -c "from vllm_omni.platforms.npu.worker.npu_generation_model_runner import NPUGenerationModelRunner"`
+
+### Comment Markers
+- [ ] Grep for "Omni-new" in all three files
+- [ ] Verify all omni blocks have closing markers
+
+### Code Review
+- [ ] No `CUDAGraphWrapper` references
+- [ ] All `set_forward_context` replaced with `set_ascend_forward_context`
+- [ ] Parameter names correct (`aclgraph_runtime_mode` not `cudagraph_runtime_mode`)
+- [ ] No duplicate code blocks
+- [ ] No missing imports
+
+---
+
+## Git Commit
+
+### Commit Message Template
+```
+[NPU] Upgrade model runners to align with vllm-ascend vX.Y.Z
+
+- Update OmniNPUModelRunner with latest NPUModelRunner base
+- Update NPUARModelRunner execute_model and sample_tokens
+- Update NPUGenerationModelRunner for async_chunk changes
+- Sync GPU-side omni changes from vX.Y.Z release
+- Preserve all omni-specific logic (marked with Omni-new comments)
+
+Changes from vllm-ascend:
+- <list key changes>
+
+Changes synced from GPU:
+- <list key GPU-side omni changes>
+```
+
+### Files to Stage
+- [ ] `vllm_omni/platforms/npu/worker/npu_model_runner.py`
+- [ ] `vllm_omni/platforms/npu/worker/npu_ar_model_runner.py`
+- [ ] `vllm_omni/platforms/npu/worker/npu_generation_model_runner.py`
+- [ ] Any other modified files
+
+---
+
+## Troubleshooting
+
+### Import Errors
+- Check if vllm-ascend module paths have changed
+- Verify PYTHONPATH includes both vllm-ascend and vllm-omni
+
+### Type Errors
+- Check method signatures match between GPU and NPU
+- Verify NamedTuple fields match expected structure
+
+### Runtime Errors
+- Enable debug logging: `export VLLM_LOGGING_LEVEL=DEBUG`
+- Check graph capture issues: try `--enforce-eager`
+- Check attention issues: verify AscendAttentionState usage
+
+### Performance Regression
+- Compare with previous version on same model
+- Check if graph capture is working: look for ACLGraph logs
+- Verify SP/EP configurations are correct