Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
b4add5b
[CI] Skip test_bagel[parallel_tp_2] and test_wan22_i2v_online_serving…
yenuo26 Apr 17, 2026
64d368d
[Bugfix] fix CI failure (#2884)
RuixiangMa Apr 17, 2026
f2edb81
[Cleanup] Remove dead runtime.defaults config parameters (#2343)
NickCao Apr 17, 2026
1637dba
[skip CI][Docs] Add Qwen3-Omni and Qwen3-TTS performance blog and fig…
Shirley125 Apr 17, 2026
b5ddff7
Nextstep online e2e (#2107)
Joshna-Medisetty Apr 17, 2026
f346f2f
Add Teacache Support for LongCat Image (#1487)
alex-jw-brooks Apr 17, 2026
5a68c21
[skip ci][recipe] draft vllm-omni recipes (#2646)
hsliuustc0106 Apr 18, 2026
4f71f73
[Docs] Update WeChat QR code for community support (#2895)
david6666666 Apr 18, 2026
d2c23d7
[Refactor] Remove resampy dependency (#2891)
NickCao Apr 18, 2026
4124a1f
[Feature]Support audio streaming input and output-phase2 (#2581)
Shirley125 Apr 18, 2026
768931e
[BugFix]: Fix multi-stage cfg bug (#2801)
princepride Apr 18, 2026
fe6cec6
[doc][skip ci] remove redundant content in readme (#2901)
Shirley125 Apr 18, 2026
9cf1fe7
[Feat] cache-dit for GLM-Image (#1399)
RuixiangMa Apr 18, 2026
9313f37
[Agent] Add NPU main2main skill (#2858)
gcanlin Apr 18, 2026
a683b1d
[Bugfix][VoxCPM2] Fix voice-clone decode loop by padding prefill prom…
Sy0307 Apr 18, 2026
a390381
[Config Refactor][2/N] Pipeline + Deploy Config Schema (#2383)
lishunyang12 Apr 19, 2026
26edc7f
[Bugfix][VoxCPM2]: Fix vectorized_gather OOB under concurrent prefill…
Sy0307 Apr 19, 2026
1568451
perf(helios): replace strided RoPE with stack+flatten for contiguous …
willamhou Apr 19, 2026
93beef1
[Bugfix] diffusion end points allow model mismatch (#2805)
xiaohajiayou Apr 19, 2026
68f28f9
[Feat] Support layerwise CPU offloading for more videogen models (#2018)
yuanheng-zhao Apr 19, 2026
cd384d9
[Config Refactor 2.5/N] Centralize pipeline registry (#2915)
lishunyang12 Apr 19, 2026
a2f4c57
Merge origin/main into dev/migrate-MR-v2
Sy0307 Apr 19, 2026
6808a44
[BugFix] Add Qwen2_5Omni to test_init_model_state expected set
Sy0307 Apr 16, 2026
ee2ebb4
[BugFix] Fix MTP buffer size mismatch for Omni Talker models
Sy0307 Apr 16, 2026
cde2903
[BugFix] Propagate finished_req_ids for already_finished_reqs
Sy0307 Apr 19, 2026
f7bada9
Condense comments in MTP and scheduler fixes
Sy0307 Apr 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
300 changes: 300 additions & 0 deletions .claude/skills/vllm-omni-npu-upgrade/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,300 @@
---
name: vllm-omni-npu-model-runner-upgrade
description: "Upgrade vllm-omni NPU model runners (OmniNPUModelRunner, NPUARModelRunner, NPUGenerationModelRunner) to align with the latest vllm-ascend NPUModelRunner while preserving omni-specific logic."
---

# vLLM-Omni NPU Model Runner Upgrade Skill

## Overview

This skill guides the process of upgrading vllm-omni's NPU model runners to align with the latest vllm-ascend codebase while preserving omni-specific enhancements. The NPU runners are designed to run omni multimodal models (like Qwen3-Omni, Bagel, MiMoAudio) on Ascend NPUs.

## File Structure

### NPU Model Runner Files
```
vllm-omni/vllm_omni/platforms/npu/worker/
├── __init__.py
├── npu_model_runner.py # OmniNPUModelRunner (base class)
├── npu_ar_model_runner.py # NPUARModelRunner (autoregressive)
├── npu_ar_worker.py # AR worker
├── npu_generation_model_runner.py # NPUGenerationModelRunner (diffusion/non-AR)
└── npu_generation_worker.py # Generation worker
```

### GPU Reference Files (for omni-specific logic sync)
```
vllm-omni/vllm_omni/worker/
├── __init__.py
├── gpu_model_runner.py # OmniGPUModelRunner
├── gpu_ar_model_runner.py # GPUARModelRunner
├── gpu_ar_worker.py
├── gpu_generation_model_runner.py
├── gpu_generation_worker.py
├── mixins.py
├── base.py
└── gpu_memory_utils.py
```

### vllm-ascend Reference Files
```
vllm-ascend/vllm_ascend/worker/
├── model_runner_v1.py # NPUModelRunner (base class to copy from)
├── npu_input_batch.py
├── block_table.py
├── pcp_utils.py
└── worker.py
```

## Inheritance Hierarchy

```
GPUModelRunner (vllm)
|
+----------------+----------------+
| |
OmniGPUModelRunner NPUModelRunner (vllm-ascend)
(vllm_omni/worker) (vllm_ascend/worker)
| |
+----------- OmniNPUModelRunner --+
(multiple inheritance)
|
+---------------+---------------+
| |
NPUARModelRunner NPUGenerationModelRunner
(autoregressive) (non-autoregressive/diffusion)
```

## Omni-Specific Comment Markers

Omni-specific logic is marked with comment blocks:
```python
# -------------------------------------- Omni-new -------------------------------------------------
# ... omni-specific code ...
# -------------------------------------- Omni-new -------------------------------------------------
```

Or simpler variations:
```python
# -------------------------------------- Omni-new -------------------------------------------------
# ------------------------------------------------------------------------------------------------
```

**Important**:
- Always preserve and add these markers when modifying code.
- **The reference documents (`references/omni-specific-blocks.md`) may not be up-to-date.** Always grep for `Omni-new` in the GPU implementations to find the authoritative list of omni-specific blocks.
- When you discover new omni-specific code that is not documented in the references, please update the reference files.

## Key Methods Requiring Attention

### OmniNPUModelRunner (npu_model_runner.py)

| Method | Description | Omni-Specific Logic |
|--------|-------------|---------------------|
| `load_model` | Load model and initialize talker_mtp | Uses `ACLGraphWrapper` instead of `CUDAGraphWrapper`, initializes talker buffers |
| `_dummy_run` | Warmup/profiling run | talker_mtp dummy forward, `extract_multimodal_outputs` |
| `_model_forward` | Forward pass wrapper | Injects `model_kwargs_extra`, wraps with `OmniOutput`, NPU-specific graph updates |
| `_talker_mtp_forward` | Talker MTP forward for Qwen3-Omni | Uses `set_ascend_forward_context` |

### NPUARModelRunner (npu_ar_model_runner.py)

| Method | Description | Omni-Specific Logic |
|--------|-------------|---------------------|
| `__init__` | Initialize with KV transfer manager | `OmniKVTransferManager` setup |
| `execute_model` | Main inference entry | KV transfer handling, `_update_states` override, `extract_multimodal_outputs` |
| `sample_tokens` | Token sampling | Hidden states extraction, multimodal outputs processing, `OmniModelRunnerOutput` |
| `_resolve_global_request_id` | Request ID resolution | For disaggregated inference |

### NPUGenerationModelRunner (npu_generation_model_runner.py)

| Method | Description | Omni-Specific Logic |
|--------|-------------|---------------------|
| `_update_request_states` | Update request states for async chunk | async_chunk handling |
| `execute_model` | Generation forward | async_chunk, `seq_token_counts`, `_run_generation_model` |
| `sample_tokens` | Output processing | multimodal output packaging to `OmniModelRunnerOutput` |
| `_dummy_run` | Dummy run override | model_kwargs initialization, multimodal extraction |
| `_run_generation_model` | Run generation model | Calls `_model_forward` with sampler |

## Upgrade Workflow

### Step 1: Preparation

1. **Identify target versions**(Use gh cli to check):
- We're using vllm-omni main branch
- Check the last release of vllm-omni
- Target vllm-ascend version(Just directly use the local latest vllm-ascend code)

2. **Check GPU-side changes** (since last release):
```bash
cd /root/vllm-workspace/vllm-omni
git log --oneline --since="<last-release-date>" -- vllm_omni/worker/
```

3. **Read latest vllm-ascend code**:
- We don't track vllm-ascend changes - just directly use the latest code from `/root/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py`
- Copy the relevant methods and re-insert omni-specific blocks

### Step 2: Analyze Omni-Specific Logic

For each NPU model runner file:

1. **Extract existing omni-specific blocks**:
```bash
grep -n "Omni-new" vllm_omni/platforms/npu/worker/npu_model_runner.py
```

2. **Document each omni block**:
- Which method it belongs to
- What functionality it provides
- Dependencies on other omni code

### Step 3: Update Base Class (OmniNPUModelRunner)

**Note**: Always check the GPU implementation `gpu_model_runner.py` for any new omni logic not yet documented in references.

1. **Read the latest vllm-ascend `NPUModelRunner.load_model`**
2. **Copy the method, keeping the structure**
3. **Re-insert omni-specific logic** (check GPU `gpu_model_runner.py` for authoritative list):
- Replace `CUDAGraphWrapper` with `ACLGraphWrapper`
- Keep talker_mtp initialization
- Preserve buffer allocations for talker
- Check for any new omni blocks added since last sync

4. **Update `_dummy_run`**:
- Copy from vllm-ascend
- Compare with GPU `_dummy_run` for omni-specific blocks
- Re-insert all `Omni-new` marked code from GPU version

5. **Update `_model_forward`**:
- Keep the omni wrapper logic
- Update NPU-specific parts (graph params, SP all-gather)
- Check GPU version for any new omni logic

### Step 4: Update AR Model Runner

1. **Compare with GPU `gpu_ar_model_runner.py`** for any new omni features
2. **Copy `execute_model` from vllm-ascend**
3. **Re-insert omni blocks** (reference `references/omni-specific-blocks.md`, but note it may be incomplete):
- **IMPORTANT**: Always check the GPU implementation `gpu_ar_model_runner.py` for all `Omni-new` marked code blocks
- The reference doc may not include newly added omni logic - treat it as a starting point, not exhaustive
- When discovering new omni code blocks, please update `references/omni-specific-blocks.md`
- Common omni blocks include but are not limited to: KV transfer, multimodal outputs, sampling_metadata handling, etc.

4. **Update `sample_tokens`** (also compare with GPU implementation):
- Compare with `gpu_ar_model_runner.py`'s `sample_tokens` method
- Identify all `Omni-new` marked code blocks
- Ensure NPU version includes all omni-specific logic

### Step 5: Update Generation Model Runner

**Note**: Generation model runner may have unique omni logic for diffusion/non-AR models.

1. **Compare with GPU `gpu_generation_model_runner.py`** - grep for all `Omni-new` blocks
2. **Update `execute_model`**:
- Check GPU version for all omni-specific blocks
- Keep async_chunk handling
- Keep `seq_token_counts` injection
- Update forward/context setup from vllm-ascend
- Look for any new omni logic not documented in references

3. **Update `_dummy_run`**:
- Copy from vllm-ascend base
- Compare with GPU `_dummy_run` if exists
- Re-insert all omni-specific logic

### Step 6: Update Imports

Check and update imports at the top of each file:

```python
# Common vllm-ascend imports
from vllm_ascend.ascend_forward_context import get_forward_context, set_ascend_forward_context
from vllm_ascend.attention.attention_v1 import AscendAttentionState
from vllm_ascend.attention.utils import using_paged_attention
from vllm_ascend.compilation.acl_graph import ACLGraphWrapper, update_full_graph_params
from vllm_ascend.ops.rotary_embedding import update_cos_sin
from vllm_ascend.utils import enable_sp, lmhead_tp_enable
from vllm_ascend.worker.model_runner_v1 import SEQ_LEN_WITH_MAX_PA_WORKSPACE, NPUModelRunner

# Omni-specific imports
from vllm_omni.model_executor.models.output_templates import OmniOutput
from vllm_omni.worker.gpu_model_runner import OmniGPUModelRunner
from vllm_omni.outputs import OmniModelRunnerOutput
from vllm_omni.distributed.omni_connectors.kv_transfer_manager import OmniKVTransferManager
```

### Step 7: Sync GPU-Side Omni Changes

1. **Check recent GPU worker changes**:
```bash
git diff <from-tag>..<to-tag> -- vllm_omni/worker/gpu_model_runner.py
git diff <from-tag>..<to-tag> -- vllm_omni/worker/gpu_ar_model_runner.py
```

2. **Identify new omni features** that need to be ported to NPU

3. **Apply corresponding changes** to NPU runners

### Step 8: Validation

1. **Run type checking**:
```bash
cd /root/vllm-workspace/vllm-omni
python -m py_compile vllm_omni/platforms/npu/worker/npu_model_runner.py
python -m py_compile vllm_omni/platforms/npu/worker/npu_ar_model_runner.py
python -m py_compile vllm_omni/platforms/npu/worker/npu_generation_model_runner.py
```

2. **Run import test**:
```bash
python -c "from vllm_omni.platforms.npu.worker import *"
```

3. **Run model serving test** (if hardware available):
```bash
vllm serve <model-path> --trust-remote-code
```

## Common Pitfalls

### 1. Forward Context Differences
- GPU uses `set_forward_context`
- NPU uses `set_ascend_forward_context`
- Parameters may differ slightly

### 2. Graph Wrapper Differences
- GPU: `CUDAGraphWrapper`
- NPU: `ACLGraphWrapper`
- Constructor parameters may differ

### 3. Buffer Creation
- GPU: `_make_buffer` returns different structure
- NPU: May need numpy=True/False parameter

### 4. Attention Metadata
- GPU: Uses vllm attention metadata builders
- NPU: Uses `AscendCommonAttentionMetadata`

### 5. Sampling
- GPU: Uses vllm sampler
- NPU: Uses `AscendSampler`

## Checklist Before Commit

- [ ] All omni-specific comment markers preserved
- [ ] New omni logic from GPU side synced
- [ ] Imports updated to latest vllm-ascend
- [ ] No `CUDAGraphWrapper` references in NPU code
- [ ] `set_ascend_forward_context` used instead of `set_forward_context`
- [ ] `ACLGraphWrapper` used for talker_mtp wrapping
- [ ] Type hints match vllm-ascend signatures
- [ ] No duplicate code blocks
- [ ] Python syntax valid (py_compile passes)

## Reference Files for Comparison

When upgrading, keep these files open for reference:

1. **vllm-ascend NPUModelRunner**: `/root/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py`
2. **vllm GPUModelRunner**: `/root/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py`
3. **vllm-omni OmniGPUModelRunner**: `/root/vllm-workspace/vllm-omni/vllm_omni/worker/gpu_model_runner.py`
Loading