Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
a5a4998
[Feature] Support Prefill-Decode disaggregation via vLLM KV transfer …
spencerr221 Apr 17, 2026
c0ccbb8
[Model] Add Ming-flash-omni-2.0 Thinker Stage (#1822)
yuanheng-zhao Apr 17, 2026
b7f2398
[Bugfix] Fix RIFE device selection for CPU-transported videos (#2876)
david6666666 Apr 17, 2026
f658bcb
[Bugfix] Limit Qwen-Image-Edit-2511 input image count (#2840)
david6666666 Apr 17, 2026
edb4f2f
[Test] Add ModelRunner V2 with Qwen3-TTS Base E2E Test to CI pipeline…
tzhouam Apr 17, 2026
cf75ae6
[Bugfix] Fix image quality in /v1/images/generations for multi-stage …
RuixiangMa Apr 17, 2026
6b7be88
Fix NoneType error of outputs (#2315)
QiuMike Apr 17, 2026
18ac679
[Refactor] refactor wan2.2 diffuse && add ut (#2672)
bjf-frz Apr 17, 2026
6c57ab7
[Misc] Warn When vLLM / vLLM-Omni Have Mismatched Versions (#2691)
alex-jw-brooks Apr 17, 2026
536f59b
[Bugfix] Fix cache dit for Longcat & LTX2 (#2860)
alex-jw-brooks Apr 17, 2026
b4add5b
[CI] Skip test_bagel[parallel_tp_2] and test_wan22_i2v_online_serving…
yenuo26 Apr 17, 2026
64d368d
[Bugfix] fix CI failure (#2884)
RuixiangMa Apr 17, 2026
f2edb81
[Cleanup] Remove dead runtime.defaults config parameters (#2343)
NickCao Apr 17, 2026
1637dba
[skip CI][Docs] Add Qwen3-Omni and Qwen3-TTS performance blog and fig…
Shirley125 Apr 17, 2026
b5ddff7
Nextstep online e2e (#2107)
Joshna-Medisetty Apr 17, 2026
f346f2f
Add Teacache Support for LongCat Image (#1487)
alex-jw-brooks Apr 17, 2026
5a68c21
[skip ci][recipe] draft vllm-omni recipes (#2646)
hsliuustc0106 Apr 18, 2026
4f71f73
[Docs] Update WeChat QR code for community support (#2895)
david6666666 Apr 18, 2026
d2c23d7
[Refactor] Remove resampy dependency (#2891)
NickCao Apr 18, 2026
4124a1f
[Feature]Support audio streaming input and output-phase2 (#2581)
Shirley125 Apr 18, 2026
768931e
[BugFix]: Fix multi-stage cfg bug (#2801)
princepride Apr 18, 2026
fe6cec6
[doc][skip ci] remove redundant content in readme (#2901)
Shirley125 Apr 18, 2026
9cf1fe7
[Feat] cache-dit for GLM-Image (#1399)
RuixiangMa Apr 18, 2026
9313f37
[Agent] Add NPU main2main skill (#2858)
gcanlin Apr 18, 2026
a683b1d
[Bugfix][VoxCPM2] Fix voice-clone decode loop by padding prefill prom…
Sy0307 Apr 18, 2026
a390381
[Config Refactor][2/N] Pipeline + Deploy Config Schema (#2383)
lishunyang12 Apr 19, 2026
26edc7f
[Bugfix][VoxCPM2]: Fix vectorized_gather OOB under concurrent prefill…
Sy0307 Apr 19, 2026
1568451
perf(helios): replace strided RoPE with stack+flatten for contiguous …
willamhou Apr 19, 2026
93beef1
[Bugfix] diffusion end points allow model mismatch (#2805)
xiaohajiayou Apr 19, 2026
68f28f9
[Feat] Support layerwise CPU offloading for more videogen models (#2018)
yuanheng-zhao Apr 19, 2026
cd384d9
[Config Refactor 2.5/N] Centralize pipeline registry (#2915)
lishunyang12 Apr 19, 2026
78f237e
[Perf] Optimize Wan2.2 device free on image preprocess (#2852)
fan2956 Apr 20, 2026
d435fe0
[Docs] update documents (#2921)
R2-Y Apr 20, 2026
0393c58
[BugFix] Fixed the issue where --no-async-chunk was not working. (#2934)
amy-why-3459 Apr 20, 2026
8a9add1
[CI] Restructure vLLM-Omni Test Layout, Fixture Scope, and Support Mo…
yenuo26 Apr 20, 2026
fb2dab9
Merge pull request #2 from yinpeiqi/stagedev
yinpeiqi Apr 20, 2026
b1c8ca4
Merge remote-tracking branch 'upstream/main' into support-stage-scale…
yinpeiqi Apr 20, 2026
b20aa48
fix precommit
yinpeiqi Apr 20, 2026
1ae5af9
update yaml
yinpeiqi Apr 20, 2026
17a057b
fix precommit
yinpeiqi Apr 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion .buildkite/test-nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -552,7 +552,7 @@ steps:
- label: ":full_moon: Diffusion X2V · Accuracy Test"
timeout_in_minutes: 180
commands:
- pytest -s -v tests/e2e/accuracy/wan22_i2v/test_wan22_i2v_video_similarity.py --run-level advanced_model
- pytest -s -v tests/e2e/accuracy/wan22_i2v/test_wan22_i2v_video_similarity.py -m advanced_model --run-level advanced_model
agents:
queue: "mithril-h100-pool"
plugins:
Expand Down
27 changes: 27 additions & 0 deletions .buildkite/test-ready.yml
Original file line number Diff line number Diff line change
Expand Up @@ -367,6 +367,33 @@ steps:
volumes:
- "/fsx/hf_cache:/fsx/hf_cache"

- label: "Qwen3-TTS Base E2E Test (ModelRunner V2)"
depends_on: upload-ready-pipeline
soft_fail:
- exit_status: 1
commands:
- |
timeout 20m bash -c '
export VLLM_LOGGING_LEVEL=DEBUG
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_ALLOW_LONG_MAX_MODEL_LEN="1"
export VLLM_OMNI_USE_V2_RUNNER="1"
pytest -s -v tests/e2e/online_serving/test_qwen3_tts_base.py -m "core_model" --run-level "core_model"
'
agents:
queue: "gpu_1_queue"
plugins:
- docker#v5.2.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
always-pull: true
propagate-environment: true
shm-size: "8gb"
environment:
- "HF_HOME=/fsx/hf_cache"
- "HF_TOKEN"
volumes:
- "/fsx/hf_cache:/fsx/hf_cache"

- label: "Voxtral-TTS E2E Test"
timeout_in_minutes: 20
depends_on: upload-ready-pipeline
Expand Down
300 changes: 300 additions & 0 deletions .claude/skills/vllm-omni-npu-upgrade/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,300 @@
---
name: vllm-omni-npu-model-runner-upgrade
description: "Upgrade vllm-omni NPU model runners (OmniNPUModelRunner, NPUARModelRunner, NPUGenerationModelRunner) to align with the latest vllm-ascend NPUModelRunner while preserving omni-specific logic."
---

# vLLM-Omni NPU Model Runner Upgrade Skill

## Overview

This skill guides the process of upgrading vllm-omni's NPU model runners to align with the latest vllm-ascend codebase while preserving omni-specific enhancements. The NPU runners are designed to run omni multimodal models (like Qwen3-Omni, Bagel, MiMoAudio) on Ascend NPUs.

## File Structure

### NPU Model Runner Files
```
vllm-omni/vllm_omni/platforms/npu/worker/
├── __init__.py
├── npu_model_runner.py # OmniNPUModelRunner (base class)
├── npu_ar_model_runner.py # NPUARModelRunner (autoregressive)
├── npu_ar_worker.py # AR worker
├── npu_generation_model_runner.py # NPUGenerationModelRunner (diffusion/non-AR)
└── npu_generation_worker.py # Generation worker
```

### GPU Reference Files (for omni-specific logic sync)
```
vllm-omni/vllm_omni/worker/
├── __init__.py
├── gpu_model_runner.py # OmniGPUModelRunner
├── gpu_ar_model_runner.py # GPUARModelRunner
├── gpu_ar_worker.py
├── gpu_generation_model_runner.py
├── gpu_generation_worker.py
├── mixins.py
├── base.py
└── gpu_memory_utils.py
```

### vllm-ascend Reference Files
```
vllm-ascend/vllm_ascend/worker/
├── model_runner_v1.py # NPUModelRunner (base class to copy from)
├── npu_input_batch.py
├── block_table.py
├── pcp_utils.py
└── worker.py
```

## Inheritance Hierarchy

```
GPUModelRunner (vllm)
|
+----------------+----------------+
| |
OmniGPUModelRunner NPUModelRunner (vllm-ascend)
(vllm_omni/worker) (vllm_ascend/worker)
| |
+----------- OmniNPUModelRunner --+
(multiple inheritance)
|
+---------------+---------------+
| |
NPUARModelRunner NPUGenerationModelRunner
(autoregressive) (non-autoregressive/diffusion)
```

## Omni-Specific Comment Markers

Omni-specific logic is marked with comment blocks:
```python
# -------------------------------------- Omni-new -------------------------------------------------
# ... omni-specific code ...
# -------------------------------------- Omni-new -------------------------------------------------
```

Or simpler variations:
```python
# -------------------------------------- Omni-new -------------------------------------------------
# ------------------------------------------------------------------------------------------------
```

**Important**:
- Always preserve and add these markers when modifying code.
- **The reference documents (`references/omni-specific-blocks.md`) may not be up-to-date.** Always grep for `Omni-new` in the GPU implementations to find the authoritative list of omni-specific blocks.
- When you discover new omni-specific code that is not documented in the references, please update the reference files.

## Key Methods Requiring Attention

### OmniNPUModelRunner (npu_model_runner.py)

| Method | Description | Omni-Specific Logic |
|--------|-------------|---------------------|
| `load_model` | Load model and initialize talker_mtp | Uses `ACLGraphWrapper` instead of `CUDAGraphWrapper`, initializes talker buffers |
| `_dummy_run` | Warmup/profiling run | talker_mtp dummy forward, `extract_multimodal_outputs` |
| `_model_forward` | Forward pass wrapper | Injects `model_kwargs_extra`, wraps with `OmniOutput`, NPU-specific graph updates |
| `_talker_mtp_forward` | Talker MTP forward for Qwen3-Omni | Uses `set_ascend_forward_context` |

### NPUARModelRunner (npu_ar_model_runner.py)

| Method | Description | Omni-Specific Logic |
|--------|-------------|---------------------|
| `__init__` | Initialize with KV transfer manager | `OmniKVTransferManager` setup |
| `execute_model` | Main inference entry | KV transfer handling, `_update_states` override, `extract_multimodal_outputs` |
| `sample_tokens` | Token sampling | Hidden states extraction, multimodal outputs processing, `OmniModelRunnerOutput` |
| `_resolve_global_request_id` | Request ID resolution | For disaggregated inference |

### NPUGenerationModelRunner (npu_generation_model_runner.py)

| Method | Description | Omni-Specific Logic |
|--------|-------------|---------------------|
| `_update_request_states` | Update request states for async chunk | async_chunk handling |
| `execute_model` | Generation forward | async_chunk, `seq_token_counts`, `_run_generation_model` |
| `sample_tokens` | Output processing | multimodal output packaging to `OmniModelRunnerOutput` |
| `_dummy_run` | Dummy run override | model_kwargs initialization, multimodal extraction |
| `_run_generation_model` | Run generation model | Calls `_model_forward` with sampler |

## Upgrade Workflow

### Step 1: Preparation

1. **Identify target versions**(Use gh cli to check):
- We're using vllm-omni main branch
- Check the last release of vllm-omni
- Target vllm-ascend version(Just directly use the local latest vllm-ascend code)

2. **Check GPU-side changes** (since last release):
```bash
cd /root/vllm-workspace/vllm-omni
git log --oneline --since="<last-release-date>" -- vllm_omni/worker/
```

3. **Read latest vllm-ascend code**:
- We don't track vllm-ascend changes - just directly use the latest code from `/root/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py`
- Copy the relevant methods and re-insert omni-specific blocks

### Step 2: Analyze Omni-Specific Logic

For each NPU model runner file:

1. **Extract existing omni-specific blocks**:
```bash
grep -n "Omni-new" vllm_omni/platforms/npu/worker/npu_model_runner.py
```

2. **Document each omni block**:
- Which method it belongs to
- What functionality it provides
- Dependencies on other omni code

### Step 3: Update Base Class (OmniNPUModelRunner)

**Note**: Always check the GPU implementation `gpu_model_runner.py` for any new omni logic not yet documented in references.

1. **Read the latest vllm-ascend `NPUModelRunner.load_model`**
2. **Copy the method, keeping the structure**
3. **Re-insert omni-specific logic** (check GPU `gpu_model_runner.py` for authoritative list):
- Replace `CUDAGraphWrapper` with `ACLGraphWrapper`
- Keep talker_mtp initialization
- Preserve buffer allocations for talker
- Check for any new omni blocks added since last sync

4. **Update `_dummy_run`**:
- Copy from vllm-ascend
- Compare with GPU `_dummy_run` for omni-specific blocks
- Re-insert all `Omni-new` marked code from GPU version

5. **Update `_model_forward`**:
- Keep the omni wrapper logic
- Update NPU-specific parts (graph params, SP all-gather)
- Check GPU version for any new omni logic

### Step 4: Update AR Model Runner

1. **Compare with GPU `gpu_ar_model_runner.py`** for any new omni features
2. **Copy `execute_model` from vllm-ascend**
3. **Re-insert omni blocks** (reference `references/omni-specific-blocks.md`, but note it may be incomplete):
- **IMPORTANT**: Always check the GPU implementation `gpu_ar_model_runner.py` for all `Omni-new` marked code blocks
- The reference doc may not include newly added omni logic - treat it as a starting point, not exhaustive
- When discovering new omni code blocks, please update `references/omni-specific-blocks.md`
- Common omni blocks include but are not limited to: KV transfer, multimodal outputs, sampling_metadata handling, etc.

4. **Update `sample_tokens`** (also compare with GPU implementation):
- Compare with `gpu_ar_model_runner.py`'s `sample_tokens` method
- Identify all `Omni-new` marked code blocks
- Ensure NPU version includes all omni-specific logic

### Step 5: Update Generation Model Runner

**Note**: Generation model runner may have unique omni logic for diffusion/non-AR models.

1. **Compare with GPU `gpu_generation_model_runner.py`** - grep for all `Omni-new` blocks
2. **Update `execute_model`**:
- Check GPU version for all omni-specific blocks
- Keep async_chunk handling
- Keep `seq_token_counts` injection
- Update forward/context setup from vllm-ascend
- Look for any new omni logic not documented in references

3. **Update `_dummy_run`**:
- Copy from vllm-ascend base
- Compare with GPU `_dummy_run` if exists
- Re-insert all omni-specific logic

### Step 6: Update Imports

Check and update imports at the top of each file:

```python
# Common vllm-ascend imports
from vllm_ascend.ascend_forward_context import get_forward_context, set_ascend_forward_context
from vllm_ascend.attention.attention_v1 import AscendAttentionState
from vllm_ascend.attention.utils import using_paged_attention
from vllm_ascend.compilation.acl_graph import ACLGraphWrapper, update_full_graph_params
from vllm_ascend.ops.rotary_embedding import update_cos_sin
from vllm_ascend.utils import enable_sp, lmhead_tp_enable
from vllm_ascend.worker.model_runner_v1 import SEQ_LEN_WITH_MAX_PA_WORKSPACE, NPUModelRunner

# Omni-specific imports
from vllm_omni.model_executor.models.output_templates import OmniOutput
from vllm_omni.worker.gpu_model_runner import OmniGPUModelRunner
from vllm_omni.outputs import OmniModelRunnerOutput
from vllm_omni.distributed.omni_connectors.kv_transfer_manager import OmniKVTransferManager
```

### Step 7: Sync GPU-Side Omni Changes

1. **Check recent GPU worker changes**:
```bash
git diff <from-tag>..<to-tag> -- vllm_omni/worker/gpu_model_runner.py
git diff <from-tag>..<to-tag> -- vllm_omni/worker/gpu_ar_model_runner.py
```

2. **Identify new omni features** that need to be ported to NPU

3. **Apply corresponding changes** to NPU runners

### Step 8: Validation

1. **Run type checking**:
```bash
cd /root/vllm-workspace/vllm-omni
python -m py_compile vllm_omni/platforms/npu/worker/npu_model_runner.py
python -m py_compile vllm_omni/platforms/npu/worker/npu_ar_model_runner.py
python -m py_compile vllm_omni/platforms/npu/worker/npu_generation_model_runner.py
```

2. **Run import test**:
```bash
python -c "from vllm_omni.platforms.npu.worker import *"
```

3. **Run model serving test** (if hardware available):
```bash
vllm serve <model-path> --trust-remote-code
```

## Common Pitfalls

### 1. Forward Context Differences
- GPU uses `set_forward_context`
- NPU uses `set_ascend_forward_context`
- Parameters may differ slightly

### 2. Graph Wrapper Differences
- GPU: `CUDAGraphWrapper`
- NPU: `ACLGraphWrapper`
- Constructor parameters may differ

### 3. Buffer Creation
- GPU: `_make_buffer` returns different structure
- NPU: May need numpy=True/False parameter

### 4. Attention Metadata
- GPU: Uses vllm attention metadata builders
- NPU: Uses `AscendCommonAttentionMetadata`

### 5. Sampling
- GPU: Uses vllm sampler
- NPU: Uses `AscendSampler`

## Checklist Before Commit

- [ ] All omni-specific comment markers preserved
- [ ] New omni logic from GPU side synced
- [ ] Imports updated to latest vllm-ascend
- [ ] No `CUDAGraphWrapper` references in NPU code
- [ ] `set_ascend_forward_context` used instead of `set_forward_context`
- [ ] `ACLGraphWrapper` used for talker_mtp wrapping
- [ ] Type hints match vllm-ascend signatures
- [ ] No duplicate code blocks
- [ ] Python syntax valid (py_compile passes)

## Reference Files for Comparison

When upgrading, keep these files open for reference:

1. **vllm-ascend NPUModelRunner**: `/root/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py`
2. **vllm GPUModelRunner**: `/root/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py`
3. **vllm-omni OmniGPUModelRunner**: `/root/vllm-workspace/vllm-omni/vllm_omni/worker/gpu_model_runner.py`
Loading