Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
141 commits
Select commit Hold shift + click to select a range
389979e
add async
yinpeiqi Feb 7, 2026
b9bc8e3
init runnable async omni
yinpeiqi Feb 7, 2026
fa1099b
temp
yinpeiqi Feb 10, 2026
2140b90
update async omni
yinpeiqi Feb 11, 2026
65cafdc
refactor init
yinpeiqi Feb 11, 2026
1078d57
add next stage without input processor
yinpeiqi Feb 11, 2026
ad1b313
move input processor to engine
yinpeiqi Feb 11, 2026
378f631
decouple input processor
yinpeiqi Feb 11, 2026
3fd53a7
refactor output processor
yinpeiqi Feb 11, 2026
97bc157
remove omni input processor
yinpeiqi Feb 12, 2026
b25a26f
use orchestrator
yinpeiqi Feb 12, 2026
01319dd
update
yinpeiqi Feb 12, 2026
d76f9b5
add metrics
yinpeiqi Feb 12, 2026
6093ea5
add download
yinpeiqi Feb 24, 2026
fc61104
add support for diffusion model
yinpeiqi Feb 28, 2026
a20f291
add doc
yinpeiqi Feb 28, 2026
0173b6c
update e2e
yinpeiqi Mar 2, 2026
5445e58
add precommit
yinpeiqi Mar 2, 2026
6c689f8
fix main
yinpeiqi Mar 2, 2026
421a46b
[draft] add basic support for bagel
yinpeiqi Mar 2, 2026
f82c940
add async chunk
yinpeiqi Mar 3, 2026
d8d1072
add qwen3 example
yinpeiqi Mar 3, 2026
8290c5d
update test
yinpeiqi Mar 4, 2026
d6e66c6
move init to engine
yinpeiqi Mar 4, 2026
da80711
rename files
yinpeiqi Mar 4, 2026
5f5d6b1
rename output handler
yinpeiqi Mar 5, 2026
9f3e156
add doc
yinpeiqi Mar 5, 2026
63f3b02
cleanup
yinpeiqi Mar 5, 2026
0b65a8b
add test case
yinpeiqi Mar 5, 2026
1659c7b
update doc
yinpeiqi Mar 5, 2026
6992042
add omni base and omni
yinpeiqi Mar 6, 2026
e31cc45
use janus queue
yinpeiqi Mar 6, 2026
67b392b
update doc
yinpeiqi Mar 6, 2026
1fc31b3
update
yinpeiqi Mar 6, 2026
4cd2d0e
update download
yinpeiqi Mar 9, 2026
f1d0ef2
update import
yinpeiqi Mar 9, 2026
d39053d
update test
yinpeiqi Mar 9, 2026
8a39b6c
update test
yinpeiqi Mar 9, 2026
0d349d2
update openai api
yinpeiqi Mar 10, 2026
1cfd827
fix
yinpeiqi Mar 10, 2026
4d84ba2
update e2e
yinpeiqi Mar 10, 2026
803cef2
rebase, update shotdown
yinpeiqi Mar 10, 2026
c5953f9
add pre-commit
yinpeiqi Mar 10, 2026
0be2032
update serve cli
yinpeiqi Mar 10, 2026
9476536
add parallel init
yinpeiqi Mar 10, 2026
e6d69b8
update rebase
yinpeiqi Mar 10, 2026
923c575
update
yinpeiqi Mar 10, 2026
328f033
rebase
yinpeiqi Mar 10, 2026
a68c7ca
rebase
yinpeiqi Mar 10, 2026
20395a3
update
yinpeiqi Mar 10, 2026
03951b3
update setup
yinpeiqi Mar 10, 2026
9e89e52
update and fix
yinpeiqi Mar 10, 2026
6d945b8
refactor
yinpeiqi Mar 10, 2026
0a9beae
rm v1 files
yinpeiqi Mar 10, 2026
4ae402e
update config
yinpeiqi Mar 11, 2026
aa33e8e
update config
yinpeiqi Mar 11, 2026
697187a
remove v0
yinpeiqi Mar 11, 2026
4dcc1f1
rm v1
yinpeiqi Mar 11, 2026
066eb03
delete input processor
yinpeiqi Mar 11, 2026
bae0b83
use weak ref
yinpeiqi Mar 11, 2026
6cbe213
update
yinpeiqi Mar 11, 2026
6d09583
remove deperated
yinpeiqi Mar 11, 2026
0d74355
stage cli (#4)
wuhang2014 Mar 11, 2026
3e92219
update
yinpeiqi Mar 12, 2026
c2ee926
add get supported tasks
yinpeiqi Mar 12, 2026
349344e
fix ci
yinpeiqi Mar 12, 2026
02a57b3
update get config
yinpeiqi Mar 12, 2026
3d678fa
update doc
yinpeiqi Mar 12, 2026
6e5ee42
update tts yaml
yinpeiqi Mar 12, 2026
673a6df
fix pre commit
yinpeiqi Mar 12, 2026
cef860f
update config
yinpeiqi Mar 12, 2026
f515204
fix
yinpeiqi Mar 12, 2026
5e70d3f
fix ci
yinpeiqi Mar 12, 2026
8d01da3
fix
yinpeiqi Mar 12, 2026
cfd1d1d
resolve config (#7)
wuhang2014 Mar 12, 2026
7b5fb26
fix for qwen3 tts
yinpeiqi Mar 13, 2026
c00cb79
fix for diffusion
yinpeiqi Mar 13, 2026
0ced660
fix stage id is none
yinpeiqi Mar 13, 2026
1616281
fix
yinpeiqi Mar 13, 2026
463a96c
fix
yinpeiqi Mar 13, 2026
941205f
rm logs
yinpeiqi Mar 13, 2026
7abead8
update
yinpeiqi Mar 13, 2026
cce6a56
fix
yinpeiqi Mar 13, 2026
ad20d43
rm request output list example
yinpeiqi Mar 13, 2026
20f72c2
fix pre commit
yinpeiqi Mar 13, 2026
8b7d483
change timtout time
yinpeiqi Mar 16, 2026
42c2efe
add factory usage
yinpeiqi Mar 16, 2026
db05cbb
fix
yinpeiqi Mar 16, 2026
eaa254a
update config
yinpeiqi Mar 16, 2026
7f2d1f5
Merge branch 'main' into refactor
fake0fan Mar 16, 2026
efb2e85
fix pre commit
yinpeiqi Mar 16, 2026
8dc3b18
Merge branch 'vllm-project:main' into refactor3
yinpeiqi Mar 16, 2026
8ee93ae
Merge pull request #14 from yinpeiqi/refactor3
yinpeiqi Mar 16, 2026
256cfbc
add comfyui
yinpeiqi Mar 16, 2026
bdb6e30
fix comfyui
yinpeiqi Mar 16, 2026
ca788e0
Merge pull request #15 from yinpeiqi/refactor3
yinpeiqi Mar 16, 2026
e9190c5
update stage
yinpeiqi Mar 16, 2026
58fbf74
fix time sleep
yinpeiqi Mar 16, 2026
36099da
Fix Qwen3-TTS broken on refactor: add pipeline.yaml and fix async_chu…
linyueqian Mar 16, 2026
027a68f
Fix Base voice clone: use actual codec encoder for exact ref_code_len
linyueqian Mar 16, 2026
18f632e
Merge pull request #1 from fake0fan/refactor
yinpeiqi Mar 17, 2026
caab74c
Merge branch 'refactor3' of https://github.com/yinpeiqi/vllm-omni int…
yinpeiqi Mar 17, 2026
6caeea1
add docs for current arch
yinpeiqi Mar 17, 2026
1b06e11
fix description
yinpeiqi Mar 17, 2026
10657a1
Merge pull request #16 from yinpeiqi/refactor3
yinpeiqi Mar 17, 2026
ca88ec0
rm deparated funcs
yinpeiqi Mar 17, 2026
e03fced
rm deparated class
yinpeiqi Mar 17, 2026
2de3bed
Merge branch 'main' into refactor
yinpeiqi Mar 17, 2026
f5492fe
Merge pull request #17 from yinpeiqi/refactor3
yinpeiqi Mar 17, 2026
07f8bfa
mv worker cls utils
yinpeiqi Mar 17, 2026
9d7b905
Fix perf config: add is_comprehension to qwen3_tts stage 0
linyueqian Mar 17, 2026
41414d4
Support auto-detection for TTS perf benchmark (optional stage_config_…
linyueqian Mar 17, 2026
52aba8f
Merge pull request #2 from fake0fan/refactor
yinpeiqi Mar 17, 2026
ff94a97
change stage init to stage init utils
yinpeiqi Mar 17, 2026
df17139
Set gpu_memory_utilization to 0.08 for Qwen3-TTS (1.7B model)
linyueqian Mar 17, 2026
23ddbca
Merge pull request #18 from yinpeiqi/refactor3
yinpeiqi Mar 17, 2026
264dead
refactor
yinpeiqi Mar 17, 2026
4d7dc9e
add kv transfer inject and cfg expand
princepride Mar 17, 2026
1b1acf2
rename stage_init.py -> stage_init_utils.py and align comments with r…
princepride Mar 17, 2026
9d63d40
Merge fake0fan/refactor into fix-bagel-bugs
princepride Mar 17, 2026
a432d99
Merge pull request #19 from princepride/fix-bagel-bugs
fake0fan Mar 17, 2026
db33f8d
fix some bug
princepride Mar 17, 2026
cf99223
remove mutli image output
princepride Mar 17, 2026
2000d6f
fix: use legacy config loading path instead of StageConfigFactory
lishunyang12 Mar 17, 2026
45e8381
Merge pull request #20 from lishunyang12/fix/use-legacy-config-path
fake0fan Mar 17, 2026
c5e22f6
fix: increase gpu_memory_utilization for TTS CI on L4 GPUs
lishunyang12 Mar 17, 2026
9a46667
Merge pull request #22 from princepride/fix-bagel-bugs-2
fake0fan Mar 17, 2026
0faad47
Merge pull request #23 from lishunyang12/fix/tts-ci-gpu-memory
fake0fan Mar 17, 2026
febe9c8
fix pre-commit and glm-image
fake0fan Mar 17, 2026
4282c09
Merge branch 'main' into refactor
fake0fan Mar 18, 2026
6e37c1a
Merge branch 'refactor3' into refactor
yinpeiqi Mar 18, 2026
6687859
Merge pull request #3 from fake0fan/refactor
yinpeiqi Mar 18, 2026
4c32e7a
fix precommit, fix error
yinpeiqi Mar 18, 2026
d50a1b7
add utils for helper function
yinpeiqi Mar 18, 2026
e079342
Merge pull request #25 from yinpeiqi/refactor3
yinpeiqi Mar 18, 2026
5dc6422
fix import
yinpeiqi Mar 18, 2026
fc55262
Merge pull request #26 from yinpeiqi/refactor3
yinpeiqi Mar 18, 2026
a8ea9da
Merge branch 'main' into refactor
yinpeiqi Mar 18, 2026
0309f54
fix is alive, avoid duplicate check
yinpeiqi Mar 18, 2026
fe24400
Merge pull request #27 from yinpeiqi/refactor3
yinpeiqi Mar 18, 2026
239a3f8
Merge branch 'main' into refactor
yinpeiqi Mar 18, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 12 additions & 19 deletions .github/ISSUE_TEMPLATE/400-bug-report.yml
Original file line number Diff line number Diff line change
Expand Up @@ -74,28 +74,21 @@ body:
If relevant, add a minimal example so that we can reproduce the error by running the code. It is very important for the snippet to be as succinct (minimal) as possible, so please take time to trim down any irrelevant code to help us debug efficiently. We are going to copy-paste your code and we expect to get the same result as you did: avoid any external data, and include the relevant imports, etc. For example:

```python
from vllm_omni import OmniLLM, create_ar_stage_config, create_dit_stage_config

# Create stage configurations
ar_config = create_ar_stage_config(
stage_id=0,
model_path="Qwen/Qwen3-0.6B",
input_modalities=["text"],
output_modalities=["text"]
)
from vllm_omni.entrypoints.omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm import SamplingParams

dit_config = create_dit_stage_config(
stage_id=1,
model_path="stabilityai/stable-diffusion-2-1",
input_modalities=["text"],
output_modalities=["image"]
omni = Omni(
model="Qwen/Qwen-Image",
stage_configs_path="/path/to/stage_configs.yaml",
)

# Initialize OmniLLM
omni = OmniLLM([ar_config, dit_config])

# Generate
outputs = omni.generate(prompt="A scenic watercolor painting of a lighthouse at sunset")
prompts = [{"prompt": "A scenic watercolor painting of a lighthouse at sunset"}]
sampling_params_list = [
SamplingParams(max_tokens=1),
OmniDiffusionSamplingParams(num_outputs_per_prompt=1),
]
outputs = omni.generate(prompts=prompts, sampling_params_list=sampling_params_list)
```

If the code is too long (hopefully, it isn't), feel free to put it in a public gist and link it in the issue: https://gist.github.com.
Expand Down
6 changes: 3 additions & 3 deletions benchmarks/qwen3-omni/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,9 +56,9 @@ What it does:
- Runs `examples/offline_inference/qwen3_omni/end2end.py` with `--log-stats`.
- Uses `benchmarks/build_dataset/top100.txt` and writes to:
- Logs: `benchmarks/qwen3-omni/vllm_omni/logs/`
- `omni_llm_pipeline_text.orchestrator.stats.jsonl` — per-stage latency stats.
- `omni_llm_pipeline_text.overall.stats.jsonl` — end-to-end latency/TPS.
- `omni_llm_pipeline_text.stage{0,1,2}.log` — per-stage detailed logs/errors.
- `omni_pipeline_text.orchestrator.stats.jsonl` — per-stage latency stats.
- `omni_pipeline_text.overall.stats.jsonl` — end-to-end latency/TPS.
- `omni_pipeline_text.stage{0,1,2}.log` — per-stage detailed logs/errors.
- Outputs: `benchmarks/qwen3-omni/vllm_omni/outputs/` — ~100 text and `.wav` files.

Key checks:
Expand Down
12 changes: 6 additions & 6 deletions benchmarks/qwen3-omni/vllm_omni/eval_qwen3_moe_omni.sh
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,12 @@ else
--log-stats \
--log-dir $log_dir
echo "Logs and outputs are saved in ${log_dir} and ${outputs_dir} respectively:"
echo " - omni_llm_pipeline_text run dir/base name"
echo " - omni_llm_pipeline_text.orchestrator.stats.jsonl orchestrator-stage latency stats"
echo " - omni_llm_pipeline_text.overall.stats.jsonl overall latency/TPS stats"
echo " - omni_llm_pipeline_text.stage0.log per-stage detailed logs"
echo " - omni_llm_pipeline_text.stage1.log"
echo " - omni_llm_pipeline_text.stage2.log"
echo " - omni_pipeline_text run dir/base name"
echo " - omni_pipeline_text.orchestrator.stats.jsonl orchestrator-stage latency stats"
echo " - omni_pipeline_text.overall.stats.jsonl overall latency/TPS stats"
echo " - omni_pipeline_text.stage0.log per-stage detailed logs"
echo " - omni_pipeline_text.stage1.log"
echo " - omni_pipeline_text.stage2.log"
echo "Key checks: overall.stats.jsonl for end-to-end latency/TPS; orchestrator.stats.jsonl for stable per-stage latency; stage*.log for errors or long tails."
echo " - outputs/ Generated txt and wav files, there should be 100 text and wav files generated respectively"
fi
10 changes: 1 addition & 9 deletions docs/api/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,13 @@ Main entry points for vLLM-Omni inference and serving.

- [vllm_omni.entrypoints.async_omni.AsyncOmni][]
- [vllm_omni.entrypoints.async_omni_diffusion.AsyncOmniDiffusion][]
- [vllm_omni.entrypoints.async_omni_llm.AsyncOmniLLM][]
- [vllm_omni.entrypoints.cli.benchmark.base.OmniBenchmarkSubcommandBase][]
- [vllm_omni.entrypoints.cli.benchmark.main.OmniBenchmarkSubcommand][]
- [vllm_omni.entrypoints.cli.benchmark.serve.OmniBenchmarkServingSubcommand][]
- [vllm_omni.entrypoints.cli.serve.OmniServeCommand][]
- [vllm_omni.entrypoints.client_request_state.ClientRequestState][]
- [vllm_omni.entrypoints.omni.Omni][]
- [vllm_omni.entrypoints.omni.OmniBase][]
- [vllm_omni.entrypoints.omni_diffusion.OmniDiffusion][]
- [vllm_omni.entrypoints.omni_llm.OmniLLM][]
- [vllm_omni.entrypoints.omni_stage.OmniStage][]
- [vllm_omni.entrypoints.stage_utils.OmniStageTaskType][]
- [vllm_omni.entrypoints.zmq_utils.ZmqQueue][]
- [vllm_omni.entrypoints.omni_base.OmniBase][]

## Inputs

Expand Down Expand Up @@ -48,9 +42,7 @@ Engine classes for offline and online inference.
- [vllm_omni.engine.OmniEngineCoreOutputs][]
- [vllm_omni.engine.OmniEngineCoreRequest][]
- [vllm_omni.engine.PromptEmbedsPayload][]
- [vllm_omni.engine.arg_utils.AsyncOmniEngineArgs][]
- [vllm_omni.engine.arg_utils.OmniEngineArgs][]
- [vllm_omni.engine.input_processor.OmniInputProcessor][]
- [vllm_omni.engine.output_processor.MultimodalOutputProcessor][]
- [vllm_omni.engine.output_processor.OmniRequestState][]

Expand Down
4 changes: 2 additions & 2 deletions docs/configuration/stage_configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ If users want to modify some part of it. The custom stage_configs file can be in
For offline (Assume necessary dependencies have ben imported):
```python
model_name = "Qwen/Qwen2.5-Omni-7B"
omni_llm = Omni(model=model_name, stage_configs_path="/path/to/custom_stage_configs.yaml")
omni = Omni(model=model_name, stage_configs_path="/path/to/custom_stage_configs.yaml")
```

For online serving:
Expand All @@ -30,7 +30,7 @@ vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091 --stage-configs-path /path/to

Below is a specific example of stage_configs.yaml in Qwen2.5-omni.
```python
# stage config for running qwen2.5-omni with architecture of OmniLLM.
# stage config for running qwen2.5-omni with AsyncOmniEngine + Orchestrator runtime.
stage_args:
- stage_id: 0 # mark the unique id for each stage
runtime: # The disaggregated configuration
Expand Down
2 changes: 1 addition & 1 deletion docs/configuration/stage_configs/qwen2_5_omni.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# stage config for running qwen2.5-omni with architecture of OmniLLM.
# stage config for running qwen2.5-omni with AsyncOmniEngine + Orchestrator runtime.
stage_args:
- stage_id: 0
runtime:
Expand Down
2 changes: 1 addition & 1 deletion docs/contributing/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ Only specific types of PRs will be reviewed. The PR title is prefixed appropriat
- `[CI/Build]` for build or continuous integration improvements.
- `[Doc]` for documentation fixes and improvements.
- `[Model]` for adding a new model or improving an existing model. Model name should appear in the title.
- `[Frontend]` For changes on the vLLM-Omni frontend (e.g., OpenAI API server, `OmniLLM` class, etc.)
- `[Frontend]` For changes on the vLLM-Omni frontend (e.g., OpenAI API server, `Omni`/`AsyncOmni`, etc.)
- `[Kernel]` for changes affecting CUDA kernels or other compute kernels.
- `[Core]` for changes in the core vLLM-Omni logic (e.g., `OmniProcessor`, `OmniARScheduler`, etc.)
- `[Hardware][Vendor]` for hardware-specific changes. Vendor name should appear in the prefix, such as [Ascend] for Ascend NPUs.
Expand Down
5 changes: 0 additions & 5 deletions docs/contributing/ci/CI_5levels.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,11 +168,6 @@ vllm_omni/ tests/
│ └── arg_utils.py │ └── test_arg_utils.py ⬜
├── entrypoints/ → ├── entrypoints/
│ ├── omni.py │ ├── test_omni.py ⬜ (E2E covered by e2e/offline, e2e/online)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any new tests for AsyncOmniEngine and Orchestrator?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add later

│ ├── omni_llm.py │ ├── test_omni_llm.py ✅
│ ├── omni_stage.py │ ├── test_omni_stage.py ⬜ (partial in test_omni_stage_diffusion_config.py)
│ ├── omni_diffusion.py │ ├── test_omni_diffusion.py ✅
│ ├── async_omni.py │ ├── test_async_omni.py ✅ actually in e2e/online_serving/test_async_omni.py
│ ├── async_omni_diffusion.py │ ├── test_async_omni_diffusion_config.py ✅
│ ├── stage_utils.py │ ├── test_stage_utils.py ✅
│ ├── cli/ │ ├── cli/ (benchmarks/test_serve_cli.py covers CLI serve)
Expand Down
9 changes: 1 addition & 8 deletions docs/contributing/ci/tests_style.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,6 @@ End-to-end tests verify the complete functionality of a system or component. For

- **`tests/e2e/online_serving/`**: Tests for online serving scenarios (e.g., API server tests)

**Example:** The test file for `vllm_omni/entrypoints/omni_llm.py` should be located at `tests/entrypoints/test_omni_llm.py`.

## Test Directory Structure

The ideal directory structure mirrors the source code organization. Legend: `✅` = test exists, `⬜` = suggested to add.
Expand Down Expand Up @@ -75,11 +73,6 @@ vllm_omni/ tests/
│ └── arg_utils.py │ └── test_arg_utils.py ⬜
├── entrypoints/ → ├── entrypoints/
│ ├── omni.py │ ├── test_omni.py ⬜ (E2E covered by e2e/offline, e2e/online)
│ ├── omni_llm.py │ ├── test_omni_llm.py ✅
│ ├── omni_stage.py │ ├── test_omni_stage.py ⬜ (partial in test_omni_stage_diffusion_config.py)
│ ├── omni_diffusion.py │ ├── test_omni_diffusion.py ✅
│ ├── async_omni.py │ ├── test_async_omni.py ✅ actually in e2e/online_serving/test_async_omni.py
│ ├── async_omni_diffusion.py │ ├── test_async_omni_diffusion_config.py ✅
│ ├── stage_utils.py │ ├── test_stage_utils.py ✅
│ ├── cli/ │ ├── cli/ (benchmarks/test_serve_cli.py covers CLI serve)
Expand Down Expand Up @@ -170,7 +163,7 @@ vllm_omni/ tests/

### Naming Conventions

- **Unit Tests**: Use `test_<module_name>.py` format. Example: `omni_llm.py` → `test_omni_llm.py`
- **Unit Tests**: Use `test_<module_name>.py` format. Example: `stage_utils.py` → `test_stage_utils.py`

- **E2E Tests**: Place in `tests/e2e/offline_inference/` or `tests/e2e/online_serving/` with descriptive names. Example: `tests/e2e/offline_inference/test_qwen3_omni.py`, `tests/e2e/offline_inference/test_diffusion_model.py`

Expand Down
62 changes: 27 additions & 35 deletions docs/contributing/model/adding_omni_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -330,54 +330,46 @@ Stage transitions are the mechanism by which outputs from one stage are converte

### Where Stage Transitions Are Called
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also change the corresponding diffusion models docs


Stage transitions happen automatically in the orchestrator (`OmniLLM` class) during the generation loop. Here's the detailed flow:
Stage transitions happen automatically in the runtime orchestrator. Here's the detailed flow:

1. **Location**: `vllm_omni/entrypoints/omni_llm.py` in the `_run_generation()` method
1. **Location**: `vllm_omni/engine/orchestrator.py` in `_forward_to_next_stage()`
2. **Trigger**: When a stage completes processing and produces outputs
3. **Execution Flow**:
```python
# In omni_llm.py, _run_generation() method (around line 345-460)

# Main orchestrator loop polls each stage for completed requests
for stage_id, stage in enumerate(self.stage_list):
result = stage.try_collect() # Get completed request
if result is None:
continue

# Store outputs from this stage
engine_outputs = _load(result, obj_key="engine_outputs", shm_key="engine_outputs_shm")
stage.set_engine_outputs(engine_outputs)

# Check if there's a next stage to forward to
next_stage_id = stage_id + 1
if next_stage_id < len(self.stage_list):
next_stage: OmniStage = self.stage_list[next_stage_id]

# THIS IS WHERE STAGE TRANSITION HAPPENS
next_inputs = next_stage.process_engine_inputs(
self.stage_list,
[request_id_to_prompt[req_id]]
)

# Submit to next stage
task = {
"type": OmniStageTaskType.GENERATE,
"request_id": req_id,
"engine_inputs": next_inputs[0],
"sampling_params": sampling_params_list[next_stage_id],
}
next_stage.submit(task)
# In orchestrator.py
next_stage_id = stage_id + 1
next_client = self.stage_clients[next_stage_id]
params = req_state.sampling_params_list[next_stage_id]

# Save current stage outputs so stage_input_processors can consume them.
self.stage_clients[stage_id].set_engine_outputs([output])

# THIS IS WHERE STAGE TRANSITION HAPPENS
next_inputs = next_client.process_engine_inputs(
stage_list=self.stage_clients,
prompt=req_state.prompt,
)

# Build and submit request(s) to the next stage.
for next_input in next_inputs:
request = build_engine_core_request_from_tokens(
request_id=req_id,
prompt=next_input,
params=params,
model_config=self.stage_vllm_configs[next_stage_id].model_config,
)
await next_client.add_request_async(request)
```

### How Stage Transitions Work

The stage transition process follows these steps:

1. **Stage Completion**: When a stage finishes processing a request, it stores outputs via `stage.set_engine_outputs(engine_outputs)`
1. **Stage Completion**: When a stage finishes processing a request, the orchestrator stores outputs via `stage_client.set_engine_outputs(...)`

2. **Transition Detection**: The orchestrator checks if there's a next stage and calls `process_engine_inputs()` on it

3. **Input Processing**: The `process_engine_inputs()` method in `OmniStage` (`omni_stage.py`) handles the transition:
3. **Input Processing**: The stage input processor configured in stage YAML (under `vllm_omni/model_executor/stage_input_processors/`) handles the transition:
```python
def process_engine_inputs(
self, stage_list: list[Any], prompt: OmniTokensPrompt | TextPrompt = None
Expand Down
20 changes: 11 additions & 9 deletions docs/contributing/profiling.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,30 +23,32 @@ export VLLM_PROFILER_MAX_ITERS=1
The profiler is default to function across all stages. But It is highly recommended to profile specific stages by passing the stages list, preventing from producing too large trace files:
```python
# Profile all stages
omni_llm.start_profile()
omni.start_profile()

# Only profile Stage 1
omni_llm.start_profile(stages=[1])
omni.start_profile(stages=[1])
```

```python
# Stage 0 (Thinker) and Stage 2 (Audio Decoder) for qwen omni
omni_llm.start_profile(stages=[0, 2])
omni.start_profile(stages=[0, 2])
```

**Python Usage**: Wrap your generation logic with `start_profile()` and `stop_profile()`.

```python
from vllm_omni import omni_llm
from vllm_omni.entrypoints.omni import Omni

omni = Omni(model="Qwen/Qwen3-Omni-30B-A3B-Instruct")

profiler_enabled = bool(os.getenv("VLLM_TORCH_PROFILER_DIR"))

# 1. Start profiling if enabled
if profiler_enabled:
omni_llm.start_profile(stages=[0])
omni.start_profile(stages=[0])

# Initialize generator
omni_generator = omni_llm.generate(prompts, sampling_params_list, py_generator=args.py_generator)
omni_generator = omni.generate(prompts, sampling_params_list, py_generator=args.py_generator)

total_requests = len(prompts)
processed_count = 0
Expand All @@ -57,21 +59,21 @@ for stage_outputs in omni_generator:
# ... [Output processing logic for text/audio would go here] ...

# Update count to track when to stop profiling
processed_count += len(stage_outputs.request_output)
processed_count += 1

# 2. Check if all requests are done to stop the profiler safely
if profiler_enabled and processed_count >= total_requests:
print(f"[Info] Processed {processed_count}/{total_requests}. Stopping profiler inside active loop...")

# Stop the profiler while workers are still active
omni_llm.stop_profile()
omni.stop_profile()

# Wait for traces to flush to disk
print("[Info] Waiting 30s for workers to write trace files to disk...")
time.sleep(30)
print("[Info] Trace export wait time finished.")

omni_llm.close()
omni.close()
```


Expand Down
10 changes: 5 additions & 5 deletions docs/design/architecture_overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,12 +67,12 @@ According to analysis for current popular open-source models, most of them have
| Component | Description |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
| **OmniRouter** | provide an intelligent router for Omni-modality requests dispatch |
| **EntryPoints** | define the APIs for offline/online serving (APIServer, Omni/AsyncOmni) and provide the OmniStage abstraction for different AR/DiT stages |
| **EntryPoints** | define the APIs for offline/online serving (APIServer, Omni/AsyncOmni), while `AsyncOmniEngine` and `Orchestrator` coordinate multi-stage AR/DiT execution |
| **AR** | adapted for omni-modality models while inheriting efficient features from vLLM, such as cache management |
| **Diffusion** | natively implemented and optimized using acceleration components |
| **OmniConnector** | supports fully disaggregation based on E/P/D/G (Encoding/Processing/Decoding/Generation) disaggregation across stages |

Disaggregated stages are managed through configuration, such as in the Qwen3-Omni example, where stages like Thinker, Talker, and Code2wav are defined as separate OmniStage instances with specific resources and input/output type.
Disaggregated stages are managed through stage configuration. In Qwen3-Omni, Thinker/Talker/Code2wav are declared as separate configured stages, and runtime routing is handled by `Orchestrator` over `StageEngineCoreClient` / `StageDiffusionClient`.

## Main features

Expand Down Expand Up @@ -127,10 +127,10 @@ Taking **Qwen3-Omni** as an example:
The **Omni** class provides a Python interface for offline batched inference. Users initialize the Omni class with a Hugging Face model name and use the generate method, passing inputs that include both text prompts and multi-modal data:

```
# Create an omni_lm with HF model name.
# Create an omni runtime with HF model name.
from vllm_omni.entrypoints.omni import Omni

omni_lm = Omni(model="Qwen/Qwen3-Omni-30B-A3B-Instruct")
omni = Omni(model="Qwen/Qwen3-Omni-30B-A3B-Instruct")

# Example prompts.
om_inputs = {"prompt": prompt,
Expand All @@ -140,7 +140,7 @@ om_inputs = {"prompt": prompt,
}}

# Generate texts and audio from the multi-modality inputs.
outputs = omni_lm.generate(om_inputs, sampling_params_list)
outputs = omni.generate(om_inputs, sampling_params_list)
```

## Online Serving
Expand Down
Loading
Loading