Skip to content

[Diffusion] DreamZero world model integration with CFG parallel + OpenPI serving#2162

Open
TKONIY wants to merge 10 commits intovllm-project:mainfrom
TKONIY:feature/dreamzero-pipeline
Open

[Diffusion] DreamZero world model integration with CFG parallel + OpenPI serving#2162
TKONIY wants to merge 10 commits intovllm-project:mainfrom
TKONIY:feature/dreamzero-pipeline

Conversation

@TKONIY
Copy link
Copy Markdown
Contributor

@TKONIY TKONIY commented Mar 25, 2026

Summary

Current PR branch: feature/dreamzero-pipeline
Latest pushed commit: a820f779 tests: fix diffusion scheduler mock imports after rebase

This PR integrates DreamZero into vllm-omni with:

  • DreamZero diffusion pipeline support
  • OpenPI-compatible robot WebSocket serving
  • DreamZero-specific transform pipeline owned by the model
  • DreamZero stage config + model-specific policy server config
  • Online serving example with bundled real videos
  • DROID sim-evals rollout client for the vLLM OpenPI endpoint
  • Explicit optional-dependency errors in the DreamZero example clients
  • Self-contained e2e tests plus optional upstream parity checks

What Changed

Model / Pipeline

  • Added DreamZero stage config: vllm_omni/model_executor/stage_configs/dreamzero.yaml
  • Added DreamZero model defaults in vllm_omni/diffusion/models/dreamzero/utils.py
  • Moved robot transforms into vllm_omni/diffusion/models/dreamzero/transform/
  • Let DreamZeroPipeline consume raw robot_obs and apply transform inside the model
  • Kept reset semantics as deferred engine reset via extra_args["reset"]

Serving

  • Added model-specific PolicyServerConfig loading in vllm_omni/entrypoints/openai/realtime/robot/openpi_serving.py
  • Added optional dependency guard, structured errors, and idle timeout in vllm_omni/entrypoints/openai/realtime/robot/openpi_connection.py
  • Added create_policy_server() so OpenPI serving is enabled only when the loaded model provides policy_server_config
  • Sent OpenPI handshake metadata from model config instead of hardcoding it in the WebSocket connection layer

Framework Wiring

  • vllm_omni/diffusion/stage_diffusion_proc.py
    • no longer carries DreamZero-specific inline detection
    • preserves explicit model_class_name for unregistered architectures
  • vllm_omni/diffusion/diffusion_engine.py
    • deduplicates multimodal payload slicing for audio / actions
  • vllm_omni/entrypoints/utils.py
    • resolves DreamZero config from model type override

Tests / Examples / Docs

  • Added OpenPI unit tests:
    • tests/entrypoints/openai_api/test_openpi_connection.py
    • tests/entrypoints/openai_api/test_openpi_serving.py
  • Added DreamZero e2e / example tests:
    • tests/e2e/online_serving/test_dreamzero.py
    • tests/examples/online_serving/test_dreamzero.py
  • Added optional upstream parity checks under:
    • tests/dreamzero/upstream/
  • Added runnable example:
    • examples/online_serving/dreamzero/
  • Added offline prediction-video export helpers:
    • examples/online_serving/dreamzero/export_prediction_video.py
    • examples/online_serving/dreamzero/generate_comparison_videos.py
  • Added DROID sim rollout client:
    • examples/online_serving/dreamzero/droid_sim_eval_client.py
  • Added model docs:
    • docs/models/dreamzero/README.md
    • docs/models/dreamzero/quick_start.md
  • Documented per-script environment requirements for:
    • vllm serve
    • bundled OpenPI client
    • prediction-video export helpers
    • DROID sim-eval client
    • optional upstream parity tests

Quick Start

Start the server

From the repository root:

CUDA_VISIBLE_DEVICES=0,1 \
examples/online_serving/dreamzero/run_server.sh

If you only want 1 GPU:

CUDA_VISIBLE_DEVICES=0 \
CFG_PARALLEL_SIZE=1 \
examples/online_serving/dreamzero/run_server.sh

OpenPI websocket endpoint:

  • ws://127.0.0.1:8000/v1/realtime/robot/openpi

Run the client

From the repository root:

python examples/online_serving/dreamzero/openpi_client.py \
  --host 127.0.0.1 \
  --port 8000

Extra client dependencies:

pip install openpi-client websockets opencv-python

The example client uses bundled real videos from:

  • examples/online_serving/dreamzero/assets/

Optional flags:

python examples/online_serving/dreamzero/openpi_client.py \
  --host 127.0.0.1 \
  --port 8000 \
  --video-dir examples/online_serving/dreamzero/assets \
  --session-id demo-session \
  --num-chunks 2

Run DROID sim-eval

Run this from an environment where isaaclab, isaaclab_tasks,
sim_evals, and gymnasium are already importable.

pip install openpi-client websockets opencv-python mediapy

From the vllm-omni repository root, invoke the client through an external
Isaac Lab launcher, for example:

/path/to/isaaclab.sh -p \
  examples/online_serving/dreamzero/droid_sim_eval_client.py \
  --host 127.0.0.1 \
  --port 8000 \
  --scene 1 \
  --episodes 1 \
  --headless \
  --device cuda:0

Validation

Passed locally:

  • PYTHONPATH=. pytest tests/entrypoints/openai_api/test_openpi_serving.py tests/entrypoints/openai_api/test_openpi_connection.py -q
  • OPENPI_E2E_GPUS=0,1 PYTHONPATH=. pytest tests/e2e/online_serving/test_dreamzero.py -q --run-level=advanced_model
  • PYTHONPATH=. .venv/bin/python -m py_compile examples/online_serving/dreamzero/openpi_client.py examples/online_serving/dreamzero/droid_sim_eval_client.py

This confirms:

  • DreamZero OpenPI handshake / serving works
  • DreamZero online e2e works with real bundled videos

Performance Snapshot

Measurement scope for the numbers below:

  • eager mode (--enforce-eager)
  • no torch.compile
  • no DiT cache / dynamic cached schedule in the baseline table
  • single-request OpenPI serving path
  • DreamZero parallelism configured via stage YAML parallel_config, not CLI TP/CFG flags

Hardware environment for these measurements:

  • GPU: 4x NVIDIA RTX PRO 6000 Blackwell Server Edition, 97887 MiB VRAM each
  • GPU driver: 590.48.01
  • CPU: 2x AMD EPYC 9355 32-Core Processor (128 logical CPUs total)
  • Host RAM: 1.5 TiB

VRAM

Interpretation notes:

  • dreamzero.yaml is the default TP=1, CFG=1 baseline
  • all GPUs on this host are the same model, so the table reports only how many GPUs were used and the per-GPU memory numbers
  • TP=2, CFG=2 needs 4 GPUs and is still blocked on this host
Mode GPUs Used Per-GPU startup VRAM Per-GPU peak VRAM Status Notes
TP=1, CFG=1 1 43.58 GiB 52.01 GB reserved Measured vllm_omni/model_executor/stage_configs/dreamzero.yaml
TP=1, CFG=2 2 43.58 GiB 49.61 GB reserved Measured True CFG-parallel serving via dreamzero_tp1_cfg2.yaml
TP=2, CFG=1 2 28.88 GiB 32.65 GB reserved Measured True TP serving via dreamzero_tp2_cfg1.yaml
TP=2, CFG=2 4 Not measured Not measured Pending Requires 4 GPUs; blocked by unrelated jobs on GPUs 2,3

Latency

Interpretation notes:

  • latency metric here is server-side DiffusionEngine.step breakdown ... total=... ms
  • each row below used the same OpenPI client workload:
    • 3 action-producing requests (infer, infer, reset, infer)
    • same bundled DreamZero example videos
  • TP=1, CFG=2 reduces wall time versus baseline on this workload
  • TP=2, CFG=1 works correctly but is slower than baseline on this workload
Mode GPUs Used Mean latency P50 latency Range Client wall time Status Notes
TP=1, CFG=1 1 7349.93 ms 7419.06 ms 7145.32–7485.42 ms 22.329 s Measured tmp/dreamzero_perf_yaml/tp1_cfg1.log
TP=1, CFG=2 2 4033.93 ms 3863.00 ms 3829.45–4409.35 ms 12.365 s Measured tmp/dreamzero_perf_yaml/tp1_cfg2.log
TP=2, CFG=1 2 8636.77 ms 8526.22 ms 8451.61–8932.49 ms 26.196 s Measured tmp/dreamzero_perf_yaml/tp2_cfg1.log
TP=2, CFG=2 4 Not measured Not measured Not measured Not measured Pending Requires 4 GPUs; blocked by unrelated jobs on GPUs 2,3

Important Future Work

  • Design clear and stable API and protocols for robot. Current design of PolicyServerConfig and Transform is ugly.
image
  • Manage KV Cache with Paged Attention.
  • Performance Optimizations, e.g., asynchronous pipelined execution.

@TKONIY TKONIY force-pushed the feature/dreamzero-pipeline branch 3 times, most recently from 8a51735 to 22bfc2f Compare March 26, 2026 19:16
@TKONIY TKONIY force-pushed the feature/dreamzero-pipeline branch 5 times, most recently from 1d6c89f to f5bfbc9 Compare April 3, 2026 21:25
@TKONIY TKONIY force-pushed the feature/dreamzero-pipeline branch from f5bfbc9 to 783b0f1 Compare April 11, 2026 01:15
@TKONIY TKONIY mentioned this pull request Apr 11, 2026
20 tasks
@TKONIY TKONIY force-pushed the feature/dreamzero-pipeline branch from 31a9faf to e6d1229 Compare April 13, 2026 00:18
@TKONIY TKONIY marked this pull request as ready for review April 13, 2026 00:48
@TKONIY TKONIY requested a review from hsliuustc0106 as a code owner April 13, 2026 00:48
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@TKONIY TKONIY force-pushed the feature/dreamzero-pipeline branch 2 times, most recently from 7542ada to e048473 Compare April 13, 2026 01:49
Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice clean architecture with good separation of concerns (connection / serving / transform). A few items to address before merge.

Comment thread vllm_omni/entrypoints/openai/realtime/robot/openpi_connection.py Outdated
Comment thread vllm_omni/entrypoints/openai/realtime/robot/openpi_connection.py Outdated
Comment thread vllm_omni/entrypoints/openai/realtime/robot/openpi_connection.py Outdated
Comment thread vllm_omni/entrypoints/openai/realtime/robot/openpi_connection.py Outdated
Comment thread vllm_omni/entrypoints/openai/realtime/robot/openpi_serving.py Outdated
Comment thread vllm_omni/entrypoints/openai/realtime/robot/openpi_serving.py
Comment thread vllm_omni/entrypoints/openai/realtime/robot/openpi_serving.py
Comment thread vllm_omni/diffusion/models/dreamzero/transform/droid.py
Comment thread vllm_omni/diffusion/stage_diffusion_proc.py Outdated
Comment thread vllm_omni/diffusion/diffusion_engine.py Outdated
Comment thread vllm_omni/diffusion/models/dreamzero/pipeline_dreamzero.py Outdated
Comment thread vllm_omni/diffusion/models/dreamzero/pipeline_dreamzero.py Outdated
Comment thread vllm_omni/diffusion/models/dreamzero/pipeline_dreamzero.py Outdated
Comment thread vllm_omni/diffusion/models/dreamzero/state_dreamzero.py
Comment thread vllm_omni/diffusion/models/dreamzero/pipeline_dreamzero.py
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLOCKING:

  • Merge conflict — This PR is in CONFLICTING state. Please rebase onto latest main before review can proceed.

Non-blocking notes:

  1. PR description TODO says "Clear the comments" — please resolve before merge.
  2. Consider adding concrete latency / VRAM numbers to the PR description (even rough figures from the local validation runs). The current "passed locally" section lists test commands but not their measured outputs.

TKONIY added 7 commits April 19, 2026 19:07
Implement the DreamZero omni serving path as a single clean feature commit
without test- or doc-only files. This keeps the model registry/stage
detection, root-config-driven pipeline initialization, root-checkpoint
weight loading, DreamZero model/state wiring, and OpenPI serving / transform
integration required for the feature branch.

Signed-off-by: Yangshen Deng <yangshen.d@outlook.com>
Add a dedicated DreamZero video-latent decode helper that matches upstream WanVideoVAE decode semantics. The fix keeps forward() output as normalized video latents for serving, but documents the contract clearly and restores exact debug-video parity by inverting latent normalization in bf16 the same way as the upstream source path.

Signed-off-by: Yangshen Deng <yangshen.d@outlook.com>
Signed-off-by: Yangshen Deng <yangshen.d@outlook.com>
Add concise environment guidance for DreamZero serving, bundled OpenPI client usage, and DROID sim-eval rollout usage.

Also guard optional client-side imports in the DreamZero example scripts so missing non-core dependencies fail with explicit messages instead of opaque import errors.

Signed-off-by: Yangshen Deng <yangshen.d@outlook.com>
Document the sim-eval launch flow without assuming Isaac Lab lives inside the vllm-omni repo, while still assuming commands are run from the repository root.

Signed-off-by: Yangshen Deng <yangshen.d@outlook.com>
Signed-off-by: Yangshen Deng <yangshen.d@outlook.com>
Signed-off-by: Yangshen Deng <yangshen.d@outlook.com>
@TKONIY TKONIY force-pushed the feature/dreamzero-pipeline branch from 88f2f22 to a820f77 Compare April 19, 2026 19:49
TKONIY added 3 commits April 20, 2026 18:59
Add offline example scripts to export DreamZero prediction videos and generate TP/CFG comparison outputs without changing the serving path. Document the workflow in the DreamZero quick start and example README, ignore local generated video artifacts, and add stage YAMLs for TP/CFG variants used by the comparison helper. Also update DreamZero weight loading to honor custom parameter weight loaders during remapped checkpoint loading.

Signed-off-by: Yangshen Deng <yangshen.d@outlook.com>
Move the upstream DreamZero policy imports in test_openpi_client_ar behind a helper so the file passes E402 without changing behavior, and restore the BasePolicy import while removing the unused cv2 dependency guard in the DROID sim-eval client.

Signed-off-by: Yangshen Deng <yangshen.d@outlook.com>
Run the same pre-commit --all-files pass used by CI and commit the resulting ruff/format adjustments so the DreamZero PR branch is clean under the repo's global hooks.

Signed-off-by: Yangshen Deng <yangshen.d@outlook.com>
@TKONIY
Copy link
Copy Markdown
Contributor Author

TKONIY commented Apr 20, 2026

@yinpeiqi @hsliuustc0106 @linyueqian
There are a lot of unit tests under tests/dreamzero/upstream and end-to-end test to check if our implementation aligns with upstream DreamZero's official implementation. They require dependency to that repo. Do you prefer keeping or removing them?

@TKONIY
Copy link
Copy Markdown
Contributor Author

TKONIY commented Apr 21, 2026

BLOCKING:

  • Merge conflict — This PR is in CONFLICTING state. Please rebase onto latest main before review can proceed.

Non-blocking notes:

  1. PR description TODO says "Clear the comments" — please resolve before merge.
  2. Consider adding concrete latency / VRAM numbers to the PR description (even rough figures from the local validation runs). The current "passed locally" section lists test commands but not their measured outputs.

All comments have been addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants