Skip to content

[Model] Add LingBot-World I2V support#2073

Draft
pakkah wants to merge 2 commits intovllm-project:mainfrom
pakkah:main
Draft

[Model] Add LingBot-World I2V support#2073
pakkah wants to merge 2 commits intovllm-project:mainfrom
pakkah:main

Conversation

@pakkah
Copy link
Copy Markdown

@pakkah pakkah commented Mar 22, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Add LingbotWorldPipeline support for robbyant/lingbot-world-base-cam and integrate it into the existing image-to-video example flow. This PR closes #1045.

Implementation notes:

  • Follow the DreamID-Omni integration pattern: keep the pipeline self-loading, reuse an external dependency repo via a download helper, and keep the vllm-omni-specific model adaptation local
  • Support LingBot-World control signals through the offline serving --action-path arg, which matches the upstream LingBot-World usage pattern
  • The API form for model-specific control signals is still to be determined [RFC]: World Model Support #1987 , so we leave that to a follow-up PR. In this PR, online serving remains limited to plain I2V without control signals.

Test Plan

cd examples/offline_inference/image_to_video

python download_lingbot_world.py \
  --model-id robbyant/lingbot-world-base-cam \
  --output-dir ./lingbot-world-base-cam

PROMPT="$(cat /tmp/vllm-omni-dependency/lingbot-world/examples/00/prompt.txt)"

python image_to_video.py \
  --model ./lingbot-world-base-cam \
  --image /tmp/vllm-omni-dependency/lingbot-world/examples/00/image.jpg \
  --action-path /tmp/vllm-omni-dependency/lingbot-world/examples/00 \
  --prompt "$PROMPT" \
  --height 480 \
  --width 832 \
  --num-frames 161 \
  --guidance-scale 5.0 \
  --guidance-scale-high 5.0 \
  --num-inference-steps 20 \
  --flow-shift 10.0 \
  --fps 16 \
  --output lingbot_world_base_cam_examples00.mp4

python image_to_video.py \
  --model ./lingbot-world-base-cam \
  --image /tmp/vllm-omni-dependency/lingbot-world/examples/00/image.jpg \
  --prompt "$PROMPT" \
  --height 480 \
  --width 832 \
  --num-frames 161 \
  --guidance-scale 5.0 \
  --guidance-scale-high 5.0 \
  --num-inference-steps 20 \
  --flow-shift 10.0 \
  --fps 16 \
  --output lingbot_world_base_cam_no_control.mp4

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

pakkah and others added 2 commits March 22, 2026 13:11
Signed-off-by: Zhaoxiang Huang <zhaoxiang.huang@outlook.com>
Signed-off-by: asukaqaq-s <1311722138@qq.com>
@pakkah pakkah mentioned this pull request Mar 22, 2026
1 task
@TKONIY TKONIY mentioned this pull request Mar 22, 2026
20 tasks
wjsuijlenh added a commit to zzhang-fr/vllm-omni that referenced this pull request Apr 15, 2026
…up claims

Fix three things the earlier phase-D draft got wrong by re-reading the
DreamZero paper carefully:

1. Decimal separator: the naive baseline is 5.7s per chunk (not 7s),
   and the bimanual action horizon is 1.6s per chunk (not 6s). pypdf
   had silently dropped the leading digit at line breaks.

2. Step-reduction progression: DreamZero does not "stay at 16 steps".
   DiT Caching (velocity reuse within a chunk, based on cosine
   similarity of flow-matching velocities) reduces effective steps
   from 16 to 4. DreamZero-Flash, a training-time noise-schedule
   change, further reduces to 1 step. The paper's Table 3 shows the
   task-progress cost: naive 1-step loses 31 points on table-bussing,
   Flash 1-step loses only 9.

3. The 38x headline is GB200-only. Table 1 shows the cumulative
   speedup caps at 9.6x on H100; NVFP4 and DreamZero-Flash rows are
   dashed for H100. CFG parallelism is also multi-GPU from row 2, so
   the post-baseline config in the paper is never single-GPU.

Also separate DreamZero's "DiT Caching" (intra-chunk velocity reuse)
from RFC vllm-project#1987 / StreamDiffusionV2's "rolling KV cache across chunks"
-- these are different optimizations and should not be conflated.

Drop all cross-baseline speedup attribution from phase_d_cross_check.md
and the journal's Phase D/E sections. Earlier drafts claimed that e.g.
"36x of DreamZero's 38x is accounted for by choosing streaming-point
hyperparameters", which compares across different baselines, hardware,
training regimes, and chunk shapes. That comparison is invalid and is
removed. Our own measurement (53.30s -> 2.585s = 20.6x on Wan-1.3B at
our offline vs our streaming point on 1x A100) is retained as a fact
about our own two configurations, with no claim about how it relates
to any published speedup.

Contribution-target vllm-project#3 (rolling KV cache + blockwise-causal attention)
is downgraded from "~1.5-2x at our streaming operating point" to
"speed-up not measured on our hardware; research commitment, not
quantified target." The measured ~260 ms framework overhead is kept
as the concrete motivating number for target #2 (per-call overhead
reduction, RFC vllm-project#2073).

Add a correction notice at the top of the earlier SOTA-scan DreamZero
subsection pointing readers to the Phase D corrections.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]: LingBot-World

3 participants