[Diffusion] Implemente the Lingbot World Fast Pipeline to work in vllm-omni by Miguel0312 · Pull Request #3701 · vllm-project/vllm-omni

Miguel0312 · 2026-05-18T14:34:06Z

Purpose

Implement the Lingbot World Fast Model. Makes progress to the Interactive Video scenario of #1987 by resolving #3072.

What Changed

Implement Lingbot World Pipeline + Transformer, enables both offline and online inference
Implement the LingbotWorldState object for storing data for the model such as KV cache
Create a script for downloading the model weights since it is not stored in a way that diffusers.from_pretrained can correctly retrieve it
Add to VLLM Omni the capacity to handle models that require camera position inputs. Add a new key camera to multi_modal_data whose values correspond to two Nx4x4 np.arrays: poses and intrinsics
Improves online serving server websocket configuration by adding flags --ws-max-size and --ws that allow the user to send the larger videos required through a websocket without having problems with packet sizes or heartbeat timeouts

API

The new endpoint /v1/realtime/world/camera/ implements the following schema (adapted from the RFC):

client ──────────────────► server: WS upgrade  /v1/realtime/world/camera
server ──msgpack(CameraServerConfig)──► client
                                         (patch_size, chunk_frames,...)

   ┌─── one session, N chunks ──────────────────────────────────────┐
   │                                                                │
   │ client ──msgpack({endpoint:"infer",                            │
   │                   session_id, image?, prompt?,                 │
   │                   cameras:{intrinsics, poses}})──► server      │
   │                                                                │
   │    ┌────  ServingRealtimeWorldCamera.infer()  ──────┐          │
   │    │                                                │          │
   │    │  [first chunk of session only:]                │          │
   │    │   T5 encode(prompt)      ─► cross KV cache     │          │
   │    │   VAE encode(image)      ─► y (first latent)   │          │
   │    │                                                │          │
   │    │  [every chunk:]                                │          │
   │    │   get_plucker(K_slice,                         │          │
   │    │               poses_slice)                     │          │
   │    │                 ─► Plücker emb [B,6,f,H/2,W/2] │          │
   │    │   current_start = state.global_end_index       │          │
   │    │                                                │          │
   │    │   for t in [0, 179, 358]:  (3 noisy steps)     │          │
   │    │     eps = WanModelFast(x, t, plucker,          │          │
   │    │                        self_kv_cache,          │          │
   │    │                        cross_kv_cache,         │          │
   │    │                        current_start, ...)     │          │
   │    │     x0 = flow_pred_to_x0(eps, x, t)            │          │
   │    │     x  = add_noise(x0, randn, next_t)          │          │
   │    │   # last step: keep x0 (no re-noise)           │          │
   │    │                                                │          │
   │    │   # cache-update pass, t ≈ 0, clean x0         │          │
   │    │   WanModelFast(x0, t≈0, plucker,               │          │
   │    │                self_kv_cache, ..., update=True)│          │
   │    │                                                │          │
   │    │   state.append_chunk(x0)                       │          │
   │    │   video = VAE.decode(x0)                       │          │
   │    └────────────────────────────────────────────────┘          │
   │                                                                │
   │ server ──msgpack({type:"frame", chunk_id,                      │
   │                  video:<mp4 bytes>})──► client                 │
   │                                                                │
   │ (repeat for next chunk; session-scoped                         │
   │  LingBotWorldFastState persists self_kv_cache,                 │
   │  cross_kv_cache, frame buffer, current_start)                 │
   │                                                                │
   │                                                                │
   │ client ──msgpack({endpoint:"reset"})──► server                 │
   │   state.reset() clears caches + frame buffer                   │
   │                                                                │
   └────────────────────────────────────────────────────────────────┘

The CameraServerConfig object sends information about the model to the client. It follows the pattern:

CameraServerConfig = {
  "patch_size":  [1, 2, 2],
  "vae_stride":  [4, 8, 8], 
  "latent_frames_per_chunk":  3, 
  "max_area":  399360   # 480*832
}

Test Plan

The tests under tests/diffusion/models/lingbot_world_fast represent the L1 tests for the model. They are lightweight and work out of the box. Run them with:

pytest -xvs tests/diffusion/models/gr00t/test_protocol_validation.py
pytest -xvs tests/diffusion/test_schedule.py
pytest -xvs tests/entrypoints/openai_api/test_session_state.py

Tests on tests/e2e/*/test_lingbot_world_fast.py are the L2 smoke tests of the model. They test that both the online and offline behaviors of the API are as desired without actually generating any video.

pytest -xvs tests/e2e/offline_inference/test_lingbot_world_fast.py
pytest -xvs tests/e2e/online_serving/test_lingbot_world_fast.py

Finally, the tests on tests/e2e/*/test_lingbot_world_fast_expansion.py load the actual weights of the models, generate videos and compare with golden reference frames. The reference frames should be stored on tests/data/lingbot_world_fast as golden_frames_LENGTH_POS.npy, where LENGTH is either "short" or "long" depending if it is the 25 or 81 frame long video and POS is one of "first" or "last", that correspond, respectively, to the first and last generated frame of the video. Importantly, they should be saved as tensors directly from the reference implementation instead of saving the model output as a .mp4 video whose frames are later extracted.

Since it requires loading the model weights, running the L3 tests rely on first downloading the model. The easiest way to do so is by running:

cd examples/offline_inference/lingbot_world_fast
python download_lingbot_world_fast.py

Some additional environment variables are needed so the model can use its inputs used to generate the video.

export LINGBOT_WORLD_FAST_PATH=examples/offline_inference/lingbot_world_fast/lingbot_world/lingbot-world-base-cam/Lingbot-World-Fast/
export LINGBOT_WORLD_FAST_CAMERA_PATH=path/to/lingbot-world/examples/04/
export LINGBOT_WORLD_FAST_IMAGE=path/to/lingbot-world/examples/04/image.jpg
pytest -xvs tests/e2e/offline_inference/test_lingbot_world_fast_expansion.py
pytest -xvs tests/e2e/online_serving/test_lingbot_world_fast_expansion.py

Test Result

We use as reference a video generated with the original implementation. The only modification that is done is to force the output video height and width instead of computing it using the aspect ratio of the reference image. This is in line with the behavior of other video generation models already in vllm-omni.

The following results are for 1 A100 GPU with the text encoder in the CPU to reduce VRAM use and enable the generation of longer videos. The offloading is enabled both in the baseline and on vllm-omni for a fair comparison.

	Short(25 frames)	Long (81 frames)
SSIM first	1.000	1.000
SSIM last	1.000	1.000
Latency (s) - vllm-omni	63	111
Latency (s) - original	64	112

Without CPU offloading we get the following performance results.

	Short(25 frames)	Medium (65 frames)
Latency (s) - vllm-omni	15	41
Latency (s) - original	20	46

The difference in performance does not depend on the size of the video which implies that the speed up comes from removing overhead (probably thanks to the warmup run that enables the VAE and text encoder modules to run faster) rather than speed up of the diffusion process.

This is the 81-frame long video generated by the original implementation.

reference_video.mp4

This is the 81-frame long video generated by the implementation in vllm-omni.

vllm-omni-video.mp4

There is no baseline for the video extension task since the original model does not support it. But the result of calling it with the following command, which makes 3 calls to the server for it to generate 25 frames, two of which are for video extension is show below:

python openai_client.py --width 832 --height 480 --fps 16 --num-frames 25 --num-calls 3 --session-id 0 --image /path/to/lingbot-world/examples/04/image.jpg --camera-path /path/to/lingbot-world/examples/04 --prompt "A sweeping cinematic journey along the Great Wall of China, winding through golden autumn hills under a brilliant blue sky — stone pathways stretch into the distance, watchtowers stand sentinel, and vibrant foliage blankets the mountainsides as the camera glides smoothly forward, capturing the grandeur and timeless majesty of this ancient wonder."

lingbot-great-wall-extension.mp4

Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com> Implement Lingbot World Transformer into vllm-omni Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com> Implement KV cache abstraction for Lingbot World Fast Add script to offline generation using Lingbot World Fast Implement online serving for Lingbot World and camera-based world models in general

hsliuustc0106

Ready for full review when draft status removed (marked as WIP). Preliminary scan available on request.

Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>

Signed-off-by: Miguel Vieira Pereira <52176659+Miguel0312@users.noreply.github.com> Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>

Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>

chatgpt-codex-connector · 2026-05-27T12:25:14Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

amy-why-3459 · 2026-05-28T14:10:45Z

    await connection.handle_connection()


+@router.websocket("/v1/realtime/world/camera")


Please provide reference documents for the newly added OpenAI external interfaces.

I added some more info to the PR description about the API. The API is also documented on file docs/user_guide/examples/online_serving/lingbot_world_fast.md. Please, if there is any additional information needed, let me know what is missing and where to add it (docs, PR description, etc.).

amy-why-3459 · 2026-05-28T14:13:40Z

+
+logger = init_logger(__name__)
+_DEFAULT_IDLE_TIMEOUT = 30.0
+CHUNK_FRAMES = 4


Can the CHUNK_FRAMES parameter be modified in the request? If users can customize this parameter, please add the option to read the parameter in the request.

I allowed the client to control this value through the extra_body.frames_per_chunk field of the request. If we also need to allow the server to control it, some top level modifications to the library must be made (i.e. adding command line arguments), which I am not sure is worth it for it to be used by only one model in one specific case (online serving). But I can do it if needed, the modification itself should be simple.

Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>

hsliuustc0106 reviewed May 18, 2026

View reviewed changes

Miguel0312 and others added 4 commits May 19, 2026 14:49

Implement video continuation

735eb54

Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>

Remove dependency on external code from Lingbot World repo

010cfc9

Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>

Implement tests for Lingbot World Fast

0109533

Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>

Merge branch 'main' into lingbot-fast

69f64e5

Signed-off-by: Miguel Vieira Pereira <52176659+Miguel0312@users.noreply.github.com> Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>

Miguel0312 force-pushed the lingbot-fast branch from 167037a to 69f64e5 Compare May 26, 2026 09:44

amy-why-3459 mentioned this pull request May 27, 2026

[RFC]: World Model Support #1987

Open

19 tasks

Update documentation

b68064e

Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>

Miguel0312 changed the title ~~[Diffusion][WIP] Implemente the Lingbot World Fast Pipeline to work in vllm-omni~~ [Diffusion] Implemente the Lingbot World Fast Pipeline to work in vllm-omni May 27, 2026

Miguel0312 marked this pull request as ready for review May 27, 2026 12:24

Miguel0312 requested review from Gaohan123, Isotr0py, RuixiangMa, SamitHuang, ZJY0516, david6666666, princepride, tzhouam, wtomin, yenuo26 and ywang96 as code owners May 27, 2026 12:24

amy-why-3459 reviewed May 28, 2026

View reviewed changes

Allow the client to modify the frames_per_chunk value of the response

bfa9b59

Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>

mnasser02 mentioned this pull request May 30, 2026

[Feature] Temporal Pipeline Parallelism & Stream Batch for Real-Time Video #3099

Open

5 tasks

Improve deploy config to populate the CameraServerConfig response object

16a4c96

Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>

Miguel0312 force-pushed the lingbot-fast branch from bf26ef6 to 16a4c96 Compare June 1, 2026 13:36

Miguel0312 requested a review from lishunyang12 as a code owner June 1, 2026 13:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Diffusion] Implemente the Lingbot World Fast Pipeline to work in vllm-omni#3701

[Diffusion] Implemente the Lingbot World Fast Pipeline to work in vllm-omni#3701
Miguel0312 wants to merge 8 commits into
vllm-project:mainfrom
zzhang-fr:lingbot-fast

Miguel0312 commented May 18, 2026 •

edited

Loading

Uh oh!

hsliuustc0106 left a comment

Uh oh!

chatgpt-codex-connector Bot commented May 27, 2026

Uh oh!

amy-why-3459 May 28, 2026

Uh oh!

Miguel0312 May 29, 2026

Uh oh!

amy-why-3459 May 28, 2026

Uh oh!

Miguel0312 May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		await connection.handle_connection()


		@router.websocket("/v1/realtime/world/camera")

Conversation

Miguel0312 commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

What Changed

API

Test Plan

Test Result

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented May 27, 2026

Uh oh!

amy-why-3459 May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Miguel0312 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

amy-why-3459 May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Miguel0312 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Miguel0312 commented May 18, 2026 •

edited

Loading