Skip to content

[Diffusion] Implemente the Lingbot World Fast Pipeline to work in vllm-omni#3701

Open
Miguel0312 wants to merge 8 commits into
vllm-project:mainfrom
zzhang-fr:lingbot-fast
Open

[Diffusion] Implemente the Lingbot World Fast Pipeline to work in vllm-omni#3701
Miguel0312 wants to merge 8 commits into
vllm-project:mainfrom
zzhang-fr:lingbot-fast

Conversation

@Miguel0312
Copy link
Copy Markdown

@Miguel0312 Miguel0312 commented May 18, 2026

Purpose

Implement the Lingbot World Fast Model. Makes progress to the Interactive Video scenario of #1987 by resolving #3072.

What Changed

  • Implement Lingbot World Pipeline + Transformer, enables both offline and online inference
  • Implement the LingbotWorldState object for storing data for the model such as KV cache
  • Create a script for downloading the model weights since it is not stored in a way that diffusers.from_pretrained can correctly retrieve it
  • Add to VLLM Omni the capacity to handle models that require camera position inputs. Add a new key camera to multi_modal_data whose values correspond to two Nx4x4 np.arrays: poses and intrinsics
  • Improves online serving server websocket configuration by adding flags --ws-max-size and --ws that allow the user to send the larger videos required through a websocket without having problems with packet sizes or heartbeat timeouts

API

The new endpoint /v1/realtime/world/camera/ implements the following schema (adapted from the RFC):

client ──────────────────► server: WS upgrade  /v1/realtime/world/camera
server ──msgpack(CameraServerConfig)──► client
                                         (patch_size, chunk_frames,...)

   ┌─── one session, N chunks ──────────────────────────────────────┐
   │                                                                │
   │ client ──msgpack({endpoint:"infer",                            │
   │                   session_id, image?, prompt?,                 │
   │                   cameras:{intrinsics, poses}})──► server      │
   │                                                                │
   │    ┌────  ServingRealtimeWorldCamera.infer()  ──────┐          │
   │    │                                                │          │
   │    │  [first chunk of session only:]                │          │
   │    │   T5 encode(prompt)      ─► cross KV cache     │          │
   │    │   VAE encode(image)      ─► y (first latent)   │          │
   │    │                                                │          │
   │    │  [every chunk:]                                │          │
   │    │   get_plucker(K_slice,                         │          │
   │    │               poses_slice)                     │          │
   │    │                 ─► Plücker emb [B,6,f,H/2,W/2] │          │
   │    │   current_start = state.global_end_index       │          │
   │    │                                                │          │
   │    │   for t in [0, 179, 358]:  (3 noisy steps)     │          │
   │    │     eps = WanModelFast(x, t, plucker,          │          │
   │    │                        self_kv_cache,          │          │
   │    │                        cross_kv_cache,         │          │
   │    │                        current_start, ...)     │          │
   │    │     x0 = flow_pred_to_x0(eps, x, t)            │          │
   │    │     x  = add_noise(x0, randn, next_t)          │          │
   │    │   # last step: keep x0 (no re-noise)           │          │
   │    │                                                │          │
   │    │   # cache-update pass, t ≈ 0, clean x0         │          │
   │    │   WanModelFast(x0, t≈0, plucker,               │          │
   │    │                self_kv_cache, ..., update=True)│          │
   │    │                                                │          │
   │    │   state.append_chunk(x0)                       │          │
   │    │   video = VAE.decode(x0)                       │          │
   │    └────────────────────────────────────────────────┘          │
   │                                                                │
   │ server ──msgpack({type:"frame", chunk_id,                      │
   │                  video:<mp4 bytes>})──► client                 │
   │                                                                │
   │ (repeat for next chunk; session-scoped                         │
   │  LingBotWorldFastState persists self_kv_cache,                 │
   │  cross_kv_cache, frame buffer, current_start)                 │
   │                                                                │
   │                                                                │
   │ client ──msgpack({endpoint:"reset"})──► server                 │
   │   state.reset() clears caches + frame buffer                   │
   │                                                                │
   └────────────────────────────────────────────────────────────────┘

The CameraServerConfig object sends information about the model to the client. It follows the pattern:

CameraServerConfig = {
  "patch_size":  [1, 2, 2],
  "vae_stride":  [4, 8, 8], 
  "latent_frames_per_chunk":  3, 
  "max_area":  399360   # 480*832
}

Test Plan

The tests under tests/diffusion/models/lingbot_world_fast represent the L1 tests for the model. They are lightweight and work out of the box. Run them with:

pytest -xvs tests/diffusion/models/gr00t/test_protocol_validation.py
pytest -xvs tests/diffusion/test_schedule.py
pytest -xvs tests/entrypoints/openai_api/test_session_state.py

Tests on tests/e2e/*/test_lingbot_world_fast.py are the L2 smoke tests of the model. They test that both the online and offline behaviors of the API are as desired without actually generating any video.

pytest -xvs tests/e2e/offline_inference/test_lingbot_world_fast.py
pytest -xvs tests/e2e/online_serving/test_lingbot_world_fast.py

Finally, the tests on tests/e2e/*/test_lingbot_world_fast_expansion.py load the actual weights of the models, generate videos and compare with golden reference frames. The reference frames should be stored on tests/data/lingbot_world_fast as golden_frames_LENGTH_POS.npy, where LENGTH is either "short" or "long" depending if it is the 25 or 81 frame long video and POS is one of "first" or "last", that correspond, respectively, to the first and last generated frame of the video. Importantly, they should be saved as tensors directly from the reference implementation instead of saving the model output as a .mp4 video whose frames are later extracted.

Since it requires loading the model weights, running the L3 tests rely on first downloading the model. The easiest way to do so is by running:

cd examples/offline_inference/lingbot_world_fast
python download_lingbot_world_fast.py

Some additional environment variables are needed so the model can use its inputs used to generate the video.

export LINGBOT_WORLD_FAST_PATH=examples/offline_inference/lingbot_world_fast/lingbot_world/lingbot-world-base-cam/Lingbot-World-Fast/
export LINGBOT_WORLD_FAST_CAMERA_PATH=path/to/lingbot-world/examples/04/
export LINGBOT_WORLD_FAST_IMAGE=path/to/lingbot-world/examples/04/image.jpg
pytest -xvs tests/e2e/offline_inference/test_lingbot_world_fast_expansion.py
pytest -xvs tests/e2e/online_serving/test_lingbot_world_fast_expansion.py

Test Result

We use as reference a video generated with the original implementation. The only modification that is done is to force the output video height and width instead of computing it using the aspect ratio of the reference image. This is in line with the behavior of other video generation models already in vllm-omni.

The following results are for 1 A100 GPU with the text encoder in the CPU to reduce VRAM use and enable the generation of longer videos. The offloading is enabled both in the baseline and on vllm-omni for a fair comparison.

Short(25 frames) Long (81 frames)
SSIM first 1.000 1.000
SSIM last 1.000 1.000
Latency (s) - vllm-omni 63 111
Latency (s) - original 64 112

Without CPU offloading we get the following performance results.

Short(25 frames) Medium (65 frames)
Latency (s) - vllm-omni 15 41
Latency (s) - original 20 46

The difference in performance does not depend on the size of the video which implies that the speed up comes from removing overhead (probably thanks to the warmup run that enables the VAE and text encoder modules to run faster) rather than speed up of the diffusion process.

This is the 81-frame long video generated by the original implementation.

reference_video.mp4

This is the 81-frame long video generated by the implementation in vllm-omni.

vllm-omni-video.mp4

There is no baseline for the video extension task since the original model does not support it. But the result of calling it with the following command, which makes 3 calls to the server for it to generate 25 frames, two of which are for video extension is show below:

python openai_client.py --width 832 --height 480 --fps 16 --num-frames 25 --num-calls 3 --session-id 0 --image /path/to/lingbot-world/examples/04/image.jpg --camera-path /path/to/lingbot-world/examples/04 --prompt "A sweeping cinematic journey along the Great Wall of China, winding through golden autumn hills under a brilliant blue sky — stone pathways stretch into the distance, watchtowers stand sentinel, and vibrant foliage blankets the mountainsides as the camera glides smoothly forward, capturing the grandeur and timeless majesty of this ancient wonder."
lingbot-great-wall-extension.mp4

Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>

Implement Lingbot World Transformer into vllm-omni

Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>

Implement KV cache abstraction for Lingbot World Fast

Add script to offline generation using Lingbot World Fast

Implement online serving for Lingbot World and camera-based world models in general
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready for full review when draft status removed (marked as WIP). Preliminary scan available on request.

Miguel0312 and others added 4 commits May 19, 2026 14:49
Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>
Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>
Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>
Signed-off-by: Miguel Vieira Pereira <52176659+Miguel0312@users.noreply.github.com>

Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>
Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>
@Miguel0312 Miguel0312 changed the title [Diffusion][WIP] Implemente the Lingbot World Fast Pipeline to work in vllm-omni [Diffusion] Implemente the Lingbot World Fast Pipeline to work in vllm-omni May 27, 2026
@Miguel0312 Miguel0312 marked this pull request as ready for review May 27, 2026 12:24
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

await connection.handle_connection()


@router.websocket("/v1/realtime/world/camera")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide reference documents for the newly added OpenAI external interfaces.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some more info to the PR description about the API. The API is also documented on file docs/user_guide/examples/online_serving/lingbot_world_fast.md. Please, if there is any additional information needed, let me know what is missing and where to add it (docs, PR description, etc.).


logger = init_logger(__name__)
_DEFAULT_IDLE_TIMEOUT = 30.0
CHUNK_FRAMES = 4
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the CHUNK_FRAMES parameter be modified in the request? If users can customize this parameter, please add the option to read the parameter in the request.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I allowed the client to control this value through the extra_body.frames_per_chunk field of the request. If we also need to allow the server to control it, some top level modifications to the library must be made (i.e. adding command line arguments), which I am not sure is worth it for it to be used by only one model in one specific case (online serving). But I can do it if needed, the modification itself should be simple.

Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>
Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants