[Diffusion] Implemente the Lingbot World Fast Pipeline to work in vllm-omni#3701
[Diffusion] Implemente the Lingbot World Fast Pipeline to work in vllm-omni#3701Miguel0312 wants to merge 8 commits into
Conversation
Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com> Implement Lingbot World Transformer into vllm-omni Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com> Implement KV cache abstraction for Lingbot World Fast Add script to offline generation using Lingbot World Fast Implement online serving for Lingbot World and camera-based world models in general
hsliuustc0106
left a comment
There was a problem hiding this comment.
Ready for full review when draft status removed (marked as WIP). Preliminary scan available on request.
Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>
Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>
Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>
Signed-off-by: Miguel Vieira Pereira <52176659+Miguel0312@users.noreply.github.com> Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>
Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
| await connection.handle_connection() | ||
|
|
||
|
|
||
| @router.websocket("/v1/realtime/world/camera") |
There was a problem hiding this comment.
Please provide reference documents for the newly added OpenAI external interfaces.
There was a problem hiding this comment.
I added some more info to the PR description about the API. The API is also documented on file docs/user_guide/examples/online_serving/lingbot_world_fast.md. Please, if there is any additional information needed, let me know what is missing and where to add it (docs, PR description, etc.).
|
|
||
| logger = init_logger(__name__) | ||
| _DEFAULT_IDLE_TIMEOUT = 30.0 | ||
| CHUNK_FRAMES = 4 |
There was a problem hiding this comment.
Can the CHUNK_FRAMES parameter be modified in the request? If users can customize this parameter, please add the option to read the parameter in the request.
There was a problem hiding this comment.
I allowed the client to control this value through the extra_body.frames_per_chunk field of the request. If we also need to allow the server to control it, some top level modifications to the library must be made (i.e. adding command line arguments), which I am not sure is worth it for it to be used by only one model in one specific case (online serving). But I can do it if needed, the modification itself should be simple.
Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>
Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>
Purpose
Implement the Lingbot World Fast Model. Makes progress to the Interactive Video scenario of #1987 by resolving #3072.
What Changed
diffusers.from_pretrainedcan correctly retrieve itcameratomulti_modal_datawhose values correspond to two Nx4x4 np.arrays:posesandintrinsics--ws-max-sizeand--wsthat allow the user to send the larger videos required through a websocket without having problems with packet sizes or heartbeat timeoutsAPI
The new endpoint
/v1/realtime/world/camera/implements the following schema (adapted from the RFC):The CameraServerConfig object sends information about the model to the client. It follows the pattern:
Test Plan
The tests under
tests/diffusion/models/lingbot_world_fastrepresent the L1 tests for the model. They are lightweight and work out of the box. Run them with:Tests on
tests/e2e/*/test_lingbot_world_fast.pyare the L2 smoke tests of the model. They test that both the online and offline behaviors of the API are as desired without actually generating any video.Finally, the tests on
tests/e2e/*/test_lingbot_world_fast_expansion.pyload the actual weights of the models, generate videos and compare with golden reference frames. The reference frames should be stored ontests/data/lingbot_world_fastasgolden_frames_LENGTH_POS.npy, where LENGTH is either "short" or "long" depending if it is the 25 or 81 frame long video and POS is one of "first" or "last", that correspond, respectively, to the first and last generated frame of the video. Importantly, they should be saved as tensors directly from the reference implementation instead of saving the model output as a.mp4video whose frames are later extracted.Since it requires loading the model weights, running the L3 tests rely on first downloading the model. The easiest way to do so is by running:
cd examples/offline_inference/lingbot_world_fast python download_lingbot_world_fast.pySome additional environment variables are needed so the model can use its inputs used to generate the video.
Test Result
We use as reference a video generated with the original implementation. The only modification that is done is to force the output video height and width instead of computing it using the aspect ratio of the reference image. This is in line with the behavior of other video generation models already in vllm-omni.
The following results are for 1 A100 GPU with the text encoder in the CPU to reduce VRAM use and enable the generation of longer videos. The offloading is enabled both in the baseline and on vllm-omni for a fair comparison.
Without CPU offloading we get the following performance results.
The difference in performance does not depend on the size of the video which implies that the speed up comes from removing overhead (probably thanks to the warmup run that enables the VAE and text encoder modules to run faster) rather than speed up of the diffusion process.
This is the 81-frame long video generated by the original implementation.
reference_video.mp4
This is the 81-frame long video generated by the implementation in vllm-omni.
vllm-omni-video.mp4
There is no baseline for the video extension task since the original model does not support it. But the result of calling it with the following command, which makes 3 calls to the server for it to generate 25 frames, two of which are for video extension is show below:
python openai_client.py --width 832 --height 480 --fps 16 --num-frames 25 --num-calls 3 --session-id 0 --image /path/to/lingbot-world/examples/04/image.jpg --camera-path /path/to/lingbot-world/examples/04 --prompt "A sweeping cinematic journey along the Great Wall of China, winding through golden autumn hills under a brilliant blue sky — stone pathways stretch into the distance, watchtowers stand sentinel, and vibrant foliage blankets the mountainsides as the camera glides smoothly forward, capturing the grandeur and timeless majesty of this ancient wonder."lingbot-great-wall-extension.mp4