Skip to content

[BugFix] Fix layerwise CPU offloading for LTX2 two-stages pipeline#2935

Closed
Songrui625 wants to merge 3 commits intovllm-project:mainfrom
Songrui625:fix-ltx2-2stages-offload
Closed

[BugFix] Fix layerwise CPU offloading for LTX2 two-stages pipeline#2935
Songrui625 wants to merge 3 commits intovllm-project:mainfrom
Songrui625:fix-ltx2-2stages-offload

Conversation

@Songrui625
Copy link
Copy Markdown
Contributor

@Songrui625 Songrui625 commented Apr 20, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR fixes layerwise CPU offloading for LTX-2 two-stage pipelines, LTX2TwoStagesPipeline and LTX2ImageToVideoTwoStagesPipeline.

I'm working on adding L4 tests for diffusion model LTX-2 #2815. After PR #2018 was merged into main branch, the layerwise CPU offloading tests of LTX-2 two-stages pipelines (LTX2TwoStagesPipeline and LTX2ImageToVideoTwoStagesPipeline) failed.

The server crashes at start up stage as DiT modules are not found in layerwise CPU offloading context. As hinted at line 7 in the code block below: WARNING 04-19 22:52:46 [layerwise_backend.py:293] No DiT/transformer modules found, skipping layer-wise offloading

$ /app/.venv/bin/python -m vllm_omni.entrypoints.cli.main serve /data00/models/LTX-2-19b-distilled --host [127.0.0.1](http://127.0.0.1/) --port 58713 --omni --enable-layerwise-offload --stage-init-timeout 600 --init-timeout 900 --model-class-name LTX2TwoStagesPipelin
...
INFO 04-19 22:52:45 [diffusers_loader.py:324] Loading weights took 2.66 seconds
INFO 04-19 22:52:46 [diffusion_model_runner.py:142] Model loading took 28.8962 GiB and 12.764463 seconds
INFO 04-19 22:52:46 [diffusion_model_runner.py:147] Model runner: Model loaded successfully.
INFO 04-19 22:52:46 [diffusion_model_runner.py:159]  Enabling offloader backend: LayerWiseOffloadBackend
WARNING 04-19 22:52:46 [layerwise_backend.py:293] No DiT/transformer modules found, skipping layer-wise offloading # <======== Here is the point!

INFO 04-19 22:52:46 [diffusion_model_runner.py:188] Model runner: Initialization complete.
INFO 04-19 22:52:46 [diffusion_worker.py:175] Worker 0: Process-scoped GPU memory after model loading: 0.00 GiB.
INFO 04-19 22:52:46 [manager.py:96] Initializing DiffusionLoRAManager: device=cuda:0, dtype=torch.bfloat16, max_cached_adapters=1, static_lora_path=None
INFO 04-19 22:52:46 [diffusion_worker.py:91] Worker 0: Initialization complete.
INFO 04-19 22:52:46 [diffusion_worker.py:555] Worker 0: Scheduler loop started.
INFO 04-19 22:52:46 [diffusion_worker.py:478] Worker 0 ready to receive requests via shared memory
(APIServer pid=833315) INFO 04-19 22:52:46 [diffusion_engine.py:443] dummy run to warm up the model
WARNING 04-19 22:52:46 [kv_transfer_manager.py:985] No connector available for receiving KV cache
  0%|                                                                                                                                                                                         | 0/8 [00:00<?, ?it/s]
ERROR 04-19 22:52:47 [diffusion_worker.py:765] Error executing method 'execute_model'. This might cause issues in distributed execution.
ERROR 04-19 22:52:47 [diffusion_worker.py:765] Traceback (most recent call last):
ERROR 04-19 22:52:47 [diffusion_worker.py:765]   File "/app/vllm-omni/vllm_omni/diffusion/worker/diffusion_worker.py", line 761, in execute_method
ERROR 04-19 22:52:47 [diffusion_worker.py:765]     return func(*args, **kwargs)
ERROR 04-19 22:52:47 [diffusion_worker.py:765]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-19 22:52:47 [diffusion_worker.py:765]   File "/app/vllm-omni/vllm_omni/diffusion/worker/diffusion_worker.py", line 236, in execute_model
ERROR 04-19 22:52:47 [diffusion_worker.py:765]     output = self.model_runner.execute_model(req)
ERROR 04-19 22:52:47 [diffusion_worker.py:765]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-19 22:52:47 [diffusion_worker.py:765]   File "/app/vllm-omni/vllm_omni/diffusion/worker/diffusion_model_runner.py", line 276, in execute_model
ERROR 04-19 22:52:47 [diffusion_worker.py:765]     output = self.pipeline.forward(req)
ERROR 04-19 22:52:47 [diffusion_worker.py:765]              ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-19 22:52:47 [diffusion_worker.py:765]   File "/app/vllm-omni/vllm_omni/diffusion/models/ltx2/pipeline_ltx2.py", line 1228, in forward
ERROR 04-19 22:52:47 [diffusion_worker.py:765]     video_latent, audio_latent = self.pipe(
ERROR 04-19 22:52:47 [diffusion_worker.py:765]                                  ^^^^^^^^^^
ERROR 04-19 22:52:47 [diffusion_worker.py:765]   File "/app/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
ERROR 04-19 22:52:47 [diffusion_worker.py:765]     return self._call_impl(*args, **kwargs)
ERROR 04-19 22:52:47 [diffusion_worker.py:765]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-19 22:52:47 [diffusion_worker.py:765]   File "/app/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
ERROR 04-19 22:52:47 [diffusion_worker.py:765]     return forward_call(*args, **kwargs)
ERROR 04-19 22:52:47 [diffusion_worker.py:765]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-19 22:52:47 [diffusion_worker.py:765]   File "/app/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
ERROR 04-19 22:52:47 [diffusion_worker.py:765]     return func(*args, **kwargs)
ERROR 04-19 22:52:47 [diffusion_worker.py:765]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-19 22:52:47 [diffusion_worker.py:765]   File "/app/vllm-omni/vllm_omni/diffusion/models/ltx2/pipeline_ltx2.py", line 1069, in forward
ERROR 04-19 22:52:47 [diffusion_worker.py:765]     noise_pred_video, noise_pred_audio = self.predict_noise_maybe_with_cfg(
ERROR 04-19 22:52:47 [diffusion_worker.py:765]                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-19 22:52:47 [diffusion_worker.py:765]   File "/app/vllm-omni/vllm_omni/diffusion/distributed/cfg_parallel.py", line 133, in predict_noise_maybe_with_cfg
ERROR 04-19 22:52:47 [diffusion_worker.py:765]     positive_noise_pred = _wrap(self.predict_noise(**positive_kwargs))
ERROR 04-19 22:52:47 [diffusion_worker.py:765]                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-19 22:52:47 [diffusion_worker.py:765]   File "/app/vllm-omni/vllm_omni/diffusion/models/ltx2/pipeline_ltx2.py", line 717, in predict_noise
ERROR 04-19 22:52:47 [diffusion_worker.py:765]     noise_pred_video, noise_pred_audio = self.transformer(**kwargs)
ERROR 04-19 22:52:47 [diffusion_worker.py:765]                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-19 22:52:47 [diffusion_worker.py:765]   File "/app/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
ERROR 04-19 22:52:47 [diffusion_worker.py:765]     return self._call_impl(*args, **kwargs)
ERROR 04-19 22:52:47 [diffusion_worker.py:765]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-19 22:52:47 [diffusion_worker.py:765]   File "/app/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
ERROR 04-19 22:52:47 [diffusion_worker.py:765]     return forward_call(*args, **kwargs)
ERROR 04-19 22:52:47 [diffusion_worker.py:765]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-19 22:52:47 [diffusion_worker.py:765]   File "/app/vllm-omni/vllm_omni/diffusion/models/ltx2/ltx2_transformer.py", line 1655, in forward
ERROR 04-19 22:52:47 [diffusion_worker.py:765]     hidden_states = self.proj_in(hidden_states)
ERROR 04-19 22:52:47 [diffusion_worker.py:765]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-19 22:52:47 [diffusion_worker.py:765]   File "/app/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
ERROR 04-19 22:52:47 [diffusion_worker.py:765]     return self._call_impl(*args, **kwargs)
ERROR 04-19 22:52:47 [diffusion_worker.py:765]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-19 22:52:47 [diffusion_worker.py:765]   File "/app/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
ERROR 04-19 22:52:47 [diffusion_worker.py:765]     return forward_call(*args, **kwargs)
ERROR 04-19 22:52:47 [diffusion_worker.py:765]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-19 22:52:47 [diffusion_worker.py:765]   File "/app/.venv/lib/python3.12/site-packages/torch/nn/modules/linear.py", line 134, in forward
ERROR 04-19 22:52:47 [diffusion_worker.py:765]     return F.linear(input, self.weight, self.bias)
ERROR 04-19 22:52:47 [diffusion_worker.py:765]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-19 22:52:47 [diffusion_worker.py:765] RuntimeError: Expected all tensors to be on the same device, but got mat2 is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_mm)

Test Plan

  • New added test test_module_collector.py passed.
pytest -v tests/diffusion/offloader/test_module_collector.py
  • Start omni server for pipeline LTX2TwoStagesPipelin no crash.
/app/.venv/bin/python -m vllm_omni.entrypoints.cli.main serve /data00/models/LTX-2-19b-distilled --host [127.0.0.1](http://127.0.0.1/) --port 58713 --omni --enable-layerwise-offload --stage-init-timeout 600 --init-timeout 900 --model-class-name LTX2TwoStagesPipeline

Test Result

  • All 2 test case from test_module_collector.py passed
(app) root@iv-ye1ye80vlsxjd1txczgc:/app/vllm-omni/tests/diffusion/offloader# pytest -v test_module_collector.py
=============================================================================================== test session starts ================================================================================================
platform linux -- Python 3.12.12, pytest-9.0.3, pluggy-1.6.0 -- /app/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /app/vllm-omni
configfile: pyproject.toml
plugins: mock-3.15.1, anyio-4.12.1
collected 2 items

test_module_collector.py::TestModuleDiscovery::test_discover_basic PASSED                                                                                                                                    [ 50%]
test_module_collector.py::TestModuleDiscovery::test_discover_nested PASSED                                                                                                                                   [100%]

================================================================================================= warnings summary =================================================================================================
../../../vllm_omni/version.py:55
  /app/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
   --> vLLM-Omni version 0.1.dev1338+gf0756914d.d20260420
   --> vLLM version 0.19.0
  This will likely cause compatibility issues.
    warn_if_misaligned_vllm_version()

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

../../../../.venv/lib/python3.12/site-packages/torch/jit/_script.py:362: 14 warnings
  /app/.venv/lib/python3.12/site-packages/torch/jit/_script.py:362: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(

../../../../.venv/lib/python3.12/site-packages/_pytest/config/__init__.py:1434
  /app/.venv/lib/python3.12/site-packages/_pytest/config/__init__.py:1434: PytestConfigWarning: Unknown config option: asyncio_mode

    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================================== 2 passed, 18 warnings in 0.10s ==========================================================================================
  • Omni server startup successfully.
The output of omni server startup successfully
:/app/vllm-omni# /app/.venv/bin/python -m vllm_omni.entrypoints.cli.main serve /data00/models/LTX-2-19b-distilled --host 127.0.0.1 --port 58713 --omni --enable-layerwise-offload --stage-init-timeout 600 --init-timeout 900 --model-class-name LTX2TwoStagesPipeline
/app/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
 --> vLLM-Omni version 0.1.dev1338+gf0756914d.d20260420
 --> vLLM version 0.19.0
This will likely cause compatibility issues.
  warn_if_misaligned_vllm_version()
/app/.venv/lib/python3.12/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
  warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)
INFO 04-19 23:14:31 [serve.py:116] Detected diffusion model: /data00/models/LTX-2-19b-distilled
INFO 04-19 23:14:31 [logo.py:45]        █     █     █▄   ▄█       ▄▀▀▀▀▄ █▄   ▄█ █▄    █ ▀█▀
INFO 04-19 23:14:31 [logo.py:45]  ▄▄ ▄█ █     █     █ ▀▄▀ █  ▄▄▄  █    █ █ ▀▄▀ █ █ ▀▄  █  █
INFO 04-19 23:14:31 [logo.py:45]   █▄█▀ █     █     █     █       █    █ █     █ █   ▀▄█  █
INFO 04-19 23:14:31 [logo.py:45]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀        ▀▀▀▀  ▀     ▀ ▀     ▀ ▀▀▀
INFO 04-19 23:14:31 [logo.py:45]
(APIServer pid=836403) INFO 04-19 23:14:31 [utils.py:299] vLLM server version 0.19.0, serving model /data00/models/LTX-2-19b-distilled
(APIServer pid=836403) INFO 04-19 23:14:31 [utils.py:233] non-default args: {'model_tag': '/data00/models/LTX-2-19b-distilled', 'host': '127.0.0.1', 'port': 58713, 'model': '/data00/models/LTX-2-19b-distilled'}
(APIServer pid=836403) INFO 04-19 23:14:31 [omni_base.py:139] [AsyncOmni] Initializing with model /data00/models/LTX-2-19b-distilled
(APIServer pid=836403) INFO 04-19 23:14:31 [async_omni_engine.py:272] [AsyncOmniEngine] Initializing with model /data00/models/LTX-2-19b-distilled
(APIServer pid=836403) WARNING 04-19 23:14:31 [utils.py:177] Filtered out 1 callable object(s) from base_engine_args that are not compatible with OmegaConf: ['dispatch_function'].
(APIServer pid=836403) INFO 04-19 23:14:31 [async_omni_engine.py:329] [AsyncOmniEngine] Launching Orchestrator thread with 1 stages
(APIServer pid=836403) INFO 04-19 23:14:31 [async_omni_engine.py:748] [AsyncOmniEngine] Initializing stage 0
(APIServer pid=836403) INFO 04-19 23:14:31 [stage_init_utils.py:384] [stage_init] Stage-0 set runtime devices: 0
(APIServer pid=836403) INFO 04-19 23:14:32 [multiproc_executor.py:105] Starting server...
/app/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
 --> vLLM-Omni version 0.1.dev1338+gf0756914d.d20260420
 --> vLLM version 0.19.0
This will likely cause compatibility issues.
  warn_if_misaligned_vllm_version()
/app/.venv/lib/python3.12/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
  warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)
INFO 04-19 23:14:40 [diffusion_worker.py:417] Worker 0 created result MessageQueue
INFO 04-19 23:14:40 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-19 23:14:40 [vllm.py:790] Asynchronous scheduling is enabled.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 04-19 23:14:40 [diffusion_worker.py:127] Worker 0: Initialized device and distributed environment.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 04-19 23:14:40 [parallel_state.py:588] Building SP subgroups from explicit sp_group_ranks (sp_size=1, ulysses=1, ring=1, use_ulysses_low=True).
INFO 04-19 23:14:40 [parallel_state.py:630] SP group details for rank 0: sp_group=[0], ulysses_group=[0], ring_group=[0]
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:02<00:00,  5.44it/s]
INFO 04-19 23:14:49 [platform.py:77] Defaulting to diffusion attention backend FLASH_ATTN
INFO 04-19 23:14:50 [manager.py:96] Initializing DiffusionLoRAManager: device=cuda:0, dtype=torch.bfloat16, max_cached_adapters=1, static_lora_path=None
Multi-thread loading shards:   0% Completed | 0/8 [00:00<?, ?it/s]
Multi-thread loading shards:  12% Completed | 1/8 [00:00<00:02,  2.40it/s]
Multi-thread loading shards:  25% Completed | 2/8 [00:00<00:02,  2.61it/s]
Multi-thread loading shards:  38% Completed | 3/8 [00:00<00:01,  3.48it/s]
Multi-thread loading shards:  50% Completed | 4/8 [00:01<00:01,  3.27it/s]
Multi-thread loading shards:  62% Completed | 5/8 [00:01<00:00,  3.13it/s]
Multi-thread loading shards:  75% Completed | 6/8 [00:01<00:00,  3.00it/s]
Multi-thread loading shards:  88% Completed | 7/8 [00:02<00:00,  2.93it/s]
Multi-thread loading shards: 100% Completed | 8/8 [00:02<00:00,  3.26it/s]
Multi-thread loading shards: 100% Completed | 8/8 [00:02<00:00,  3.10it/s]

INFO 04-19 23:14:53 [diffusers_loader.py:324] Loading weights took 2.62 seconds
INFO 04-19 23:14:53 [diffusion_model_runner.py:142] Model loading took 28.8962 GiB and 12.712799 seconds
INFO 04-19 23:14:53 [diffusion_model_runner.py:147] Model runner: Model loaded successfully.
INFO 04-19 23:14:53 [diffusion_model_runner.py:159]  Enabling offloader backend: LayerWiseOffloadBackend
INFO 04-19 23:14:53 [layerwise_backend.py:307] Applying layer-wise offloading on ['transformer', 'language_model', 'model']
INFO 04-19 23:14:53 [layerwise_backend.py:313] Applying hooks on transformer (LTX2VideoTransformer3DModel)
INFO 04-19 23:15:09 [layerwise_backend.py:385] Layer-wise offloading enabled on 48 layers (blocks)
INFO 04-19 23:15:09 [layerwise_backend.py:313] Applying hooks on language_model (Gemma3TextModel)
WARNING 04-19 23:15:09 [layerwise_backend.py:443] No _layerwise_offload_blocks_attrs defined for Gemma3TextModel, skipping layerwise offloading
WARNING 04-19 23:15:09 [layerwise_backend.py:318] Target layers (blocks) not found. Skipping offloading on language_model (Gemma3TextModel)
INFO 04-19 23:15:09 [layerwise_backend.py:313] Applying hooks on model (Gemma3Model)
WARNING 04-19 23:15:09 [layerwise_backend.py:443] No _layerwise_offload_blocks_attrs defined for Gemma3Model, skipping layerwise offloading
WARNING 04-19 23:15:09 [layerwise_backend.py:318] Target layers (blocks) not found. Skipping offloading on model (Gemma3Model)
INFO 04-19 23:15:09 [diffusion_model_runner.py:188] Model runner: Initialization complete.
INFO 04-19 23:15:09 [diffusion_worker.py:175] Worker 0: Process-scoped GPU memory after model loading: 0.00 GiB.
INFO 04-19 23:15:09 [manager.py:96] Initializing DiffusionLoRAManager: device=cuda:0, dtype=torch.bfloat16, max_cached_adapters=1, static_lora_path=None
INFO 04-19 23:15:09 [diffusion_worker.py:91] Worker 0: Initialization complete.
INFO 04-19 23:15:09 [diffusion_worker.py:555] Worker 0: Scheduler loop started.
INFO 04-19 23:15:09 [diffusion_worker.py:478] Worker 0 ready to receive requests via shared memory
(APIServer pid=836403) INFO 04-19 23:15:09 [diffusion_engine.py:443] dummy run to warm up the model
WARNING 04-19 23:15:09 [kv_transfer_manager.py:985] No connector available for receiving KV cache
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:11<00:00,  1.41s/it]
INFO 04-19 23:15:21 [pipeline_ltx2.py:922] Got latents of shape [batch_size, latent_dim, latent_frames, latent_height, latent_width], `latent_num_frames`, `latent_height`, `latent_width` will be inferred.
INFO 04-19 23:15:21 [pipeline_ltx2.py:958] Got audio_latents of shape [batch_size, num_channels, audio_length, mel_bins], `audio_num_frames` will be inferred.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.42it/s]
INFO 04-19 23:15:23 [diffusion_model_runner.py:213] Peak GPU memory (this request): 35.98 GB reserved, 33.86 GB allocated, 2.12 GB pool overhead (5.9%)
(APIServer pid=836403) INFO 04-19 23:15:23 [inline_stage_diffusion_client.py:63] [InlineStageDiffusionClient] Stage-0 initialized inline (batch_size=1)
(APIServer pid=836403) INFO 04-19 23:15:23 [async_omni_engine.py:803] [AsyncOmniEngine] Stage 0 initialized (diffusion, batch_size=1)
(APIServer pid=836403) INFO 04-19 23:15:23 [orchestrator.py:185] [Orchestrator] Starting event loop
(APIServer pid=836403) INFO 04-19 23:15:23 [async_omni_engine.py:371] [AsyncOmniEngine] Orchestrator ready with 1 stages
(APIServer pid=836403) INFO 04-19 23:15:23 [omni_base.py:152] [AsyncOmni] AsyncOmniEngine initialized in 51.98 seconds
(APIServer pid=836403) INFO 04-19 23:15:23 [omni_base.py:167] [AsyncOmni] Initialized with 1 stages for model /data00/models/LTX-2-19b-distilled
(APIServer pid=836403) INFO 04-19 23:15:24 [api_server.py:477] Detected pure diffusion mode (single diffusion stage)
(APIServer pid=836403) INFO 04-19 23:15:24 [api_server.py:528] Pure diffusion API server initialized for model: /data00/models/LTX-2-19b-distilled
(APIServer pid=836403) INFO 04-19 23:15:24 [api_server.py:323] Starting vLLM API server (pure diffusion mode) on http://127.0.0.1:58713
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:37] Available routes are:
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/audio/speech, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/audio/speech/batch, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/audio/voices, Methods: GET
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/audio/voices, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/audio/voices/{name}, Methods: DELETE
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/images/generations, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/images/edits, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/videos, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/videos/sync, Methods: POST
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/videos, Methods: GET
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/videos/{video_id}, Methods: GET
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/videos/{video_id}, Methods: DELETE
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:46] Route: /v1/videos/{video_id}/content, Methods: GET
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:57] Route: /v1/audio/speech/stream, Endpoint: streaming_speech
(APIServer pid=836403) INFO 04-19 23:15:24 [launcher.py:57] Route: /v1/realtime, Endpoint: realtime_websocket
(APIServer pid=836403) INFO:     Started server process [836403]
(APIServer pid=836403) INFO:     Waiting for application startup.
(APIServer pid=836403) INFO:     Application startup complete.

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Songrui625 <songrui625@gmail.com>
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@Songrui625
Copy link
Copy Markdown
Contributor Author

@wtomin @hsliuustc0106 @lishunyang12 PTAL. Thanks.

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLOCKING:

  • Test Coverage — Missing regression test. Please add an automated test that verifies layerwise offloading correctly discovers DiT/transformer modules in nested pipeline structures like LTX2TwoStagesPipeline. The current test plan only provides manual server startup verification.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@yuanheng-zhao PTAL, this is your domain

@Songrui625 Songrui625 force-pushed the fix-ltx2-2stages-offload branch from 6ae1a95 to 9f6b04b Compare April 20, 2026 11:39
@Songrui625
Copy link
Copy Markdown
Contributor Author

BLOCKING:

  • Test Coverage — Missing regression test. Please add an automated test that verifies layerwise offloading correctly discovers DiT/transformer modules in nested pipeline structures like LTX2TwoStagesPipeline. The current test plan only provides manual server startup verification.

Added a simple test case to test it. CC @yuanheng-zhao

Copy link
Copy Markdown
Contributor

@yuanheng-zhao yuanheng-zhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be considered as a temporary fix for LTX2 two stages pipeline: as LTX2TwoStagesPipeline wraps LTX2Pipeline instance inside so the transformer module collector fails to find it.

Related PR #2427 cc @NickCao

module = find_module_with_attr(pipeline, attr)
if module is None:
continue
pipeline = module
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reassignment to pipeline here descend other dit module which it's going to look for under the current one - which I think might be not that stable as it discards the root

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to have some tracking from the outmost wrapper pipeline to the transformer module which contains offloadable layers

self.upsample_pipe = DummyPipeline()


class TestModuleDiscovery:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a DiT on both pipe and upsample_pipe is found? The current resolution seems to fail on it

@yuanheng-zhao
Copy link
Copy Markdown
Contributor

I think we might prefer a user(developer)-specified way to control how the target transformer(s) should be found. Even #2427 does not handle the condition of recursively looking for transformers from children modules.

@NickCao Might want to add this handling? For example, enable looking for modules in childrens A.B.transformer:

_dit_modules: ClassVar[list[str]] = ["pipe.transformer"]

@Songrui625
Copy link
Copy Markdown
Contributor Author

I think we might prefer a user(developer)-specified way to control how the target transformer(s) should be found. Even #2427 does not handle the condition of recursively looking for transformers from children modules.

@NickCao Might want to add this handling? For example, enable looking for modules in childrens A.B.transformer:

_dit_modules: ClassVar[list[str]] = ["pipe.transformer"]

LGTM, totally agree with it. let's track this case in the PR #2427. Happy to help if needed.

@Songrui625
Copy link
Copy Markdown
Contributor Author

It is solved by the PR #2427.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants