Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions examples/online_serving/text_to_image/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,7 @@ Use `extra_body` to pass generation parameters:
| `seed` | int | None | Random seed (reproducible) |
| `negative_prompt` | str | None | Negative prompt |
| `num_outputs_per_prompt` | int | 1 | Number of images to generate |
| `--cfg-parallel-size`. | int | 1 | Number of GPUs for CFG parallelism |

## Response Format

Expand Down
5 changes: 3 additions & 2 deletions vllm_omni/entrypoints/async_omni.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,9 +132,10 @@ def _create_default_diffusion_stage_cfg(self, kwargs: dict[str, Any]) -> dict[st
ring_degree = kwargs.get("ring_degree") or 1
sequence_parallel_size = kwargs.get("sequence_parallel_size")
tensor_parallel_size = kwargs.get("tensor_parallel_size") or 1
cfg_parallel_size = kwargs.get("cfg_parallel_size") or 1
if sequence_parallel_size is None:
sequence_parallel_size = ulysses_degree * ring_degree
num_devices = sequence_parallel_size * tensor_parallel_size
num_devices = sequence_parallel_size * tensor_parallel_size * cfg_parallel_size
Comment on lines +135 to +138
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Account for CFG parallel GPUs in stage device locking

The new cfg_parallel_size factor increases the diffusion stage’s device list (num_devices now multiplies by cfg_parallel_size), but the stage worker’s lock calculation still only uses TP/PP/DP/SP (vllm_omni/entrypoints/omni_stage.py lines 470–499). When cfg_parallel_size > 1 and multiple stages/processes initialize concurrently, the extra CFG GPUs won’t be locked, so another stage can initialize on them at the same time, defeating the “lock ALL devices” guarantee and risking memory-calculation/OOM races. Consider including cfg_parallel_size in num_devices_per_stage (or otherwise locking all CUDA_VISIBLE_DEVICES) to keep the lock coverage consistent with the new device list.

Useful? React with 👍 / 👎.

for i in range(1, num_devices):
devices += f",{i}"
parallel_config = DiffusionParallelConfig(
Expand All @@ -144,7 +145,7 @@ def _create_default_diffusion_stage_cfg(self, kwargs: dict[str, Any]) -> dict[st
sequence_parallel_size=sequence_parallel_size,
ulysses_degree=ulysses_degree,
ring_degree=ring_degree,
cfg_parallel_size=1,
cfg_parallel_size=cfg_parallel_size,
)
default_stage_cfg = [
{
Expand Down
3 changes: 3 additions & 0 deletions vllm_omni/entrypoints/cli/serve.py
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,9 @@ def subparser_init(self, subparsers: argparse._SubParsersAction) -> FlexibleArgu
default=None,
help="Scheduler flow_shift for video models (e.g., 5.0 for 720p, 12.0 for 480p).",
)
omni_config_group.add_argument(
"--cfg-parallel-size", type=int, default=1, help="Number of GPUs for CFG parallel computation"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep it as before

Copy link
Copy Markdown
Contributor Author

@gDINESH13 gDINESH13 Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to remove this? that would make this param unavailable to be configured right? while starting the server.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, i mean keep line break as before

Copy link
Copy Markdown
Contributor Author

@gDINESH13 gDINESH13 Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it fails pre-commit formatting check if I keep line break

)
return serve_parser


Expand Down
6 changes: 5 additions & 1 deletion vllm_omni/entrypoints/omni_stage.py
Original file line number Diff line number Diff line change
Expand Up @@ -474,12 +474,14 @@ def _stage_worker(
data_parallel_size = parallel_config.get("data_parallel_size", 1)
prefill_context_parallel_size = 1 # not used for diffusion
sequence_parallel_size = parallel_config.get("sequence_parallel_size", 1)
cfg_parallel_size = parallel_config.get("cfg_parallel_size", 1)
else:
tensor_parallel_size = engine_args.get("tensor_parallel_size", 1)
pipeline_parallel_size = engine_args.get("pipeline_parallel_size", 1)
data_parallel_size = engine_args.get("data_parallel_size", 1)
prefill_context_parallel_size = engine_args.get("prefill_context_parallel_size", 1)
sequence_parallel_size = 1 # not use in omni model
cfg_parallel_size = 1 # not used in omni model

# Calculate total number of devices needed for this stage
# For a single stage worker:
Expand All @@ -488,14 +490,16 @@ def _stage_worker(
# - DP: replicates model, but each replica uses TP devices
# - PCP: context parallelism, typically uses TP devices
# - SP: sequence parallelism, typically uses TP devices
# The number of devices per stage is determined by TP * PP * DP * PCP * SP size
# - CFG: Classifier-Free Guidance parallelism for diffusion models
# The number of devices per stage is determined by TP * PP * DP * PCP * SP * CFG size
# (PP/DP/PCP are higher-level parallelism that don't add devices per stage)
num_devices_per_stage = (
tensor_parallel_size
* pipeline_parallel_size
* data_parallel_size
* prefill_context_parallel_size
* sequence_parallel_size
* cfg_parallel_size
)

# Get physical device IDs from CUDA_VISIBLE_DEVICES
Expand Down