Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions docs/getting_started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,46 @@ if __name__ == "__main__":
images[0].save("coffee.png")
```

You can pass a list of prompts and wait for them to process altogether, shown below.

!!! info

However, it is not currently recommended to do so
because not all models support batch inference,
and batch requesting mostly does not provide significant performance improvement (despite the impression that it does).
This feature is primarily for the sake of interface compatibility with vLLM and to allow for future improvements.

```python
from vllm_omni.entrypoints.omni import Omni

if __name__ == "__main__":
omni = Omni(
model="Tongyi-MAI/Z-Image-Turbo",
# stage_configs_path="./stage-config.yaml", # See below
)
prompts = [
"a cup of coffee on a table",
"a toy dinosaur on a sandy beach",
"a fox waking up in bed and yawning",
]
omni_outputs = omni.generate(prompts)
for i_prompt, prompt_output in enumerate(omni_outputs):
this_request_output = prompt_output.request_output[0]
this_images = this_request_output.images
for i_image, image in enumerate(this_images):
image.save(f"p{i_prompt}-img{i_image}.jpg")
print("saved to", f"p{i_prompt}-img{i_image}.jpg")
# saved to p0-img0.jpg
# saved to p1-img0.jpg
# saved to p2-img0.jpg
```

!!! info

For diffusion pipelines, the stage config field `stage_args.[].runtime.max_batch_size` is 1 by default, and the input
list is sliced into single-item requests before feeding into the diffusion pipeline. For models that do internally support
batched inputs, you can [modify this configuration](../configuration/stage_configs.md) to let the model accept a longer batch of prompts.

For more usages, please refer to [offline inference](../user_guide/examples/offline_inference/qwen2_5_omni.md)

## Online Serving with OpenAI-Completions API
Expand Down
3 changes: 2 additions & 1 deletion docs/user_guide/diffusion/cache_dit_acceleration.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ Enable cache-dit acceleration by simply setting `cache_backend="cache_dit"`. Cac

```python
from vllm_omni.entrypoints.omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

# Simplest way: just enable cache-dit with default parameters
omni = Omni(
Expand All @@ -27,7 +28,7 @@ omni = Omni(

images = omni.generate(
"a beautiful landscape",
num_inference_steps=50,
OmniDiffusionSamplingParams(num_inference_steps=50),
)
```

Expand Down
42 changes: 30 additions & 12 deletions docs/user_guide/diffusion/parallelism_acceleration.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,10 +67,12 @@ omni = Omni(
)

outputs = omni.generate(
prompt="a cat reading a book",
num_inference_steps=9,
width=512,
height=512,
"a cat reading a book",
OmniDiffusionSamplingParams(
num_inference_steps=9,
width=512,
height=512,
),
)
```

Expand All @@ -83,6 +85,7 @@ outputs = omni.generate(
An example of offline inference script using [Ulysses-SP](https://arxiv.org/pdf/2309.14509) is shown below:
```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.diffusion.data import DiffusionParallelConfig
ulysses_degree = 2

Expand All @@ -91,7 +94,10 @@ omni = Omni(
parallel_config=DiffusionParallelConfig(ulysses_degree=2)
)

outputs = omni.generate(prompt="A cat sitting on a windowsill", num_inference_steps=50, width=2048, height=2048)
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(num_inference_steps=50, width=2048, height=2048),
)
```

See `examples/offline_inference/text_to_image/text_to_image.py` for a complete working example.
Expand Down Expand Up @@ -133,6 +139,7 @@ Ring-Attention ([arxiv paper](https://arxiv.org/abs/2310.01889)) splits the inpu
An example of offline inference script using Ring-Attention is shown below:
```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.diffusion.data import DiffusionParallelConfig
ring_degree = 2

Expand All @@ -141,7 +148,10 @@ omni = Omni(
parallel_config=DiffusionParallelConfig(ring_degree=2)
)

outputs = omni.generate(prompt="A cat sitting on a windowsill", num_inference_steps=50, width=2048, height=2048)
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(num_inference_steps=50, width=2048, height=2048),
)
```

See `examples/offline_inference/text_to_image/text_to_image.py` for a complete working example.
Expand Down Expand Up @@ -183,6 +193,7 @@ You can combine both Ulysses-SP and Ring-Attention for larger scale parallelism.

```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.diffusion.data import DiffusionParallelConfig

# Hybrid: 2 Ulysses × 2 Ring = 4 GPUs total
Expand All @@ -191,7 +202,10 @@ omni = Omni(
parallel_config=DiffusionParallelConfig(ulysses_degree=2, ring_degree=2)
)

outputs = omni.generate(prompt="A cat sitting on a windowsill", num_inference_steps=50, width=2048, height=2048)
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(num_inference_steps=50, width=2048, height=2048),
)
```

##### Online Serving
Expand Down Expand Up @@ -374,11 +388,15 @@ omni = Omni(
)

outputs = omni.generate(
prompt="turn this cat to a dog",
negative_prompt="low quality, blurry",
true_cfg_scale=4.0,
pil_image=input_image,
num_inference_steps=50,
{
"prompt": "turn this cat to a dog",
"negative_prompt": "low quality, blurry",
},
OmniDiffusionSamplingParams(
true_cfg_scale=4.0,
pil_image=input_image,
num_inference_steps=50,
),
)
```

Expand Down
16 changes: 14 additions & 2 deletions docs/user_guide/diffusion/teacache.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Enable TeaCache by setting `cache_backend` to `"tea_cache"`:

```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

# Simple configuration - model_type is automatically extracted from pipeline.__class__.__name__
omni = Omni(
Expand All @@ -17,7 +18,12 @@ omni = Omni(
"rel_l1_thresh": 0.2 # Optional, defaults to 0.2
}
)
outputs = omni.generate(prompt="A cat sitting on a windowsill", num_inference_steps=50)
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(
num_inference_steps=50,
),
)
```

### Using Environment Variable
Expand Down Expand Up @@ -68,13 +74,19 @@ Controls the balance between speed and quality. Lower values prioritize quality,

```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

omni = Omni(
model="Qwen/Qwen-Image",
cache_backend="tea_cache",
cache_config={"rel_l1_thresh": 0.2}
)
outputs = omni.generate(prompt="A cat sitting on a windowsill", num_inference_steps=50)
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(
num_inference_steps=50,
),
)
```

## Performance Tuning
Expand Down
48 changes: 40 additions & 8 deletions docs/user_guide/diffusion_acceleration.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,20 +96,27 @@ To measure the parallelism methods, we run benchmarks with **Qwen/Qwen-Image** m

```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

omni = Omni(
model="Qwen/Qwen-Image",
cache_backend="tea_cache",
cache_config={"rel_l1_thresh": 0.2} # Optional, defaults to 0.2
)

outputs = omni.generate(prompt="A cat sitting on a windowsill", num_inference_steps=50)
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(
num_inference_steps=50,
),
)
```

### Using Cache-DiT

```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

omni = Omni(
model="Qwen/Qwen-Image",
Expand All @@ -123,14 +130,20 @@ omni = Omni(
}
)

outputs = omni.generate(prompt="A cat sitting on a windowsill", num_inference_steps=50)
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(
num_inference_steps=50,
),
)
```

### Using Ulysses-SP

Run text-to-image:
```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.diffusion.data import DiffusionParallelConfig
ulysses_degree = 2

Expand All @@ -139,13 +152,17 @@ omni = Omni(
parallel_config=DiffusionParallelConfig(ulysses_degree=ulysses_degree)
)

outputs = omni.generate(prompt="A cat sitting on a windowsill", num_inference_steps=50, width=2048, height=2048)
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(num_inference_steps=50, width=2048, height=2048),
)
```


Run image-to-image:
```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.diffusion.data import DiffusionParallelConfig
ulysses_degree = 2

Expand All @@ -154,15 +171,21 @@ omni = Omni(
parallel_config=DiffusionParallelConfig(ulysses_degree=ulysses_degree)
)

outputs = omni.generate(prompt="turn this cat to a dog",
pil_image=input_image, num_inference_steps=50)
outputs = omni.generate(
{
"prompt": "turn this cat to a dog",
"multi_modal_data": {"image": input_image}
},
OmniDiffusionSamplingParams(num_inference_steps=50),
)
```

### Using Ring-Attention

Run text-to-image:
```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.diffusion.data import DiffusionParallelConfig
ring_degree = 2

Expand All @@ -171,7 +194,10 @@ omni = Omni(
parallel_config=DiffusionParallelConfig(ring_degree=2)
)

outputs = omni.generate(prompt="A cat sitting on a windowsill", num_inference_steps=50, width=2048, height=2048)
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(num_inference_steps=50, width=2048, height=2048),
)
```

### Using CFG-Parallel
Expand All @@ -182,6 +208,7 @@ CFG-Parallel splits the CFG positive/negative branches across GPUs. Use it when

```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.diffusion.data import DiffusionParallelConfig
cfg_parallel_size = 2

Expand All @@ -190,8 +217,13 @@ omni = Omni(
parallel_config=DiffusionParallelConfig(cfg_parallel_size=cfg_parallel_size)
)

outputs = omni.generate(prompt="turn this cat to a dog",
pil_image=input_image, num_inference_steps=50, true_cfg_scale=4.0)
outputs = omni.generate(
{
"prompt": "turn this cat to a dog",
"multi_modal_data": {"image": input_image}
},
OmniDiffusionSamplingParams(num_inference_steps=50, true_cfg_scale=4.0),
)
```

## Documentation
Expand Down
37 changes: 36 additions & 1 deletion docs/user_guide/examples/offline_inference/text_to_image.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ if __name__ == "__main__":
images[0].save("coffee.png")
```

Or put more than one prompt in a request, processing them sequentially.
Or put more than one prompt in a request.

```python
from vllm_omni.entrypoints.omni import Omni
Expand All @@ -40,6 +40,41 @@ if __name__ == "__main__":
image = output.request_output[0].images[0].save(f"{i}.jpg")
```

!!! info

However, it is not currently recommended to do so
because not all models support batch inference,
and batch requesting mostly does not provide significant performance improvement (despite the impression that it does).
This feature is primarily for the sake of interface compatibility with vLLM and to allow for future improvements.

!!! info

For diffusion pipelines, the stage config field `stage_args.[].runtime.max_batch_size` is 1 by default, and the input
list is sliced into single-item requests before feeding into the diffusion pipeline. For models that do internally support
batched inputs, you can [modify this configuration](../../../configuration/stage_configs.md) to let the model accept a longer batch of prompts.

Apart from string prompt, vLLM-Omni also supports dictionary prompts in the same style as vLLM.
This is useful for models that support negative prompts.

```python
from vllm_omni.entrypoints.omni import Omni

if __name__ == "__main__":
omni = Omni(model="Qwen/Qwen-Image")
outputs = omni.generate([
{
"prompt": "a cup of coffee on a table",
"negative_prompt": "low resolution"
},
{
"prompt": "a toy dinosaur on a sandy beach",
"negative_prompt": "cinematic, realistic"
}
])
for i, output in enumerate(outputs):
image = output.request_output[0].images[0].save(f"{i}.jpg")
```

## Local CLI Usage

```bash
Expand Down
Loading