Add tensor parallel support to stable audio open 1.0 by akshatvishu · Pull Request #1406 · vllm-project/vllm-omni

akshatvishu · 2026-02-19T21:33:31Z

Part of #1217

Purpose

Add tensor parallel support to stable audio open

Test Plan

All experiments were conducted on Kaggle using 2× Tesla T4 GPUs.
Due to the hardware constraints, the maximum tensor parallelism (tp_size) tested was 2.

Environment

GPUs: 2 × Tesla T4

Precision: float16 & float32

Maximum tensor parallel size (TP): 2

Global Test Configuration:

STEPS = 100
AUDIO_LENGTH = 10.0  # seconds

Model Initialization:

parallel_config = DiffusionParallelConfig(
    tensor_parallel_size=tp_size
)

# Initialize Omni model
omni = Omni(
    model=MODEL_PATH,
    dtype="float16",
    parallel_config=parallel_config,
)

generator = torch.Generator(
    device=current_omni_platform.device_type
).manual_seed(SEED)

params = OmniDiffusionSamplingParams(
    num_inference_steps=STEPS,
    guidance_scale=7.0,
    generator=generator,
    extra_args={
        "audio_start_in_s": 0.0,
        "audio_end_in_s": AUDIO_LENGTH,
    },
omni.generate({"prompt": "A danceable electronic track in the genre of dance.", "negative_prompt": "Low quality, noisy"}, params)
)

Test Result

For float16 :

Name	tp_size	Time(seconds)	Speedup	File
hf_fp16	-	28.67	-	dance_track.mp3
baseline_tp1_fp16	1	26.40	-	TP_1_fp16.mp3
tp2_fp16	2	20.01	1.32x	TP_2_fp16.mp3

For float32 :

Name	tp_size	Time(sec)	Speedup	File
hf_fp32	-	194.15	-	dance_track_fp32.mp3
baseline_tp_1_fp32	1	145.00	-	TP_1_fp32.mp3
tp_2_fp32	2	78.01	1.86x	TP_2_fp32.mp3

files are in .mp3 as github don't allows .wav
hf_ prefix entries means that we run stable audio hugging face diffuser's pipeline with the same parameters.
baseline_ prefix entries are the ones from against which the speedup is calculated.

For extended testing and results, please refer to this kaggle-notebook.

Notes

At kaggle-notebook, The default final_sigmas_type="zero" in Diffusers produces a harmless torchsde boundary warnings when using CosineDPMSolverMultistepScheduler. This is expected upstream behavior and not a vLLM Omni issue.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Add Tensor Parallelism (TP) support for Stable Audio Open. - Implement fused QKV and GLU weight loading for TP shards - Add cross-rank SDE synchronization for deterministic sampling - Preserve compatibility with the existing inference flow Tested with multi-GPU(T4x2) TP configurations. Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

lishunyang12

Thanks for the contribution. The TP approach works, but broadcasting after every transformer block effectively defeats the purpose of tensor parallelism. I suspect there is a subtle numerical divergence elsewhere. A few high-level concerns and questions inline.

- Remove cross-rank hidden state broadcasts to restore true TP All-Reduce. - Fix nn.Sequential tuple crash by using nn.Linear for cross-attention. - Refactor Gaussian Fourier embeddings to avoid unsafe distributed init. - Replace Python assert statements with explicit exceptions. - Add architectural docstrings for MHA/GQA routing and SwiGLU fusion. - Pass synchronized generator to SDE scheduler step to fix numerical drift. - Sync unseeded generation via tp_group.broadcast instead of global RNG mutation. - Reset scheduler.noise_sampler on forward pass. - Remove sigma_min configuration override to restore native noise schedule. Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

…latents Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

…enable TP support. Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

…without OOM Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b5dbdf6ca6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

fix(transformer): restore legacy checkpoint key mapping in load_weight Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

akshatvishu · 2026-02-21T21:24:48Z

Thanks for the review! Just a heads-up: I accidentally performed a force-with-lease on a follow-up after the initial squash . This shifted the hunk headers, so my previous responses to your line-specific comments might point to the wrong code blocks now but I’ve manually verified that all requested changes are addressed.

checkpoint keys in `name_mapping` while retaining the legacy `.linear_x.` keys for backward compatibility. This prevents `timestep_proj` and `global_proj` from silently failing to load and using random initialization. Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

hsliuustc0106 · 2026-02-24T07:06:53Z

@vllm-omni-reviewer

hsliuustc0106 · 2026-03-12T12:49:28Z

Hi @akshatvishu 👋

This tensor parallel support PR for stable audio hasn't been updated for 16 days. Is this still being worked on?

Thanks!

akshatvishu · 2026-03-12T13:39:57Z

Hey @hsliuustc0106 !
I think apart from a minor style fix in one of the comment ; this is ready to go from my side! Happy to provide any more test if needed!

Update: Done with the style changes!

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

…atvishu/vllm-omni into feature/sao_tensor_parallel Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

lishunyang12 · 2026-04-19T15:02:50Z

Any progress?

akshatvishu · 2026-04-20T11:49:47Z

Any progress?

Its ready for my side! Any test or benchmark you want me to test/run?

Keep both tensor-parallel-size arg (feature branch) and cache-backend/ tea-cache args (main) in text_to_audio.py example. Pass parallel_config and cache params to Omni constructor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

Signed-off-by: akshatvishu <33392262+akshatvishu@users.noreply.github.com>

linyueqian · 2026-04-24T15:22:38Z

+
+            # Memory Optimization: Decode latents in chunks to prevent VAE OOM spikes.
+            # Note: Safe default for 47s audio on T4.
+            chunk_size = 1


🟡 [important] VAE chunk_size = 1 serializes decode for every caller, not just T4.

The earlier code did self.vae.decode(latents_for_vae).sample in a single call. Hardcoding chunk_size = 1 here means that once num_waveforms_per_prompt > 1 or batched inference lands, users on H100 / A100 pay a serial VAE decode that they do not need. The comment says safe default for 47s audio on T4 but the default is now applied to everyone.

Could we either:

Expose vae_chunk_size as a sampling param / config option, defaulting to latents_for_vae.shape[0] (whole-batch decode, matches old behavior), so T4 users can opt into vae_chunk_size=1, or

Pick the chunk size dynamically from available VRAM or batch size so the common single-waveform path stays a single decode call?

As-is, this is a silent perf regression for non-T4 users.

Took approach 1: vae_chunk_size is now an explicit sampling param with default latents_for_vae.shape[0], so the common path stays a single VAE decode call. vae_chunk_size=1 is still available for low-VRAM use.

I added a unit test for the chunking behavior. Should I keep that, or remove it?

linyueqian · 2026-04-24T15:23:33Z

@princepride ptal as well thanks

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

akshatvishu · 2026-04-29T10:45:34Z

failed tests:

=========================== short test summary info ============================
FAILED tests/diffusion/cache/test_teacache_extractors.py::TestFluxExtractor::test_modulated_input_shape
FAILED tests/diffusion/cache/test_teacache_extractors.py::TestFluxExtractor::test_run_transformer_blocks_callable
FAILED tests/diffusion/cache/test_teacache_extractors.py::TestFluxExtractor::test_postprocess_callable
FAILED tests/diffusion/cache/test_teacache_extractors.py::TestFluxExtractor::test_postprocess_output_shape
FAILED tests/diffusion/cache/test_teacache_extractors.py::TestFluxExtractor::test_postprocess_return_tuple_when_return_dict_false
FAILED tests/diffusion/cache/test_teacache_extractors.py::TestFluxExtractor::test_without_guidance

error:

E NotImplementedError: Could not run 'vllm::rocm_unquantized_gemm' with arguments from the 'CPU' backend

The CI CPU tests for FLUX extractors are crashing on ROCm builds probably because ReplicatedLinear dispatches to GPU kernels. I think we can fix this by using the same monkeypatch logic used in the adalayernorm tests to force the default CPU-compatible GEMM path during test execution.

Happy to raise a PR to fix this if this is indeed the preferred way forward!

ref:
https://github.com/vllm-project/vllm-omni/blob/main/tests/diffusion/layers/test_adalayernorm.py#L37

cc: @linyueqian

akshatvishu force-pushed the feature/sao_tensor_parallel branch from c28f88a to 7e207dc Compare February 20, 2026 21:44

akshatvishu force-pushed the feature/sao_tensor_parallel branch from 7e207dc to 3a5fa86 Compare February 20, 2026 21:51

lishunyang12 requested changes Feb 21, 2026

View reviewed changes

akshatvishu added 5 commits February 21, 2026 20:14

refactor(stable_audio): remove redundant seed 42 fallback in prepare_…

37969c9

…latents Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

feat(examples/text_to_audio): add --tensor-parallel-size argument to …

42345a5

…enable TP support. Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

fix(pipeline): correctly infer num_waveforms_per_prompt

dc9a194

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

perf(pipeline): chunk VAE decoding to support num_outputs_per_prompt …

b5dbdf6

…without OOM Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

akshatvishu marked this pull request as ready for review February 21, 2026 20:14

akshatvishu requested a review from hsliuustc0106 as a code owner February 21, 2026 20:14

chatgpt-codex-connector Bot reviewed Feb 21, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/models/stable_audio/stable_audio_transformer.py

Comment thread vllm_omni/diffusion/models/stable_audio/pipeline_stable_audio.py

fix(pipeline): correct latent noise scaling

3b3f0c0

fix(transformer): restore legacy checkpoint key mapping in load_weight Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

akshatvishu changed the title ~~Add tensor parallel support to stable audio opem~~ Add tensor parallel support to stable audio open 1.0 Feb 21, 2026

wtomin mentioned this pull request Feb 23, 2026

[RFC]: Continuous Diffusion Model Acceleration Support #1217

Open

1 task

wtomin mentioned this pull request Mar 12, 2026

[RFC]: Diffusion Models Features Supports Plan #814

Open

54 tasks

refactor(stable_audio): remove dead code from load_weights

8989cb9

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

Gaohan123 added this to the v0.18.0 milestone Mar 17, 2026

linyueqian and others added 4 commits March 24, 2026 01:14

Merge branch 'main' into feature/sao_tensor_parallel

3faca99

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

Merge branch 'main' into feature/sao_tensor_parallel

b625ab5

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

Merge branch 'feature/sao_tensor_parallel' of https://github.com/aksh…

69cb0f8

…atvishu/vllm-omni into feature/sao_tensor_parallel Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

fix(example): resolve text_to_audio merge with main

1ab6769

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

akshatvishu mentioned this pull request Mar 24, 2026

[RFC]: Continuous Quantization Support #1854

Open

zhangj1an mentioned this pull request Mar 31, 2026

[Test] Add Stable Audio offline e2e TeaCache Test #2377

Merged

5 tasks

Gaohan123 removed this from the v0.18.0 milestone Apr 14, 2026

Gaohan123 added this to the v0.20.0 milestone Apr 14, 2026

akshatvishu and others added 2 commits April 20, 2026 20:43

Merge branch 'main' into feature/sao_tensor_parallel

878769c

Signed-off-by: akshatvishu <33392262+akshatvishu@users.noreply.github.com>

linyueqian added the ready label to trigger buildkite CI label Apr 24, 2026

linyueqian removed this from the v0.20.0 milestone Apr 24, 2026

linyueqian reviewed Apr 24, 2026

View reviewed changes

akshatvishu and others added 3 commits April 24, 2026 23:15

feat(stable-audio): expose vae_chunk_size param

53290d0

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

feat(stable_audio): restore type-hints

33d3452

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

Merge branch 'main' into feature/sao_tensor_parallel

c33ac6b

linyueqian approved these changes Apr 28, 2026

View reviewed changes

linyueqian added this to the v0.20.0 milestone Apr 28, 2026

linyueqian enabled auto-merge (squash) April 28, 2026 17:36

Merge branch 'main' into feature/sao_tensor_parallel

d211d2f

Gaohan123 modified the milestones: v0.20.0, v0.22.0 May 9, 2026

Conversation

akshatvishu commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Environment

Test Result

Notes

Uh oh!

lishunyang12 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

akshatvishu commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hsliuustc0106 commented Feb 24, 2026

Uh oh!

hsliuustc0106 commented Mar 12, 2026

Uh oh!

akshatvishu commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lishunyang12 commented Apr 19, 2026

Uh oh!

akshatvishu commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linyueqian Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

akshatvishu Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

linyueqian commented Apr 24, 2026

Uh oh!

akshatvishu commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

akshatvishu commented Feb 19, 2026 •

edited

Loading

lishunyang12 left a comment •

edited

Loading

akshatvishu commented Feb 21, 2026 •

edited

Loading

akshatvishu commented Mar 12, 2026 •

edited

Loading

akshatvishu commented Apr 20, 2026 •

edited

Loading

akshatvishu commented Apr 29, 2026 •

edited

Loading