Skip to content

[Bugfix][StableAudio] Pass model_class_name to Omni() and declare audio class attrs#3406

Open
linyueqian wants to merge 5 commits into
vllm-project:mainfrom
linyueqian:fix/stable_audio_audio_meta
Open

[Bugfix][StableAudio] Pass model_class_name to Omni() and declare audio class attrs#3406
linyueqian wants to merge 5 commits into
vllm-project:mainfrom
linyueqian:fix/stable_audio_audio_meta

Conversation

@linyueqian
Copy link
Copy Markdown
Collaborator

Purpose

The L4 nightly run for test_stable_audio_quantization_and_teacache (build 9093) fails with:

AssertionError: assert 'image' == 'audio'
- audio
+ image
tests/e2e/offline_inference/test_stable_audio_expansion.py:61

#2077 added a branch in async_omni_engine._create_default_diffusion_stage_cfg that sets the default stage's final_output_type="audio" when kwargs["model_class_name"] resolves to a pipeline whose support_audio_output flag is True, and tightened the stable-audio assertion to == "audio". The catch: OmniDiffusionConfig.enrich_config() is what auto-resolves model_class_name from model_index.json, and it runs after the default stage cfg is built. So at the time the engine branches on kwargs.get("model_class_name", None) it's still None, the else arm fires, and the outer stage carries final_output_type="image".

The companion tests/e2e/offline_inference/test_audiox_model.py already side-steps this by passing model_class_name="AudioXPipeline" explicitly into Omni(). Mirror the same pattern in the stable-audio test.

While I was there, also align StableAudioPipeline's class header with AudioXPipeline's by declaring the audio-output contract explicitly:

  • support_audio_output: ClassVar[bool] = True — currently inherited from the SupportAudioOutput Protocol, which works because Protocol class attributes carry through subclasses, but making it explicit matches the AudioX/OmniVoice pattern and removes the dependency on Protocol-default-attribute semantics.
  • audio_sample_rate: ClassVar[int] = 44100 — picked up by diffusion_engine._audio_mm so multimodal_output[\"audio_sample_rate\"] is populated; downstream consumers no longer need to hardcode 44.1 kHz for Stable Audio Open.

Verification

On h20-server-0 against vllm-project/vllm-omni:main (3c85ca55):

step result
upstream main: supports_audio_output(\"StableAudioPipeline\") True (Protocol inheritance already provided the flag, so the class-attr addition is defensive, not load-bearing for the pass/fail)
upstream main: _create_default_diffusion_stage_cfg(kwargs) with kwargs={\"model\": \"...\"} final_output_type=\"image\" because kwargs[\"model_class_name\"] is None (auto-resolution hasn't run) → matches the failing assertion
this PR: same call with kwargs={\"model\": \"...\", \"model_class_name\": \"StableAudioPipeline\"} final_output_type=\"audio\"

Test Plan

tests/e2e/offline_inference/test_stable_audio_expansion.py::test_stable_audio_quantization_and_teacache should now go green on the next L4 nightly. The companion test_audiox_model is unchanged and should still pass.

The full L4 stable-audio inference run was not exercised on h20 (the 16 GB FP8 weights + tea_cache combination is L4-shaped), but the prompt-shape mismatch that produced the assertion is fully reproducible with a Python-only check of _create_default_diffusion_stage_cfg's output.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR.
  • The test plan.
  • The test results.
  • (Optional) Documentation update.
  • (Optional) Release notes update.

linyueqian added 4 commits May 7, 2026 01:11
…io class attrs

The L4 nightly test_stable_audio_quantization_and_teacache fails with 'image' != 'audio'. PR vllm-project#2077 added an engine branch in async_omni_engine._create_default_diffusion_stage_cfg that sets final_output_type='audio' when kwargs has model_class_name pointing at a pipeline whose support_audio_output is True, and tightened the test assertion. The model_class_name auto-resolution from model_index.json runs later (in OmniDiffusionConfig.enrich_config); by the time it runs, the default stage cfg's final_output_type is already locked to 'image'. Mirror the AudioX offline test, which already passes model_class_name='AudioXPipeline' explicitly. Also align StableAudioPipeline with AudioXPipeline by declaring support_audio_output and audio_sample_rate as class attributes (the latter is read by diffusion_engine._audio_mm to populate multimodal_output['audio_sample_rate']).

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
@linyueqian linyueqian requested a review from hsliuustc0106 as a code owner May 7, 2026 05:58
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
@linyueqian
Copy link
Copy Markdown
Collaborator Author

Closing+reopening to retrigger RTD with the now-exposed pull/3406/head ref (RTD's earlier attempts raced GitHub's async ref propagation).

@linyueqian linyueqian closed this May 7, 2026
@linyueqian linyueqian reopened this May 7, 2026
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The root cause analysis is clear (auto-resolution of model_class_name happens after default stage cfg is built), and the fix is minimal and targeted. Adding explicit class attributes for support_audio_output and audio_sample_rate is a nice cleanup that aligns with AudioXPipeline pattern.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Hi @linyueqian, friendly reminder — this PR hasn't had any activity (commits or reviews) in the past 9 days. 🕐

Could you please provide an update?

  • If you're still working on it, that's great — just let us know.
  • If you're blocked on something, feel free to ask for help.
  • If this PR is no longer being pursued, please consider closing it so we can keep the review queue manageable.

Thanks for your contribution! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants