feat: Add support for Qwen2-Audio#2324
Conversation
|
@yaoyu-33 Hi, would you mind helping reviewing the PR? Many thanks. I would like to ask what minimal CI/CD tests you recommend adding? Currently, the inference results of Megatron are correct. I wonder if it is sufficient to write an e2e generation test similar to tests/functional_tests/models/qwen_vl/test_qwen3_vl_generation.py. |
📝 WalkthroughWalkthroughAdds complete Qwen2-Audio model integration into Megatron-Bridge with bridge implementation for HuggingFace-to-Megatron conversion, a MegatronModule wrapper for audio-language generation, provider classes, an example inference script supporting distributed generation with optional audio inputs, and documentation. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Script as hf_to_megatron_generate_audio_lm
participant Tokenizer as Tokenizer/Processor
participant Model as Qwen2AudioModel
participant Dist as Distributed(all-gather)
participant Decoder as Token Decoder
User->>Script: Provide prompt, audio path, model config
Script->>Script: Load/convert model (HF or Megatron)
Script->>Script: Initialize distributed parallelism
loop Generation Loop (until stop token or max_new_tokens)
Script->>Tokenizer: Process input (audio + text)
Tokenizer->>Model: Forward pass with tokens, positions, audio features
Model->>Model: Encode audio, project to embedding space, integrate with text
Model->>Dist: Return logits from language model forward
Dist->>Dist: All-gather across data-parallel stages
Script->>Script: Select token (greedy)
Script->>Script: Broadcast token, append to sequence
end
Script->>Decoder: Decode generated token sequence
Decoder->>User: Print prompt, audio path, generated text
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@examples/conversion/hf_to_megatron_generate_audio_lm.py`:
- Line 35: The docstring/README command contains a filename typo: it references
"hf_to_megatron_generate_alm.py" but the actual script is
"hf_to_megatron_generate_audio_lm.py"; update the command string in the example
(the line that currently reads uv run python
examples/conversion/hf_to_megatron_generate_alm.py) to reference the correct
filename hf_to_megatron_generate_audio_lm.py so the example matches the real
script name.
- Line 440: The CLI flag registration for "trust_remote_code" currently uses
action="store_true" which makes its default False and prevents
is_safe_repo(trust_remote_code, ...) from falling back to the SAFE_REPOS list;
update the parser.add_argument call for the "--trust_remote_code" option to pass
default=None so args.trust_remote_code is None when the flag is omitted and
is_safe_repo(...) can consult the safe repository list.
In `@examples/models/audio_lm/qwen2_audio/README.md`:
- Line 31: The README command incorrectly includes a duplicate "python"; update
the invocation that currently reads 'uv run python -m torch.distributed.run
python examples/conversion/hf_to_megatron_generate_audio_lm.py \\' to remove the
extra 'python' so torch.distributed.run is given the script path directly (i.e.,
pass 'examples/conversion/hf_to_megatron_generate_audio_lm.py' to '-m
torch.distributed.run'); edit the README.md line to reflect this corrected
command.
In `@src/megatron/bridge/models/qwen_audio/modeling_qwen2_audio.py`:
- Line 199: The code accesses input_ids.shape when input_features is not None
which will raise AttributeError if input_ids is None; update the guard around
the audio-processing block in modeling_qwen2_audio.py to explicitly check
input_ids is not None before accessing .shape (e.g., change the condition to
check input_ids is not None and input_ids.shape[1] != 1 or raise a clear
ValueError when input_features is provided but input_ids is missing), and
propagate this guard to other places that use input_ids (the audio block
references at the same function where lines ~240, ~255, ~263 access input_ids)
so either handle the inputs_embeds-only path correctly or require/document that
input_ids must be supplied.
🧹 Nitpick comments (12)
src/megatron/bridge/models/qwen_audio/modeling_qwen2_audio.py (5)
1-1: Copyright year should be 2026.The file is being created in February 2026, but the copyright header says 2025.
-# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
27-27: UseT | Noneinstead ofOptional[T].Per coding guidelines, prefer
T | NoneoverOptional[T].-from typing import TYPE_CHECKING, Optional +from typing import TYPE_CHECKINGThen update all usages (lines 108, 161–170) from
Optional[X]toX | None.As per coding guidelines: "Use 'T | None' for nullable types instead of 'Optional[T]'"
144-148: Monkey-patching_merge_input_ids_with_audio_featuresis fragile.Binding a private method from
Qwen2AudioForConditionalGenerationonto this instance couples the code tightly to HuggingFace's internal implementation. If the HF method's signature, expected attributes onself, or behavior changes acrosstransformersversions, this will break at runtime with a hard-to-diagnose error.Consider either:
- Adding a version pin/check against the expected
transformersversion.- Copying and adapting the logic directly so it's self-contained and testable.
154-156: Missing type hint forinput_tensorparameter.- def set_input_tensor(self, input_tensor) -> None: + def set_input_tensor(self, input_tensor: torch.Tensor) -> None:As per coding guidelines: "Use type hints for function arguments and return types"
248-261: Prefix unused unpacked variables with_.
num_audiosandembed_dimare unpacked but never used (flagged by ruff RUF059).Proposed fix
- num_audios, max_audio_tokens, embed_dim = audio_features.shape + _num_audios, max_audio_tokens, _embed_dim = audio_features.shapeexamples/models/audio_lm/qwen2_audio/README.md (1)
41-41: Add a language specifier to the fenced code block.Markdownlint (MD040) flags this. Use
```textfor plain output.-``` +```textsrc/megatron/bridge/models/qwen_audio/qwen2_audio_provider.py (1)
29-30: UseAny | Noneinstead ofOptional[Any].Same guideline as noted in the modeling file.
Optionalshould be replaced with theX | Nonesyntax.-from typing import TYPE_CHECKING, Any, Optional +from typing import TYPE_CHECKING, AnyThen on line 69:
- hf_config: Optional[Any] = None + hf_config: Any | None = NoneAs per coding guidelines: "Use 'T | None' for nullable types instead of 'Optional[T]'"
src/megatron/bridge/models/qwen_audio/qwen2_audio_bridge.py (1)
198-204: Silentexcept Exception: passhides registration failures.A bare
except Exception: passmakes it impossible to diagnose why bridge auto-discovery fails. At minimum, log the exception. Ruff also flags this (S110, BLE001).Proposed fix
+ import logging + + logger = logging.getLogger(__name__) + try: Qwen2AudioBridge = MegatronModelBridge.register_bridge( source=Qwen2AudioForConditionalGeneration, target=Qwen2AudioModel )(Qwen2AudioBridge) - except Exception: - # If registration fails, the bridge will still work manually - pass + except Exception: + logger.debug("Qwen2AudioBridge auto-registration failed; manual usage still available.", exc_info=True)examples/conversion/hf_to_megatron_generate_audio_lm.py (4)
160-163: SSRF risk: validate URL scheme before opening.
urlopenwill follow any scheme includingfile://, allowing local file reads from an attacker-controlled input. Since this is an example script, the risk is low, but it's good practice to restrict tohttp/https.Proposed fix
+ ALLOWED_SCHEMES = ("http://", "https://") + if audio_path.startswith(("http://", "https://")): - audio_data, _ = librosa.load(BytesIO(urlopen(audio_path).read()), sr=sampling_rate) + if not audio_path.startswith(ALLOWED_SCHEMES): + raise ValueError(f"Unsupported URL scheme in: {audio_path}") + audio_data, _ = librosa.load(BytesIO(urlopen(audio_path).read()), sr=sampling_rate) # noqa: S310Actually, the existing
startswithcheck already restricts to http/https. The ruff S310 is a false positive in this case since the check on line 162 already guards theurlopencall. Adding a# noqa: S310comment would suppress the warning.
313-313: Unused variablemessages.The unpacked
messagesvariable is never used. Prefix with_to signal intent.- input_ids, input_features, feature_attention_mask, messages = process_audio_inputs(processor, audio_path, prompt) + input_ids, input_features, feature_attention_mask, _messages = process_audio_inputs(processor, audio_path, prompt)
332-352: Audio encoder runs on every generation step — significant inefficiency.The full audio features (
input_features) are passed and re-encoded through the audio tower on every generation step becauseinput_idskeeps growing (soinput_ids.shape[1] != 1in the model's forward, line 199 ofmodeling_qwen2_audio.py). For a typical generation of 50 tokens, the audio encoder runs 50 times.Consider encoding audio features once before the loop, then passing the merged
inputs_embedsdirectly (or usingNoneforinput_featureson subsequent steps). Alternatively, a KV-cache approach would avoid the full recomputation entirely.This is acceptable for an example script, but worth a TODO comment.
325-326: TODO: attention mask is alwaysNone.The
# TODO: add attention maskcomment indicates this is known unfinished work. Withattention_mask=None, the Megatron language model relies on its internal causal masking, which should be correct for standard autoregressive generation. However, for sequences with padding (e.g., batched inference), this would silently produce incorrect results.Would you like me to open an issue to track this TODO?
|
@yuekaizhang : please help to rebase on top of #2250 thanks for your contributions! |
e9916f0 to
211ad55
Compare
Done. Also, add a generation test file. |
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Signed-off-by: root <zhangyuekai@foxmail.com>
211ad55 to
b2f1482
Compare
|
/ok to test a5ae02c |
|
/ok to test 6111d52 |
|
/ok to test 7fae585 |
3 similar comments
|
/ok to test 7fae585 |
|
/ok to test 7fae585 |
|
/ok to test 7fae585 |
|
/ok to test 30b82a5 |
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
|
@ko3n1g Hi, would you mind helping enable CI/CD again? Thanks. |
|
/ok to test e1041d5 |
|
/ok to test 327673f |
|
/ok to test f931d3c |
|
@yuekaizhang Can you please resolve the merge conflict ? |
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Done. |
|
@yaoyu-33 @gautham-kollu Would you mind helping enable CI/CD tests? Thanks. |
|
/ok to test e7a053d |
|
/ok to test b83e46e |
|
/ok to test f87cdab |
This PR supports qwen2-audio in Mbridge.
Summary by CodeRabbit
New Features
Documentation