Skip to content

feat: Add support for Qwen2-Audio#2324

Merged
yaoyu-33 merged 20 commits intoNVIDIA-NeMo:mainfrom
yuekaizhang:audio_new
Mar 24, 2026
Merged

feat: Add support for Qwen2-Audio#2324
yaoyu-33 merged 20 commits intoNVIDIA-NeMo:mainfrom
yuekaizhang:audio_new

Conversation

@yuekaizhang
Copy link
Copy Markdown
Contributor

@yuekaizhang yuekaizhang commented Feb 11, 2026

This PR supports qwen2-audio in Mbridge.

Summary by CodeRabbit

  • New Features

    • Added support for Qwen2-Audio models in the Megatron-Bridge framework.
    • Added audio-language generation example script demonstrating inference from both HuggingFace and Megatron checkpoints.
  • Documentation

    • Added README with end-to-end usage instructions, example commands, and expected outputs for Qwen2-Audio inference workflows.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 11, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yuekaizhang
Copy link
Copy Markdown
Contributor Author

@yaoyu-33 Hi, would you mind helping reviewing the PR? Many thanks.

I would like to ask what minimal CI/CD tests you recommend adding? Currently, the inference results of Megatron are correct. I wonder if it is sufficient to write an e2e generation test similar to tests/functional_tests/models/qwen_vl/test_qwen3_vl_generation.py.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 11, 2026

📝 Walkthrough

Walkthrough

Adds complete Qwen2-Audio model integration into Megatron-Bridge with bridge implementation for HuggingFace-to-Megatron conversion, a MegatronModule wrapper for audio-language generation, provider classes, an example inference script supporting distributed generation with optional audio inputs, and documentation.

Changes

Cohort / File(s) Summary
Bridge Core Implementation
src/megatron/bridge/models/qwen_audio/qwen2_audio_bridge.py, src/megatron/bridge/models/qwen_audio/qwen2_audio_provider.py, src/megatron/bridge/models/qwen_audio/modeling_qwen2_audio.py
Bridge class for HF-to-Megatron weight mapping, provider classes (base and 7B variant) with audio-specific configuration and freezing options, and MegatronModule wrapper integrating audio encoder, multimodal projector, and language model with forward pass handling audio features, tokens, and attention masks.
Bridge Integration & Exports
src/megatron/bridge/models/qwen_audio/__init__.py, src/megatron/bridge/models/__init__.py
New module aggregator exposing Qwen2AudioBridge, Qwen2AudioModel, and provider classes; parent package updated to export these entities in public API.
Example & Documentation
examples/conversion/hf_to_megatron_generate_audio_lm.py, examples/models/audio_lm/qwen2_audio/README.md
Example script demonstrating inference with librosa-based audio loading, single-batch iteration, distributed forward passes, all-gather, greedy token selection, and optional audio feature processing; README documenting end-to-end usage workflows.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Script as hf_to_megatron_generate_audio_lm
    participant Tokenizer as Tokenizer/Processor
    participant Model as Qwen2AudioModel
    participant Dist as Distributed(all-gather)
    participant Decoder as Token Decoder

    User->>Script: Provide prompt, audio path, model config
    Script->>Script: Load/convert model (HF or Megatron)
    Script->>Script: Initialize distributed parallelism
    
    loop Generation Loop (until stop token or max_new_tokens)
        Script->>Tokenizer: Process input (audio + text)
        Tokenizer->>Model: Forward pass with tokens, positions, audio features
        Model->>Model: Encode audio, project to embedding space, integrate with text
        Model->>Dist: Return logits from language model forward
        Dist->>Dist: All-gather across data-parallel stages
        Script->>Script: Select token (greedy)
        Script->>Script: Broadcast token, append to sequence
    end
    
    Script->>Decoder: Decode generated token sequence
    Decoder->>User: Print prompt, audio path, generated text
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • cuichenx
  • ko3n1g
  • meatybobby
🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 68.75% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR adds 1,240+ lines of code for Qwen2-Audio support but lacks dedicated unit/functional tests and PR testing documentation unlike similar Qwen models. Add unit tests in tests/unit_tests/models/qwen_audio/ and functional tests in tests/functional_tests/models/qwen_audio/ following existing Qwen test patterns, and document test results in PR description.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat: Add support for Qwen2-Audio' directly and clearly describes the main change: introducing Qwen2-Audio support to Megatron-Bridge. The changeset includes new bridge classes, model providers, audio processing, and example scripts—all centered on adding this audio-language model support.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@examples/conversion/hf_to_megatron_generate_audio_lm.py`:
- Line 35: The docstring/README command contains a filename typo: it references
"hf_to_megatron_generate_alm.py" but the actual script is
"hf_to_megatron_generate_audio_lm.py"; update the command string in the example
(the line that currently reads uv run python
examples/conversion/hf_to_megatron_generate_alm.py) to reference the correct
filename hf_to_megatron_generate_audio_lm.py so the example matches the real
script name.
- Line 440: The CLI flag registration for "trust_remote_code" currently uses
action="store_true" which makes its default False and prevents
is_safe_repo(trust_remote_code, ...) from falling back to the SAFE_REPOS list;
update the parser.add_argument call for the "--trust_remote_code" option to pass
default=None so args.trust_remote_code is None when the flag is omitted and
is_safe_repo(...) can consult the safe repository list.

In `@examples/models/audio_lm/qwen2_audio/README.md`:
- Line 31: The README command incorrectly includes a duplicate "python"; update
the invocation that currently reads 'uv run python -m torch.distributed.run
python examples/conversion/hf_to_megatron_generate_audio_lm.py \\' to remove the
extra 'python' so torch.distributed.run is given the script path directly (i.e.,
pass 'examples/conversion/hf_to_megatron_generate_audio_lm.py' to '-m
torch.distributed.run'); edit the README.md line to reflect this corrected
command.

In `@src/megatron/bridge/models/qwen_audio/modeling_qwen2_audio.py`:
- Line 199: The code accesses input_ids.shape when input_features is not None
which will raise AttributeError if input_ids is None; update the guard around
the audio-processing block in modeling_qwen2_audio.py to explicitly check
input_ids is not None before accessing .shape (e.g., change the condition to
check input_ids is not None and input_ids.shape[1] != 1 or raise a clear
ValueError when input_features is provided but input_ids is missing), and
propagate this guard to other places that use input_ids (the audio block
references at the same function where lines ~240, ~255, ~263 access input_ids)
so either handle the inputs_embeds-only path correctly or require/document that
input_ids must be supplied.
🧹 Nitpick comments (12)
src/megatron/bridge/models/qwen_audio/modeling_qwen2_audio.py (5)

1-1: Copyright year should be 2026.

The file is being created in February 2026, but the copyright header says 2025.

-# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.

27-27: Use T | None instead of Optional[T].

Per coding guidelines, prefer T | None over Optional[T].

-from typing import TYPE_CHECKING, Optional
+from typing import TYPE_CHECKING

Then update all usages (lines 108, 161–170) from Optional[X] to X | None.

As per coding guidelines: "Use 'T | None' for nullable types instead of 'Optional[T]'"


144-148: Monkey-patching _merge_input_ids_with_audio_features is fragile.

Binding a private method from Qwen2AudioForConditionalGeneration onto this instance couples the code tightly to HuggingFace's internal implementation. If the HF method's signature, expected attributes on self, or behavior changes across transformers versions, this will break at runtime with a hard-to-diagnose error.

Consider either:

  1. Adding a version pin/check against the expected transformers version.
  2. Copying and adapting the logic directly so it's self-contained and testable.

154-156: Missing type hint for input_tensor parameter.

-    def set_input_tensor(self, input_tensor) -> None:
+    def set_input_tensor(self, input_tensor: torch.Tensor) -> None:

As per coding guidelines: "Use type hints for function arguments and return types"


248-261: Prefix unused unpacked variables with _.

num_audios and embed_dim are unpacked but never used (flagged by ruff RUF059).

Proposed fix
-                    num_audios, max_audio_tokens, embed_dim = audio_features.shape
+                    _num_audios, max_audio_tokens, _embed_dim = audio_features.shape
examples/models/audio_lm/qwen2_audio/README.md (1)

41-41: Add a language specifier to the fenced code block.

Markdownlint (MD040) flags this. Use ```text for plain output.

-```
+```text
src/megatron/bridge/models/qwen_audio/qwen2_audio_provider.py (1)

29-30: Use Any | None instead of Optional[Any].

Same guideline as noted in the modeling file. Optional should be replaced with the X | None syntax.

-from typing import TYPE_CHECKING, Any, Optional
+from typing import TYPE_CHECKING, Any

Then on line 69:

-    hf_config: Optional[Any] = None
+    hf_config: Any | None = None

As per coding guidelines: "Use 'T | None' for nullable types instead of 'Optional[T]'"

src/megatron/bridge/models/qwen_audio/qwen2_audio_bridge.py (1)

198-204: Silent except Exception: pass hides registration failures.

A bare except Exception: pass makes it impossible to diagnose why bridge auto-discovery fails. At minimum, log the exception. Ruff also flags this (S110, BLE001).

Proposed fix
+    import logging
+
+    logger = logging.getLogger(__name__)
+
     try:
         Qwen2AudioBridge = MegatronModelBridge.register_bridge(
             source=Qwen2AudioForConditionalGeneration, target=Qwen2AudioModel
         )(Qwen2AudioBridge)
-    except Exception:
-        # If registration fails, the bridge will still work manually
-        pass
+    except Exception:
+        logger.debug("Qwen2AudioBridge auto-registration failed; manual usage still available.", exc_info=True)
examples/conversion/hf_to_megatron_generate_audio_lm.py (4)

160-163: SSRF risk: validate URL scheme before opening.

urlopen will follow any scheme including file://, allowing local file reads from an attacker-controlled input. Since this is an example script, the risk is low, but it's good practice to restrict to http/https.

Proposed fix
+    ALLOWED_SCHEMES = ("http://", "https://")
+
     if audio_path.startswith(("http://", "https://")):
-        audio_data, _ = librosa.load(BytesIO(urlopen(audio_path).read()), sr=sampling_rate)
+        if not audio_path.startswith(ALLOWED_SCHEMES):
+            raise ValueError(f"Unsupported URL scheme in: {audio_path}")
+        audio_data, _ = librosa.load(BytesIO(urlopen(audio_path).read()), sr=sampling_rate)  # noqa: S310

Actually, the existing startswith check already restricts to http/https. The ruff S310 is a false positive in this case since the check on line 162 already guards the urlopen call. Adding a # noqa: S310 comment would suppress the warning.


313-313: Unused variable messages.

The unpacked messages variable is never used. Prefix with _ to signal intent.

-    input_ids, input_features, feature_attention_mask, messages = process_audio_inputs(processor, audio_path, prompt)
+    input_ids, input_features, feature_attention_mask, _messages = process_audio_inputs(processor, audio_path, prompt)

332-352: Audio encoder runs on every generation step — significant inefficiency.

The full audio features (input_features) are passed and re-encoded through the audio tower on every generation step because input_ids keeps growing (so input_ids.shape[1] != 1 in the model's forward, line 199 of modeling_qwen2_audio.py). For a typical generation of 50 tokens, the audio encoder runs 50 times.

Consider encoding audio features once before the loop, then passing the merged inputs_embeds directly (or using None for input_features on subsequent steps). Alternatively, a KV-cache approach would avoid the full recomputation entirely.

This is acceptable for an example script, but worth a TODO comment.


325-326: TODO: attention mask is always None.

The # TODO: add attention mask comment indicates this is known unfinished work. With attention_mask=None, the Megatron language model relies on its internal causal masking, which should be correct for standard autoregressive generation. However, for sequences with padding (e.g., batched inference), this would silently produce incorrect results.

Would you like me to open an issue to track this TODO?

@yaoyu-33
Copy link
Copy Markdown
Contributor

yaoyu-33 commented Feb 11, 2026

@yuekaizhang : please help to rebase on top of #2250
the provider bridge part need some refactors.

thanks for your contributions!

@yuekaizhang
Copy link
Copy Markdown
Contributor Author

@yuekaizhang : please help to rebase on top of #2250 the provider bridge part need some refactors.

thanks for your contributions!

Done. Also, add a generation test file.

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Signed-off-by: root <zhangyuekai@foxmail.com>
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
@yuekaizhang
Copy link
Copy Markdown
Contributor Author

@yaoyu-33 Since #2250 has been merged, I just rebased the current code to the latest main branch. Would you mind helping check the PR? Many thanks.

@yaoyu-33
Copy link
Copy Markdown
Contributor

/ok to test a5ae02c

@yaoyu-33
Copy link
Copy Markdown
Contributor

yaoyu-33 commented Mar 6, 2026

/ok to test 6111d52

@yaoyu-33
Copy link
Copy Markdown
Contributor

yaoyu-33 commented Mar 8, 2026

/ok to test 7fae585

3 similar comments
@yaoyu-33
Copy link
Copy Markdown
Contributor

yaoyu-33 commented Mar 8, 2026

/ok to test 7fae585

@yaoyu-33
Copy link
Copy Markdown
Contributor

yaoyu-33 commented Mar 9, 2026

/ok to test 7fae585

@yaoyu-33
Copy link
Copy Markdown
Contributor

yaoyu-33 commented Mar 9, 2026

/ok to test 7fae585

yaoyu-33
yaoyu-33 previously approved these changes Mar 10, 2026
@ko3n1g
Copy link
Copy Markdown
Contributor

ko3n1g commented Mar 10, 2026

/ok to test 30b82a5

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
@yuekaizhang
Copy link
Copy Markdown
Contributor Author

@ko3n1g Hi, would you mind helping enable CI/CD again? Thanks.

@yaoyu-33
Copy link
Copy Markdown
Contributor

/ok to test e1041d5

yaoyu-33
yaoyu-33 previously approved these changes Mar 12, 2026
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
@yaoyu-33
Copy link
Copy Markdown
Contributor

/ok to test 327673f

@yaoyu-33
Copy link
Copy Markdown
Contributor

/ok to test f931d3c

@gautham-kollu
Copy link
Copy Markdown
Contributor

@yuekaizhang Can you please resolve the merge conflict ?

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
@yuekaizhang
Copy link
Copy Markdown
Contributor Author

@yuekaizhang Can you please resolve the merge conflict ?

Done.

@yuekaizhang
Copy link
Copy Markdown
Contributor Author

@yaoyu-33 @gautham-kollu Would you mind helping enable CI/CD tests? Thanks.

@yaoyu-33
Copy link
Copy Markdown
Contributor

/ok to test e7a053d

@yaoyu-33
Copy link
Copy Markdown
Contributor

/ok to test b83e46e

@cuichenx
Copy link
Copy Markdown
Contributor

/ok to test f87cdab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request ready-to-merge PR is approved, current, and only waiting for CI to pass before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants