feat: Add support for Qwen2-Audio by yuekaizhang · Pull Request #2324 · NVIDIA-NeMo/Megatron-Bridge

yuekaizhang · 2026-02-11T08:21:54Z

This PR supports qwen2-audio in Mbridge.

Summary by CodeRabbit

New Features
- Added support for Qwen2-Audio models in the Megatron-Bridge framework.
- Added audio-language generation example script demonstrating inference from both HuggingFace and Megatron checkpoints.
Documentation
- Added README with end-to-end usage instructions, example commands, and expected outputs for Qwen2-Audio inference workflows.

copy-pr-bot · 2026-02-11T08:21:58Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yuekaizhang · 2026-02-11T08:25:55Z

@yaoyu-33 Hi, would you mind helping reviewing the PR? Many thanks.

I would like to ask what minimal CI/CD tests you recommend adding? Currently, the inference results of Megatron are correct. I wonder if it is sufficient to write an e2e generation test similar to tests/functional_tests/models/qwen_vl/test_qwen3_vl_generation.py.

coderabbitai · 2026-02-11T08:28:42Z

📝 Walkthrough

Walkthrough

Adds complete Qwen2-Audio model integration into Megatron-Bridge with bridge implementation for HuggingFace-to-Megatron conversion, a MegatronModule wrapper for audio-language generation, provider classes, an example inference script supporting distributed generation with optional audio inputs, and documentation.

Changes

Cohort / File(s)	Summary
Bridge Core Implementation `src/megatron/bridge/models/qwen_audio/qwen2_audio_bridge.py`, `src/megatron/bridge/models/qwen_audio/qwen2_audio_provider.py`, `src/megatron/bridge/models/qwen_audio/modeling_qwen2_audio.py`	Bridge class for HF-to-Megatron weight mapping, provider classes (base and 7B variant) with audio-specific configuration and freezing options, and MegatronModule wrapper integrating audio encoder, multimodal projector, and language model with forward pass handling audio features, tokens, and attention masks.
Bridge Integration & Exports `src/megatron/bridge/models/qwen_audio/__init__.py`, `src/megatron/bridge/models/__init__.py`	New module aggregator exposing Qwen2AudioBridge, Qwen2AudioModel, and provider classes; parent package updated to export these entities in public API.
Example & Documentation `examples/conversion/hf_to_megatron_generate_audio_lm.py`, `examples/models/audio_lm/qwen2_audio/README.md`	Example script demonstrating inference with librosa-based audio loading, single-batch iteration, distributed forward passes, all-gather, greedy token selection, and optional audio feature processing; README documenting end-to-end usage workflows.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Script as hf_to_megatron_generate_audio_lm
    participant Tokenizer as Tokenizer/Processor
    participant Model as Qwen2AudioModel
    participant Dist as Distributed(all-gather)
    participant Decoder as Token Decoder

    User->>Script: Provide prompt, audio path, model config
    Script->>Script: Load/convert model (HF or Megatron)
    Script->>Script: Initialize distributed parallelism
    
    loop Generation Loop (until stop token or max_new_tokens)
        Script->>Tokenizer: Process input (audio + text)
        Tokenizer->>Model: Forward pass with tokens, positions, audio features
        Model->>Model: Encode audio, project to embedding space, integrate with text
        Model->>Dist: Return logits from language model forward
        Dist->>Dist: All-gather across data-parallel stages
        Script->>Script: Select token (greedy)
        Script->>Script: Broadcast token, append to sequence
    end
    
    Script->>Decoder: Decode generated token sequence
    Decoder->>User: Print prompt, audio path, generated text

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[model, refactor] refactor: Centralize provider_bridge config mapping in base class #2052: Refactors bridge/provider pattern with centralized provider_bridge and CONFIG_MAPPING; this PR implements a new Qwen2-Audio bridge following the same provider pattern.
[model, refactor] refactor: Centralize provider_bridge config mapping in base class for Nemotron models #2225: Refactors MegatronModelBridge base class and provider/config mapping; the new Qwen2-Audio bridge/provider implementation depends on this centralized infrastructure.

Suggested reviewers

cuichenx
ko3n1g
meatybobby

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 68.75% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes	⚠️ Warning	PR adds 1,240+ lines of code for Qwen2-Audio support but lacks dedicated unit/functional tests and PR testing documentation unlike similar Qwen models.	Add unit tests in tests/unit_tests/models/qwen_audio/ and functional tests in tests/functional_tests/models/qwen_audio/ following existing Qwen test patterns, and document test results in PR description.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'feat: Add support for Qwen2-Audio' directly and clearly describes the main change: introducing Qwen2-Audio support to Megatron-Bridge. The changeset includes new bridge classes, model providers, audio processing, and example scripts—all centered on adding this audio-language model support.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@examples/conversion/hf_to_megatron_generate_audio_lm.py`:
- Line 35: The docstring/README command contains a filename typo: it references
"hf_to_megatron_generate_alm.py" but the actual script is
"hf_to_megatron_generate_audio_lm.py"; update the command string in the example
(the line that currently reads uv run python
examples/conversion/hf_to_megatron_generate_alm.py) to reference the correct
filename hf_to_megatron_generate_audio_lm.py so the example matches the real
script name.
- Line 440: The CLI flag registration for "trust_remote_code" currently uses
action="store_true" which makes its default False and prevents
is_safe_repo(trust_remote_code, ...) from falling back to the SAFE_REPOS list;
update the parser.add_argument call for the "--trust_remote_code" option to pass
default=None so args.trust_remote_code is None when the flag is omitted and
is_safe_repo(...) can consult the safe repository list.

In `@examples/models/audio_lm/qwen2_audio/README.md`:
- Line 31: The README command incorrectly includes a duplicate "python"; update
the invocation that currently reads 'uv run python -m torch.distributed.run
python examples/conversion/hf_to_megatron_generate_audio_lm.py \\' to remove the
extra 'python' so torch.distributed.run is given the script path directly (i.e.,
pass 'examples/conversion/hf_to_megatron_generate_audio_lm.py' to '-m
torch.distributed.run'); edit the README.md line to reflect this corrected
command.

In `@src/megatron/bridge/models/qwen_audio/modeling_qwen2_audio.py`:
- Line 199: The code accesses input_ids.shape when input_features is not None
which will raise AttributeError if input_ids is None; update the guard around
the audio-processing block in modeling_qwen2_audio.py to explicitly check
input_ids is not None before accessing .shape (e.g., change the condition to
check input_ids is not None and input_ids.shape[1] != 1 or raise a clear
ValueError when input_features is provided but input_ids is missing), and
propagate this guard to other places that use input_ids (the audio block
references at the same function where lines ~240, ~255, ~263 access input_ids)
so either handle the inputs_embeds-only path correctly or require/document that
input_ids must be supplied.

🧹 Nitpick comments (12)

src/megatron/bridge/models/qwen_audio/modeling_qwen2_audio.py (5)
1-1: Copyright year should be 2026.

The file is being created in February 2026, but the copyright header says 2025.
-# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
27-27: Use T | None instead of Optional[T].

Per coding guidelines, prefer T | None over Optional[T].
-from typing import TYPE_CHECKING, Optional
+from typing import TYPE_CHECKING
Then update all usages (lines 108, 161–170) from Optional[X] to X | None.

As per coding guidelines: "Use 'T | None' for nullable types instead of 'Optional[T]'"

144-148: Monkey-patching _merge_input_ids_with_audio_features is fragile.

Binding a private method from Qwen2AudioForConditionalGeneration onto this instance couples the code tightly to HuggingFace's internal implementation. If the HF method's signature, expected attributes on self, or behavior changes across transformers versions, this will break at runtime with a hard-to-diagnose error.

Consider either:

Adding a version pin/check against the expected transformers version.

Copying and adapting the logic directly so it's self-contained and testable.

154-156: Missing type hint for input_tensor parameter.
-    def set_input_tensor(self, input_tensor) -> None:
+    def set_input_tensor(self, input_tensor: torch.Tensor) -> None:
As per coding guidelines: "Use type hints for function arguments and return types"

248-261: Prefix unused unpacked variables with _.

num_audios and embed_dim are unpacked but never used (flagged by ruff RUF059).
Proposed fix
-                    num_audios, max_audio_tokens, embed_dim = audio_features.shape
+                    _num_audios, max_audio_tokens, _embed_dim = audio_features.shape
examples/models/audio_lm/qwen2_audio/README.md (1)
41-41: Add a language specifier to the fenced code block.

Markdownlint (MD040) flags this. Use ```text for plain output.
-```
+```text
src/megatron/bridge/models/qwen_audio/qwen2_audio_provider.py (1)
29-30: Use Any | None instead of Optional[Any].

Same guideline as noted in the modeling file. Optional should be replaced with the X | None syntax.
-from typing import TYPE_CHECKING, Any, Optional
+from typing import TYPE_CHECKING, Any
Then on line 69:
-    hf_config: Optional[Any] = None
+    hf_config: Any | None = None
As per coding guidelines: "Use 'T | None' for nullable types instead of 'Optional[T]'"
src/megatron/bridge/models/qwen_audio/qwen2_audio_bridge.py (1)
198-204: Silent except Exception: pass hides registration failures.

A bare except Exception: pass makes it impossible to diagnose why bridge auto-discovery fails. At minimum, log the exception. Ruff also flags this (S110, BLE001).
Proposed fix
+    import logging
+
+    logger = logging.getLogger(__name__)
+
     try:
         Qwen2AudioBridge = MegatronModelBridge.register_bridge(
             source=Qwen2AudioForConditionalGeneration, target=Qwen2AudioModel
         )(Qwen2AudioBridge)
-    except Exception:
-        # If registration fails, the bridge will still work manually
-        pass
+    except Exception:
+        logger.debug("Qwen2AudioBridge auto-registration failed; manual usage still available.", exc_info=True)
examples/conversion/hf_to_megatron_generate_audio_lm.py (4)
160-163: SSRF risk: validate URL scheme before opening.

urlopen will follow any scheme including file://, allowing local file reads from an attacker-controlled input. Since this is an example script, the risk is low, but it's good practice to restrict to http/https.
Proposed fix
+    ALLOWED_SCHEMES = ("http://", "https://")
+
     if audio_path.startswith(("http://", "https://")):
-        audio_data, _ = librosa.load(BytesIO(urlopen(audio_path).read()), sr=sampling_rate)
+        if not audio_path.startswith(ALLOWED_SCHEMES):
+            raise ValueError(f"Unsupported URL scheme in: {audio_path}")
+        audio_data, _ = librosa.load(BytesIO(urlopen(audio_path).read()), sr=sampling_rate)  # noqa: S310
Actually, the existing startswith check already restricts to http/https. The ruff S310 is a false positive in this case since the check on line 162 already guards the urlopen call. Adding a # noqa: S310 comment would suppress the warning.

313-313: Unused variable messages.

The unpacked messages variable is never used. Prefix with _ to signal intent.
-    input_ids, input_features, feature_attention_mask, messages = process_audio_inputs(processor, audio_path, prompt)
+    input_ids, input_features, feature_attention_mask, _messages = process_audio_inputs(processor, audio_path, prompt)
332-352: Audio encoder runs on every generation step — significant inefficiency.

The full audio features (input_features) are passed and re-encoded through the audio tower on every generation step because input_ids keeps growing (so input_ids.shape[1] != 1 in the model's forward, line 199 of modeling_qwen2_audio.py). For a typical generation of 50 tokens, the audio encoder runs 50 times.

Consider encoding audio features once before the loop, then passing the merged inputs_embeds directly (or using None for input_features on subsequent steps). Alternatively, a KV-cache approach would avoid the full recomputation entirely.

This is acceptable for an example script, but worth a TODO comment.

325-326: TODO: attention mask is always None.

The # TODO: add attention mask comment indicates this is known unfinished work. With attention_mask=None, the Megatron language model relies on its internal causal masking, which should be correct for standard autoregressive generation. However, for sequences with padding (e.g., batched inference), this would silently produce incorrect results.

Would you like me to open an issue to track this TODO?

examples/conversion/hf_to_megatron_generate_audio_lm.py

examples/models/audio_lm/qwen2_audio/README.md

src/megatron/bridge/models/qwen_audio/modeling_qwen2_audio.py

yaoyu-33 · 2026-02-11T20:40:02Z

@yuekaizhang : please help to rebase on top of #2250
the provider bridge part need some refactors.

thanks for your contributions!

yuekaizhang · 2026-02-12T06:33:10Z

@yuekaizhang : please help to rebase on top of #2250 the provider bridge part need some refactors.

thanks for your contributions!

Done. Also, add a generation test file.

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>

Signed-off-by: root <zhangyuekai@foxmail.com>

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>

yuekaizhang · 2026-02-24T08:18:41Z

@yaoyu-33 Since #2250 has been merged, I just rebased the current code to the latest main branch. Would you mind helping check the PR? Many thanks.

yaoyu-33 · 2026-02-25T03:22:35Z

/ok to test a5ae02c

yaoyu-33 · 2026-03-06T02:18:56Z

/ok to test 6111d52

yaoyu-33 · 2026-03-08T04:54:05Z

/ok to test 7fae585

yaoyu-33 · 2026-03-08T21:23:53Z

/ok to test 7fae585

yaoyu-33 · 2026-03-09T14:10:54Z

/ok to test 7fae585

yaoyu-33 · 2026-03-09T14:34:53Z

/ok to test 7fae585

ko3n1g · 2026-03-10T09:18:38Z

/ok to test 30b82a5

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>

yuekaizhang · 2026-03-12T01:17:04Z

@ko3n1g Hi, would you mind helping enable CI/CD again? Thanks.

yaoyu-33 · 2026-03-12T23:13:02Z

/ok to test e1041d5

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>

yaoyu-33 · 2026-03-14T01:21:06Z

/ok to test 327673f

yaoyu-33 · 2026-03-16T16:23:59Z

/ok to test f931d3c

gautham-kollu · 2026-03-16T22:47:08Z

@yuekaizhang Can you please resolve the merge conflict ?

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>

yuekaizhang · 2026-03-17T01:30:41Z

@yuekaizhang Can you please resolve the merge conflict ?

Done.

yuekaizhang · 2026-03-18T12:45:33Z

@yaoyu-33 @gautham-kollu Would you mind helping enable CI/CD tests? Thanks.

yaoyu-33 · 2026-03-18T17:07:13Z

/ok to test e7a053d

yaoyu-33 · 2026-03-19T17:22:20Z

/ok to test b83e46e

cuichenx · 2026-03-23T03:25:49Z

/ok to test f87cdab

github-actions bot added the community-request label Feb 11, 2026

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

snowmanwwg removed the community-request label Feb 12, 2026

yuekaizhang force-pushed the audio_new branch from e9916f0 to 211ad55 Compare February 12, 2026 06:31

github-actions bot added the community-request label Feb 12, 2026

yuekaizhang added 2 commits February 24, 2026 15:25

add qwen2 audio

d511a2b

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>

rebase latest

b2f1482

Signed-off-by: root <zhangyuekai@foxmail.com>

yuekaizhang force-pushed the audio_new branch from 211ad55 to b2f1482 Compare February 24, 2026 08:12

add test

a5ae02c

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>

copy-pr-bot bot temporarily deployed to test February 25, 2026 03:23 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 25, 2026 03:54 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 25, 2026 04:42 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 25, 2026 05:29 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 25, 2026 05:29 Failure

copy-pr-bot bot temporarily deployed to nemo-ci February 25, 2026 05:29 Inactive

Merge branch 'main' into audio_new

7fae585

yaoyu-33 previously approved these changes Mar 10, 2026

View reviewed changes

Merge branch 'main' into audio_new

30b82a5

Merge branch 'main' into audio_new

e1041d5

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>

yaoyu-33 previously approved these changes Mar 12, 2026

View reviewed changes

yuekaizhang added 2 commits March 13, 2026 08:59

Merge branch 'main' into audio_new

5c03c6a

fix lint

327673f

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>

Merge branch 'main' into audio_new

f931d3c

Merge branch 'main' into audio_new

cc3deeb

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>

yuekaizhang added 2 commits March 17, 2026 17:11

Merge branch 'main' into audio_new

22aaa74

Merge branch 'main' into audio_new

e7a053d

yuekaizhang mentioned this pull request Mar 19, 2026

Support Qwen3-ASR Megatron Bridge #2836

Open

5 tasks

Merge branch 'main' into audio_new

b83e46e

Merge branch 'main' into audio_new

f87cdab

Conversation

yuekaizhang commented Feb 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 11, 2026

Uh oh!

yuekaizhang commented Feb 11, 2026

Uh oh!

coderabbitai bot commented Feb 11, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yaoyu-33 commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuekaizhang commented Feb 12, 2026

Uh oh!

yuekaizhang commented Feb 24, 2026

Uh oh!

yaoyu-33 commented Feb 25, 2026

Uh oh!

yaoyu-33 commented Mar 6, 2026

Uh oh!

yaoyu-33 commented Mar 8, 2026

Uh oh!

yaoyu-33 commented Mar 8, 2026

Uh oh!

yaoyu-33 commented Mar 9, 2026

Uh oh!

yaoyu-33 commented Mar 9, 2026

Uh oh!

ko3n1g commented Mar 10, 2026

Uh oh!

yuekaizhang commented Mar 12, 2026

Uh oh!

yaoyu-33 commented Mar 12, 2026

Uh oh!

yaoyu-33 commented Mar 14, 2026

Uh oh!

yaoyu-33 commented Mar 16, 2026

Uh oh!

gautham-kollu commented Mar 16, 2026

Uh oh!

yuekaizhang commented Mar 17, 2026

Uh oh!

yuekaizhang commented Mar 18, 2026

Uh oh!

yaoyu-33 commented Mar 18, 2026

Uh oh!

yaoyu-33 commented Mar 19, 2026

Uh oh!

cuichenx commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

yuekaizhang commented Feb 11, 2026 •

edited by coderabbitai bot

Loading

yaoyu-33 commented Feb 11, 2026 •

edited

Loading