Request to vLLM with audio message by karpnv · Pull Request #1042 · NVIDIA-NeMo/Skills

karpnv · 2025-11-13T22:25:52Z

Summary by CodeRabbit

Release Notes

New Features
- Added audio transcription and evaluation support with multicriteria scoring for correctness, relevance, completeness, and clarity.
- Audio content encoding for model inference requests.
Documentation
- Expanded audio evaluation guides with server setup examples and evaluation workflows.
- Updated result formats and benchmark descriptions.
Tests
- Added audio generation and evaluation test coverage.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-03T01:23:30Z

📝 Walkthrough

Walkthrough

This PR introduces comprehensive audio/speech evaluation support to the NemoSkills framework. Changes encompass dataset preparation updates for MMAU-Pro with revised message formatting and absolute audio path handling, vLLM model extensions for base64 audio encoding and content preprocessing, inference layer updates to strip binary data during postprocessing, multi-criteria scoring mechanisms in metrics calculation, new judge prompt configuration, and accompanying integration and unit tests.

Changes

Cohort / File(s)	Summary
Documentation `docs/evaluation/speech-audio.md`	Comprehensive update with new overview sections for audio understanding categories, supported server types table, expanded MMAU-Pro benchmark descriptions with subcategories, revised preparation and evaluation workflow guidance with multi-profile examples (vLLM and Megatron), updated result tables reflecting new evaluation outputs, and reorganized "Understanding Results" sections.
Dataset Configuration & Preparation `nemo_skills/dataset/mmau-pro/closed_form/__init__.py` `nemo_skills/dataset/mmau-pro/open_ended/__init__.py`	Added EVAL_ARGS constant in closed_form module; updated JUDGE_ARGS in open_ended module to switch prompt_config from judge/speechlm to judge/mmau-pro.
Dataset Preparation Logic `nemo_skills/dataset/mmau-pro/prepare.py`	Updated MCQ formatting with option text (A) B) ...) and completion prompt; changed audio path handling from per-path with durations to absolute /dataset/mmau-pro/ paths using "audio" (single) or "audios" (list) keys; introduced system message "You are a helpful assistant." and appended " /no_think" for non-open categories to disable reasoning; restructured final messages as two-element list (system + user).
Inference Model Audio Handling `nemo_skills/inference/model/vllm.py`	Added audio_file_to_base64() utility for encoding audio to base64 data URLs; extended VLLMModel.init with data_dir parameter; introduced content_text_to_list() method to normalize mixed text/audio message content; applied content preprocessing in _build_chat_request_params.
Inference Output Processing `nemo_skills/inference/generate.py`	Added drop_binary_data() method in GenerationTask to strip binary content (audio_url type) from message outputs during postprocessing; initialized data_dir attribute from eval_config.
Evaluation Metrics `nemo_skills/evaluation/metrics/mmau_pro_metrics.py`	Introduced extract_multicriteria_scores() function to parse 1-5 scale scores (correctness, relevance, completeness, clarity, overall) from judgement text; added multicriteria_scores attribute to MMAUProMetrics; augmented get_metrics with per-criterion averages/std as percentages, good/poor response rates; exposed multicriteria metrics in metrics_to_print output.
Judge Prompt Configuration `nemo_skills/prompt/config/judge/mmau-pro.yaml` (added) `nemo_skills/prompt/config/judge/speechlm.yaml` (deleted)	Added new mmau-pro.yaml judge prompt for multi-criteria evaluation across correctness, relevance, completeness, and clarity on 1-5 scale with structured output format; removed legacy speechlm.yaml binary judgment configuration.
Test Infrastructure & Configuration `tests/gpu-tests/run_qwen.sh` `tests/gpu-tests/test-local.yaml`	Added audio-generation test execution for Qwen/Qwen2.5-Omni-3B model; added vllm-audio container definition (nvcr.io/nvidian/ac-aiapps/vllm-openai-audio:v1.0.0).
Integration Tests `tests/gpu-tests/test_vllm_audio.py`	Added test_vllm_audio_generation() integration test validating end-to-end audio transcription flow via vLLM server with JSONL input/output validation.
Unit Tests `tests/test_vllm_audio.py`	Added comprehensive unit tests for audio_file_to_base64 encoding/decoding, VLLMModel.content_text_to_list message preprocessing (single and multiple audios), _build_chat_request_params with audio content structure, and async generation with mocked responses.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/Dataset
    participant Inference as Inference<br/>(generate.py)
    participant VLLMModel as VLLMModel<br/>(vllm.py)
    participant vLLMServer as vLLM Server
    participant Judge as Judge Model
    participant Metrics as Metrics<br/>(mmau_pro_metrics.py)

    User->>Inference: Input with audio path
    Inference->>VLLMModel: Prepare message with audio
    VLLMModel->>VLLMModel: audio_file_to_base64()
    VLLMModel->>VLLMModel: content_text_to_list()<br/>(normalize to text + audio_url)
    VLLMModel->>vLLMServer: _build_chat_request_params<br/>(send base64 data URL)
    vLLMServer->>vLLMServer: Process audio + text
    vLLMServer-->>VLLMModel: Generation response
    VLLMModel-->>Inference: Model output
    Inference->>Inference: drop_binary_data()<br/>(strip audio_url)
    Inference-->>User: Cleaned output

    User->>Judge: Eval: generation vs reference
    Judge->>Judge: Score on 1-5 scale
    Judge-->>Metrics: Judgement text
    Metrics->>Metrics: extract_multicriteria_scores()
    Metrics->>Metrics: Calculate per-criterion<br/>avg/std/rates
    Metrics-->>User: Multi-criteria metrics

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

Audio encoding/decoding logic (vllm.py audio_file_to_base64, content_text_to_list, base64 data URL construction) — verify correctness of RIFF header detection and base64 encoding safety.
Message structure transformations (prepare.py audio path handling, message list restructuring with system/user roles) — ensure backward compatibility and correct format for downstream consumers.
Multi-criteria scoring extraction (mmau_pro_metrics.py regex parsing of judgement text for five criteria) — validate robustness of score extraction and fallback logic when fields are missing.
Binary data removal (generate.py drop_binary_data filtering) — confirm proper integration with postprocessing pipeline and memory implications.
Cross-layer integration — verify data flow consistency between dataset preparation, model inference, and metrics calculation through end-to-end test execution paths.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding support for sending audio messages to vLLM endpoints, which is the core focus across multiple files including vllm.py, test files, and documentation updates.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch audio_bin

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/inference/generate.py (1)

383-398: data_dir is set but never passed to model factory functions.

self.data_dir is extracted from cfg.eval_config (lines 384-385) but not passed to any of the three model factory calls (get_code_execution_model, get_tool_calling_model, or get_model). Since VLLMModel expects data_dir as a constructor parameter for resolving relative audio paths, this needs to be added: include data_dir=self.data_dir in each factory call.

🧹 Nitpick comments (9)

nemo_skills/inference/model/vllm.py (3)

112-132: Method mutates input message dictionary.

content_text_to_list modifies the message dict in-place, which could cause unintended side effects if the same message object is reused elsewhere. Consider working on a copy.
     def content_text_to_list(self, message):
+        message = message.copy()  # Avoid mutating the original
         if "audio" in message or "audios" in message:
             content = message["content"]
             if isinstance(content, str):
29-33: Add error handling for file operations.

The function doesn't handle FileNotFoundError or IOError, which could lead to cryptic error messages when audio files are missing or inaccessible.
 def audio_file_to_base64(audio_file_path: str):
     """Encodes an audio file into a base64 string."""
-    with open(audio_file_path, "rb") as audio_file:
-        audio_content = audio_file.read()
-        return base64.b64encode(audio_content).decode("utf-8")
+    try:
+        with open(audio_file_path, "rb") as audio_file:
+            audio_content = audio_file.read()
+            return base64.b64encode(audio_content).decode("utf-8")
+    except FileNotFoundError:
+        raise FileNotFoundError(f"Audio file not found: {audio_file_path}")
+    except IOError as e:
+        raise IOError(f"Failed to read audio file {audio_file_path}: {e}")
125-125: Hardcoded MIME type may be incorrect for non-WAV audio files.

The MIME type is hardcoded as audio/wav, but audio files could be in other formats (e.g., MP3, FLAC, OGG). Consider detecting the format from the file extension or magic bytes.

nemo_skills/evaluation/metrics/mmau_pro_metrics.py (1)

123-123: Rename unused loop variable per static analysis.

The loop variable agg_mode is not used within the loop body.
-        for agg_mode, agg_metrics in metrics_dict.items():
+        for _agg_mode, agg_metrics in metrics_dict.items():

tests/gpu-tests/test_vllm_audio.py (1)

92-94: Missing output directory cleanup in finally block.

The finally block cleans up the temp input file but doesn't remove output_dir. While the test pre-cleans the directory at the start (lines 32-34), adding cleanup to the finally block ensures cleanup happens even if the test fails before reaching that point in subsequent runs.
     finally:
         # Cleanup temp file
         Path(input_file).unlink(missing_ok=True)
+        # Cleanup output directory
+        if Path(output_dir).exists():
+            shutil.rmtree(output_dir, ignore_errors=True)

tests/test_vllm_audio.py (1)

123-146: Consider testing actual audio preprocessing instead of mocking the entire method.

This test mocks generate_async entirely, which means it doesn't actually exercise the audio preprocessing logic (content_text_to_list, base64 encoding). The real value would be testing that audio messages are correctly transformed before the API call.

Consider either:

Mocking at a lower level (e.g., the HTTP client) to test the actual transformation

Removing this test since test_build_chat_request_with_audio already covers the preprocessing logic

nemo_skills/inference/generate.py (1)

383-385: Simplify the None check.

The isinstance(..., type(None)) check is unconventional. A simpler is not None would be clearer.

         self.data_dir = None
-        if "data_dir" in self.cfg.eval_config and not isinstance(self.cfg.eval_config.get("data_dir"), type(None)):
+        if self.cfg.eval_config.get("data_dir") is not None:
             self.data_dir = self.cfg.eval_config["data_dir"]

docs/evaluation/speech-audio.md (2)

72-92: Convert indented code blocks to fenced syntax for consistency and readability. Static analysis flags indicate these code blocks use indentation-based syntax rather than fenced blocks (triple backticks), which is the modern markdown convention and enables language-specific syntax highlighting.

Convert indented code blocks to fenced blocks with language specification:

-    ```python
-    import os
-    from nemo_skills.pipeline.cli import wrap_arguments, eval
+```python
+import os
+from nemo_skills.pipeline.cli import wrap_arguments, eval

-    eval(
-        ctx=wrap_arguments(""),
+eval(
+    ctx=wrap_arguments(""),
-    )
-    ```
+)
+```

Apply the same pattern to all indented code blocks in the Python API Examples and Alternative command-line usage sections (lines 72–92, 96–116, 120–160).

Also applies to: 96-116, 120-160

263-263: Specify language for fenced code blocks at lines 263 and 320. Static analysis flags indicate these code blocks lack language specification, which prevents syntax highlighting and readability.

Add language specification to the fenced code blocks:

 **Instruction Following:**

-```
+```text
 ----------------------- mmau-pro.instruction_following -------------------------

And similarly for line 320:

 **Overall Aggregate Score:**

-```
+```text
 -------------------------------- mmau-pro -----------------------------------------




Also applies to: 320-320

</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: CHILL

**Plan**: Pro

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 48cbf77596cb6abd6abf65b912ed408368bfd678 and 7287e0ad617ad785e39b1c1f62f116b20900123b.

</details>

<details>
<summary>📒 Files selected for processing (13)</summary>

* `docs/evaluation/speech-audio.md` (7 hunks)
* `nemo_skills/dataset/mmau-pro/closed_form/__init__.py` (1 hunks)
* `nemo_skills/dataset/mmau-pro/open_ended/__init__.py` (1 hunks)
* `nemo_skills/dataset/mmau-pro/prepare.py` (1 hunks)
* `nemo_skills/evaluation/metrics/mmau_pro_metrics.py` (3 hunks)
* `nemo_skills/inference/generate.py` (3 hunks)
* `nemo_skills/inference/model/vllm.py` (4 hunks)
* `nemo_skills/prompt/config/judge/mmau-pro.yaml` (1 hunks)
* `nemo_skills/prompt/config/judge/speechlm.yaml` (0 hunks)
* `tests/gpu-tests/run_qwen.sh` (1 hunks)
* `tests/gpu-tests/test-local.yaml` (1 hunks)
* `tests/gpu-tests/test_vllm_audio.py` (1 hunks)
* `tests/test_vllm_audio.py` (1 hunks)

</details>

<details>
<summary>💤 Files with no reviewable changes (1)</summary>

* nemo_skills/prompt/config/judge/speechlm.yaml

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>🧬 Code graph analysis (4)</summary>

<details>
<summary>nemo_skills/evaluation/metrics/mmau_pro_metrics.py (2)</summary><blockquote>

<details>
<summary>nemo_skills/evaluation/metrics/base.py (3)</summary>

* `as_int` (443-446)
* `as_percentage` (437-440)
* `_compute_pass_at_k` (352-423)

</details>
<details>
<summary>nemo_skills/utils.py (1)</summary>

* `get_logger_name` (39-43)

</details>

</blockquote></details>
<details>
<summary>tests/gpu-tests/test_vllm_audio.py (2)</summary><blockquote>

<details>
<summary>tests/gpu-tests/utils.py (1)</summary>

* `require_env_var` (18-23)

</details>
<details>
<summary>nemo_skills/pipeline/utils/declarative.py (1)</summary>

* `run` (346-483)

</details>

</blockquote></details>
<details>
<summary>tests/test_vllm_audio.py (1)</summary><blockquote>

<details>
<summary>nemo_skills/inference/model/vllm.py (2)</summary>

* `audio_file_to_base64` (29-33)
* `content_text_to_list` (112-132)

</details>

</blockquote></details>
<details>
<summary>nemo_skills/dataset/mmau-pro/prepare.py (1)</summary><blockquote>

<details>
<summary>nemo_skills/inference/chat_interface/core.py (1)</summary>

* `get` (136-151)

</details>

</blockquote></details>

</details><details>
<summary>🪛 markdownlint-cli2 (0.18.1)</summary>

<details>
<summary>docs/evaluation/speech-audio.md</summary>

72-72: Code block style
Expected: fenced; Actual: indented

(MD046, code-block-style)

---

96-96: Code block style
Expected: fenced; Actual: indented

(MD046, code-block-style)

---

120-120: Code block style
Expected: fenced; Actual: indented

(MD046, code-block-style)

---

263-263: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

320-320: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>
<details>
<summary>🪛 Ruff (0.14.7)</summary>

<details>
<summary>nemo_skills/evaluation/metrics/mmau_pro_metrics.py</summary>

123-123: Loop control variable `agg_mode` not used within loop body

Rename unused `agg_mode` to `_agg_mode`

(B007)

</details>
<details>
<summary>tests/gpu-tests/test_vllm_audio.py</summary>

31-31: Probable insecure usage of temporary file or directory: "/tmp/nemo-skills-tests/"

(S108)

---

79-79: `subprocess` call with `shell=True` identified, security issue

(S602)

</details>

</details>

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)</summary>

* GitHub Check: unit-tests
* GitHub Check: pre-commit

</details>

<details>
<summary>🔇 Additional comments (21)</summary><blockquote>

<details>
<summary>nemo_skills/dataset/mmau-pro/open_ended/__init__.py (1)</summary><blockquote>

`26-26`: **LGTM!**

The prompt configuration reference is correctly updated to use the new `judge/mmau-pro` configuration, which aligns with the new prompt file introduced in this PR.

</blockquote></details>
<details>
<summary>nemo_skills/dataset/mmau-pro/closed_form/__init__.py (1)</summary><blockquote>

`19-19`: **LGTM!**

The new `EVAL_ARGS` constant follows the established pattern and properly configures evaluation type for closed-form MMAU-Pro assessments.

</blockquote></details>
<details>
<summary>nemo_skills/prompt/config/judge/mmau-pro.yaml (1)</summary><blockquote>

`1-30`: **LGTM!**

The judge prompt is well-structured with clear multi-criteria scoring instructions. The output format aligns correctly with the `extract_multicriteria_scores` function in `mmau_pro_metrics.py`.

</blockquote></details>
<details>
<summary>nemo_skills/dataset/mmau-pro/prepare.py (3)</summary><blockquote>

`87-91`: **Verify the hardcoded absolute path prefix.**

The path prefix `/dataset/mmau-pro/` is hardcoded, which assumes a specific cluster mount point. This could cause issues if the deployment environment differs or if the dataset is mounted at a different location.

Consider making this configurable via an argument or environment variable for flexibility across different deployment environments.

---

`78-79`: **LGTM!**

The MCQ formatting is improved with clearer option labels using "A) B) ..." format, and the instruction "Respond with the complete text of the correct option, not just the letter" provides explicit guidance to the model.

---

`93-98`: **LGTM!**

The system message structure with conditional `/no_think` for non-open-ended questions appropriately controls reasoning behavior. The two-element message structure (system + user) is a clean approach.

</blockquote></details>
<details>
<summary>nemo_skills/evaluation/metrics/mmau_pro_metrics.py (2)</summary><blockquote>

`82-91`: **LGTM!**

The scoring logic correctly derives correctness from the multicriteria judgement with the threshold of >= 3.0 for open-ended questions, while maintaining the existing binary correctness path for closed-form evaluations.

---

`130-151`: **LGTM!**

The multi-criteria metric aggregation is well-implemented with proper conversion from 1-5 scale to percentage format, and the good/poor response rate thresholds (>=4 and <=2) provide useful quality indicators.

</blockquote></details>
<details>
<summary>tests/gpu-tests/test-local.yaml (1)</summary><blockquote>

`20-20`: **Container registry access consideration.**

The `nvcr.io/nvidian/` registry is an internal NVIDIA registry. External contributors or CI systems without proper credentials may not be able to pull this image. Consider documenting authentication requirements or providing an alternative public image for broader compatibility.

</blockquote></details>
<details>
<summary>tests/gpu-tests/run_qwen.sh (1)</summary><blockquote>

`16-18`: **LGTM!**

The audio test integration follows the existing pattern correctly. The model switch to `Qwen2.5-Omni-3B` is appropriate for audio testing, and the placement between generation tests and contamination tests is logical.

</blockquote></details>
<details>
<summary>tests/test_vllm_audio.py (5)</summary><blockquote>

`25-39`: **LGTM!**

The `test_audio_file_to_base64` test correctly validates the encoding roundtrip with proper temp file cleanup.

---

`42-48`: **LGTM!**

The fixture properly initializes `VLLMModel` with `data_dir` for relative path resolution in audio handling.

---

`51-66`: **LGTM!**

Good coverage of single audio content conversion with proper assertions on structure and data URL format.

---

`69-90`: **LGTM!**

Tests the multiple audio scenario correctly, verifying that each audio file results in a separate `audio_url` entry.

---

`93-120`: **LGTM!**

Comprehensive test that validates the entire request building flow including base64 encoding verification.

</blockquote></details>
<details>
<summary>nemo_skills/inference/generate.py (2)</summary><blockquote>

`532-544`: **LGTM!**

The `drop_binary_data` method correctly:
1. Guards against missing `messages` key
2. Guards against non-list content (string content in simple messages)
3. Filters out `audio_url` items to reduce output file size

The in-place mutation is appropriate here since the output dict is transient.

---

`561-562`: **LGTM!**

Correct placement of `drop_binary_data` call - after merging original data but before reasoning parsing, ensuring binary data doesn't interfere with text processing.

</blockquote></details>
<details>
<summary>tests/gpu-tests/test_vllm_audio.py (1)</summary><blockquote>

`38-59`: The hardcoded audio paths are part of the intended container setup and do not pose a risk. The audio files (`t2_16.wav` and `t3_16.wav`) exist in the repository at `tests/slurm-tests/asr_nim/wavs/`, and the `/nemo_run/code/` prefix is the standard container mount point path used consistently throughout the test suite. This is expected behavior for container-based tests.

</blockquote></details>
<details>
<summary>docs/evaluation/speech-audio.md (3)</summary><blockquote>

`3-16`: **Clear categorical introduction and server support overview.** The new structure effectively explains the scope (Audio understanding, ASR, AST) and provides a quick reference table for supported server types. Well organized.

---

`34-64`: **Data preparation section is clear and well-documented.** Explains the default behavior (audio download), provides explicit guidance on text-only mode with appropriate warnings, and includes environment variable and flag requirements. The distinction between modes and use cases is helpful.

---

`232-323`: **Results section is comprehensive and well-structured.** Directory structure, output format, and example metrics are clearly documented. The metrics shown across categories are internally consistent and demonstrate realistic variation (e.g., average tokens lower for open-ended than closed-form, appropriate num_entries for each category).

</blockquote></details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

coderabbitai · 2025-12-03T01:23:33Z

nemo_skills/evaluation/metrics/mmau_pro_metrics.py

+    # Fallback: compute overall if missing or still 3.0
+    if "overall" not in scores or scores["overall"] == 3.0:
+        criteria_scores = [scores.get(k, 3.0) for k in ["correctness", "relevance", "completeness", "clarity"]]
+        scores["overall"] = sum(criteria_scores) / len(criteria_scores)


⚠️ Potential issue | 🟠 Major

Fallback logic incorrectly overwrites legitimate score of 3.0.

The condition scores["overall"] == 3.0 will overwrite a legitimately assigned overall score of 3.0 with the computed average. A score of 3.0 could be a valid LLM response.

Consider tracking whether the overall score was actually matched vs. defaulted:

for criterion, pattern in patterns.items(): match = re.search(pattern, judgement_text, re.IGNORECASE) - scores[criterion] = float(match.group(1)) if match else 3.0 + if match: + scores[criterion] = float(match.group(1)) + else: + scores[criterion] = 3.0 + if criterion == "overall": + scores["_overall_missing"] = True - # Fallback: compute overall if missing or still 3.0 - if "overall" not in scores or scores["overall"] == 3.0: + # Fallback: compute overall if missing + if scores.get("_overall_missing", False): criteria_scores = [scores.get(k, 3.0) for k in ["correctness", "relevance", "completeness", "clarity"]] scores["overall"] = sum(criteria_scores) / len(criteria_scores) + del scores["_overall_missing"] return scores

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

nemo_skills/evaluation/metrics/mmau_pro_metrics.py around lines 54-57: the current fallback checks scores["overall"] == 3.0 and thus overwrites a legitimately assigned 3.0; change the logic to only compute and set overall when the key is absent or explicitly marked as a default (e.g. use None as the default overall value or a separate flag like scores["_overall_defaulted"]), so update callers to set scores["overall"]=None or scores["_overall_defaulted"]=True when no real overall exists, and then replace the condition with a check for key absence or None/flag before computing the average from the four criteria.

coderabbitai · 2025-12-03T01:23:33Z

nemo_skills/inference/model/vllm.py

+        if "audio" in message:
+            audio = message["audio"]
+            base64_audio = audio_file_to_base64(os.path.join(self.data_dir, audio["path"]))
+            audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}}
+            message["content"].append(audio_message)
+        elif "audios" in message:
+            for audio in message["audios"]:
+                base64_audio = audio_file_to_base64(os.path.join(self.data_dir, audio["path"]))
+                audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}}
+                message["content"].append(audio_message)


⚠️ Potential issue | 🟠 Major

Absolute paths will bypass data_dir prefix.

When audio["path"] starts with / (as generated by prepare.py with paths like /dataset/mmau-pro/...), os.path.join ignores data_dir entirely. This means the data_dir parameter becomes ineffective for the absolute paths currently being generated.

Either:

Store relative paths in prepare.py and rely on data_dir at inference time, or

Remove data_dir handling if paths are always absolute.

if "audio" in message: audio = message["audio"] - base64_audio = audio_file_to_base64(os.path.join(self.data_dir, audio["path"])) + audio_path = audio["path"] + if not os.path.isabs(audio_path): + audio_path = os.path.join(self.data_dir, audio_path) + base64_audio = audio_file_to_base64(audio_path) audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}} message["content"].append(audio_message) elif "audios" in message: for audio in message["audios"]: - base64_audio = audio_file_to_base64(os.path.join(self.data_dir, audio["path"])) + audio_path = audio["path"] + if not os.path.isabs(audio_path): + audio_path = os.path.join(self.data_dir, audio_path) + base64_audio = audio_file_to_base64(audio_path) audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}} message["content"].append(audio_message)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if "audio" in message:

audio = message["audio"]

base64_audio = audio_file_to_base64(os.path.join(self.data_dir, audio["path"]))

audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}}

message["content"].append(audio_message)

elif "audios" in message:

for audio in message["audios"]:

base64_audio = audio_file_to_base64(os.path.join(self.data_dir, audio["path"]))

audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}}

message["content"].append(audio_message)

if "audio" in message:

audio = message["audio"]

audio_path = audio["path"]

if not os.path.isabs(audio_path):

audio_path = os.path.join(self.data_dir, audio_path)

base64_audio = audio_file_to_base64(audio_path)

audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}}

message["content"].append(audio_message)

elif "audios" in message:

for audio in message["audios"]:

audio_path = audio["path"]

if not os.path.isabs(audio_path):

audio_path = os.path.join(self.data_dir, audio_path)

base64_audio = audio_file_to_base64(audio_path)

audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}}

message["content"].append(audio_message)

🤖 Prompt for AI Agents

nemo_skills/inference/model/vllm.py around lines 122-131: the current os.path.join(self.data_dir, audio["path"]) will ignore data_dir when audio["path"] is absolute (starts with '/'), making data_dir ineffective; update both branches to first check if audio["path"] is absolute (os.path.isabs) and use it directly, otherwise join with self.data_dir, or alternatively normalize by stripping a leading slash (audio["path"].lstrip("/")) before joining so prepared absolute-looking paths become relative; apply the same change for the single "audio" and the "audios" loop.

gwarmstrong

Have a few suggestions/questions for moving forward

gwarmstrong · 2025-12-03T01:10:18Z

nemo_skills/inference/model/vllm.py

 class VLLMModel(BaseModel):
-    def __init__(self, **kwargs):
+    def __init__(self, data_dir: str = "", **kwargs):
+        self.data_dir = data_dir


Can we move this out of the VLLModel class? E.g., if we format the audio file path in around here: https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/inference/generate.py#L579

Also, would it be safe to make an assumption like paths are relative to the input_file? So we could reduce a parameter here?

Please check again, if not outdated, than no - we should have it here as now we need to chunk audio

gwarmstrong · 2025-12-03T01:12:15Z

nemo_skills/inference/model/vllm.py

            "extra_body": self._build_request_body(top_k, min_p, repetition_penalty, extra_body=extra_body),
        }

+    def content_text_to_list(self, message):


Is any of this logic able to be used outside of vllm? E.g., does it also work for openai APIs?

Even if not, I think it may have more utility if we make a hook for it in generate.py rather than here

gwarmstrong · 2025-12-03T01:15:21Z

nemo_skills/inference/generate.py

+                continue
+
+            # Filter out audio_url items from list-style content
+            message["content"] = [content for content in message["content"] if content.get("type") != "audio_url"]


can you make "audio_url" here a list that is specified in this config? that way it can be configurable and other fields can be included/excluded if desired

gwarmstrong · 2025-12-03T02:24:33Z

tests/gpu-tests/test-local.yaml

 containers:
  trtllm: nvcr.io/nvidia/tensorrt-llm/release:1.0.0
  vllm: vllm/vllm-openai:v0.10.1.1
+  vllm-audio: nvcr.io/nvidian/ac-aiapps/vllm-openai-audio:v1.0.0


Is this org/image gated? That will cause a failure if it is used in CI. I don't actually see this being used though--would it be used as the server container in the gpu tests?

I am pretty sure this image is org‑gated. Yes, I am using it in the GPU tests. What’s the recommended workaround in this case?

This will be built into the base vllm image now--no need to pull nvcr container

Jorjeous

in vllm.py there is a critical issue:
The orider in wich model recieves audio and instruction

Below is corrected version of function

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> clean up Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous · 2025-12-10T17:14:55Z

Added postprocessing specific for QWEN models, may be we should add check for model type

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

gwarmstrong

Please see #1137 for a suggested architectural change to help consolidate all the audio-specific logic and clean up the VLLM client here a little bit

Signed-off-by: George Armstrong <georgea@nvidia.com>

Jorjeous · 2025-12-23T14:58:34Z

Issues with task type.
Seems that we lost auto-enabling by "data_dir" presence. Working on it

Jorjeous · 2025-12-23T15:32:25Z

About chunking size: lets leave it 30s as most models works fine with it, while enabling 60s default would significantly affect results.

1) after refactoring AudioProcessor worked only if we pass ++audio arg. this is inconvenient 2) now we audio-enabling in based on dataset_group=speechlm or task_type=audio all audio containing benchmarks have one or both Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous · 2025-12-23T16:13:10Z

Tested on QWEN, LGTM

outdated

Jorjeous · 2026-01-08T09:08:53Z

Need to fix import of audio_file_to_base64

Kipok · 2026-01-08T21:16:20Z

nemo_skills/dataset/mmau-pro/prepare.py

+    # Don't use /no_think for open-ended questions to allow reasoning
+    system_content = "You are a helpful assistant."
+    if category != "open":
+        system_content += " /no_think"


this is specific to qwen models, I don't think we should be doing it globally here?

@melllinia ?

Imho, this is not problem for other models

It certainly introduces a little mismatch, whether it's significant or not for accuracy is unclear, but I think it's much safer not to have it. What's the reason why we need to hardcode it here?

Kipok · 2026-01-08T21:20:56Z

nemo_skills/inference/generate.py


+        # Audio wrapper (preprocesses messages before they reach the model)
+        # Auto-enable for audio benchmarks on vLLM (eval_type=audio OR dataset_group=speechlm)
+        should_enable_audio = self.cfg.audio is not None or (


I'd rather configure this through prepare.py of relevant benchmarks. Otherwise this logic is too coupled with datasets but this module is very general. So instead of dataset_group, I'd add a flag should_enable_audio and you can just set it as default parameter in the init of those benchmarks. You can still have a check for vllm to be used as a server and other logic for wrapping, just let's remove the dataset_group based behaviour

Got your point, when moving to separate class PR #1157 this would be auto-resolved

Kipok · 2026-01-08T21:22:08Z

nemo_skills/inference/generate.py

        for output in outputs:
            fout.write(json.dumps(output) + "\n")

+    def drop_binary_data(self, output):


maybe just drop_data or better drop_fields_from_messages or something like this as it's not limited to binary but just checks what to drop from the parameter?

Ok, why not

Kipok · 2026-01-08T21:23:01Z

nemo_skills/pipeline/utils/eval.py

        generation_args = f"{eval_args} {generation_args}"
    generation_args += f" ++eval_config.split={split} "
+
+    # Pass dataset_group from benchmark config if defined


so I'd remove this as well

No harm from one additional field, but makes a but more convenient to separate datasets
Still want to remove?

Jorjeous · 2026-01-09T13:24:43Z

@Kipok
moving this to #1157, with some arch changes. will address comments there

Kipok · 2026-01-10T03:36:51Z

should we close this one then @Jorjeous ?

melllinia force-pushed the audio_bin branch from 4e161f8 to 94d3bd5 Compare November 27, 2025 08:17

melllinia added the run GPU tests label Nov 27, 2025

melllinia force-pushed the audio_bin branch from 94d3bd5 to 63f6153 Compare November 27, 2025 15:40

melllinia self-assigned this Nov 27, 2025

melllinia added run GPU tests and removed run GPU tests labels Nov 27, 2025

melllinia marked this pull request as ready for review November 27, 2025 17:05

melllinia added run GPU tests and removed run GPU tests labels Nov 27, 2025

melllinia force-pushed the audio_bin branch from 681554b to 27457d5 Compare November 28, 2025 13:44

melllinia added run GPU tests and removed run GPU tests labels Nov 28, 2025

gwarmstrong self-requested a review December 3, 2025 00:57

coderabbitai bot reviewed Dec 3, 2025

View reviewed changes

gwarmstrong force-pushed the audio_bin branch 2 times, most recently from f3c7729 to 933db61 Compare December 3, 2025 01:25

gwarmstrong added run GPU tests and removed run GPU tests labels Dec 3, 2025

gwarmstrong closed this Dec 3, 2025

gwarmstrong reopened this Dec 3, 2025

gwarmstrong added run GPU tests and removed run GPU tests labels Dec 3, 2025

gwarmstrong reviewed Dec 3, 2025

View reviewed changes

gwarmstrong force-pushed the audio_bin branch from 630e2a1 to 4d8e8c9 Compare December 3, 2025 17:30

gwarmstrong mentioned this pull request Dec 3, 2025

vLLM Audio Input Support + MMAU-Pro Open-Ended Score Improvements #1069

Closed

gwarmstrong removed the run GPU tests label Dec 3, 2025

Jorjeous requested changes Dec 4, 2025

View reviewed changes

Enable audio chunking

0f1b86d

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> clean up Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous force-pushed the audio_bin branch from a4b5181 to 0f1b86d Compare December 10, 2025 17:08

Jorjeous and others added 4 commits December 10, 2025 09:35

remove qwen specific post processing

32ed94c

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Merge branch 'main' into audio_bin

30a1c7f

fixing audio chunking config

15f30fd

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

Merge branch 'main' into audio_bin

eaaba9b

melllinia force-pushed the audio_bin branch 2 times, most recently from 5e9cc99 to 62d5166 Compare December 16, 2025 10:43

Fix audio tests to use VLLMModel instead of GenerationTask

8a0765a

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

melllinia force-pushed the audio_bin branch from 62d5166 to 8a0765a Compare December 16, 2025 10:55

melllinia added the run GPU tests label Dec 16, 2025

adding vllm integration test to run_qwen.sh

f8c3089

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

melllinia added run GPU tests and removed run GPU tests labels Dec 16, 2025

gwarmstrong previously requested changes Dec 19, 2025

View reviewed changes

move audio processing into model (#1137)

a88736e

Signed-off-by: George Armstrong <georgea@nvidia.com>

Jorjeous added 2 commits December 23, 2025 08:01

fix linter

1e418e7

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous self-requested a review December 23, 2025 16:13

Jorjeous approved these changes Dec 23, 2025

View reviewed changes

Kipok requested changes Jan 8, 2026

View reviewed changes

Jorjeous closed this Jan 10, 2026

Conversation

karpnv commented Nov 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 3, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

gwarmstrong left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jorjeous Dec 10, 2025 • edited by gwarmstrong Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jorjeous left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jorjeous commented Dec 10, 2025

Uh oh!

gwarmstrong left a comment

Choose a reason for hiding this comment

Uh oh!

Jorjeous commented Dec 23, 2025

Uh oh!

Jorjeous commented Dec 23, 2025

Uh oh!

Jorjeous commented Dec 23, 2025

Uh oh!

Jorjeous commented Jan 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karpnv commented Nov 13, 2025 •

edited by coderabbitai bot

Loading

Jorjeous Dec 10, 2025 •

edited by gwarmstrong

Loading

Jorjeous left a comment •

edited

Loading