Skip to content

Request to vLLM with audio message #1042

Closed
karpnv wants to merge 40 commits intomainfrom
audio_bin
Closed

Request to vLLM with audio message #1042
karpnv wants to merge 40 commits intomainfrom
audio_bin

Conversation

@karpnv
Copy link
Collaborator

@karpnv karpnv commented Nov 13, 2025

Summary by CodeRabbit

Release Notes

  • New Features

    • Added audio transcription and evaluation support with multicriteria scoring for correctness, relevance, completeness, and clarity.
    • Audio content encoding for model inference requests.
  • Documentation

    • Expanded audio evaluation guides with server setup examples and evaluation workflows.
    • Updated result formats and benchmark descriptions.
  • Tests

    • Added audio generation and evaluation test coverage.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 3, 2025

📝 Walkthrough

Walkthrough

This PR introduces comprehensive audio/speech evaluation support to the NemoSkills framework. Changes encompass dataset preparation updates for MMAU-Pro with revised message formatting and absolute audio path handling, vLLM model extensions for base64 audio encoding and content preprocessing, inference layer updates to strip binary data during postprocessing, multi-criteria scoring mechanisms in metrics calculation, new judge prompt configuration, and accompanying integration and unit tests.

Changes

Cohort / File(s) Summary
Documentation
docs/evaluation/speech-audio.md
Comprehensive update with new overview sections for audio understanding categories, supported server types table, expanded MMAU-Pro benchmark descriptions with subcategories, revised preparation and evaluation workflow guidance with multi-profile examples (vLLM and Megatron), updated result tables reflecting new evaluation outputs, and reorganized "Understanding Results" sections.
Dataset Configuration & Preparation
nemo_skills/dataset/mmau-pro/closed_form/__init__.py
nemo_skills/dataset/mmau-pro/open_ended/__init__.py
Added EVAL_ARGS constant in closed_form module; updated JUDGE_ARGS in open_ended module to switch prompt_config from judge/speechlm to judge/mmau-pro.
Dataset Preparation Logic
nemo_skills/dataset/mmau-pro/prepare.py
Updated MCQ formatting with option text (A) B) ...) and completion prompt; changed audio path handling from per-path with durations to absolute /dataset/mmau-pro/ paths using "audio" (single) or "audios" (list) keys; introduced system message "You are a helpful assistant." and appended " /no_think" for non-open categories to disable reasoning; restructured final messages as two-element list (system + user).
Inference Model Audio Handling
nemo_skills/inference/model/vllm.py
Added audio_file_to_base64() utility for encoding audio to base64 data URLs; extended VLLMModel.init with data_dir parameter; introduced content_text_to_list() method to normalize mixed text/audio message content; applied content preprocessing in _build_chat_request_params.
Inference Output Processing
nemo_skills/inference/generate.py
Added drop_binary_data() method in GenerationTask to strip binary content (audio_url type) from message outputs during postprocessing; initialized data_dir attribute from eval_config.
Evaluation Metrics
nemo_skills/evaluation/metrics/mmau_pro_metrics.py
Introduced extract_multicriteria_scores() function to parse 1-5 scale scores (correctness, relevance, completeness, clarity, overall) from judgement text; added multicriteria_scores attribute to MMAUProMetrics; augmented get_metrics with per-criterion averages/std as percentages, good/poor response rates; exposed multicriteria metrics in metrics_to_print output.
Judge Prompt Configuration
nemo_skills/prompt/config/judge/mmau-pro.yaml (added)
nemo_skills/prompt/config/judge/speechlm.yaml (deleted)
Added new mmau-pro.yaml judge prompt for multi-criteria evaluation across correctness, relevance, completeness, and clarity on 1-5 scale with structured output format; removed legacy speechlm.yaml binary judgment configuration.
Test Infrastructure & Configuration
tests/gpu-tests/run_qwen.sh
tests/gpu-tests/test-local.yaml
Added audio-generation test execution for Qwen/Qwen2.5-Omni-3B model; added vllm-audio container definition (nvcr.io/nvidian/ac-aiapps/vllm-openai-audio:v1.0.0).
Integration Tests
tests/gpu-tests/test_vllm_audio.py
Added test_vllm_audio_generation() integration test validating end-to-end audio transcription flow via vLLM server with JSONL input/output validation.
Unit Tests
tests/test_vllm_audio.py
Added comprehensive unit tests for audio_file_to_base64 encoding/decoding, VLLMModel.content_text_to_list message preprocessing (single and multiple audios), _build_chat_request_params with audio content structure, and async generation with mocked responses.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/Dataset
    participant Inference as Inference<br/>(generate.py)
    participant VLLMModel as VLLMModel<br/>(vllm.py)
    participant vLLMServer as vLLM Server
    participant Judge as Judge Model
    participant Metrics as Metrics<br/>(mmau_pro_metrics.py)

    User->>Inference: Input with audio path
    Inference->>VLLMModel: Prepare message with audio
    VLLMModel->>VLLMModel: audio_file_to_base64()
    VLLMModel->>VLLMModel: content_text_to_list()<br/>(normalize to text + audio_url)
    VLLMModel->>vLLMServer: _build_chat_request_params<br/>(send base64 data URL)
    vLLMServer->>vLLMServer: Process audio + text
    vLLMServer-->>VLLMModel: Generation response
    VLLMModel-->>Inference: Model output
    Inference->>Inference: drop_binary_data()<br/>(strip audio_url)
    Inference-->>User: Cleaned output

    User->>Judge: Eval: generation vs reference
    Judge->>Judge: Score on 1-5 scale
    Judge-->>Metrics: Judgement text
    Metrics->>Metrics: extract_multicriteria_scores()
    Metrics->>Metrics: Calculate per-criterion<br/>avg/std/rates
    Metrics-->>User: Multi-criteria metrics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

  • Audio encoding/decoding logic (vllm.py audio_file_to_base64, content_text_to_list, base64 data URL construction) — verify correctness of RIFF header detection and base64 encoding safety.
  • Message structure transformations (prepare.py audio path handling, message list restructuring with system/user roles) — ensure backward compatibility and correct format for downstream consumers.
  • Multi-criteria scoring extraction (mmau_pro_metrics.py regex parsing of judgement text for five criteria) — validate robustness of score extraction and fallback logic when fields are missing.
  • Binary data removal (generate.py drop_binary_data filtering) — confirm proper integration with postprocessing pipeline and memory implications.
  • Cross-layer integration — verify data flow consistency between dataset preparation, model inference, and metrics calculation through end-to-end test execution paths.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding support for sending audio messages to vLLM endpoints, which is the core focus across multiple files including vllm.py, test files, and documentation updates.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch audio_bin

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/inference/generate.py (1)

383-398: data_dir is set but never passed to model factory functions.

self.data_dir is extracted from cfg.eval_config (lines 384-385) but not passed to any of the three model factory calls (get_code_execution_model, get_tool_calling_model, or get_model). Since VLLMModel expects data_dir as a constructor parameter for resolving relative audio paths, this needs to be added: include data_dir=self.data_dir in each factory call.

🧹 Nitpick comments (9)
nemo_skills/inference/model/vllm.py (3)

112-132: Method mutates input message dictionary.

content_text_to_list modifies the message dict in-place, which could cause unintended side effects if the same message object is reused elsewhere. Consider working on a copy.

     def content_text_to_list(self, message):
+        message = message.copy()  # Avoid mutating the original
         if "audio" in message or "audios" in message:
             content = message["content"]
             if isinstance(content, str):

29-33: Add error handling for file operations.

The function doesn't handle FileNotFoundError or IOError, which could lead to cryptic error messages when audio files are missing or inaccessible.

 def audio_file_to_base64(audio_file_path: str):
     """Encodes an audio file into a base64 string."""
-    with open(audio_file_path, "rb") as audio_file:
-        audio_content = audio_file.read()
-        return base64.b64encode(audio_content).decode("utf-8")
+    try:
+        with open(audio_file_path, "rb") as audio_file:
+            audio_content = audio_file.read()
+            return base64.b64encode(audio_content).decode("utf-8")
+    except FileNotFoundError:
+        raise FileNotFoundError(f"Audio file not found: {audio_file_path}")
+    except IOError as e:
+        raise IOError(f"Failed to read audio file {audio_file_path}: {e}")

125-125: Hardcoded MIME type may be incorrect for non-WAV audio files.

The MIME type is hardcoded as audio/wav, but audio files could be in other formats (e.g., MP3, FLAC, OGG). Consider detecting the format from the file extension or magic bytes.

nemo_skills/evaluation/metrics/mmau_pro_metrics.py (1)

123-123: Rename unused loop variable per static analysis.

The loop variable agg_mode is not used within the loop body.

-        for agg_mode, agg_metrics in metrics_dict.items():
+        for _agg_mode, agg_metrics in metrics_dict.items():
tests/gpu-tests/test_vllm_audio.py (1)

92-94: Missing output directory cleanup in finally block.

The finally block cleans up the temp input file but doesn't remove output_dir. While the test pre-cleans the directory at the start (lines 32-34), adding cleanup to the finally block ensures cleanup happens even if the test fails before reaching that point in subsequent runs.

     finally:
         # Cleanup temp file
         Path(input_file).unlink(missing_ok=True)
+        # Cleanup output directory
+        if Path(output_dir).exists():
+            shutil.rmtree(output_dir, ignore_errors=True)
tests/test_vllm_audio.py (1)

123-146: Consider testing actual audio preprocessing instead of mocking the entire method.

This test mocks generate_async entirely, which means it doesn't actually exercise the audio preprocessing logic (content_text_to_list, base64 encoding). The real value would be testing that audio messages are correctly transformed before the API call.

Consider either:

  1. Mocking at a lower level (e.g., the HTTP client) to test the actual transformation
  2. Removing this test since test_build_chat_request_with_audio already covers the preprocessing logic
nemo_skills/inference/generate.py (1)

383-385: Simplify the None check.

The isinstance(..., type(None)) check is unconventional. A simpler is not None would be clearer.

         self.data_dir = None
-        if "data_dir" in self.cfg.eval_config and not isinstance(self.cfg.eval_config.get("data_dir"), type(None)):
+        if self.cfg.eval_config.get("data_dir") is not None:
             self.data_dir = self.cfg.eval_config["data_dir"]
docs/evaluation/speech-audio.md (2)

72-92: Convert indented code blocks to fenced syntax for consistency and readability. Static analysis flags indicate these code blocks use indentation-based syntax rather than fenced blocks (triple backticks), which is the modern markdown convention and enables language-specific syntax highlighting.

Convert indented code blocks to fenced blocks with language specification:

-    ```python
-    import os
-    from nemo_skills.pipeline.cli import wrap_arguments, eval
+```python
+import os
+from nemo_skills.pipeline.cli import wrap_arguments, eval

-    eval(
-        ctx=wrap_arguments(""),
+eval(
+    ctx=wrap_arguments(""),
-    )
-    ```
+)
+```

Apply the same pattern to all indented code blocks in the Python API Examples and Alternative command-line usage sections (lines 72–92, 96–116, 120–160).

Also applies to: 96-116, 120-160


263-263: Specify language for fenced code blocks at lines 263 and 320. Static analysis flags indicate these code blocks lack language specification, which prevents syntax highlighting and readability.

Add language specification to the fenced code blocks:

 **Instruction Following:**

-```
+```text
 ----------------------- mmau-pro.instruction_following -------------------------

And similarly for line 320:

 **Overall Aggregate Score:**

-```
+```text
 -------------------------------- mmau-pro -----------------------------------------



Also applies to: 320-320

</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: CHILL

**Plan**: Pro

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 48cbf77596cb6abd6abf65b912ed408368bfd678 and 7287e0ad617ad785e39b1c1f62f116b20900123b.

</details>

<details>
<summary>📒 Files selected for processing (13)</summary>

* `docs/evaluation/speech-audio.md` (7 hunks)
* `nemo_skills/dataset/mmau-pro/closed_form/__init__.py` (1 hunks)
* `nemo_skills/dataset/mmau-pro/open_ended/__init__.py` (1 hunks)
* `nemo_skills/dataset/mmau-pro/prepare.py` (1 hunks)
* `nemo_skills/evaluation/metrics/mmau_pro_metrics.py` (3 hunks)
* `nemo_skills/inference/generate.py` (3 hunks)
* `nemo_skills/inference/model/vllm.py` (4 hunks)
* `nemo_skills/prompt/config/judge/mmau-pro.yaml` (1 hunks)
* `nemo_skills/prompt/config/judge/speechlm.yaml` (0 hunks)
* `tests/gpu-tests/run_qwen.sh` (1 hunks)
* `tests/gpu-tests/test-local.yaml` (1 hunks)
* `tests/gpu-tests/test_vllm_audio.py` (1 hunks)
* `tests/test_vllm_audio.py` (1 hunks)

</details>

<details>
<summary>💤 Files with no reviewable changes (1)</summary>

* nemo_skills/prompt/config/judge/speechlm.yaml

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>🧬 Code graph analysis (4)</summary>

<details>
<summary>nemo_skills/evaluation/metrics/mmau_pro_metrics.py (2)</summary><blockquote>

<details>
<summary>nemo_skills/evaluation/metrics/base.py (3)</summary>

* `as_int` (443-446)
* `as_percentage` (437-440)
* `_compute_pass_at_k` (352-423)

</details>
<details>
<summary>nemo_skills/utils.py (1)</summary>

* `get_logger_name` (39-43)

</details>

</blockquote></details>
<details>
<summary>tests/gpu-tests/test_vllm_audio.py (2)</summary><blockquote>

<details>
<summary>tests/gpu-tests/utils.py (1)</summary>

* `require_env_var` (18-23)

</details>
<details>
<summary>nemo_skills/pipeline/utils/declarative.py (1)</summary>

* `run` (346-483)

</details>

</blockquote></details>
<details>
<summary>tests/test_vllm_audio.py (1)</summary><blockquote>

<details>
<summary>nemo_skills/inference/model/vllm.py (2)</summary>

* `audio_file_to_base64` (29-33)
* `content_text_to_list` (112-132)

</details>

</blockquote></details>
<details>
<summary>nemo_skills/dataset/mmau-pro/prepare.py (1)</summary><blockquote>

<details>
<summary>nemo_skills/inference/chat_interface/core.py (1)</summary>

* `get` (136-151)

</details>

</blockquote></details>

</details><details>
<summary>🪛 markdownlint-cli2 (0.18.1)</summary>

<details>
<summary>docs/evaluation/speech-audio.md</summary>

72-72: Code block style
Expected: fenced; Actual: indented

(MD046, code-block-style)

---

96-96: Code block style
Expected: fenced; Actual: indented

(MD046, code-block-style)

---

120-120: Code block style
Expected: fenced; Actual: indented

(MD046, code-block-style)

---

263-263: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

320-320: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>
<details>
<summary>🪛 Ruff (0.14.7)</summary>

<details>
<summary>nemo_skills/evaluation/metrics/mmau_pro_metrics.py</summary>

123-123: Loop control variable `agg_mode` not used within loop body

Rename unused `agg_mode` to `_agg_mode`

(B007)

</details>
<details>
<summary>tests/gpu-tests/test_vllm_audio.py</summary>

31-31: Probable insecure usage of temporary file or directory: "/tmp/nemo-skills-tests/"

(S108)

---

79-79: `subprocess` call with `shell=True` identified, security issue

(S602)

</details>

</details>

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)</summary>

* GitHub Check: unit-tests
* GitHub Check: pre-commit

</details>

<details>
<summary>🔇 Additional comments (21)</summary><blockquote>

<details>
<summary>nemo_skills/dataset/mmau-pro/open_ended/__init__.py (1)</summary><blockquote>

`26-26`: **LGTM!**

The prompt configuration reference is correctly updated to use the new `judge/mmau-pro` configuration, which aligns with the new prompt file introduced in this PR.

</blockquote></details>
<details>
<summary>nemo_skills/dataset/mmau-pro/closed_form/__init__.py (1)</summary><blockquote>

`19-19`: **LGTM!**

The new `EVAL_ARGS` constant follows the established pattern and properly configures evaluation type for closed-form MMAU-Pro assessments.

</blockquote></details>
<details>
<summary>nemo_skills/prompt/config/judge/mmau-pro.yaml (1)</summary><blockquote>

`1-30`: **LGTM!**

The judge prompt is well-structured with clear multi-criteria scoring instructions. The output format aligns correctly with the `extract_multicriteria_scores` function in `mmau_pro_metrics.py`.

</blockquote></details>
<details>
<summary>nemo_skills/dataset/mmau-pro/prepare.py (3)</summary><blockquote>

`87-91`: **Verify the hardcoded absolute path prefix.**

The path prefix `/dataset/mmau-pro/` is hardcoded, which assumes a specific cluster mount point. This could cause issues if the deployment environment differs or if the dataset is mounted at a different location.

Consider making this configurable via an argument or environment variable for flexibility across different deployment environments.

---

`78-79`: **LGTM!**

The MCQ formatting is improved with clearer option labels using "A) B) ..." format, and the instruction "Respond with the complete text of the correct option, not just the letter" provides explicit guidance to the model.

---

`93-98`: **LGTM!**

The system message structure with conditional `/no_think` for non-open-ended questions appropriately controls reasoning behavior. The two-element message structure (system + user) is a clean approach.

</blockquote></details>
<details>
<summary>nemo_skills/evaluation/metrics/mmau_pro_metrics.py (2)</summary><blockquote>

`82-91`: **LGTM!**

The scoring logic correctly derives correctness from the multicriteria judgement with the threshold of >= 3.0 for open-ended questions, while maintaining the existing binary correctness path for closed-form evaluations.

---

`130-151`: **LGTM!**

The multi-criteria metric aggregation is well-implemented with proper conversion from 1-5 scale to percentage format, and the good/poor response rate thresholds (>=4 and <=2) provide useful quality indicators.

</blockquote></details>
<details>
<summary>tests/gpu-tests/test-local.yaml (1)</summary><blockquote>

`20-20`: **Container registry access consideration.**

The `nvcr.io/nvidian/` registry is an internal NVIDIA registry. External contributors or CI systems without proper credentials may not be able to pull this image. Consider documenting authentication requirements or providing an alternative public image for broader compatibility.

</blockquote></details>
<details>
<summary>tests/gpu-tests/run_qwen.sh (1)</summary><blockquote>

`16-18`: **LGTM!**

The audio test integration follows the existing pattern correctly. The model switch to `Qwen2.5-Omni-3B` is appropriate for audio testing, and the placement between generation tests and contamination tests is logical.

</blockquote></details>
<details>
<summary>tests/test_vllm_audio.py (5)</summary><blockquote>

`25-39`: **LGTM!**

The `test_audio_file_to_base64` test correctly validates the encoding roundtrip with proper temp file cleanup.

---

`42-48`: **LGTM!**

The fixture properly initializes `VLLMModel` with `data_dir` for relative path resolution in audio handling.

---

`51-66`: **LGTM!**

Good coverage of single audio content conversion with proper assertions on structure and data URL format.

---

`69-90`: **LGTM!**

Tests the multiple audio scenario correctly, verifying that each audio file results in a separate `audio_url` entry.

---

`93-120`: **LGTM!**

Comprehensive test that validates the entire request building flow including base64 encoding verification.

</blockquote></details>
<details>
<summary>nemo_skills/inference/generate.py (2)</summary><blockquote>

`532-544`: **LGTM!**

The `drop_binary_data` method correctly:
1. Guards against missing `messages` key
2. Guards against non-list content (string content in simple messages)
3. Filters out `audio_url` items to reduce output file size

The in-place mutation is appropriate here since the output dict is transient.

---

`561-562`: **LGTM!**

Correct placement of `drop_binary_data` call - after merging original data but before reasoning parsing, ensuring binary data doesn't interfere with text processing.

</blockquote></details>
<details>
<summary>tests/gpu-tests/test_vllm_audio.py (1)</summary><blockquote>

`38-59`: The hardcoded audio paths are part of the intended container setup and do not pose a risk. The audio files (`t2_16.wav` and `t3_16.wav`) exist in the repository at `tests/slurm-tests/asr_nim/wavs/`, and the `/nemo_run/code/` prefix is the standard container mount point path used consistently throughout the test suite. This is expected behavior for container-based tests.

</blockquote></details>
<details>
<summary>docs/evaluation/speech-audio.md (3)</summary><blockquote>

`3-16`: **Clear categorical introduction and server support overview.** The new structure effectively explains the scope (Audio understanding, ASR, AST) and provides a quick reference table for supported server types. Well organized.

---

`34-64`: **Data preparation section is clear and well-documented.** Explains the default behavior (audio download), provides explicit guidance on text-only mode with appropriate warnings, and includes environment variable and flag requirements. The distinction between modes and use cases is helpful.

---

`232-323`: **Results section is comprehensive and well-structured.** Directory structure, output format, and example metrics are clearly documented. The metrics shown across categories are internally consistent and demonstrate realistic variation (e.g., average tokens lower for open-ended than closed-form, appropriate num_entries for each category).

</blockquote></details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment on lines +54 to +57
# Fallback: compute overall if missing or still 3.0
if "overall" not in scores or scores["overall"] == 3.0:
criteria_scores = [scores.get(k, 3.0) for k in ["correctness", "relevance", "completeness", "clarity"]]
scores["overall"] = sum(criteria_scores) / len(criteria_scores)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fallback logic incorrectly overwrites legitimate score of 3.0.

The condition scores["overall"] == 3.0 will overwrite a legitimately assigned overall score of 3.0 with the computed average. A score of 3.0 could be a valid LLM response.

Consider tracking whether the overall score was actually matched vs. defaulted:

     for criterion, pattern in patterns.items():
         match = re.search(pattern, judgement_text, re.IGNORECASE)
-        scores[criterion] = float(match.group(1)) if match else 3.0
+        if match:
+            scores[criterion] = float(match.group(1))
+        else:
+            scores[criterion] = 3.0
+            if criterion == "overall":
+                scores["_overall_missing"] = True

-    # Fallback: compute overall if missing or still 3.0
-    if "overall" not in scores or scores["overall"] == 3.0:
+    # Fallback: compute overall if missing
+    if scores.get("_overall_missing", False):
         criteria_scores = [scores.get(k, 3.0) for k in ["correctness", "relevance", "completeness", "clarity"]]
         scores["overall"] = sum(criteria_scores) / len(criteria_scores)
+        del scores["_overall_missing"]

     return scores

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
nemo_skills/evaluation/metrics/mmau_pro_metrics.py around lines 54-57: the
current fallback checks scores["overall"] == 3.0 and thus overwrites a
legitimately assigned 3.0; change the logic to only compute and set overall when
the key is absent or explicitly marked as a default (e.g. use None as the
default overall value or a separate flag like scores["_overall_defaulted"]), so
update callers to set scores["overall"]=None or
scores["_overall_defaulted"]=True when no real overall exists, and then replace
the condition with a check for key absence or None/flag before computing the
average from the four criteria.

Comment on lines +122 to +131
if "audio" in message:
audio = message["audio"]
base64_audio = audio_file_to_base64(os.path.join(self.data_dir, audio["path"]))
audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}}
message["content"].append(audio_message)
elif "audios" in message:
for audio in message["audios"]:
base64_audio = audio_file_to_base64(os.path.join(self.data_dir, audio["path"]))
audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}}
message["content"].append(audio_message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Absolute paths will bypass data_dir prefix.

When audio["path"] starts with / (as generated by prepare.py with paths like /dataset/mmau-pro/...), os.path.join ignores data_dir entirely. This means the data_dir parameter becomes ineffective for the absolute paths currently being generated.

Either:

  1. Store relative paths in prepare.py and rely on data_dir at inference time, or
  2. Remove data_dir handling if paths are always absolute.
         if "audio" in message:
             audio = message["audio"]
-            base64_audio = audio_file_to_base64(os.path.join(self.data_dir, audio["path"]))
+            audio_path = audio["path"]
+            if not os.path.isabs(audio_path):
+                audio_path = os.path.join(self.data_dir, audio_path)
+            base64_audio = audio_file_to_base64(audio_path)
             audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}}
             message["content"].append(audio_message)
         elif "audios" in message:
             for audio in message["audios"]:
-                base64_audio = audio_file_to_base64(os.path.join(self.data_dir, audio["path"]))
+                audio_path = audio["path"]
+                if not os.path.isabs(audio_path):
+                    audio_path = os.path.join(self.data_dir, audio_path)
+                base64_audio = audio_file_to_base64(audio_path)
                 audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}}
                 message["content"].append(audio_message)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if "audio" in message:
audio = message["audio"]
base64_audio = audio_file_to_base64(os.path.join(self.data_dir, audio["path"]))
audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}}
message["content"].append(audio_message)
elif "audios" in message:
for audio in message["audios"]:
base64_audio = audio_file_to_base64(os.path.join(self.data_dir, audio["path"]))
audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}}
message["content"].append(audio_message)
if "audio" in message:
audio = message["audio"]
audio_path = audio["path"]
if not os.path.isabs(audio_path):
audio_path = os.path.join(self.data_dir, audio_path)
base64_audio = audio_file_to_base64(audio_path)
audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}}
message["content"].append(audio_message)
elif "audios" in message:
for audio in message["audios"]:
audio_path = audio["path"]
if not os.path.isabs(audio_path):
audio_path = os.path.join(self.data_dir, audio_path)
base64_audio = audio_file_to_base64(audio_path)
audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}}
message["content"].append(audio_message)
🤖 Prompt for AI Agents
nemo_skills/inference/model/vllm.py around lines 122-131: the current
os.path.join(self.data_dir, audio["path"]) will ignore data_dir when
audio["path"] is absolute (starts with '/'), making data_dir ineffective; update
both branches to first check if audio["path"] is absolute (os.path.isabs) and
use it directly, otherwise join with self.data_dir, or alternatively normalize
by stripping a leading slash (audio["path"].lstrip("/")) before joining so
prepared absolute-looking paths become relative; apply the same change for the
single "audio" and the "audios" loop.

Copy link
Collaborator

@gwarmstrong gwarmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a few suggestions/questions for moving forward

class VLLMModel(BaseModel):
def __init__(self, **kwargs):
def __init__(self, data_dir: str = "", **kwargs):
self.data_dir = data_dir
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this out of the VLLModel class? E.g., if we format the audio file path in around here: https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/inference/generate.py#L579

Also, would it be safe to make an assumption like paths are relative to the input_file? So we could reduce a parameter here?

Copy link
Member

@Jorjeous Jorjeous Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check again, if not outdated, than no - we should have it here as now we need to chunk audio

"extra_body": self._build_request_body(top_k, min_p, repetition_penalty, extra_body=extra_body),
}

def content_text_to_list(self, message):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is any of this logic able to be used outside of vllm? E.g., does it also work for openai APIs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if not, I think it may have more utility if we make a hook for it in generate.py rather than here

continue

# Filter out audio_url items from list-style content
message["content"] = [content for content in message["content"] if content.get("type") != "audio_url"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make "audio_url" here a list that is specified in this config? that way it can be configurable and other fields can be included/excluded if desired

containers:
trtllm: nvcr.io/nvidia/tensorrt-llm/release:1.0.0
vllm: vllm/vllm-openai:v0.10.1.1
vllm-audio: nvcr.io/nvidian/ac-aiapps/vllm-openai-audio:v1.0.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this org/image gated? That will cause a failure if it is used in CI. I don't actually see this being used though--would it be used as the server container in the gpu tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure this image is org‑gated. Yes, I am using it in the GPU tests. What’s the recommended workaround in this case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be built into the base vllm image now--no need to pull nvcr container

Copy link
Member

@Jorjeous Jorjeous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in vllm.py there is a critical issue:
The orider in wich model recieves audio and instruction

Below is corrected version of function

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

clean up

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
@Jorjeous
Copy link
Member

Added postprocessing specific for QWEN models, may be we should add check for model type

Jorjeous and others added 4 commits December 10, 2025 09:35
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
@melllinia melllinia force-pushed the audio_bin branch 2 times, most recently from 5e9cc99 to 62d5166 Compare December 16, 2025 10:43
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Copy link
Collaborator

@gwarmstrong gwarmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see #1137 for a suggested architectural change to help consolidate all the audio-specific logic and clean up the VLLM client here a little bit

Signed-off-by: George Armstrong <georgea@nvidia.com>
@Jorjeous
Copy link
Member

Issues with task type.
Seems that we lost auto-enabling by "data_dir" presence. Working on it

@Jorjeous
Copy link
Member

About chunking size: lets leave it 30s as most models works fine with it, while enabling 60s default would significantly affect results.

1) after refactoring AudioProcessor worked only if we pass ++audio arg. this is inconvenient

2) now we audio-enabling in based on dataset_group=speechlm or task_type=audio

all audio containing benchmarks have one or both

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
@Jorjeous
Copy link
Member

Tested on QWEN, LGTM

@Jorjeous Jorjeous self-requested a review December 23, 2025 16:13
@Jorjeous
Copy link
Member

Jorjeous commented Jan 8, 2026

Need to fix import of audio_file_to_base64

# Don't use /no_think for open-ended questions to allow reasoning
system_content = "You are a helpful assistant."
if category != "open":
system_content += " /no_think"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is specific to qwen models, I don't think we should be doing it globally here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imho, this is not problem for other models

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It certainly introduces a little mismatch, whether it's significant or not for accuracy is unclear, but I think it's much safer not to have it. What's the reason why we need to hardcode it here?


# Audio wrapper (preprocesses messages before they reach the model)
# Auto-enable for audio benchmarks on vLLM (eval_type=audio OR dataset_group=speechlm)
should_enable_audio = self.cfg.audio is not None or (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather configure this through prepare.py of relevant benchmarks. Otherwise this logic is too coupled with datasets but this module is very general. So instead of dataset_group, I'd add a flag should_enable_audio and you can just set it as default parameter in the init of those benchmarks. You can still have a check for vllm to be used as a server and other logic for wrapping, just let's remove the dataset_group based behaviour

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got your point, when moving to separate class PR #1157 this would be auto-resolved

for output in outputs:
fout.write(json.dumps(output) + "\n")

def drop_binary_data(self, output):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just drop_data or better drop_fields_from_messages or something like this as it's not limited to binary but just checks what to drop from the parameter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, why not

generation_args = f"{eval_args} {generation_args}"
generation_args += f" ++eval_config.split={split} "

# Pass dataset_group from benchmark config if defined
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I'd remove this as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No harm from one additional field, but makes a but more convenient to separate datasets
Still want to remove?

@Jorjeous
Copy link
Member

Jorjeous commented Jan 9, 2026

@Kipok
moving this to #1157, with some arch changes. will address comments there

@Kipok
Copy link
Collaborator

Kipok commented Jan 10, 2026

should we close this one then @Jorjeous ?

@Jorjeous Jorjeous closed this Jan 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants