Skip to content

Audio OpenAI format#1144

Closed
Jorjeous wants to merge 43 commits intomainfrom
audio_openai_format
Closed

Audio OpenAI format#1144
Jorjeous wants to merge 43 commits intomainfrom
audio_openai_format

Conversation

@Jorjeous
Copy link
Member

@Jorjeous Jorjeous commented Dec 25, 2025

Addition on top of vllm audio support with openai api request to inference server with audio

Summary by CodeRabbit

  • New Features
    • Added audio preprocessing support for multimodal models with configurable audio formats
    • Introduced multi-criteria evaluation metrics for benchmark assessments (correctness, relevance, completeness, clarity)
    • Added tool schema customization capability
  • Bug Fixes
    • Updated evaluation configuration for MMAU-Pro dataset
    • Enhanced NVIDIA API key detection for inference endpoints
    • Improved multiple choice question formatting
  • Tests
    • Added audio generation integration tests

✏️ Tip: You can customize this high-level summary in your review settings.

karpnv and others added 30 commits December 5, 2025 18:07
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Jorjeous and others added 12 commits December 10, 2025 09:07
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

clean up

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
1) after refactoring AudioProcessor worked only if we pass ++audio arg. this is inconvenient

2) now we audio-enabling in based on dataset_group=speechlm or task_type=audio

all audio containing benchmarks have one or both

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 25, 2025

📝 Walkthrough

Walkthrough

This PR adds audio processing support to the inference pipeline, introduces multi-criteria scoring for MMAU-Pro benchmark evaluation, updates dataset configurations to use a new judge prompt (mmau-pro instead of speechlm), and refactors data preparation with improved choice formatting and system/user message structure.

Changes

Cohort / File(s) Summary
Dataset Configuration
nemo_skills/dataset/mmau-pro/closed_form/__init__.py, nemo_skills/dataset/mmau-pro/open_ended/__init__.py
Added EVAL_ARGS constant for closed-form; updated JUDGE_ARGS to reference new mmau-pro judge prompt instead of speechlm
Data Preparation
nemo_skills/dataset/mmau-pro/prepare.py
Refactored choice display format from "A. B" to "(A) B"; appended instruction text for multi-choice answers; reworked audio path handling to use absolute paths; introduced system/user message structure with conditional /no_think suffix for non-open questions
Evaluation Metrics
nemo_skills/evaluation/metrics/mmau_pro_metrics.py
Added extract_multicriteria_scores() function to parse 1–5 scores for four criteria (correctness, relevance, completeness, clarity) plus overall; updated metrics computation to track per-criterion statistics and derive correctness from multicriteria scores for open-ended predictions
Judge Prompt Configuration
nemo_skills/prompt/config/judge/mmau-pro.yaml, nemo_skills/prompt/config/judge/speechlm.yaml
Added new mmau-pro judge prompt requiring multi-criteria scoring (1–5 scale) with justifications; removed deprecated speechlm prompt
Inference Core
nemo_skills/inference/generate.py
Added config fields for schema overrides, content type dropping, audio format, and dataset group; integrated AudioProcessor wrapper when audio processing is enabled; added audio field propagation to messages; introduced preprocess_prompt and drop_binary_data hooks
Inference Model Layer
nemo_skills/inference/model/__init__.py, nemo_skills/inference/model/audio_processor.py
Added get_audio_model() factory function; extended get_tool_calling_model() with schema_overrides support; introduced AudioProcessor class with chunking support for long audio, base64 encoding, and per-chunk generation aggregation
Model Implementation
nemo_skills/inference/model/openai.py, nemo_skills/inference/model/vllm.py
Narrowed NVIDIA API key condition to include inference-api.nvidia.com; removed explicit VLLMModel init and added docstring noting AudioProcessor wrapping for audio support
Pipeline Utilities
nemo_skills/pipeline/utils/eval.py
Added logic to propagate optional dataset_group from benchmark module into generation arguments
Tests
tests/gpu-tests/run_qwen.sh, tests/gpu-tests/test_vllm_audio.py, tests/test_vllm_audio.py
Added GPU-based audio generation test using Qwen2.5-Omni; added unit tests validating audio file encoding and message construction with audio content

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client Code
    participant Gen as generate.py
    participant AP as AudioProcessor
    participant Base as Base LLM
    participant Backend as Backend Model

    Client->>Gen: process_single_datapoint(data_point)
    
    rect rgb(200, 230, 255)
    Note over Gen,AP: Audio Initialization
    Gen->>Gen: fill_prompt(data_point)
    Note over Gen: Extract audio field from data_point
    Gen->>Gen: preprocess_prompt(prompt)
    Gen->>AP: Initialize AudioProcessor if audio enabled
    AP->>AP: Infer data_dir from eval_type
    end
    
    rect rgb(230, 220, 255)
    Note over AP,Backend: Audio Processing with Chunking
    AP->>AP: Detect if audio needs chunking<br/>(duration > threshold)
    alt Chunking Enabled
        AP->>AP: chunk_audio(audio_array,<br/>sampling_rate, chunk_duration)
        loop For Each Chunk
            AP->>AP: save_audio_chunk_to_base64(chunk)
            AP->>AP: make_audio_content_block(base64)
            AP->>AP: Generate prompt for chunk
            AP->>Base: generate_async(prompt_chunk)
            Base->>Backend: inference
            Backend-->>Base: chunk_result
            Base-->>AP: chunk_result
            AP->>AP: Accumulate chunk_result
        end
        AP->>AP: Aggregate results<br/>metadata: num_chunks, duration
    else Single Audio
        AP->>AP: audio_file_to_base64(audio_path)
        AP->>AP: make_audio_content_block(base64)
        AP->>Base: generate_async(prompt)
        Base->>Backend: inference
        Backend-->>Base: response
        Base-->>AP: response
    end
    end
    
    rect rgb(230, 255, 230)
    Note over Gen,AP: Output Processing
    AP-->>Gen: final_output (combined)
    Gen->>Gen: postprocess_single_output(output)
    Gen->>Gen: drop_binary_data(output)<br/>Remove audio_url, input_audio
    Gen-->>Client: processed_output
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested labels

run GPU tests

Suggested reviewers

  • gwarmstrong
  • melllinia

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 76.32% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'Audio OpenAI format' is vague and does not clearly describe the specific changes made. While audio-related changes are present, the title lacks clarity about whether it adds, modifies, or enhances audio support, making it difficult for teammates to understand the primary purpose from the title alone. Consider revising the title to be more specific and descriptive, such as 'Add OpenAI-compatible audio processing for vLLM inference' or 'Support audio input in OpenAI API-style requests to vLLM servers' to better convey the changeset's intent.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch audio_openai_format

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (3)
nemo_skills/evaluation/metrics/mmau_pro_metrics.py (1)

123-128: Rename unused loop variable per static analysis.

The agg_mode variable is not used within the loop body.

🔎 Proposed fix
-        for agg_mode, agg_metrics in metrics_dict.items():
+        for _agg_mode, agg_metrics in metrics_dict.items():
nemo_skills/inference/model/audio_processor.py (2)

199-204: Fix implicit Optional type hints.

Per PEP 484, use explicit T | None instead of = None default without the type annotation.

🔎 Proposed fix
     async def generate_async(
         self,
         prompt: str | list[dict] | None = None,
-        task_type: str = None,
+        task_type: str | None = None,
         **kwargs,
     ) -> dict:

353-384: Rename unused loop variable.

The chunk_idx variable is not used within the loop.

🔎 Proposed fix
-        for chunk_idx, audio_chunk in enumerate(chunks):
+        for _chunk_idx, audio_chunk in enumerate(chunks):

Alternatively, if chunk indexing might be useful for debugging, consider adding it to the log:

-        for chunk_idx, audio_chunk in enumerate(chunks):
+        for chunk_idx, audio_chunk in enumerate(chunks):
             chunk_messages = []
+            LOG.debug("Processing chunk %d/%d", chunk_idx + 1, len(chunks))
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e85e995 and 71db50f.

📒 Files selected for processing (15)
  • nemo_skills/dataset/mmau-pro/closed_form/__init__.py
  • nemo_skills/dataset/mmau-pro/open_ended/__init__.py
  • nemo_skills/dataset/mmau-pro/prepare.py
  • nemo_skills/evaluation/metrics/mmau_pro_metrics.py
  • nemo_skills/inference/generate.py
  • nemo_skills/inference/model/__init__.py
  • nemo_skills/inference/model/audio_processor.py
  • nemo_skills/inference/model/openai.py
  • nemo_skills/inference/model/vllm.py
  • nemo_skills/pipeline/utils/eval.py
  • nemo_skills/prompt/config/judge/mmau-pro.yaml
  • nemo_skills/prompt/config/judge/speechlm.yaml
  • tests/gpu-tests/run_qwen.sh
  • tests/gpu-tests/test_vllm_audio.py
  • tests/test_vllm_audio.py
💤 Files with no reviewable changes (1)
  • nemo_skills/prompt/config/judge/speechlm.yaml
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.
📚 Learning: 2025-12-12T16:09:53.870Z
Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.

Applied to files:

  • nemo_skills/prompt/config/judge/mmau-pro.yaml
🧬 Code graph analysis (5)
tests/gpu-tests/test_vllm_audio.py (2)
tests/gpu-tests/utils.py (1)
  • require_env_var (18-23)
nemo_skills/pipeline/utils/declarative.py (1)
  • run (346-483)
nemo_skills/evaluation/metrics/mmau_pro_metrics.py (2)
nemo_skills/evaluation/metrics/base.py (2)
  • as_int (443-446)
  • as_percentage (437-440)
nemo_skills/utils.py (1)
  • get_logger_name (39-43)
nemo_skills/inference/generate.py (4)
nemo_skills/inference/model/audio_processor.py (2)
  • AudioProcessor (144-406)
  • AudioProcessorConfig (34-51)
nemo_skills/inference/chat_interface/core.py (3)
  • cfg (181-182)
  • sandbox (177-178)
  • get (136-151)
nemo_skills/inference/model/__init__.py (1)
  • get_model (75-95)
recipes/asr_tts/riva_generate.py (1)
  • fill_prompt (130-134)
tests/test_vllm_audio.py (2)
nemo_skills/inference/model/vllm.py (1)
  • VLLMModel (27-151)
nemo_skills/inference/model/audio_processor.py (1)
  • audio_file_to_base64 (54-58)
nemo_skills/inference/model/__init__.py (1)
nemo_skills/inference/model/audio_processor.py (2)
  • AudioProcessor (144-406)
  • AudioProcessorConfig (34-51)
🪛 Ruff (0.14.10)
tests/gpu-tests/test_vllm_audio.py

31-31: Probable insecure usage of temporary file or directory: "/tmp/nemo-skills-tests/"

(S108)


66-66: subprocess call with shell=True identified, security issue

(S602)

nemo_skills/evaluation/metrics/mmau_pro_metrics.py

123-123: Loop control variable agg_mode not used within loop body

Rename unused agg_mode to _agg_mode

(B007)

nemo_skills/inference/generate.py

578-578: Unused method argument: data_points

(ARG002)

nemo_skills/inference/model/audio_processor.py

202-202: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


265-265: Avoid specifying long messages outside the exception class

(TRY003)


289-289: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


320-321: try-except-pass detected, consider logging the exception

(S110)


320-320: Do not catch blind exception: Exception

(BLE001)


353-353: Loop control variable chunk_idx not used within loop body

Rename unused chunk_idx to _chunk_idx

(B007)


369-371: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

tests/test_vllm_audio.py

45-45: Unused lambda argument: self

(ARG005)


45-45: Unused lambda argument: kwargs

(ARG005)

🔇 Additional comments (25)
nemo_skills/dataset/mmau-pro/closed_form/__init__.py (1)

19-19: LGTM!

The EVAL_ARGS constant addition is consistent with the module's pattern and aligns with the MMAU-Pro evaluation flow introduced across the repository.

nemo_skills/inference/model/openai.py (1)

51-51: LGTM!

The extension to support "inference-api.nvidia.com" alongside "api.nvidia.com" for NVIDIA API key detection is appropriate and maintains the same security posture.

tests/gpu-tests/run_qwen.sh (1)

23-25: LGTM!

The audio test workflow addition follows the established pattern in the file and appropriately sets the model for Qwen2.5-Omni audio testing.

nemo_skills/dataset/mmau-pro/open_ended/__init__.py (1)

26-26: LGTM!

The judge prompt configuration update from speechlm to mmau-pro aligns with the new multi-criteria evaluation approach introduced in this PR.

nemo_skills/pipeline/utils/eval.py (1)

134-138: LGTM!

The dataset_group propagation follows the established pattern in this function and enables proper audio dataset handling in the generation pipeline.

tests/test_vllm_audio.py (4)

25-39: LGTM!

The test correctly validates base64 encoding and decoding round-trip for audio files, with proper temporary file cleanup.


42-48: LGTM!

The mock fixture appropriately bypasses VLLMModel initialization for isolated testing of audio preprocessing logic. The unused lambda arguments flagged by static analysis are intentional for the mocking pattern.


51-72: LGTM!

The test properly validates that audio content is prepended before text content in the message list, which is critical for Qwen Audio models to transcribe correctly.


74-101: LGTM!

The test correctly validates handling of multiple audio files, ensuring all audio entries precede the text content as required by Qwen Audio.

nemo_skills/prompt/config/judge/mmau-pro.yaml (1)

1-30: LGTM!

The multi-criteria judge prompt is well-structured with clear evaluation criteria and output format requirements. The 4-criteria evaluation (Correctness, Relevance, Completeness, Clarity) with 1-5 scoring and brief justifications provides comprehensive assessment for MMAU-Pro open-ended responses.

nemo_skills/inference/model/vllm.py (1)

28-32: LGTM!

The added docstring clearly describes the VLLMModel's purpose and appropriately documents that audio processing capabilities are provided through the AudioProcessor wrapper, maintaining clear separation of concerns.

nemo_skills/inference/model/__init__.py (2)

98-123: Clean factory function for audio model wrapping.

The get_audio_model function follows the established factory pattern and correctly handles the three cases for audio_config (None, dict, or AudioProcessorConfig).


161-178: LGTM on schema_overrides addition.

The schema_overrides parameter is properly threaded through to ToolCallingWrapper.

nemo_skills/dataset/mmau-pro/prepare.py (1)

93-98: System message structure looks good.

The /no_think directive for non-open categories is a reasonable approach to control model reasoning behavior. The message structure with system + user messages follows OpenAI conventions.

nemo_skills/evaluation/metrics/mmau_pro_metrics.py (1)

130-161: Multicriteria aggregation logic is well-structured.

The conversion from 1-5 scale to 0-100 percentage is clear, and the good/poor response rate thresholds (≥4.0 and ≤2.0) are reasonable. The rounding at the end ensures consistent output formatting.

nemo_skills/inference/model/audio_processor.py (3)

54-58: LGTM on base64 encoding utility.

Simple and correct implementation for audio file encoding.


112-141: Good use of tempfile with proper cleanup.

The save_audio_chunk_to_base64 function correctly uses a temporary file with cleanup in a finally block. The check for file existence before unlinking is a good defensive practice.


241-287: Audio message preparation handles ordering correctly.

The comment on line 247 about Qwen models requiring audio before text is important, and the implementation correctly places audio items before text content. Good documentation of this requirement.

nemo_skills/inference/generate.py (7)

40-49: LGTM on new imports.

The AudioProcessor and AudioProcessorConfig imports are correctly added to support the audio processing feature.


193-204: New audio-related config fields are well-documented.

The drop_content_types, audio_format, and dataset_group fields provide good flexibility for audio processing configuration. The defaults are sensible.


559-567: Audio field propagation in fill_prompt.

The code correctly copies the audio field from data_point to the user message. Note that this only handles the singular audio key; the audios (plural) case is not handled here, but that appears to be intentional based on the AudioProcessor handling both cases.


582-596: drop_binary_data implementation is correct.

The method properly handles edge cases (missing messages, non-list content) and filters content types based on configuration. The in-place mutation pattern is consistent with postprocess_single_output.


578-580: The data_points parameter is kept for subclass compatibility.

The static analysis flags data_points as unused, but this is intentional - the method signature must remain consistent for subclasses that may override dump_outputs and need access to data_points. This is a false positive.


646-658: Prompt preprocessing and task_type handling look correct.

The preprocessing hook allows for custom logic in subclasses, and the task_type is correctly passed to AudioProcessor when available.


409-431: AudioProcessor wrapping logic is well-implemented and appropriately specific.

The auto-enabling logic based on server_type and eval_type/dataset_group is clear, with the dual condition providing good defensive coverage. The logging provides good visibility into when audio processing is enabled. The current condition self.cfg.eval_type == "audio" is appropriately specific to the single audio evaluation type used in the codebase; different audio evaluation tasks are handled within the AudioEvaluator configuration rather than through separate eval_type values.

Comment on lines 85 to +92
if entry.get("audio_path"):
audio_path = entry["audio_path"]

if isinstance(audio_path, list) and audio_path:
user_message["audios"] = [{"path": path, "duration": 10.0} for path in audio_path]
elif isinstance(audio_path, str):
user_message["audio"] = {"path": audio_path, "duration": 10.0}

formatted_entry["messages"] = [user_message]
# Prepend /dataset/mmau-pro/ to make paths absolute for cluster
if len(audio_path) == 1:
user_message["audio"] = {"path": f"/dataset/mmau-pro/{audio_path[0]}"}
else:
user_message["audios"] = [{"path": f"/dataset/mmau-pro/{path}"} for path in audio_path]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential IndexError if audio_path is an empty list.

The code checks if entry.get("audio_path") which is truthy for a non-empty list, but if audio_path could be an empty list [], accessing audio_path[0] would fail. While likely not an issue with the dataset, consider adding explicit length check for robustness.

🔎 Suggested defensive check
     if entry.get("audio_path"):
         audio_path = entry["audio_path"]
         # Prepend /dataset/mmau-pro/ to make paths absolute for cluster
-        if len(audio_path) == 1:
+        if audio_path and len(audio_path) == 1:
             user_message["audio"] = {"path": f"/dataset/mmau-pro/{audio_path[0]}"}
-        else:
+        elif audio_path:
             user_message["audios"] = [{"path": f"/dataset/mmau-pro/{path}"} for path in audio_path]
🤖 Prompt for AI Agents
In nemo_skills/dataset/mmau-pro/prepare.py around lines 85 to 92, the code
assumes audio_path[0] exists when entry.get("audio_path") is truthy, which can
raise IndexError for an empty list; update the logic to explicitly check the
length before indexing—if audio_path is a non-empty list use the existing
single/multiple-path handling, and if it's empty skip adding "audio"/"audios"
(or set to an empty list) to user_message to handle the empty-list case safely.

Comment on lines +54 to +58
# Fallback: compute overall if missing or still 3.0
if "overall" not in scores or scores["overall"] == 3.0:
criteria_scores = [scores.get(k, 3.0) for k in ["correctness", "relevance", "completeness", "clarity"]]
scores["overall"] = sum(criteria_scores) / len(criteria_scores)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fallback logic may incorrectly overwrite a valid "overall" score of exactly 3.0.

The condition scores["overall"] == 3.0 will trigger the fallback even when the LLM judge explicitly rated "overall" as 3.0. This could lead to inconsistent behavior.

🔎 Suggested fix using a flag to track if score was parsed
+    overall_parsed = False
     for criterion, pattern in patterns.items():
         match = re.search(pattern, judgement_text, re.IGNORECASE)
-        scores[criterion] = float(match.group(1)) if match else 3.0
+        if match:
+            scores[criterion] = float(match.group(1))
+            if criterion == "overall":
+                overall_parsed = True
+        else:
+            scores[criterion] = 3.0

     # Fallback: compute overall if missing or still 3.0
-    if "overall" not in scores or scores["overall"] == 3.0:
+    if not overall_parsed:
         criteria_scores = [scores.get(k, 3.0) for k in ["correctness", "relevance", "completeness", "clarity"]]
         scores["overall"] = sum(criteria_scores) / len(criteria_scores)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In nemo_skills/evaluation/metrics/mmau_pro_metrics.py around lines 54-58, the
fallback currently triggers when scores["overall"] == 3.0 which incorrectly
overrides an explicitly parsed overall score of 3.0; change the logic to only
compute the fallback when the overall score is absent or when a parsing flag
indicates overall was not successfully parsed—introduce a boolean (e.g.,
overall_parsed) set true when you successfully parse a provided overall value,
and replace the condition with if "overall" not in scores or not overall_parsed
then compute overall as the mean of the criteria.

Comment on lines +289 to +322
def _check_chunking_needed(self, messages: list[dict], task_type: str = None) -> tuple[bool, str, float]:
"""Check if audio in messages needs chunking.

Returns:
Tuple of (needs_chunking, audio_path, duration)
"""
if not self.config.enable_chunking:
return False, None, 0.0

# Check if task type should be chunked (if filter is specified)
if self.config.chunk_task_types is not None:
if task_type not in self.config.chunk_task_types:
return False, None, 0.0

# Find audio in messages
for msg in messages:
if msg.get("role") == "user":
audio_info = msg.get("audio") or (msg.get("audios", [{}])[0] if msg.get("audios") else {})
if audio_info and "path" in audio_info:
audio_path = os.path.join(self.data_dir, audio_info["path"])

if not os.path.exists(audio_path):
return False, None, 0.0

# Load audio to check duration
try:
audio_array, sampling_rate = load_audio_file(audio_path)
duration = len(audio_array) / sampling_rate

if duration > self.config.chunk_threshold_sec:
return True, audio_path, duration
except Exception:
pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Bare exception catch silently swallows errors.

The except Exception: pass at lines 320-321 hides potential issues with audio file loading. Consider logging the error or at least handling specific exceptions.

🔎 Suggested improvement with logging
-    def _check_chunking_needed(self, messages: list[dict], task_type: str = None) -> tuple[bool, str, float]:
+    def _check_chunking_needed(self, messages: list[dict], task_type: str | None = None) -> tuple[bool, str, float]:
         """Check if audio in messages needs chunking.

         Returns:
             Tuple of (needs_chunking, audio_path, duration)
         """
         if not self.config.enable_chunking:
             return False, None, 0.0

         # Check if task type should be chunked (if filter is specified)
         if self.config.chunk_task_types is not None:
             if task_type not in self.config.chunk_task_types:
                 return False, None, 0.0

         # Find audio in messages
         for msg in messages:
             if msg.get("role") == "user":
                 audio_info = msg.get("audio") or (msg.get("audios", [{}])[0] if msg.get("audios") else {})
                 if audio_info and "path" in audio_info:
                     audio_path = os.path.join(self.data_dir, audio_info["path"])

                     if not os.path.exists(audio_path):
                         return False, None, 0.0

                     # Load audio to check duration
                     try:
                         audio_array, sampling_rate = load_audio_file(audio_path)
                         duration = len(audio_array) / sampling_rate

                         if duration > self.config.chunk_threshold_sec:
                             return True, audio_path, duration
-                    except Exception:
-                        pass
+                    except Exception as e:
+                        LOG.debug("Failed to load audio for chunking check: %s - %s", audio_path, e)

         return False, None, 0.0

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.10)

289-289: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


320-321: try-except-pass detected, consider logging the exception

(S110)


320-320: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In nemo_skills/inference/model/audio_processor.py around lines 289 to 322, the
bare `except Exception: pass` swallows errors when loading audio; replace it
with targeted exception handling and logging: catch specific exceptions (e.g.,
FileNotFoundError, OSError, any codec/format error your audio loader raises) and
log a clear error message including the audio_path and exception details via the
class logger (or a provided logger) so failures are visible, then continue
processing (return False, None, 0.0 or skip to next message) rather than
silently ignoring all exceptions.

Comment on lines +25 to +83
@pytest.mark.gpu
def test_vllm_audio_generation():
"""Integration test: Generate with vLLM server using audio input."""
model_path = require_env_var("NEMO_SKILLS_TEST_HF_MODEL")
model_type = require_env_var("NEMO_SKILLS_TEST_MODEL_TYPE")

output_dir = f"/tmp/nemo-skills-tests/{model_type}/vllm-audio-generation"
# Clean up output directory
if Path(output_dir).exists():
shutil.rmtree(output_dir)

# Create test input file with audio
with tempfile.NamedTemporaryFile(mode="w", suffix=".jsonl", delete=False) as f:
test_data = [
{
"question": "Transcribe this audio",
"audio": {"path": "/nemo_run/code/tests/slurm-tests/asr_nim/wavs/t2_16.wav"},
},
{
"question": "What is in this audio?",
"audio": {"path": "/nemo_run/code/tests/slurm-tests/asr_nim/wavs/t3_16.wav"},
},
]
for item in test_data:
f.write(json.dumps(item) + "\n")
input_file = f.name

try:
cmd = (
f"ns generate "
f" --cluster test-local --config_dir {Path(__file__).absolute().parent} "
f" --model {model_path} "
f" --output_dir {output_dir} "
f" --server_type vllm "
f" --server_gpus 1 "
f" --server_nodes 1 "
f" --server_args '--enforce-eager' "
f" --input_file={input_file} "
f" ++prompt_config=generic/default "
f" ++skip_filled=False "
)
subprocess.run(cmd, shell=True, check=True)

# Verify output exists and has audio-related generation
with open(f"{output_dir}/output.jsonl") as fin:
lines = fin.readlines()

assert len(lines) == 2, "Should have 2 output lines"

for line in lines:
data = json.loads(line)
assert "generation" in data, "Should have generation field"
assert len(data["generation"]) > 0, "Generation should not be empty"
# If model supports audio, generation should contain something
print(f"Generated: {data['generation']}")

finally:
# Cleanup temp file
Path(input_file).unlink(missing_ok=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's check if the audio files are referenced in the codebase
rg -n "t2_16\.wav|t3_16\.wav" --type py

Repository: NVIDIA-NeMo/Skills

Length of output: 325


🏁 Script executed:

# Check if the audio directory structure exists
find . -type f -name "*.wav" 2>/dev/null | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 147


🏁 Script executed:

# Check the actual test file to verify the line numbers and content match the review
wc -l tests/gpu-tests/test_vllm_audio.py 2>/dev/null || echo "File not found in expected location"
git ls-files | grep -E "test.*audio|audio.*test" | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 159


🏁 Script executed:

# Look for any test data or fixtures directories
find . -type d -name "wavs" -o -type d -name "test_data" -o -type d -name "fixtures" 2>/dev/null | grep -v ".git"

Repository: NVIDIA-NeMo/Skills

Length of output: 94


🏁 Script executed:

# Check how other tests reference test data/fixtures
rg -B 2 -A 2 "tests/slurm-tests|tests/gpu-tests" --type py | head -50

Repository: NVIDIA-NeMo/Skills

Length of output: 3457


🏁 Script executed:

# Look for any pattern of dynamic path resolution in tests
rg -n "Path(__file__)" tests/ --type py | head -10

Repository: NVIDIA-NeMo/Skills

Length of output: 44


🏁 Script executed:

# Check if there's a pattern for base path configuration or environment variables
rg -n "nemo_run|NEMO.*PATH|TEST.*PATH" tests/ --type py | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 2148


🏁 Script executed:

# Look at the test file itself more closely, especially around path handling
cat -n tests/gpu-tests/test_vllm_audio.py | head -50

Repository: NVIDIA-NeMo/Skills

Length of output: 2201


Integration test follows established codebase patterns with room for improvement.

The test structure is well-organized with proper cleanup in the finally block. The hardcoded /nemo_run/code/ paths follow the pattern used throughout the test suite for containerized/cluster environments, though they do limit portability to local development setups.

Consider these enhancements:

  1. Hardcoded audio paths (lines 41, 45): While this pattern is consistent with other tests in the codebase, consider using an environment variable or Path(__file__).parent to make the test work in both cluster and local development environments.

  2. Test assertions are minimal: The test only verifies that output exists and has non-empty generations. Consider adding assertions for audio-specific metadata to better validate audio processing.

  3. No negative test cases: Consider adding tests for missing audio files or invalid audio formats to ensure proper error handling.

🧰 Tools
🪛 Ruff (0.14.10)

31-31: Probable insecure usage of temporary file or directory: "/tmp/nemo-skills-tests/"

(S108)


66-66: subprocess call with shell=True identified, security issue

(S602)

🤖 Prompt for AI Agents
In tests/gpu-tests/test_vllm_audio.py around lines 25-83, the test uses
hardcoded container paths for audio files and has minimal assertions; update it
to obtain audio file locations from an environment variable (e.g.,
NEMO_TEST_ASSETS) or relative to Path(__file__).parent so it works both in
CI/container and local dev, replace the literal "/nemo_run/code/..." paths with
a resolved Path built from that env var or the test file's parent, strengthen
assertions to check audio-related metadata in each output (e.g., presence of
audio tokens/flags, expected transcript substrings or an audio_processing_status
field) and add at least one negative subtest or input case that uses a
missing/invalid audio path and asserts the tool returns a handled error or
specific failure field; keep the existing cleanup logic for the temporary input
file.

@vmendelev
Copy link
Collaborator

I thought we agreed to move audio related stuff into another class (vllm_multimodal)? https://nvidia.slack.com/archives/C0A0VRM2S5A/p1766173279180119

@Kipok
Copy link
Collaborator

Kipok commented Jan 13, 2026

@Jorjeous does this overlap with #1157 ? Should we close this one or we need both?

@Jorjeous
Copy link
Member Author

Hi @Kipok
No, this is not overlaping, this is about usage of inference server (just accepted format)
Current form is build on top of previous way of processing without vllm multimodal

yes, we wont merge this one and will create just a tiny pr after merging #1157

@Jorjeous
Copy link
Member Author

Closing, on top of #1157

@Jorjeous Jorjeous closed this Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants