Audio OpenAI format by Jorjeous · Pull Request #1144 · NVIDIA-NeMo/Skills

Jorjeous · 2025-12-25T12:04:57Z

Addition on top of vllm audio support with openai api request to inference server with audio

Summary by CodeRabbit

New Features
- Added audio preprocessing support for multimodal models with configurable audio formats
- Introduced multi-criteria evaluation metrics for benchmark assessments (correctness, relevance, completeness, clarity)
- Added tool schema customization capability
Bug Fixes
- Updated evaluation configuration for MMAU-Pro dataset
- Enhanced NVIDIA API key detection for inference endpoints
- Improved multiple choice question formatting
Tests
- Added audio generation integration tests

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> clean up Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com>

1) after refactoring AudioProcessor worked only if we pass ++audio arg. this is inconvenient 2) now we audio-enabling in based on dataset_group=speechlm or task_type=audio all audio containing benchmarks have one or both Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

coderabbitai · 2025-12-25T12:10:32Z

📝 Walkthrough

Walkthrough

This PR adds audio processing support to the inference pipeline, introduces multi-criteria scoring for MMAU-Pro benchmark evaluation, updates dataset configurations to use a new judge prompt (mmau-pro instead of speechlm), and refactors data preparation with improved choice formatting and system/user message structure.

Changes

Cohort / File(s)	Summary
Dataset Configuration `nemo_skills/dataset/mmau-pro/closed_form/__init__.py`, `nemo_skills/dataset/mmau-pro/open_ended/__init__.py`	Added `EVAL_ARGS` constant for closed-form; updated `JUDGE_ARGS` to reference new mmau-pro judge prompt instead of speechlm
Data Preparation `nemo_skills/dataset/mmau-pro/prepare.py`	Refactored choice display format from "A. B" to "(A) B"; appended instruction text for multi-choice answers; reworked audio path handling to use absolute paths; introduced system/user message structure with conditional /no_think suffix for non-open questions
Evaluation Metrics `nemo_skills/evaluation/metrics/mmau_pro_metrics.py`	Added `extract_multicriteria_scores()` function to parse 1–5 scores for four criteria (correctness, relevance, completeness, clarity) plus overall; updated metrics computation to track per-criterion statistics and derive correctness from multicriteria scores for open-ended predictions
Judge Prompt Configuration `nemo_skills/prompt/config/judge/mmau-pro.yaml`, `nemo_skills/prompt/config/judge/speechlm.yaml`	Added new mmau-pro judge prompt requiring multi-criteria scoring (1–5 scale) with justifications; removed deprecated speechlm prompt
Inference Core `nemo_skills/inference/generate.py`	Added config fields for schema overrides, content type dropping, audio format, and dataset group; integrated AudioProcessor wrapper when audio processing is enabled; added audio field propagation to messages; introduced preprocess_prompt and drop_binary_data hooks
Inference Model Layer `nemo_skills/inference/model/__init__.py`, `nemo_skills/inference/model/audio_processor.py`	Added `get_audio_model()` factory function; extended `get_tool_calling_model()` with schema_overrides support; introduced AudioProcessor class with chunking support for long audio, base64 encoding, and per-chunk generation aggregation
Model Implementation `nemo_skills/inference/model/openai.py`, `nemo_skills/inference/model/vllm.py`	Narrowed NVIDIA API key condition to include inference-api.nvidia.com; removed explicit VLLMModel init and added docstring noting AudioProcessor wrapping for audio support
Pipeline Utilities `nemo_skills/pipeline/utils/eval.py`	Added logic to propagate optional dataset_group from benchmark module into generation arguments
Tests `tests/gpu-tests/run_qwen.sh`, `tests/gpu-tests/test_vllm_audio.py`, `tests/test_vllm_audio.py`	Added GPU-based audio generation test using Qwen2.5-Omni; added unit tests validating audio file encoding and message construction with audio content

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client Code
    participant Gen as generate.py
    participant AP as AudioProcessor
    participant Base as Base LLM
    participant Backend as Backend Model

    Client->>Gen: process_single_datapoint(data_point)
    
    rect rgb(200, 230, 255)
    Note over Gen,AP: Audio Initialization
    Gen->>Gen: fill_prompt(data_point)
    Note over Gen: Extract audio field from data_point
    Gen->>Gen: preprocess_prompt(prompt)
    Gen->>AP: Initialize AudioProcessor if audio enabled
    AP->>AP: Infer data_dir from eval_type
    end
    
    rect rgb(230, 220, 255)
    Note over AP,Backend: Audio Processing with Chunking
    AP->>AP: Detect if audio needs chunking<br/>(duration > threshold)
    alt Chunking Enabled
        AP->>AP: chunk_audio(audio_array,<br/>sampling_rate, chunk_duration)
        loop For Each Chunk
            AP->>AP: save_audio_chunk_to_base64(chunk)
            AP->>AP: make_audio_content_block(base64)
            AP->>AP: Generate prompt for chunk
            AP->>Base: generate_async(prompt_chunk)
            Base->>Backend: inference
            Backend-->>Base: chunk_result
            Base-->>AP: chunk_result
            AP->>AP: Accumulate chunk_result
        end
        AP->>AP: Aggregate results<br/>metadata: num_chunks, duration
    else Single Audio
        AP->>AP: audio_file_to_base64(audio_path)
        AP->>AP: make_audio_content_block(base64)
        AP->>Base: generate_async(prompt)
        Base->>Backend: inference
        Backend-->>Base: response
        Base-->>AP: response
    end
    end
    
    rect rgb(230, 255, 230)
    Note over Gen,AP: Output Processing
    AP-->>Gen: final_output (combined)
    Gen->>Gen: postprocess_single_output(output)
    Gen->>Gen: drop_binary_data(output)<br/>Remove audio_url, input_audio
    Gen-->>Client: processed_output
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Revert "Use run.Script for generate pipeline (#1052)" #1125: Directly overlaps with this PR's modifications to the same pipeline modules (generate.py, model layer) and introduces competing abstractions for API handling.
Add responses api type #889: Both PRs modify GenerateSolutionsConfig surface and generation flow with new config fields and branching behavior.

Suggested labels

run GPU tests

Suggested reviewers

gwarmstrong
melllinia

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 76.32% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title check	❓ Inconclusive	The title 'Audio OpenAI format' is vague and does not clearly describe the specific changes made. While audio-related changes are present, the title lacks clarity about whether it adds, modifies, or enhances audio support, making it difficult for teammates to understand the primary purpose from the title alone.	Consider revising the title to be more specific and descriptive, such as 'Add OpenAI-compatible audio processing for vLLM inference' or 'Support audio input in OpenAI API-style requests to vLLM servers' to better convey the changeset's intent.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch audio_openai_format

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (3)

nemo_skills/evaluation/metrics/mmau_pro_metrics.py (1)
123-128: Rename unused loop variable per static analysis.

The agg_mode variable is not used within the loop body.
🔎 Proposed fix
-        for agg_mode, agg_metrics in metrics_dict.items():
+        for _agg_mode, agg_metrics in metrics_dict.items():
nemo_skills/inference/model/audio_processor.py (2)
199-204: Fix implicit Optional type hints.

Per PEP 484, use explicit T | None instead of = None default without the type annotation.
🔎 Proposed fix
     async def generate_async(
         self,
         prompt: str | list[dict] | None = None,
-        task_type: str = None,
+        task_type: str | None = None,
         **kwargs,
     ) -> dict:
353-384: Rename unused loop variable.

The chunk_idx variable is not used within the loop.
🔎 Proposed fix
-        for chunk_idx, audio_chunk in enumerate(chunks):
+        for _chunk_idx, audio_chunk in enumerate(chunks):
Alternatively, if chunk indexing might be useful for debugging, consider adding it to the log:
-        for chunk_idx, audio_chunk in enumerate(chunks):
+        for chunk_idx, audio_chunk in enumerate(chunks):
             chunk_messages = []
+            LOG.debug("Processing chunk %d/%d", chunk_idx + 1, len(chunks))

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e85e995 and 71db50f.

📒 Files selected for processing (15)

nemo_skills/dataset/mmau-pro/closed_form/__init__.py
nemo_skills/dataset/mmau-pro/open_ended/__init__.py
nemo_skills/dataset/mmau-pro/prepare.py
nemo_skills/evaluation/metrics/mmau_pro_metrics.py
nemo_skills/inference/generate.py
nemo_skills/inference/model/__init__.py
nemo_skills/inference/model/audio_processor.py
nemo_skills/inference/model/openai.py
nemo_skills/inference/model/vllm.py
nemo_skills/pipeline/utils/eval.py
nemo_skills/prompt/config/judge/mmau-pro.yaml
nemo_skills/prompt/config/judge/speechlm.yaml
tests/gpu-tests/run_qwen.sh
tests/gpu-tests/test_vllm_audio.py
tests/test_vllm_audio.py

💤 Files with no reviewable changes (1)

nemo_skills/prompt/config/judge/speechlm.yaml

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.

📚 Learning: 2025-12-12T16:09:53.870Z

Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.

Applied to files:

nemo_skills/prompt/config/judge/mmau-pro.yaml

🧬 Code graph analysis (5)

tests/gpu-tests/test_vllm_audio.py (2)

tests/gpu-tests/utils.py (1)

require_env_var (18-23)

nemo_skills/pipeline/utils/declarative.py (1)

run (346-483)

nemo_skills/evaluation/metrics/mmau_pro_metrics.py (2)

nemo_skills/evaluation/metrics/base.py (2)

as_int (443-446)

as_percentage (437-440)

nemo_skills/utils.py (1)

get_logger_name (39-43)

nemo_skills/inference/generate.py (4)

nemo_skills/inference/model/audio_processor.py (2)

AudioProcessor (144-406)

AudioProcessorConfig (34-51)

nemo_skills/inference/chat_interface/core.py (3)

cfg (181-182)

sandbox (177-178)

get (136-151)

nemo_skills/inference/model/__init__.py (1)

get_model (75-95)

recipes/asr_tts/riva_generate.py (1)

fill_prompt (130-134)

tests/test_vllm_audio.py (2)

nemo_skills/inference/model/vllm.py (1)

VLLMModel (27-151)

nemo_skills/inference/model/audio_processor.py (1)

audio_file_to_base64 (54-58)

nemo_skills/inference/model/__init__.py (1)

nemo_skills/inference/model/audio_processor.py (2)

AudioProcessor (144-406)

AudioProcessorConfig (34-51)

🪛 Ruff (0.14.10)

tests/gpu-tests/test_vllm_audio.py

31-31: Probable insecure usage of temporary file or directory: "/tmp/nemo-skills-tests/"

(S108)

66-66: subprocess call with shell=True identified, security issue

(S602)

nemo_skills/evaluation/metrics/mmau_pro_metrics.py

123-123: Loop control variable agg_mode not used within loop body

Rename unused agg_mode to _agg_mode

(B007)

nemo_skills/inference/generate.py

578-578: Unused method argument: data_points

(ARG002)

nemo_skills/inference/model/audio_processor.py

202-202: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

265-265: Avoid specifying long messages outside the exception class

(TRY003)

289-289: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

320-321: try-except-pass detected, consider logging the exception

(S110)

320-320: Do not catch blind exception: Exception

(BLE001)

353-353: Loop control variable chunk_idx not used within loop body

Rename unused chunk_idx to _chunk_idx

(B007)

369-371: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

tests/test_vllm_audio.py

45-45: Unused lambda argument: self

(ARG005)

45-45: Unused lambda argument: kwargs

(ARG005)

🔇 Additional comments (25)

nemo_skills/dataset/mmau-pro/closed_form/__init__.py (1)

19-19: LGTM!

The EVAL_ARGS constant addition is consistent with the module's pattern and aligns with the MMAU-Pro evaluation flow introduced across the repository.

nemo_skills/inference/model/openai.py (1)

51-51: LGTM!

The extension to support "inference-api.nvidia.com" alongside "api.nvidia.com" for NVIDIA API key detection is appropriate and maintains the same security posture.

tests/gpu-tests/run_qwen.sh (1)

23-25: LGTM!

The audio test workflow addition follows the established pattern in the file and appropriately sets the model for Qwen2.5-Omni audio testing.

nemo_skills/dataset/mmau-pro/open_ended/__init__.py (1)

26-26: LGTM!

The judge prompt configuration update from speechlm to mmau-pro aligns with the new multi-criteria evaluation approach introduced in this PR.

nemo_skills/pipeline/utils/eval.py (1)

134-138: LGTM!

The dataset_group propagation follows the established pattern in this function and enables proper audio dataset handling in the generation pipeline.

tests/test_vllm_audio.py (4)

25-39: LGTM!

The test correctly validates base64 encoding and decoding round-trip for audio files, with proper temporary file cleanup.

42-48: LGTM!

The mock fixture appropriately bypasses VLLMModel initialization for isolated testing of audio preprocessing logic. The unused lambda arguments flagged by static analysis are intentional for the mocking pattern.

51-72: LGTM!

The test properly validates that audio content is prepended before text content in the message list, which is critical for Qwen Audio models to transcribe correctly.

74-101: LGTM!

The test correctly validates handling of multiple audio files, ensuring all audio entries precede the text content as required by Qwen Audio.

nemo_skills/prompt/config/judge/mmau-pro.yaml (1)

1-30: LGTM!

The multi-criteria judge prompt is well-structured with clear evaluation criteria and output format requirements. The 4-criteria evaluation (Correctness, Relevance, Completeness, Clarity) with 1-5 scoring and brief justifications provides comprehensive assessment for MMAU-Pro open-ended responses.

nemo_skills/inference/model/vllm.py (1)

28-32: LGTM!

The added docstring clearly describes the VLLMModel's purpose and appropriately documents that audio processing capabilities are provided through the AudioProcessor wrapper, maintaining clear separation of concerns.

nemo_skills/inference/model/__init__.py (2)

98-123: Clean factory function for audio model wrapping.

The get_audio_model function follows the established factory pattern and correctly handles the three cases for audio_config (None, dict, or AudioProcessorConfig).

161-178: LGTM on schema_overrides addition.

The schema_overrides parameter is properly threaded through to ToolCallingWrapper.

nemo_skills/dataset/mmau-pro/prepare.py (1)

93-98: System message structure looks good.

The /no_think directive for non-open categories is a reasonable approach to control model reasoning behavior. The message structure with system + user messages follows OpenAI conventions.

nemo_skills/evaluation/metrics/mmau_pro_metrics.py (1)

130-161: Multicriteria aggregation logic is well-structured.

The conversion from 1-5 scale to 0-100 percentage is clear, and the good/poor response rate thresholds (≥4.0 and ≤2.0) are reasonable. The rounding at the end ensures consistent output formatting.

nemo_skills/inference/model/audio_processor.py (3)

54-58: LGTM on base64 encoding utility.

Simple and correct implementation for audio file encoding.

112-141: Good use of tempfile with proper cleanup.

The save_audio_chunk_to_base64 function correctly uses a temporary file with cleanup in a finally block. The check for file existence before unlinking is a good defensive practice.

241-287: Audio message preparation handles ordering correctly.

The comment on line 247 about Qwen models requiring audio before text is important, and the implementation correctly places audio items before text content. Good documentation of this requirement.

nemo_skills/inference/generate.py (7)

40-49: LGTM on new imports.

The AudioProcessor and AudioProcessorConfig imports are correctly added to support the audio processing feature.

193-204: New audio-related config fields are well-documented.

The drop_content_types, audio_format, and dataset_group fields provide good flexibility for audio processing configuration. The defaults are sensible.

559-567: Audio field propagation in fill_prompt.

The code correctly copies the audio field from data_point to the user message. Note that this only handles the singular audio key; the audios (plural) case is not handled here, but that appears to be intentional based on the AudioProcessor handling both cases.

582-596: drop_binary_data implementation is correct.

The method properly handles edge cases (missing messages, non-list content) and filters content types based on configuration. The in-place mutation pattern is consistent with postprocess_single_output.

578-580: The data_points parameter is kept for subclass compatibility.

The static analysis flags data_points as unused, but this is intentional - the method signature must remain consistent for subclasses that may override dump_outputs and need access to data_points. This is a false positive.

646-658: Prompt preprocessing and task_type handling look correct.

The preprocessing hook allows for custom logic in subclasses, and the task_type is correctly passed to AudioProcessor when available.

409-431: AudioProcessor wrapping logic is well-implemented and appropriately specific.

The auto-enabling logic based on server_type and eval_type/dataset_group is clear, with the dual condition providing good defensive coverage. The logging provides good visibility into when audio processing is enabled. The current condition self.cfg.eval_type == "audio" is appropriately specific to the single audio evaluation type used in the codebase; different audio evaluation tasks are handled within the AudioEvaluator configuration rather than through separate eval_type values.

coderabbitai · 2025-12-25T12:10:35Z

nemo_skills/dataset/mmau-pro/prepare.py

    if entry.get("audio_path"):
        audio_path = entry["audio_path"]
-
-        if isinstance(audio_path, list) and audio_path:
-            user_message["audios"] = [{"path": path, "duration": 10.0} for path in audio_path]
-        elif isinstance(audio_path, str):
-            user_message["audio"] = {"path": audio_path, "duration": 10.0}
-
-    formatted_entry["messages"] = [user_message]
+        # Prepend /dataset/mmau-pro/ to make paths absolute for cluster
+        if len(audio_path) == 1:
+            user_message["audio"] = {"path": f"/dataset/mmau-pro/{audio_path[0]}"}
+        else:
+            user_message["audios"] = [{"path": f"/dataset/mmau-pro/{path}"} for path in audio_path]
+


⚠️ Potential issue | 🟡 Minor

Potential IndexError if audio_path is an empty list.

The code checks if entry.get("audio_path") which is truthy for a non-empty list, but if audio_path could be an empty list [], accessing audio_path[0] would fail. While likely not an issue with the dataset, consider adding explicit length check for robustness.

🔎 Suggested defensive check

if entry.get("audio_path"): audio_path = entry["audio_path"] # Prepend /dataset/mmau-pro/ to make paths absolute for cluster - if len(audio_path) == 1: + if audio_path and len(audio_path) == 1: user_message["audio"] = {"path": f"/dataset/mmau-pro/{audio_path[0]}"} - else: + elif audio_path: user_message["audios"] = [{"path": f"/dataset/mmau-pro/{path}"} for path in audio_path]

🤖 Prompt for AI Agents

In nemo_skills/dataset/mmau-pro/prepare.py around lines 85 to 92, the code assumes audio_path[0] exists when entry.get("audio_path") is truthy, which can raise IndexError for an empty list; update the logic to explicitly check the length before indexing—if audio_path is a non-empty list use the existing single/multiple-path handling, and if it's empty skip adding "audio"/"audios" (or set to an empty list) to user_message to handle the empty-list case safely.

coderabbitai · 2025-12-25T12:10:35Z

nemo_skills/evaluation/metrics/mmau_pro_metrics.py

+    # Fallback: compute overall if missing or still 3.0
+    if "overall" not in scores or scores["overall"] == 3.0:
+        criteria_scores = [scores.get(k, 3.0) for k in ["correctness", "relevance", "completeness", "clarity"]]
+        scores["overall"] = sum(criteria_scores) / len(criteria_scores)
+


⚠️ Potential issue | 🟡 Minor

Fallback logic may incorrectly overwrite a valid "overall" score of exactly 3.0.

The condition scores["overall"] == 3.0 will trigger the fallback even when the LLM judge explicitly rated "overall" as 3.0. This could lead to inconsistent behavior.

🔎 Suggested fix using a flag to track if score was parsed

+ overall_parsed = False for criterion, pattern in patterns.items(): match = re.search(pattern, judgement_text, re.IGNORECASE) - scores[criterion] = float(match.group(1)) if match else 3.0 + if match: + scores[criterion] = float(match.group(1)) + if criterion == "overall": + overall_parsed = True + else: + scores[criterion] = 3.0 # Fallback: compute overall if missing or still 3.0 - if "overall" not in scores or scores["overall"] == 3.0: + if not overall_parsed: criteria_scores = [scores.get(k, 3.0) for k in ["correctness", "relevance", "completeness", "clarity"]] scores["overall"] = sum(criteria_scores) / len(criteria_scores)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In nemo_skills/evaluation/metrics/mmau_pro_metrics.py around lines 54-58, the fallback currently triggers when scores["overall"] == 3.0 which incorrectly overrides an explicitly parsed overall score of 3.0; change the logic to only compute the fallback when the overall score is absent or when a parsing flag indicates overall was not successfully parsed—introduce a boolean (e.g., overall_parsed) set true when you successfully parse a provided overall value, and replace the condition with if "overall" not in scores or not overall_parsed then compute overall as the mean of the criteria.

coderabbitai · 2025-12-25T12:10:35Z

nemo_skills/inference/model/audio_processor.py

+    def _check_chunking_needed(self, messages: list[dict], task_type: str = None) -> tuple[bool, str, float]:
+        """Check if audio in messages needs chunking.
+
+        Returns:
+            Tuple of (needs_chunking, audio_path, duration)
+        """
+        if not self.config.enable_chunking:
+            return False, None, 0.0
+
+        # Check if task type should be chunked (if filter is specified)
+        if self.config.chunk_task_types is not None:
+            if task_type not in self.config.chunk_task_types:
+                return False, None, 0.0
+
+        # Find audio in messages
+        for msg in messages:
+            if msg.get("role") == "user":
+                audio_info = msg.get("audio") or (msg.get("audios", [{}])[0] if msg.get("audios") else {})
+                if audio_info and "path" in audio_info:
+                    audio_path = os.path.join(self.data_dir, audio_info["path"])
+
+                    if not os.path.exists(audio_path):
+                        return False, None, 0.0
+
+                    # Load audio to check duration
+                    try:
+                        audio_array, sampling_rate = load_audio_file(audio_path)
+                        duration = len(audio_array) / sampling_rate
+
+                        if duration > self.config.chunk_threshold_sec:
+                            return True, audio_path, duration
+                    except Exception:
+                        pass
+


⚠️ Potential issue | 🟡 Minor

Bare exception catch silently swallows errors.

The except Exception: pass at lines 320-321 hides potential issues with audio file loading. Consider logging the error or at least handling specific exceptions.

🔎 Suggested improvement with logging

- def _check_chunking_needed(self, messages: list[dict], task_type: str = None) -> tuple[bool, str, float]: + def _check_chunking_needed(self, messages: list[dict], task_type: str | None = None) -> tuple[bool, str, float]: """Check if audio in messages needs chunking. Returns: Tuple of (needs_chunking, audio_path, duration) """ if not self.config.enable_chunking: return False, None, 0.0 # Check if task type should be chunked (if filter is specified) if self.config.chunk_task_types is not None: if task_type not in self.config.chunk_task_types: return False, None, 0.0 # Find audio in messages for msg in messages: if msg.get("role") == "user": audio_info = msg.get("audio") or (msg.get("audios", [{}])[0] if msg.get("audios") else {}) if audio_info and "path" in audio_info: audio_path = os.path.join(self.data_dir, audio_info["path"]) if not os.path.exists(audio_path): return False, None, 0.0 # Load audio to check duration try: audio_array, sampling_rate = load_audio_file(audio_path) duration = len(audio_array) / sampling_rate if duration > self.config.chunk_threshold_sec: return True, audio_path, duration - except Exception: - pass + except Exception as e: + LOG.debug("Failed to load audio for chunking check: %s - %s", audio_path, e) return False, None, 0.0

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.14.10)

289-289: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

320-321: try-except-pass detected, consider logging the exception

(S110)

320-320: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In nemo_skills/inference/model/audio_processor.py around lines 289 to 322, the bare `except Exception: pass` swallows errors when loading audio; replace it with targeted exception handling and logging: catch specific exceptions (e.g., FileNotFoundError, OSError, any codec/format error your audio loader raises) and log a clear error message including the audio_path and exception details via the class logger (or a provided logger) so failures are visible, then continue processing (return False, None, 0.0 or skip to next message) rather than silently ignoring all exceptions.

coderabbitai · 2025-12-25T12:10:35Z

tests/gpu-tests/test_vllm_audio.py

+@pytest.mark.gpu
+def test_vllm_audio_generation():
+    """Integration test: Generate with vLLM server using audio input."""
+    model_path = require_env_var("NEMO_SKILLS_TEST_HF_MODEL")
+    model_type = require_env_var("NEMO_SKILLS_TEST_MODEL_TYPE")
+
+    output_dir = f"/tmp/nemo-skills-tests/{model_type}/vllm-audio-generation"
+    # Clean up output directory
+    if Path(output_dir).exists():
+        shutil.rmtree(output_dir)
+
+    # Create test input file with audio
+    with tempfile.NamedTemporaryFile(mode="w", suffix=".jsonl", delete=False) as f:
+        test_data = [
+            {
+                "question": "Transcribe this audio",
+                "audio": {"path": "/nemo_run/code/tests/slurm-tests/asr_nim/wavs/t2_16.wav"},
+            },
+            {
+                "question": "What is in this audio?",
+                "audio": {"path": "/nemo_run/code/tests/slurm-tests/asr_nim/wavs/t3_16.wav"},
+            },
+        ]
+        for item in test_data:
+            f.write(json.dumps(item) + "\n")
+        input_file = f.name
+
+    try:
+        cmd = (
+            f"ns generate "
+            f"    --cluster test-local --config_dir {Path(__file__).absolute().parent} "
+            f"    --model {model_path} "
+            f"    --output_dir {output_dir} "
+            f"    --server_type vllm "
+            f"    --server_gpus 1 "
+            f"    --server_nodes 1 "
+            f"    --server_args '--enforce-eager' "
+            f"    --input_file={input_file} "
+            f"    ++prompt_config=generic/default "
+            f"    ++skip_filled=False "
+        )
+        subprocess.run(cmd, shell=True, check=True)
+
+        # Verify output exists and has audio-related generation
+        with open(f"{output_dir}/output.jsonl") as fin:
+            lines = fin.readlines()
+
+        assert len(lines) == 2, "Should have 2 output lines"
+
+        for line in lines:
+            data = json.loads(line)
+            assert "generation" in data, "Should have generation field"
+            assert len(data["generation"]) > 0, "Generation should not be empty"
+            # If model supports audio, generation should contain something
+            print(f"Generated: {data['generation']}")
+
+    finally:
+        # Cleanup temp file
+        Path(input_file).unlink(missing_ok=True)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's check if the audio files are referenced in the codebase rg -n "t2_16\.wav|t3_16\.wav" --type py

Repository: NVIDIA-NeMo/Skills

Length of output: 325

🏁 Script executed:

# Check if the audio directory structure exists find . -type f -name "*.wav" 2>/dev/null | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 147

🏁 Script executed:

# Check the actual test file to verify the line numbers and content match the review wc -l tests/gpu-tests/test_vllm_audio.py 2>/dev/null || echo "File not found in expected location" git ls-files | grep -E "test.*audio|audio.*test" | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 159

🏁 Script executed:

# Look for any test data or fixtures directories find . -type d -name "wavs" -o -type d -name "test_data" -o -type d -name "fixtures" 2>/dev/null | grep -v ".git"

Repository: NVIDIA-NeMo/Skills

Length of output: 94

🏁 Script executed:

# Check how other tests reference test data/fixtures rg -B 2 -A 2 "tests/slurm-tests|tests/gpu-tests" --type py | head -50

Repository: NVIDIA-NeMo/Skills

Length of output: 3457

🏁 Script executed:

# Look for any pattern of dynamic path resolution in tests rg -n "Path(__file__)" tests/ --type py | head -10

Repository: NVIDIA-NeMo/Skills

Length of output: 44

🏁 Script executed:

# Check if there's a pattern for base path configuration or environment variables rg -n "nemo_run|NEMO.*PATH|TEST.*PATH" tests/ --type py | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 2148

🏁 Script executed:

# Look at the test file itself more closely, especially around path handling cat -n tests/gpu-tests/test_vllm_audio.py | head -50

Repository: NVIDIA-NeMo/Skills

Length of output: 2201

Integration test follows established codebase patterns with room for improvement.

The test structure is well-organized with proper cleanup in the finally block. The hardcoded /nemo_run/code/ paths follow the pattern used throughout the test suite for containerized/cluster environments, though they do limit portability to local development setups.

Consider these enhancements:

Hardcoded audio paths (lines 41, 45): While this pattern is consistent with other tests in the codebase, consider using an environment variable or Path(__file__).parent to make the test work in both cluster and local development environments.

Test assertions are minimal: The test only verifies that output exists and has non-empty generations. Consider adding assertions for audio-specific metadata to better validate audio processing.

No negative test cases: Consider adding tests for missing audio files or invalid audio formats to ensure proper error handling.

🧰 Tools

🪛 Ruff (0.14.10)

31-31: Probable insecure usage of temporary file or directory: "/tmp/nemo-skills-tests/"

(S108)

66-66: subprocess call with shell=True identified, security issue

(S602)

🤖 Prompt for AI Agents

In tests/gpu-tests/test_vllm_audio.py around lines 25-83, the test uses hardcoded container paths for audio files and has minimal assertions; update it to obtain audio file locations from an environment variable (e.g., NEMO_TEST_ASSETS) or relative to Path(__file__).parent so it works both in CI/container and local dev, replace the literal "/nemo_run/code/..." paths with a resolved Path built from that env var or the test file's parent, strengthen assertions to check audio-related metadata in each output (e.g., presence of audio tokens/flags, expected transcript substrings or an audio_processing_status field) and add at least one negative subtest or input case that uses a missing/invalid audio path and asserts the tool returns a handled error or specific failure field; keep the existing cleanup logic for the temporary input file.

…nflict

vmendelev · 2025-12-25T19:07:44Z

I thought we agreed to move audio related stuff into another class (vllm_multimodal)? https://nvidia.slack.com/archives/C0A0VRM2S5A/p1766173279180119

Kipok · 2026-01-13T01:09:11Z

@Jorjeous does this overlap with #1157 ? Should we close this one or we need both?

Jorjeous · 2026-01-13T12:20:44Z

Hi @Kipok
No, this is not overlaping, this is about usage of inference server (just accepted format)
Current form is build on top of previous way of processing without vllm multimodal

yes, we wont merge this one and will create just a tiny pr after merging #1157

Jorjeous · 2026-01-13T12:21:17Z

Closing, on top of #1157

karpnv and others added 30 commits December 5, 2025 18:07

print

07a408e

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

content_text_to_list

6acbc9f

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

rm print

78c6313

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

self.discard_binary_types

8f5c3a9

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

print

5e36e55

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

messages

862be04

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

binary_data_to_drop_from_output

6768b00

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

print

f522c0b

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

print

d48d43f

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

data_dir

4487ca1

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

data_dir=''

c53d405

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

eval_config

a7cf5b5

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

eval_type

689e6fc

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

if

9536b8b

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

or

155a480

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

print

0d9292d

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

print

c9d99a5

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

print

0c54cad

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

++val_type

fc7a344

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

audio

c68f8af

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

rm data_dir

cfaf38a

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

else

8186164

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

mmau-pro open-ended metrics improvement

813d675

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

pre-commit fixes

1b7e715

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

Add audio tests for vLLM server

ea8f294

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

making audio paths generic

206c7d6

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

fix generate.py

289c694

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

requested changes and audio/text order fix

3cc7ae5

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

resolving conflicts and lint fix

5825f27

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

Fix drop_binary_data missing messages check

6f58a1b

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

Jorjeous and others added 12 commits December 10, 2025 09:07

Enable audio chunking

0f1b86d

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> clean up Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

remove qwen specific post processing

32ed94c

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Merge branch 'main' into audio_bin

30a1c7f

fixing audio chunking config

15f30fd

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

Merge branch 'main' into audio_bin

eaaba9b

Fix audio tests to use VLLMModel instead of GenerationTask

8a0765a

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

adding vllm integration test to run_qwen.sh

f8c3089

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

move audio processing into model (#1137)

a88736e

Signed-off-by: George Armstrong <georgea@nvidia.com>

fix linter

1e418e7

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

upgratre audio vllm support with openai api requests with audio

7c11040

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

pre commit

71db50f

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

coderabbitai bot reviewed Dec 25, 2025

View reviewed changes

Merge main into audio_openai_format - resolve schema_overrides doc co…

f021989

…nflict

Jorjeous closed this Jan 13, 2026

Conversation

Jorjeous commented Dec 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 25, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

vmendelev commented Dec 25, 2025

Uh oh!

Kipok commented Jan 13, 2026

Uh oh!

Jorjeous commented Jan 13, 2026

Uh oh!

Jorjeous commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Jorjeous commented Dec 25, 2025 •

edited by coderabbitai bot

Loading