Audio input output integration by Jorjeous · Pull Request #1157 · NVIDIA-NeMo/Skills

Jorjeous · 2026-01-08T15:56:00Z

This PR enchance vllm multimodal class with audioinput, while maintating it's output capabilities

Summary by CodeRabbit

New Features
- Audio I/O, chunking and content-handling for multimodal models; audio utilities added
- Multi-criteria evaluation scoring and expanded ASR normalization modes
Chores
- Enabled audio-related flags across dataset/evaluation configs; added backward-compatible metric alias
Documentation
- New MMAU-Pro judge prompt; removed legacy judge prompt
Tests
- Added unit and GPU integration tests covering audio workflows

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-09T20:35:05Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds audio support across datasets, evaluation, inference, and metrics: new audio flags, normalization modes, audio utilities and chunking, VLLMMultimodal generation with chunked audio I/O, updated MMAU‑Pro judge prompt and multi‑criteria metrics, plus unit and GPU tests for audio flows.

Changes

Cohort / File(s)	Summary
Dataset config updates `nemo_skills/dataset/asr-leaderboard/__init__.py`, `nemo_skills/dataset/audiobench//__init__.py`, `nemo_skills/dataset/librispeech-pc/__init__.py`, `nemo_skills/dataset/mmau-pro//__init__.py`	Added/updated `GENERATION_ARGS` to include `++enable_audio=true` and introduced/adjusted `EVAL_ARGS` normalization strings; minor comment removal and judge path update in mmau‑pro. Review propagation of arg strings.
Audio evaluator & normalization `nemo_skills/evaluation/evaluator/audio.py`	Introduced normalization_mode options, strip_helpful_prefixes, multiple ASR preprocessing helpers, and refactored ASR evaluation paths to support "standard", "audiobench", "hf_leaderboard", "none". High logic density — review normalization correctness and unit coverage.
MMAU‑Pro metrics `nemo_skills/evaluation/metrics/mmau_pro_metrics.py`, `nemo_skills/evaluation/metrics/map_metrics.py`	Added multi-criteria extraction and tracking, per-criterion averages/std, rates; added "speechlm" alias in metrics map. Review parsing robustness and numeric scaling/rounding.
Audio utilities (new) `nemo_skills/inference/model/audio_utils.py`, `nemo_skills/inference/model/__init__.py`	New audio helpers: file load, base64 encode, chunking, save chunk to WAV+base64, and content block builder; exported in package init. Verify I/O, temp file cleanup, and dependency imports.
Generation pipeline `nemo_skills/inference/generate.py`	Added GenerateSolutionsConfig audio fields (drop_content_types, enable_audio, enable_audio_chunking, audio_chunk_task_types, chunk_audio_threshold_sec); audio-aware server selection (switch to vllm_multimodal) and drop_fields_from_messages postprocessing. Review server config propagation and default behaviors.
VLLM multimodal model (major) `nemo_skills/inference/model/vllm_multimodal.py`	Large expansion: audio input conversion, chunk detection, chunked generation loop, audio response processing/saving, and generate_async changes. High-impact control flow — review chunking logic, concurrency/aggregation, audio saving, and backward compatibility.
Judge prompts `nemo_skills/prompt/config/judge/mmau-pro.yaml`, `nemo_skills/prompt/config/judge/speechlm.yaml`	Added MMAU‑Pro YAML prompt (4 criteria scoring) and removed legacy speechlm.yaml. Review formatting and required output parsing expectations.
Tests `tests/test_vllm_audio.py`, `tests/gpu-tests/test_vllm_audio.py`	Added unit tests for audio utils and VLLM multimodal behaviors, plus a GPU integration test invoking ns generate for vllm_multimodal. Review test gating, environment deps, and cleanup semantics.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Generate as Generate Pipeline
    participant VLLM as VLLMMultimodalModel
    participant AudioUtils
    participant vLLMServer as vLLM Server

    Client->>Generate: request with audio + enable_audio=true
    Generate->>Generate: setup_llm() detects needs_audio -> selects vllm_multimodal
    Generate->>VLLM: instantiate with audio chunking config
    VLLM->>VLLM: _needs_audio_chunking(messages, task_type)
    alt chunking needed
        VLLM->>AudioUtils: load_audio_file(path)
        AudioUtils-->>VLLM: (array, sr)
        VLLM->>AudioUtils: chunk_audio(array, sr, threshold)
        AudioUtils-->>VLLM: chunks[]
        loop per chunk
            VLLM->>AudioUtils: save_audio_chunk_to_base64(chunk)
            AudioUtils-->>VLLM: base64_chunk
            VLLM->>VLLM: content_text_to_list(messages + base64_chunk)
            VLLM->>vLLMServer: generate_async(messages_chunk)
            vLLMServer-->>VLLM: generation_result
        end
        VLLM->>VLLM: aggregate chunk results + metadata
    else no chunking
        VLLM->>VLLM: content_text_to_list(messages)
        VLLM->>vLLMServer: generate_async(messages)
        vLLMServer-->>VLLM: generation_result
    end
    VLLM->>Generate: return generation + audio metadata
    Generate->>Generate: drop_fields_from_messages(output)
    Generate-->>Client: final_output

sequenceDiagram
    participant Evaluator
    participant MMAU as MMAUProMetrics
    participant Parser as extract_multicriteria_scores

    Evaluator->>MMAU: update(prediction)
    MMAU->>Parser: extract_multicriteria_scores(judgement_text)
    Parser-->>MMAU: {correctness,relevance,completeness,clarity,overall}
    MMAU->>MMAU: append scores to multicriteria_buffers
    Evaluator->>MMAU: get_metrics()
    MMAU->>MMAU: compute avg/std per criterion, scale to 0-100, derive good/poor rates
    MMAU-->>Evaluator: aggregated_metrics

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

Possibly related PRs

Introduced vLLM_multimodal model to save multimodal outputs #1136: closely related changes to vLLM multimodal audio handling and inference pipeline propagation.
Adding clan PR with AudioBench and Librispeech PC. #1103: related dataset and audio evaluation/metric updates overlapping AudioBench/LibriSpeech and normalization.
HF ASR Leaderboard Evaluation #1104: directly touches the same dataset GENERATION_ARGS/EVAL_ARGS adjustments for ASR leaderboard.

Suggested reviewers

melllinia
gwarmstrong
Kipok

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Audio input output integration' directly describes the main objective of the PR - adding audio input support to the vllm multimodal class. It is concise, clear, and accurately reflects the primary change across the codebase.
Docstring Coverage	✅ Passed	Docstring coverage is 90.70% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps

Greptile Overview

Greptile Summary

This PR enhances the VLLM multimodal class with audio input capabilities while maintaining existing audio output functionality. The changes introduce audio preprocessing, chunking for long audio files, and integration with multiple audio benchmarks (MMAU-Pro, AudioBench, ASR-Leaderboard, LibriSpeech-PC).

Key Changes

Audio Input Support:

Added audio_utils.py module with utilities for audio encoding, loading, and chunking
Extended VLLMMultimodalModel with audio preprocessing (content_text_to_list) to convert audio file references to base64-encoded data URIs
Implemented audio chunking for long files (>30s threshold) to handle model duration limits
Added model-specific preprocessing (strips /no_think from system messages for Qwen compatibility)

Evaluation Updates:

Migrated MMAU-Pro open-ended evaluation from binary Yes/No judgements to multi-criteria 1-5 scoring (correctness, relevance, completeness, clarity)
Created new judge/mmau-pro.yaml prompt config with structured multi-criteria format
Updated metrics to extract and report scores as percentages with good/poor response rate tracking

Configuration:

Added should_enable_audio flag to enable audio processing in benchmark configs
Introduced audio chunking configuration (enable_audio_chunking, chunk_audio_threshold_sec, audio_chunk_task_types)
Implemented drop_content_types to remove base64 audio from output files (reduces file size)

Integration:

Updated 6 dataset configs to enable audio processing
Modified generate.py to automatically switch to vllm_multimodal server when audio is needed

Critical Issue

⚠️ The entire generate_async method implementation in VLLMMultimodalModel is commented out (lines 412-445). This makes all audio input functionality completely non-functional. Without this method, audio files won't be converted to base64, chunking won't occur, and the model will receive raw audio file paths instead of processed audio data.

Confidence Score: 0/5

This PR cannot be merged safely - the core audio input feature is completely non-functional due to commented-out code.
The entire generate_async method in VLLMMultimodalModel is commented out, which is the entry point for all audio input processing. This means the primary feature this PR introduces (audio input with chunking) will not work at all. When users attempt to use audio inputs, the model will fall back to the parent class behavior without any audio preprocessing, causing failures. Additionally, there are multiple logic errors in the audio processing code that would cause runtime errors if uncommented (IndexError in split operations, potential issues with empty audios list).
nemo_skills/inference/model/vllm_multimodal.py - commented-out generate_async method must be uncommented and logic errors fixed before merge

Important Files Changed

File Analysis

Filename	Score	Overview
nemo_skills/inference/model/vllm_multimodal.py	0/5	Added audio input support with chunking, but entire generate_async method is commented out, making the feature non-functional. Multiple logic issues in audio processing.
nemo_skills/inference/model/audio_utils.py	5/5	New utility module for audio processing (base64 encoding, chunking, loading). Clean implementation with proper error handling.
nemo_skills/inference/generate.py	4/5	Added audio config parameters and drop_fields_from_messages to remove base64 audio from output. Logic is sound but drops audio by default.
nemo_skills/evaluation/metrics/mmau_pro_metrics.py	5/5	Changed from binary Yes/No judgement to multi-criteria 1-5 scoring. Well-structured with proper score extraction and percentage conversion.
tests/test_vllm_audio.py	4/5	New unit tests for audio utilities and preprocessing. Tests won't catch the commented-out generate_async issue since they mock the model.
tests/gpu-tests/test_vllm_audio.py	3/5	New GPU integration test for audio generation. Will fail due to commented-out generate_async implementation.

Sequence Diagram

sequenceDiagram
    participant User
    participant GenerationTask
    participant VLLMMultimodalModel
    participant AudioUtils
    participant VLLMServer
    participant FileSystem

    Note over User,FileSystem: Audio INPUT Flow (when should_enable_audio=true)
    
    User->>GenerationTask: Request generation with audio
    GenerationTask->>GenerationTask: Check should_enable_audio flag
    GenerationTask->>GenerationTask: Switch server_type to vllm_multimodal
    GenerationTask->>VLLMMultimodalModel: Initialize with audio config
    
    GenerationTask->>VLLMMultimodalModel: generate_async(messages with audio)
    Note over VLLMMultimodalModel: ❌ BLOCKED: generate_async is commented out
    Note over VLLMMultimodalModel: Expected flow (if uncommented):
    
    VLLMMultimodalModel->>VLLMMultimodalModel: _needs_audio_chunking()
    VLLMMultimodalModel->>FileSystem: Load audio file
    VLLMMultimodalModel->>AudioUtils: load_audio_file(audio_path)
    AudioUtils-->>VLLMMultimodalModel: audio_array, sampling_rate
    
    alt Audio duration > threshold
        VLLMMultimodalModel->>AudioUtils: chunk_audio()
        AudioUtils-->>VLLMMultimodalModel: audio chunks
        
        loop For each chunk
            VLLMMultimodalModel->>AudioUtils: save_audio_chunk_to_base64()
            AudioUtils-->>VLLMMultimodalModel: base64_audio
            VLLMMultimodalModel->>VLLMMultimodalModel: Build chunk messages
            VLLMMultimodalModel->>VLLMServer: Generate with chunk
            VLLMServer-->>VLLMMultimodalModel: chunk result
            VLLMMultimodalModel->>VLLMMultimodalModel: _postprocess_chunk_generation()
        end
        
        VLLMMultimodalModel->>VLLMMultimodalModel: Aggregate results
    else No chunking needed
        VLLMMultimodalModel->>VLLMMultimodalModel: content_text_to_list()
        VLLMMultimodalModel->>AudioUtils: audio_file_to_base64()
        AudioUtils-->>VLLMMultimodalModel: base64 audio
        VLLMMultimodalModel->>VLLMMultimodalModel: Insert audio before text
        VLLMMultimodalModel->>VLLMServer: Generate with full audio
        VLLMServer-->>VLLMMultimodalModel: result
    end
    
    Note over User,FileSystem: Audio OUTPUT Flow
    
    VLLMServer->>VLLMMultimodalModel: Response with audio data
    VLLMMultimodalModel->>VLLMMultimodalModel: _parse_chat_completion_response()
    VLLMMultimodalModel->>VLLMMultimodalModel: _process_audio_response()
    VLLMMultimodalModel->>FileSystem: Save audio to output_dir/audio/
    FileSystem-->>VLLMMultimodalModel: Audio file path
    VLLMMultimodalModel-->>GenerationTask: Result with audio path
    
    GenerationTask->>GenerationTask: drop_fields_from_messages()
    Note over GenerationTask: Remove base64 audio_url from output
    GenerationTask->>FileSystem: Write final output.jsonl
    FileSystem-->>User: Results with audio references

nemo_skills/inference/model/vllm_multimodal.py

nemo_skills/inference/generate.py

nemo_skills/inference/model/vllm_multimodal.py

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/inference/generate.py (1)

417-456: This relies on VLLMMultimodalModel.generate_async actually preprocessing audio—currently it doesn’t.
As written, setup_llm will route audio workloads to vllm_multimodal, but the model’s generate_async override (in nemo_skills/inference/model/vllm_multimodal.py) is currently commented out and returns None.

🤖 Fix all issues with AI agents

In @nemo_skills/evaluation/evaluator/audio.py:
- Line 127: The project now imports EnglishTextNormalizer (from
whisper_normalizer.english) in preprocess_asr_text, but whisper_normalizer is
missing from requirements/main.txt causing ModuleNotFoundError at runtime; add
the whisper_normalizer package to requirements/main.txt (pin a version if known
or use a compatible latest), run pip install -r requirements/main.txt, and
confirm the API (EnglishTextNormalizer) matches how preprocess_asr_text
constructs and calls it (constructor args and normalize methods) adjusting the
call sites in preprocess_asr_text if the package uses a different class or
method names.

In @nemo_skills/evaluation/metrics/mmau_pro_metrics.py:
- Around line 108-110: The list comprehension that builds predicted_answers in
update() calls .strip() on pred.get("generation", None), which crashes when
generation is None; change it to safely handle missing values by first
retrieving gen = pred.get("generation") and then setting answer = gen.strip() if
gen is not None else None, and convert empty strings to None (e.g., answer =
answer or None); update the predicted_answers assignment accordingly before
calling _compute_pass_at_k and _compute_majority_at_k so both methods receive
None for missing/blank generations.
- Around line 54-58: The code currently overwrites an explicit overall score of
3.0 by treating 3.0 as "missing"; change the fallback so it only computes
overall when the "overall" key is absent. In the block that uses scores and the
"overall" key (the snippet assigning scores["overall"]), remove the comparison
to 3.0 and use only if "overall" not in scores: (compute criteria_scores from
correctness/relevance/completeness/clarity and set scores["overall"]
accordingly) so an explicitly provided overall=3.0 is preserved.

In @nemo_skills/inference/model/audio_utils.py:
- Around line 59-82: The chunk_audio function must validate inputs to avoid
divide-by-zero/zero-length chunks: inside chunk_audio (function name
chunk_audio) check that sampling_rate and chunk_duration_sec are > 0 and that
audio_array is not None/has positive length, and if any check fails raise a
ValueError with a clear message; compute chunk_samples only after validation and
ensure chunk_samples > 0 before using it (raise if not), so the subsequent
num_chunks = int(np.ceil(len(audio_array) / chunk_samples)) and loop cannot blow
up or produce zero-length chunks.

In @nemo_skills/inference/model/vllm_multimodal.py:
- Around line 156-200: The content_text_to_list function (and the similar code
block at lines 228-266) currently uses os.path.join(self.data_dir,
audio["path"]) which allows path traversal; fix by resolving the final path
using os.path.abspath/os.path.realpath (e.g., resolved =
os.path.realpath(os.path.join(self.data_dir, audio["path"]))) and verify it is
inside self.data_dir using os.path.commonpath or by ensuring
resolved.startswith(os.path.realpath(self.data_dir) + os.sep); if the check
fails, raise a ValueError/PermissionError instead of reading the file. Refactor
the resolution/check into a small helper like _resolve_data_path(path) and use
it wherever audio paths are handled.
- Around line 390-426: The generate_async method is currently commented out and
returns None; restore its implementation so it always returns the parent's dict
result: when prompt is a list, set messages=prompt, call
self._needs_audio_chunking(messages, task_type) and if needs_chunking return
await self._generate_with_chunking(messages, audio_path, duration,
tokens_to_generate, **kwargs); otherwise convert messages =
[self.content_text_to_list(msg.copy()) for msg in messages], then messages =
self._preprocess_messages_for_model(messages), set prompt=messages; finally call
and return await super().generate_async(prompt=prompt,
tokens_to_generate=tokens_to_generate, **kwargs) so the method adheres to the
BaseModel contract.

In @tests/gpu-tests/test_vllm_audio.py:
- Around line 33-37: The test uses string shell commands (e.g., subprocess.run
or similar) which pass through shell=True; update
tests/gpu-tests/test_vllm_audio.py to construct explicit argv lists and invoke
the command without a shell (shell=False). Locate the invocations that build
command strings (around the output_dir handling and the other commands in the
55-69 block), replace string commands with lists of program and args and call
subprocess.run(...) or subprocess.check_call(...) with shell=False, quoting each
arg separately and preserving the same environment/redirects.
- Around line 38-49: The hard-coded CI paths in the test_vllm_audio.py test_data
make the test environment-coupled; update the test to either check for the WAVs'
existence with Path("/nemo_run/code/...").exists() and call pytest.skip(...)
when missing, or replace those entries by generating lightweight temporary WAV
files (e.g., create small silent WAVs into tempfile.NamedTemporaryFile and use
their .name) and reference them in the test_data variable so the test is
self-contained; modify the block around the NamedTemporaryFile creation and the
test_data list to implement one of these two options and ensure cleanup if
generating temp files.

In @tests/test_vllm_audio.py:
- Around line 45-56: The patched VLLMMultimodalModel.__init__ in the
mock_vllm_multimodal_model fixture is too specific and triggers Ruff ARG005;
change the patch to accept arbitrary positional and keyword args (e.g., use
lambda *_args, **_kwargs: None or a tiny def __init__(*args, **kwargs): return
None) so the test remains robust to signature changes and stops lint noise while
keeping the rest of the fixture (model.data_dir, output_dir, output_audio_dir,
enable_audio_chunking, audio_chunk_task_types, chunk_audio_threshold_sec)
intact.

🧹 Nitpick comments (4)

nemo_skills/inference/model/__init__.py (1)

22-30: Consider adding __all__ to make the public API explicit (and stable).
Right now these helpers are “public by import side-effect”; __all__ would prevent accidental exports as this module grows.

nemo_skills/inference/model/audio_utils.py (1)

116-133: Consider taking audio_mime/format instead of hardcoding wav.
Right now callers can only emit data:audio/wav;base64,... and {format:"wav"} even if the input is mp3/flac.

nemo_skills/inference/model/vllm_multimodal.py (2)

256-265: Avoid silent exception swallowing in _needs_audio_chunking() (hard to debug).
At least LOG.debug(..., exc_info=True) on failure; also consider catching narrower exceptions than Exception.

183-199: Reuse make_audio_content_block() instead of duplicating “audio_url” formatting.
This prevents drift between helpers and keeps “audio_url vs input_audio” handling consistent.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 52dbc32 and b470c5d.

📒 Files selected for processing (18)

nemo_skills/dataset/asr-leaderboard/__init__.py
nemo_skills/dataset/audiobench/judge/__init__.py
nemo_skills/dataset/audiobench/nonjudge/__init__.py
nemo_skills/dataset/librispeech-pc/__init__.py
nemo_skills/dataset/mmau-pro/closed_form/__init__.py
nemo_skills/dataset/mmau-pro/open_ended/__init__.py
nemo_skills/evaluation/evaluator/audio.py
nemo_skills/evaluation/evaluator/mmau_pro.py
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/evaluation/metrics/mmau_pro_metrics.py
nemo_skills/inference/generate.py
nemo_skills/inference/model/__init__.py
nemo_skills/inference/model/audio_utils.py
nemo_skills/inference/model/vllm_multimodal.py
nemo_skills/prompt/config/judge/mmau-pro.yaml
nemo_skills/prompt/config/judge/speechlm.yaml
tests/gpu-tests/test_vllm_audio.py
tests/test_vllm_audio.py

💤 Files with no reviewable changes (1)

nemo_skills/prompt/config/judge/speechlm.yaml

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-12-12T16:09:53.870Z

Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.

Applied to files:

nemo_skills/dataset/audiobench/nonjudge/__init__.py
nemo_skills/dataset/mmau-pro/closed_form/__init__.py
nemo_skills/prompt/config/judge/mmau-pro.yaml
nemo_skills/dataset/audiobench/judge/__init__.py
nemo_skills/dataset/mmau-pro/open_ended/__init__.py

🧬 Code graph analysis (7)

tests/gpu-tests/test_vllm_audio.py (1)

tests/gpu-tests/utils.py (1)

require_env_var (18-23)

nemo_skills/evaluation/metrics/map_metrics.py (1)

nemo_skills/evaluation/metrics/audio_metrics.py (1)

AudioMetrics (43-345)

nemo_skills/evaluation/metrics/mmau_pro_metrics.py (1)

nemo_skills/evaluation/metrics/base.py (2)

as_int (443-446)

as_percentage (437-440)

nemo_skills/inference/model/audio_utils.py (2)

nemo_skills/utils.py (1)

get_logger_name (39-43)

nemo_skills/inference/patch_litellm_logging.py (1)

start (37-38)

nemo_skills/inference/model/__init__.py (1)

nemo_skills/inference/model/audio_utils.py (5)

audio_file_to_base64 (30-41)

chunk_audio (59-81)

load_audio_file (44-56)

make_audio_content_block (116-133)

save_audio_chunk_to_base64 (84-113)

tests/test_vllm_audio.py (2)

nemo_skills/inference/model/audio_utils.py (1)

audio_file_to_base64 (30-41)

nemo_skills/inference/model/vllm_multimodal.py (5)

VLLMMultimodalModel (44-445)

content_text_to_list (156-200)

_preprocess_messages_for_model (202-226)

_needs_audio_chunking (228-266)

_postprocess_chunk_generation (353-388)

nemo_skills/inference/model/vllm_multimodal.py (3)

nemo_skills/inference/model/audio_utils.py (4)

audio_file_to_base64 (30-41)

chunk_audio (59-81)

load_audio_file (44-56)

save_audio_chunk_to_base64 (84-113)

nemo_skills/inference/chat_interface/core.py (1)

get (136-151)

nemo_skills/inference/model/base.py (1)

generate_async (218-325)

🪛 Ruff (0.14.10)

tests/gpu-tests/test_vllm_audio.py

33-33: Probable insecure usage of temporary file or directory: "/tmp/nemo-skills-tests/"

(S108)

68-68: subprocess call with shell=True identified, security issue

(S602)

nemo_skills/evaluation/metrics/mmau_pro_metrics.py

123-123: Loop control variable agg_mode not used within loop body

Rename unused agg_mode to _agg_mode

(B007)

tests/test_vllm_audio.py

48-48: Unused lambda argument: self

(ARG005)

48-48: Unused lambda argument: kwargs

(ARG005)

nemo_skills/inference/model/vllm_multimodal.py

179-179: Avoid specifying long messages outside the exception class

(TRY003)

228-228: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

263-264: try-except-pass detected, consider logging the exception

(S110)

263-263: Do not catch blind exception: Exception

(BLE001)

296-296: Loop control variable chunk_idx not used within loop body

Rename unused chunk_idx to _chunk_idx

(B007)

312-314: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

394-394: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

🔇 Additional comments (12)

nemo_skills/dataset/mmau-pro/closed_form/__init__.py (1)

18-19: LGTM - Audio enablement flags added correctly.

The addition of ++should_enable_audio=true to GENERATION_ARGS and the new EVAL_ARGS constant align with the PR's objective to enable audio input support for the vllm multimodal component.

nemo_skills/evaluation/metrics/map_metrics.py (1)

53-53: LGTM - Backward compatibility alias added.

The "speechlm" alias maintains backward compatibility while routing to the unified AudioMetrics class. This allows existing references to "speechlm" metric type to continue working without modification.

nemo_skills/dataset/librispeech-pc/__init__.py (1)

28-29: LGTM - Audio enablement and cleanup.

The changes correctly add the audio enablement flag and remove trailing whitespace. This is consistent with the broader PR changes to enable audio input support.

nemo_skills/evaluation/evaluator/mmau_pro.py (1)

34-34: LGTM - Documentation updated to reflect new judge configuration.

The docstring correctly references the new judge/mmau-pro prompt config, aligning with the broader migration from speechlm to mmau-pro judge configuration across the codebase.

nemo_skills/dataset/asr-leaderboard/__init__.py (1)

21-21: LGTM - Audio enablement flag added correctly.

The addition of ++should_enable_audio=true to GENERATION_ARGS aligns with the PR objective to enable audio input support. The flag is placed appropriately alongside the existing audio-related arguments.

nemo_skills/dataset/audiobench/nonjudge/__init__.py (1)

28-31: LGTM - Audio flag added and formatting cleaned up.

The changes consistently add the ++should_enable_audio=true flag and remove trailing spaces from both EVAL_ARGS and GENERATION_ARGS, aligning with the PR's audio enablement objective.

nemo_skills/dataset/mmau-pro/open_ended/__init__.py (2)

18-18: Audio enablement flag added correctly.

The addition of ++should_enable_audio=true to GENERATION_ARGS is consistent with the PR's objective and matches the pattern applied across other dataset configurations.

26-26: Judge path migration to judge/mmau-pro is complete and correct.

The judge config path has been successfully updated from judge/speechlm to judge/mmau-pro. Verification confirms no broken references to judge/speechlm exist in the codebase, and the mmau-pro.yaml configuration file is properly in place. The code change is sound.

nemo_skills/dataset/audiobench/judge/__init__.py (1)

29-30: LGTM - Consistent audio enablement and cleanup.

The changes add the ++should_enable_audio=true flag to GENERATION_ARGS and remove trailing spaces from both constants, maintaining consistency with the audio enablement pattern applied across the codebase.

nemo_skills/prompt/config/judge/mmau-pro.yaml (1)

1-30: Format specification correctly matches the evaluation metrics parser.

The output format in the prompt (CORRECTNESS, RELEVANCE, COMPLETENESS, CLARITY, OVERALL with numeric scores) aligns perfectly with the regex patterns in extract_multicriteria_scores(). The parser uses case-insensitive matching and safely defaults to 3.0 for any missing fields, making it robust against minor format variations.

nemo_skills/evaluation/metrics/mmau_pro_metrics.py (1)

123-152: Multi-criteria aggregation ignores per-aggregation buckets.
self.multicriteria_scores is accumulated once and then applied to every agg_metrics entry; if metrics_dict contains per-subgroup metrics, this will smear scores across groups. If MMAU-Pro never uses subgrouping, OK—but otherwise consider collecting multicriteria per agg key.

nemo_skills/inference/generate.py (1)

590-607: drop_fields_from_messages() is a good safeguard, but ensure consumers tolerate empty content lists.
If a message contains only dropped content types, it becomes []. That’s probably fine for persisted outputs, but worth confirming downstream readers don’t assume non-empty content.

Also applies to: 624-627

nemo_skills/evaluation/evaluator/audio.py

nemo_skills/evaluation/metrics/mmau_pro_metrics.py

nemo_skills/inference/model/audio_utils.py

nemo_skills/inference/model/vllm_multimodal.py

tests/gpu-tests/test_vllm_audio.py

tests/test_vllm_audio.py

greptile-apps

Greptile Overview

Greptile Summary

This PR enhances the VLLMMultimodalModel to support audio input processing while maintaining existing audio output capabilities.

Key Changes

Audio Input Processing:

New audio_utils.py module provides utilities for encoding audio files to base64, chunking long audio, and temporary file handling
VLLMMultimodalModel now accepts audio chunking configuration (threshold, task type filtering)
Implements automatic chunking for audio files exceeding duration limits, processing each chunk separately and aggregating results
Converts audio file paths to base64-encoded data URIs in the proper format for vLLM/Qwen models
Audio content is placed before text content (required for Qwen models)

Evaluation Enhancements:

mmau_pro_metrics.py now supports multi-criteria scoring (correctness, relevance, completeness, clarity, overall) on 1-5 scale for open-ended questions
New strip_helpful_prefixes() function in audio.py removes LLM-generated prefixes like "The audio says:" from transcriptions before WER calculation
Judge prompt configuration changed from speechlm to mmau-pro for multi-criteria evaluation

Pipeline Integration:

generate.py adds audio configuration options and drop_fields_from_messages() to remove base64 audio data from output files
Multiple dataset configs updated with should_enable_audio=true flag to enable audio processing

Architecture

The audio input flow follows this pattern:

Check if audio exceeds duration threshold
If yes: chunk audio → process each chunk → aggregate results
If no: encode entire audio to base64 → process normally
Audio responses from server are saved to disk and base64 data replaced with file paths

Confidence Score: 2/5

Has critical bugs that will cause runtime failures and test failures
Found P0 IndexError bug in string splitting (line 367-370), P1 implementation-documentation mismatch causing test failures, P2 shallow copy issue, and P2 fallback logic ambiguity. The IndexError could trigger during chunk generation post-processing, and the /no_think test will definitely fail.
nemo_skills/inference/model/vllm_multimodal.py (has critical bugs in _postprocess_chunk_generation and _preprocess_messages_for_model), nemo_skills/evaluation/metrics/mmau_pro_metrics.py (ambiguous fallback logic)

Important Files Changed

File Analysis

Filename	Score	Overview
nemo_skills/inference/model/vllm_multimodal.py	2/5	Added audio input processing (chunking, base64 encoding) and enhanced output handling. Found critical bugs: _preprocess_messages_for_model doesn't strip /no_think as documented and tested, potential IndexError in _postprocess_chunk_generation, and shallow copy issue in _generate_with_chunking.
nemo_skills/inference/model/audio_utils.py	5/5	New utility module for audio processing (base64 encoding, chunking, file I/O). Implementation is clean and straightforward with proper resource cleanup.
nemo_skills/inference/generate.py	4/5	Added audio configuration parameters and drop_fields_from_messages() to reduce output size. Clean integration with minimal changes.
nemo_skills/evaluation/metrics/mmau_pro_metrics.py	3/5	Enhanced with multi-criteria scoring (1-5 scale) for open-ended questions. Found logic bug in extract_multicriteria_scores where fallback condition will not distinguish between actual score of 3.0 and missing score.
nemo_skills/evaluation/evaluator/audio.py	4/5	Added strip_helpful_prefixes() to remove LLM-generated prefixes from transcriptions. Implementation looks sound with proper fallback strategies.
tests/test_vllm_audio.py	3/5	New unit tests for audio utilities and multimodal model. Test expects /no_think stripping behavior that the implementation doesn't provide, exposing a bug.

Sequence Diagram

sequenceDiagram
    participant User
    participant GenerationTask
    participant VLLMMultimodalModel
    participant AudioUtils
    participant VLLMServer
    
    User->>GenerationTask: Run inference with audio
    GenerationTask->>GenerationTask: Configure audio settings (chunking, etc)
    GenerationTask->>VLLMMultimodalModel: Initialize model with audio config
    
    alt Audio chunking needed
        VLLMMultimodalModel->>AudioUtils: load_audio_file()
        AudioUtils-->>VLLMMultimodalModel: audio_array, sampling_rate
        VLLMMultimodalModel->>AudioUtils: chunk_audio()
        AudioUtils-->>VLLMMultimodalModel: List of chunks
        
        loop For each chunk
            VLLMMultimodalModel->>AudioUtils: save_audio_chunk_to_base64()
            AudioUtils-->>VLLMMultimodalModel: base64_audio
            VLLMMultimodalModel->>VLLMMultimodalModel: content_text_to_list() [inline]
            VLLMMultimodalModel->>VLLMMultimodalModel: _preprocess_messages_for_model()
            VLLMMultimodalModel->>VLLMServer: send chunk request
            VLLMServer-->>VLLMMultimodalModel: chunk response
            VLLMMultimodalModel->>VLLMMultimodalModel: _postprocess_chunk_generation()
        end
        
        VLLMMultimodalModel->>VLLMMultimodalModel: Aggregate chunk results
    else No chunking
        VLLMMultimodalModel->>AudioUtils: audio_file_to_base64()
        AudioUtils-->>VLLMMultimodalModel: base64_audio
        VLLMMultimodalModel->>VLLMMultimodalModel: content_text_to_list()
        VLLMMultimodalModel->>VLLMMultimodalModel: _preprocess_messages_for_model()
        VLLMMultimodalModel->>VLLMServer: send request
        VLLMServer-->>VLLMMultimodalModel: response
    end
    
    VLLMMultimodalModel->>VLLMMultimodalModel: _parse_chat_completion_response()
    
    alt Audio in response
        VLLMMultimodalModel->>VLLMMultimodalModel: _process_audio_response()
        VLLMMultimodalModel->>VLLMMultimodalModel: Save audio to disk
    end
    
    VLLMMultimodalModel-->>GenerationTask: result with generation
    GenerationTask->>GenerationTask: drop_fields_from_messages()
    GenerationTask-->>User: Final output

nemo_skills/inference/model/vllm_multimodal.py

nemo_skills/evaluation/metrics/mmau_pro_metrics.py

greptile-apps · 2026-01-11T18:01:10Z

Greptile Summary

This PR adds comprehensive audio input/output support to the vLLM multimodal inference pipeline, enabling audio processing for benchmarks like MMAU-Pro, AudioBench, and ASR-Leaderboard.

Key Changes:

Audio Input Processing: New VLLMMultimodalModel class handles audio file encoding to base64, automatic chunking for long audio files (>30s), and proper message formatting with audio items prepended before text (critical for Qwen models)
Audio Chunking: Implements intelligent chunking that splits long audio into processable segments, generates responses for each chunk, and aggregates results while summing statistics (tokens, time) across all chunks
Audio Output Handling: Saves generated audio responses to disk and replaces base64 data with file paths to reduce output size
Evaluation Enhancements: Expanded MMAU-Pro metrics with multi-criteria scoring (correctness, relevance, completeness, clarity, overall) using regex extraction and fallback logic; added multiple ASR normalization modes for AudioBench
Configuration Updates: All audio benchmarks now set enable_audio=true flag; automatic server type switching from vllm to vllm_multimodal when audio is enabled
Testing: Comprehensive unit tests for audio utilities and preprocessing logic, plus GPU integration tests for end-to-end audio generation

Architecture:
The implementation follows a clean separation: audio_utils.py provides reusable audio operations (loading, chunking, base64 encoding with proper cleanup), while vllm_multimodal.py orchestrates the workflow including preprocessing, chunking decisions, and result aggregation. The chunking logic accumulates statistics properly across chunks (addressing previous thread concern about preserving only last-chunk metadata).

Previous Thread Issues Addressed:
Most previous comments were marked as false positives by the senior developer. Key legitimate concerns that were addressed include: AttributeError protection added for None generation values in metrics, and the developer confirmed that shallow copy with deepcopy at call sites is the intended pattern.

Confidence Score: 4/5

This PR is safe to merge with minor caveats - the core implementation is solid with good test coverage
The implementation is well-structured with proper separation of concerns, comprehensive unit and integration tests, and addresses a clear need for audio processing. The chunking logic correctly sums statistics across chunks (lines 289-349 in vllm_multimodal.py), audio utilities have proper resource cleanup (lines 117-128 in audio_utils.py), and the preprocessing flow is consistent. Previous review threads identified mostly false positives. Score of 4 rather than 5 due to: (1) the complex preprocessing flow with deepcopy patterns that could benefit from clearer documentation about mutation expectations, (2) regex-based score extraction in mmau_pro_metrics.py that defaults to 3.0 (neutral) for missing scores which could mask evaluation failures, and (3) the auto-switching from vllm to vllm_multimodal server type could surprise users with explicit vllm configuration
Pay attention to nemo_skills/inference/model/vllm_multimodal.py for the preprocessing flow and ensure the deepcopy patterns are understood; review nemo_skills/evaluation/metrics/mmau_pro_metrics.py for the 3.0 default score behavior

Important Files Changed

Filename	Overview
nemo_skills/inference/model/vllm_multimodal.py	Adds comprehensive audio input/output support including chunking, base64 encoding, preprocessing, and aggregation logic for multimodal models
nemo_skills/inference/model/audio_utils.py	New utility module with well-tested audio operations including file loading, base64 encoding, and chunking with proper cleanup
nemo_skills/evaluation/metrics/mmau_pro_metrics.py	Enhanced multi-criteria scoring for MMAU-Pro with regex extraction, fallback logic, and percentage conversion
nemo_skills/inference/generate.py	Added audio flag handling and auto-switching from vllm to vllm_multimodal server type

Sequence Diagram

sequenceDiagram
    participant User
    participant Generate
    participant VLLMMultimodalModel
    participant AudioUtils
    participant VLLMServer
    
    User->>Generate: Input with audio path
    Generate->>Generate: Check enable_audio flag
    Generate->>Generate: Auto-switch vllm to vllm_multimodal if needed
    Generate->>VLLMMultimodalModel: Initialize with audio config
    
    alt Audio chunking enabled
        VLLMMultimodalModel->>VLLMMultimodalModel: _needs_audio_chunking()
        VLLMMultimodalModel->>AudioUtils: load_audio_file()
        AudioUtils-->>VLLMMultimodalModel: audio_array, sampling_rate
        VLLMMultimodalModel->>VLLMMultimodalModel: Check duration > threshold
        
        alt Duration exceeds threshold
            VLLMMultimodalModel->>AudioUtils: chunk_audio()
            AudioUtils-->>VLLMMultimodalModel: List of audio chunks
            
            loop For each chunk
                VLLMMultimodalModel->>AudioUtils: save_audio_chunk_to_base64()
                AudioUtils-->>VLLMMultimodalModel: base64 encoded chunk
                VLLMMultimodalModel->>VLLMMultimodalModel: content_text_to_list()
                VLLMMultimodalModel->>VLLMMultimodalModel: _preprocess_messages_for_model()
                VLLMMultimodalModel->>VLLMServer: generate_async() with chunk
                VLLMServer-->>VLLMMultimodalModel: Chunk generation result
                VLLMMultimodalModel->>VLLMMultimodalModel: Accumulate statistics
            end
            
            VLLMMultimodalModel->>VLLMMultimodalModel: Aggregate chunk results
            VLLMMultimodalModel-->>Generate: Final result with summed stats
        end
    end
    
    alt No chunking needed
        VLLMMultimodalModel->>AudioUtils: audio_file_to_base64()
        AudioUtils-->>VLLMMultimodalModel: base64 encoded audio
        VLLMMultimodalModel->>VLLMMultimodalModel: content_text_to_list()
        Note over VLLMMultimodalModel: Audio items prepended before text
        VLLMMultimodalModel->>VLLMMultimodalModel: _preprocess_messages_for_model()
        VLLMMultimodalModel->>VLLMServer: generate_async() with full audio
        VLLMServer-->>VLLMMultimodalModel: Generation result
        
        alt Response contains audio output
            VLLMMultimodalModel->>VLLMMultimodalModel: _process_audio_response()
            VLLMMultimodalModel->>VLLMMultimodalModel: Save audio to output_dir/audio/
            VLLMMultimodalModel->>VLLMMultimodalModel: Replace base64 with file path
        end
        
        VLLMMultimodalModel-->>Generate: Final result
    end
    
    Generate->>Generate: Drop content_types if configured
    Generate-->>User: Output with generation

greptile-apps

_{3 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/inference/model/audio_utils.py

nemo_skills/evaluation/metrics/mmau_pro_metrics.py

nemo_skills/inference/model/vllm_multimodal.py

greptile-apps

_{6 files reviewed, 6 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/inference/model/vllm_multimodal.py

nemo_skills/evaluation/metrics/mmau_pro_metrics.py

nemo_skills/inference/model/vllm_multimodal.py

greptile-apps

_{8 files reviewed, 8 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/evaluation/metrics/mmau_pro_metrics.py

nemo_skills/inference/model/vllm_multimodal.py

nemo_skills/inference/generate.py

greptile-apps · 2026-01-11T20:05:28Z

Additional Comments (1)

nemo_skills/evaluation/evaluator/mmau_pro.py
The has_postscript function removes all non-alphabetic characters from both the marker and text before comparing, which could cause false positives. For example, if marker="P.S." and the text contains the word "ps" anywhere, it will match.

Consider a more precise matching strategy that preserves word boundaries:

    def has_postscript(text, marker):
        return bool(re.search(r'\b' + re.escape(marker) + r'\b', text, re.IGNORECASE))

Jorjeous · 2026-01-12T13:55:34Z

@melllinia Can you plz check mmau-pro part?

Kipok

thanks, just a few minor comments! I didn't check the underlying logic for audio processing too much, if you need a thorough review there please add relevant people as reviewers

nemo_skills/dataset/asr-leaderboard/__init__.py

nemo_skills/evaluation/evaluator/audio.py

nemo_skills/evaluation/metrics/mmau_pro_metrics.py

nemo_skills/inference/model/vllm_multimodal.py

nemo_skills/inference/generate.py

tests/gpu-tests/test_vllm_audio.py

Jorjeous · 2026-01-13T15:43:14Z

Testing is wip

greptile-apps

_{19 files reviewed, 19 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/inference/model/audio_utils.py

nemo_skills/inference/model/vllm_multimodal.py

nemo_skills/inference/model/audio_utils.py

nemo_skills/evaluation/evaluator/audio.py

nemo_skills/inference/model/vllm_multimodal.py

tests/test_vllm_audio.py

greptile-apps · 2026-01-13T15:52:13Z

Additional Comments (6)

nemo_skills/evaluation/evaluator/audio.py
The evaluate_hallucination function returns a successful result (is_correct: True, error: "missing_audio_duration") when audio duration is missing or invalid. This is incorrect - a missing duration should be treated as an error condition, not as "correct".

if not audio_duration or audio_duration <= 0:
    return {
        "hallucination_rate": 0.0,
        "char_rate": 0.0,
        "is_correct": True,  # ← This should be False
        "error": "missing_audio_duration",
    }

When evaluation cannot be performed due to missing data, is_correct should be False to indicate the evaluation failed, not that the audio passed the hallucination check.

nemo_skills/evaluation/evaluator/audio.py
The capitalization accuracy check in evaluate_pc_rate has a critical flaw: when reference and hypothesis have different word counts, it sets cap_accuracy = 0.0 (line 419). However, this is overly harsh and doesn't reflect actual capitalization accuracy.

For example, if reference is "Hello World Test" and hypothesis is "Hello World", the function returns 0.0 cap_accuracy even though 2 out of 2 words in the hypothesis are correctly capitalized.

A better approach would be to check capitalization for the overlapping portion:

min_len = min(len(ref_words), len(hyp_words))
if min_len > 0:
    cap_matches = sum(
        1 for r, h in zip(ref_words[:min_len], hyp_words[:min_len])
        if r and h and r[0].isupper() == h[0].isupper()
    )
    cap_accuracy = cap_matches / min_len
else:
    cap_accuracy = 0.0

nemo_skills/evaluation/evaluator/audio.py
Replacing empty strings with "empty" in WER calculation (lines 318-320) will produce misleading WER scores. If both reference and hypothesis are empty, WER should be 0.0 (perfect match), but comparing "empty" vs "empty" still gives 0.0, which happens to be correct by accident.

However, if reference is empty but hypothesis is not (or vice versa), this creates artificial WER. For example:

Reference: "" (empty), Hypothesis: "hello" → becomes ref="empty", hyp="hello" → WER calculated as if "empty" was the ground truth

This distorts metrics. Better approach:

if not ref and not hyp:
    return {"wer": 0.0, "is_correct": True}  # Both empty = perfect match
if not ref or not hyp:
    return {"wer": 1.0, "is_correct": False}  # One empty = complete mismatch

nemo_skills/evaluation/evaluator/mmau_pro.py
The has_postscript function removes ALL non-alphabetic characters before comparing, which causes false positives. For example:

If marker is "P.S." and text contains the word "psychology", after removing non-alphabetic chars: marker becomes "ps" and text contains "ps...", which will match
If marker is "Note:" and text contains "anote" or "denote", it will incorrectly match

This was already identified in a previous review comment. The function needs word boundary checking:

def has_postscript(text, marker):
    # Remove punctuation from marker for flexible matching
    clean_marker = re.sub(r'[^a-zA-Z\s]', '', marker).strip()
    # Use word boundaries to avoid substring matches
    pattern = r'\b' + re.escape(clean_marker) + r'\b'
    return bool(re.search(pattern, text, re.IGNORECASE))

nemo_skills/inference/model/vllm_multimodal.py
The code strips audio base64 data from serialized_output to save space (lines 111-114), but this logic has issues:

It only removes the "data" key but keeps the empty "audio" dict structure, which is wasteful
The code checks "audio" in item but audio could be nested deeper in the structure
It doesn't handle the case where audio was already stripped (accessing non-existent key could fail)

Better implementation:

if "serialized_output" in result:
    for item in result["serialized_output"]:
        if isinstance(item, dict) and "audio" in item:
            # Remove entire audio data to save space, keep only metadata
            if isinstance(item["audio"], dict) and "data" in item["audio"]:
                # Keep only essential metadata
                item["audio"] = {
                    k: v for k, v in item["audio"].items() 
                    if k != "data"
                }

nemo_skills/evaluation/evaluator/audio.py
The punctuation matching in evaluate_pc_rate uses a tolerance of 2 characters (line 402) without justification. This tolerance is arbitrary and could lead to incorrect matches.

For example, in a 100-character text:

Comma at position 50 in reference
Comma at position 52 in hypothesis
These match despite being in potentially different contexts

A better approach would be:

Use word-position-based matching instead of character position
Document why this tolerance is necessary
Make it configurable

Additionally, the code counts each punctuation mark only once (line 403 break), but if the same position has multiple matching punctuation marks in hypothesis, only the first is counted. This could undercount precision.

greptile-apps

_{18 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/inference/model/vllm_multimodal.py

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous · 2026-01-14T13:41:04Z

@coderabbitai help

coderabbitai · 2026-01-14T13:42:14Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@nemo_skills/evaluation/evaluator/audio.py`:
- Around line 69-84: The sequential quote-extraction on the variable result
(using re.search(r'"((?:\\.|[^"\\])*)")', the colon-quote checks ":'" / ": '",
and re.search(r"'(.*)'")) can over-strip nested quotes (e.g., "He said:
'hello'"); change the flow so only one extraction applies: after a successful
double-quote match, or after a colon-quote branch, or after a single-quote
match, immediately accept that extraction and skip the remaining handlers
(either by converting the blocks into an if/elif chain or by returning the
extracted result from the surrounding function), and add a short comment
documenting the chosen precedence order for clarity.
- Line 475: The code uses direct dict access generation =
sample["generation"].strip() which can raise KeyError if "generation" is
missing; change this to defensively access the key (e.g., use
sample.get("generation", "") and ensure you handle None/empty string before
calling .strip()) so malformed samples won't crash the evaluator; update the
occurrence(s) of generation = sample["generation"].strip() to use the safe
lookup and preserve existing behavior when the field is present.

In `@nemo_skills/evaluation/metrics/mmau_pro_metrics.py`:
- Around line 140-145: The conversion from a 1-5 scale to percent in the block
that computes avg_score and std_score and writes to agg_metrics (keys like
f"avg_{criterion}" and f"std_{criterion}") is wrong: (score / 5.0) * 100 maps
1→20 not 0. If you intend 0–100, change the conversion to ((value - 1) / 4.0) *
100 for both avg_score and std_score; otherwise update the comment to state that
the mapping yields 20–100. Ensure you adjust both assignments that set
agg_metrics[f"avg_{criterion}"] and agg_metrics[f"std_{criterion}"] (using the
local names avg_score, std_score, and scores).

In `@nemo_skills/inference/model/audio_utils.py`:
- Around line 59-96: In chunk_audio, add input validation to ensure
sampling_rate and chunk_duration_sec are positive (e.g., > 0) before computing
chunk_samples; if not, raise a ValueError with a clear message. Specifically
validate the sampling_rate and chunk_duration_sec parameters at the start of
chunk_audio (and/or assert chunk_samples > 0) to avoid chunk_samples being zero
or negative and subsequent infinite/invalid loops or divide-by-zero issues, then
proceed with existing logic that computes chunk_samples and min_chunk_samples
and builds chunks.

♻️ Duplicate comments (8)

nemo_skills/inference/model/audio_utils.py (1)
114-128: Potential resource leak if exception occurs before try block.

If an exception occurs between creating the temp file (line 114) and entering the try block (line 117), the file won't be cleaned up.
Safer structure
-    # Create temporary file
-    tmp_file = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
-    tmp_path = tmp_file.name
-
-    try:
-        tmp_file.close()
+    tmp_file = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
+    tmp_path = tmp_file.name
+    tmp_file.close()
+
+    try:
         sf.write(tmp_path, audio_chunk, sampling_rate)
This ensures tmp_file.close() completes before entering the try block, and any failure in sf.write or later is properly caught with cleanup.
tests/test_vllm_audio.py (1)
45-57: Tighten the __init__ patch to suppress lint warnings.

The lambda triggers ARG005 because self and kwargs are unused. Use *_args, **_kwargs to indicate intentional non-use.
♻️ Proposed fix
-    with patch.object(VLLMMultimodalModel, "__init__", lambda self, **kwargs: None):
+    with patch.object(VLLMMultimodalModel, "__init__", lambda *_args, **_kwargs: None):
tests/gpu-tests/test_vllm_audio.py (2)
55-68: Avoid shell=True with string commands.

Using shell=True with a string command introduces potential command injection risks and is flagged by security linters. Convert to an argument list.
♻️ Proposed fix using argument list
-        cmd = (
-            f"ns generate "
-            f"    --cluster test-local --config_dir {Path(__file__).absolute().parent} "
-            f"    --model {model_path} "
-            f"    --output_dir {output_dir} "
-            f"    --server_type vllm_multimodal "
-            f"    --server_gpus 1 "
-            f"    --server_nodes 1 "
-            f"    --server_args '--enforce-eager' "
-            f"    --input_file={input_file} "
-            f"    ++prompt_format=openai "
-            f"    ++skip_filled=False "
-        )
-        subprocess.run(cmd, shell=True, check=True)
+        cmd = [
+            "ns", "generate",
+            "--cluster", "test-local",
+            "--config_dir", str(Path(__file__).absolute().parent),
+            "--model", model_path,
+            "--output_dir", output_dir,
+            "--server_type", "vllm_multimodal",
+            "--server_gpus", "1",
+            "--server_nodes", "1",
+            "--server_args", "--enforce-eager",
+            f"--input_file={input_file}",
+            "++prompt_format=openai",
+            "++skip_filled=False",
+        ]
+        subprocess.run(cmd, check=True)
38-52: Add existence check for CI-specific audio files.

The hardcoded paths (/nemo_run/code/tests/slurm-tests/asr_nim/wavs/...) are CI-specific. Consider adding a guard to skip the test when files are missing.
♻️ Proposed fix with existence check
     # Create test input file with audio
+    audio_files = [
+        "/nemo_run/code/tests/slurm-tests/asr_nim/wavs/t2_16.wav",
+        "/nemo_run/code/tests/slurm-tests/asr_nim/wavs/t3_16.wav",
+    ]
+    for audio_file in audio_files:
+        if not Path(audio_file).exists():
+            pytest.skip(f"Test audio file not found: {audio_file}")
+
     with tempfile.NamedTemporaryFile(mode="w", suffix=".jsonl", delete=False) as f:
         test_data = [
             {
                 "problem": "Transcribe this audio",
-                "audio": {"path": "/nemo_run/code/tests/slurm-tests/asr_nim/wavs/t2_16.wav"},
+                "audio": {"path": audio_files[0]},
             },
             {
                 "problem": "What is in this audio?",
-                "audio": {"path": "/nemo_run/code/tests/slurm-tests/asr_nim/wavs/t3_16.wav"},
+                "audio": {"path": audio_files[1]},
             },
         ]
nemo_skills/inference/model/vllm_multimodal.py (1)
184-197: Path traversal vulnerability when resolving audio paths.

Using os.path.join(self.data_dir, audio["path"]) allows directory traversal if audio["path"] contains ../ sequences. A malicious input could read arbitrary files on the system.

Consider validating that the resolved path stays within data_dir:
🔒 Proposed fix to prevent path traversal
+    def _resolve_safe_path(self, relative_path: str) -> str:
+        """Resolve path ensuring it stays within data_dir."""
+        resolved = os.path.realpath(os.path.join(self.data_dir, relative_path))
+        data_dir_resolved = os.path.realpath(self.data_dir)
+        if not resolved.startswith(data_dir_resolved + os.sep):
+            raise ValueError(f"Path traversal detected: {relative_path}")
+        return resolved

     if "audio" in message:
         audio = message["audio"]
-        audio_path = os.path.join(self.data_dir, audio["path"])
+        audio_path = self._resolve_safe_path(audio["path"])
         base64_audio = audio_file_to_base64(audio_path)
nemo_skills/evaluation/evaluator/audio.py (2)

183-217: Digit-to-word mapping is incomplete for compound numbers.

As noted in prior reviews, this handles 0-20 and decade multiples (30, 40, etc.) but not compound numbers like 21-29, 31-39. This is acceptable if documented, but numbers like "25" will remain as digits after normalization, causing WER discrepancies when references spell them out.

287-311: Verify whisper_normalizer dependency is declared.

The import of EnglishTextNormalizer from whisper_normalizer.english requires the whisper_normalizer package. Per prior discussion, this was addressed in PR #1140 — please confirm that dependency is merged or included in this PR's requirements.

nemo_skills/evaluation/metrics/mmau_pro_metrics.py (1)

133-134: Add a comment explaining the division by 2.0.

This magic number was flagged in previous reviews but still lacks explanation. Please document why no_answer is divided by 2.0 to prevent confusion for future maintainers.

🧹 Nitpick comments (10)

nemo_skills/inference/model/vllm_multimodal.py (3)
218-218: Fix implicit Optional type annotation.

Per PEP 484, task_type: str = None should be task_type: str | None = None to explicitly indicate the parameter accepts None.
♻️ Proposed fix
-    def _needs_audio_chunking(self, messages: list[dict], task_type: str = None) -> tuple[bool, str, float]:
+    def _needs_audio_chunking(self, messages: list[dict], task_type: str | None = None) -> tuple[bool, str, float]:
357-357: Fix implicit Optional type annotation.

Same issue as _needs_audio_chunking - explicitly annotate as str | None.
♻️ Proposed fix
-    task_type: str = None,
+    task_type: str | None = None,
294-294: Consider using chunk_idx or renaming to indicate non-use.

The loop variable chunk_idx is declared but unused. Either use it for logging (e.g., LOG.debug(f"Processing chunk {chunk_idx + 1}/{len(chunks)}")) or rename to _chunk_idx to suppress the lint warning.
tests/test_vllm_audio.py (2)
73-78: Add assertion to verify original audio field was removed.

The test verifies the content structure but doesn't check that the original audio key was deleted after conversion (per vllm_multimodal.py line 190).
♻️ Proposed fix
     assert result["content"][0]["type"] == "audio_url"
     assert result["content"][0]["audio_url"]["url"].startswith("data:audio/wav;base64,")
     assert result["content"][1]["type"] == "text"
+    assert "audio" not in result  # Verify original field was removed
100-105: Add assertion to verify original audios field was removed.

Similar to the single audio test, verify that the audios key was deleted after conversion.
♻️ Proposed fix
     assert result["content"][0]["type"] == "audio_url"
     assert result["content"][1]["type"] == "audio_url"
     assert result["content"][2]["type"] == "text"
+    assert "audios" not in result  # Verify original field was removed
nemo_skills/evaluation/evaluator/audio.py (3)
42-48: Consider using compiled regex patterns for performance.

These patterns are used on every sample evaluation. Pre-compiling them at module load would avoid repeated compilation overhead.
♻️ Suggested improvement
-_FAILURE_RESPONSES = [
-    r"the speech is in audio format and needs to be transcribed",
-    r"i do not have access to audio",
-    r"i cannot access audio",
-    r"i'm sorry.*i do not have access",
-    r"as an ai language model.*i do not have access",
-]
+_FAILURE_RESPONSES = [
+    re.compile(r"the speech is in audio format and needs to be transcribed", re.IGNORECASE),
+    re.compile(r"i do not have access to audio", re.IGNORECASE),
+    re.compile(r"i cannot access audio", re.IGNORECASE),
+    re.compile(r"i'm sorry.*i do not have access", re.IGNORECASE),
+    re.compile(r"as an ai language model.*i do not have access", re.IGNORECASE),
+]
Then update usage:
     for failure_pattern in _FAILURE_RESPONSES:
-        if re.search(failure_pattern, result, flags=re.IGNORECASE):
+        if failure_pattern.search(result):
             return ""
220-248: Contraction expansion may conflict with jiwer's ExpandCommonEnglishContractions.

In preprocess_asr_text for "audiobench" mode (lines 297-304), _expand_contractions is called before jiwer.ExpandCommonEnglishContractions(). Since both expand contractions, the jiwer transform becomes a no-op. This is harmless but redundant — consider removing one.

271-274: Static analysis: Consider a custom exception class for invalid normalization modes.

Per TRY003, long messages in raise ValueError(...) are better encapsulated in a custom exception. However, this is a minor style concern and acceptable for a configuration validation error.
nemo_skills/evaluation/metrics/mmau_pro_metrics.py (2)
110-123: Good fix for the None.strip() crash.

Line 114 now safely handles missing/None generation values.

Note: extract_multicriteria_scores may be called twice per prediction with a judgement—once in _get_score_dict (via the parent's update) and again in lines 121-123. If performance becomes a concern, consider caching the parsed result.

129-129: Unused loop variable.

agg_mode is not used within the loop body.
Suggested fix
-        for agg_mode, agg_metrics in metrics_dict.items():
+        for _agg_mode, agg_metrics in metrics_dict.items():

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 841f077 and 681725e.

📒 Files selected for processing (18)

nemo_skills/dataset/asr-leaderboard/__init__.py
nemo_skills/dataset/audiobench/judge/__init__.py
nemo_skills/dataset/audiobench/nonjudge/__init__.py
nemo_skills/dataset/librispeech-pc/__init__.py
nemo_skills/dataset/mmau-pro/closed_form/__init__.py
nemo_skills/dataset/mmau-pro/open_ended/__init__.py
nemo_skills/evaluation/evaluator/audio.py
nemo_skills/evaluation/evaluator/mmau_pro.py
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/evaluation/metrics/mmau_pro_metrics.py
nemo_skills/inference/generate.py
nemo_skills/inference/model/__init__.py
nemo_skills/inference/model/audio_utils.py
nemo_skills/inference/model/vllm_multimodal.py
nemo_skills/prompt/config/judge/mmau-pro.yaml
nemo_skills/prompt/config/judge/speechlm.yaml
tests/gpu-tests/test_vllm_audio.py
tests/test_vllm_audio.py

💤 Files with no reviewable changes (1)

nemo_skills/prompt/config/judge/speechlm.yaml

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-12-12T16:09:53.870Z

Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.

Applied to files:

nemo_skills/dataset/mmau-pro/open_ended/__init__.py
nemo_skills/dataset/audiobench/nonjudge/__init__.py
nemo_skills/dataset/mmau-pro/closed_form/__init__.py
nemo_skills/prompt/config/judge/mmau-pro.yaml
nemo_skills/dataset/audiobench/judge/__init__.py
nemo_skills/evaluation/metrics/mmau_pro_metrics.py

🧬 Code graph analysis (7)

tests/gpu-tests/test_vllm_audio.py (1)

tests/gpu-tests/utils.py (1)

require_env_var (18-23)

nemo_skills/inference/model/audio_utils.py (1)

nemo_skills/utils.py (1)

get_logger_name (39-43)

tests/test_vllm_audio.py (2)

nemo_skills/inference/model/audio_utils.py (1)

audio_file_to_base64 (30-41)

nemo_skills/inference/model/vllm_multimodal.py (3)

content_text_to_list (157-203)

_preprocess_messages_for_model (205-216)

_needs_audio_chunking (218-256)

nemo_skills/evaluation/evaluator/audio.py (1)

nemo_skills/evaluation/metrics/audio_metrics.py (1)

update (177-228)

nemo_skills/inference/model/__init__.py (1)

nemo_skills/inference/model/audio_utils.py (5)

audio_file_to_base64 (30-41)

chunk_audio (59-96)

load_audio_file (44-56)

make_audio_content_block (133-150)

save_audio_chunk_to_base64 (99-130)

nemo_skills/evaluation/metrics/map_metrics.py (1)

nemo_skills/evaluation/metrics/audio_metrics.py (1)

AudioMetrics (43-352)

nemo_skills/inference/model/vllm_multimodal.py (1)

nemo_skills/inference/model/audio_utils.py (4)

audio_file_to_base64 (30-41)

chunk_audio (59-96)

load_audio_file (44-56)

save_audio_chunk_to_base64 (99-130)

🪛 Ruff (0.14.11)

tests/gpu-tests/test_vllm_audio.py

33-33: Probable insecure usage of temporary file or directory: "/tmp/nemo-skills-tests/"

(S108)

68-68: subprocess call with shell=True identified, security issue

(S602)

nemo_skills/inference/model/audio_utils.py

78-80: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/evaluation/metrics/mmau_pro_metrics.py

129-129: Loop control variable agg_mode not used within loop body

Rename unused agg_mode to _agg_mode

(B007)

tests/test_vllm_audio.py

48-48: Unused lambda argument: self

(ARG005)

48-48: Unused lambda argument: kwargs

(ARG005)

nemo_skills/evaluation/evaluator/audio.py

272-274: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/inference/model/vllm_multimodal.py

180-180: Avoid specifying long messages outside the exception class

(TRY003)

218-218: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

284-284: Avoid specifying long messages outside the exception class

(TRY003)

294-294: Loop control variable chunk_idx not used within loop body

Rename unused chunk_idx to _chunk_idx

(B007)

310-312: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

340-340: Avoid specifying long messages outside the exception class

(TRY003)

357-357: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

🔇 Additional comments (26)

nemo_skills/evaluation/metrics/map_metrics.py (1)

53-54: LGTM! Clean alias addition.

The new "speechlm" alias correctly maps to AudioMetrics, consistent with the existing "audio" entry. This provides flexibility for configurations that reference the metric type by either name.

nemo_skills/dataset/librispeech-pc/__init__.py (1)

28-29: LGTM!

The configuration changes correctly enable audio support for LibriSpeech-PC. The ++enable_audio=true flag and trailing whitespace cleanup align with the standardization across other dataset configurations in this PR.

nemo_skills/dataset/audiobench/nonjudge/__init__.py (1)

28-31: LGTM!

The updated EVAL_ARGS with normalization_mode=audiobench and GENERATION_ARGS with enable_audio=true correctly configure this dataset for audio-enabled evaluation and generation, consistent with other AudioBench configurations in the PR.

nemo_skills/dataset/audiobench/judge/__init__.py (1)

29-30: LGTM!

The configuration aligns with the nonjudge variant and correctly enables audio support with the AudioBench-specific normalization mode for judge-based evaluation tasks.

nemo_skills/dataset/asr-leaderboard/__init__.py (1)

20-21: LGTM!

The enable_audio=true flag naming aligns with the agreed convention from previous review feedback. The hf_leaderboard normalization mode is appropriate per the file's documentation about HuggingFace leaderboard preprocessing.

nemo_skills/inference/model/audio_utils.py (3)

30-41: LGTM with upstream validation assumption.

The function correctly encodes audio files to base64. Ensure callers that construct audio_file_path from user input validate paths to prevent traversal attacks (e.g., resolved path must stay under an expected base directory).

44-56: LGTM!

Clean wrapper with lazy import for soundfile.

133-150: LGTM!

Clean implementation supporting both vLLM/Qwen and OpenAI/NVIDIA API audio content formats.

nemo_skills/inference/model/__init__.py (1)

22-30: LGTM - Audio utility exports are well-organized.

The new audio utilities are properly imported and re-exported from the module's public interface. The imports are placed in alphabetical order and the functions (audio_file_to_base64, chunk_audio, load_audio_file, make_audio_content_block, save_audio_chunk_to_base64) align with the implementations in audio_utils.py.

nemo_skills/inference/model/vllm_multimodal.py (2)

353-387: LGTM - generate_async implementation properly handles audio input.

The method correctly:

Checks if chunking is needed via _needs_audio_chunking

Delegates to _generate_with_chunking for long audio

Converts audio fields to base64 format for non-chunked audio

Falls back to parent's generate_async for final processing

The preprocessing order (content_text_to_list → _preprocess_messages_for_model) is consistent with the chunking path.

286-351: Good implementation of chunked audio generation with proper statistics aggregation.

The cumulative statistics tracking (lines 289-292, 328-331, 347-349) correctly sums input_tokens, generated_tokens, and time_elapsed across all chunks, addressing the concern about only preserving last-chunk metadata.

nemo_skills/inference/generate.py (3)

208-215: Audio configuration fields are well-documented.

The new configuration options (drop_content_types, enable_audio, enable_audio_chunking, audio_chunk_task_types, chunk_audio_threshold_sec) provide good control over audio processing behavior. The defaults are sensible for typical workflows.

417-434: Audio processing setup looks correct.

The logic properly:

Detects when audio processing is needed based on enable_audio or vllm_multimodal server type

Auto-switches to vllm_multimodal with an appropriate warning

Passes audio chunking configuration to the server

The warning message at line 425 is helpful for users who may be surprised by the auto-switch behavior.

592-609: Clean implementation of content type filtering.

The drop_fields_from_messages method correctly:

Handles missing messages key gracefully

Skips non-list content (e.g., string content in system messages)

Filters out specified content types to reduce output file size

tests/test_vllm_audio.py (1)

108-155: Good test coverage for edge cases and chunking behavior.

The tests appropriately cover:

Messages without audio returning unchanged

/no_think suffix preservation in system messages

Chunking disabled behavior

Task type filtering with graceful handling of non-existent files

tests/gpu-tests/test_vllm_audio.py (1)

70-85: Test validation and cleanup look appropriate.

The test properly validates output existence and content, and cleans up the temporary input file in the finally block.

nemo_skills/evaluation/evaluator/audio.py (2)

37-38: LGTM! New configuration fields are well-designed.

The new strip_helpful_prefixes and normalization_mode fields with sensible defaults provide good flexibility for different ASR evaluation scenarios.

495-517: LGTM! Normalization mode propagation is consistent.

The pattern config.normalization_mode if config.apply_whisper_normalization else "none" correctly respects both the legacy flag and the new mode setting across ASR, ASR-PC, and ASR_LEADERBOARD task types.

nemo_skills/dataset/mmau-pro/closed_form/__init__.py (1)

18-19: LGTM! Configuration updates align with audio integration pattern.

The addition of ++enable_audio=true to GENERATION_ARGS and the new EVAL_ARGS for mmau-pro evaluation are consistent with similar updates across other dataset configurations in this PR.

nemo_skills/evaluation/evaluator/mmau_pro.py (1)

34-34: LGTM! Docstring correctly updated to reflect new judge config.

The documentation now accurately points to judge/mmau-pro prompt config, which aligns with the new mmau-pro.yaml file added in this PR.

nemo_skills/dataset/mmau-pro/open_ended/__init__.py (1)

18-26: LGTM! Open-ended configuration correctly updated for audio and new judge prompt.

The updates to GENERATION_ARGS (enabling audio) and JUDGE_ARGS (pointing to the new judge/mmau-pro config) are consistent with the closed-form configuration and the new judge prompt file.

nemo_skills/prompt/config/judge/mmau-pro.yaml (1)

1-30: Well-structured judge prompt with proper decimal score handling.

The multi-criteria evaluation (Correctness, Relevance, Completeness, Clarity) with explicit formatting instructions produces parseable judgements. The OVERALL line correctly asks for an "average score," which may naturally be a decimal (e.g., 3.75). The extraction logic in mmau_pro_metrics.py (lines 44-48) uses regex patterns \d+(?:\.\d+)? on all criteria to explicitly capture both integer and decimal scores, converting them to float. This design properly supports the prompt's format and handles the averaging case correctly.

nemo_skills/evaluation/metrics/mmau_pro_metrics.py (4)

26-65: Good fix for the OVERALL score overwrite issue.

The found_overall flag correctly distinguishes between a missing OVERALL score and an explicitly provided score of 3.0. This addresses the previously flagged logic bug.

One minor consideration: the regex accepts any numeric value, so scores outside the expected 1-5 range would be accepted without validation. If malformed judge output is a concern, you could add bounds checking (clamping to 1-5).

75-82: LGTM!

The multicriteria_scores structure is well-organized for accumulating per-criterion scores.

84-98: LGTM!

The scoring logic correctly differentiates between open-ended (judgement-based) and closed-form (binary is_correct) questions, with an appropriate threshold of >= 3.0 for correctness on the 1-5 scale.

170-195: LGTM!

The metrics_to_print method correctly extends the output to include multi-criteria metrics when available.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

nemo_skills/evaluation/evaluator/audio.py

nemo_skills/evaluation/metrics/mmau_pro_metrics.py

nemo_skills/inference/model/audio_utils.py

Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>

greptile-apps

Additional Comments (1)

nemo_skills/evaluation/evaluator/audio.py, line 340-341 (link)

logic: Undefined variables text and pred_text will cause NameError. Should use ref and hyp instead.

_{18 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Kipok · 2026-01-16T22:44:53Z

@Jorjeous @melllinia let me know when all looks good on your side and I can do another review and hopefully we can soon merge

Jorjeous · 2026-01-19T08:33:37Z

@Kipok Lets proceed. All looking good

Jorjeous · 2026-01-21T17:10:41Z

@Kipok Could you please take a look today?

Kipok

just a small comment, but otherwise lgtm!

nemo_skills/inference/generate.py

… audio processing Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps

_{18 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

tests/gpu-tests/test_vllm_audio.py

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Jorjeous mentioned this pull request Jan 9, 2026

Request to vLLM with audio message #1042

Closed

Jorjeous marked this pull request as ready for review January 9, 2026 20:31

greptile-apps bot reviewed Jan 9, 2026

View reviewed changes

coderabbitai bot reviewed Jan 9, 2026

View reviewed changes

Jorjeous assigned melllinia Jan 9, 2026

greptile-apps bot reviewed Jan 9, 2026

View reviewed changes

greptile-apps bot reviewed Jan 11, 2026

View reviewed changes

nemo_skills/inference/model/audio_utils.py Outdated Show resolved Hide resolved

nemo_skills/evaluation/metrics/mmau_pro_metrics.py Outdated Show resolved Hide resolved

nemo_skills/inference/model/vllm_multimodal.py Show resolved Hide resolved

greptile-apps bot reviewed Jan 11, 2026

View reviewed changes

Jorjeous mentioned this pull request Jan 12, 2026

HF ASR Leaderboard Fix #1140

Merged

Kipok mentioned this pull request Jan 13, 2026

Audio OpenAI format #1144

Closed

Kipok requested changes Jan 13, 2026

View reviewed changes

Jorjeous requested a review from melllinia January 13, 2026 15:43

Jorjeous added the run GPU tests label Jan 13, 2026

greptile-apps bot reviewed Jan 13, 2026

View reviewed changes

Jorjeous added the enhancement New feature or request label Jan 14, 2026

greptile-apps bot reviewed Jan 14, 2026

View reviewed changes

nemo_skills/inference/model/vllm_multimodal.py Outdated Show resolved Hide resolved

Jorjeous force-pushed the audio-input-output-integration branch from e7e27c7 to 79978f3 Compare January 14, 2026 13:05

Jorjeous added 6 commits January 14, 2026 05:06

Moving input audio support to vllm multimodal

51d9fba

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

pre commit

425771b

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

test_address_comments

f3c195d

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Debugged. Need to pass should_enable_audio=true

c9f4de3

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

precommit

5c38913

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

add helpfull prefix removal

e94276f

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

NVIDIA-NeMo deleted a comment from coderabbitai bot Jan 14, 2026

coderabbitai bot reviewed Jan 14, 2026

View reviewed changes

Jorjeous added run GPU tests and removed run GPU tests labels Jan 14, 2026

Jorjeous enabled auto-merge (squash) January 15, 2026 08:07

Jorjeous disabled auto-merge January 15, 2026 08:09

Merge branch 'main' into audio-input-output-integration

537f5c7

Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>

greptile-apps bot reviewed Jan 15, 2026

View reviewed changes

fix linter

62b8023

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Kipok reviewed Jan 21, 2026

View reviewed changes

nemo_skills/inference/generate.py Outdated Show resolved Hide resolved

removing or statment and drop warning in case server not designed for…

7650fa9

… audio processing Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps bot reviewed Jan 22, 2026

View reviewed changes

tests/gpu-tests/test_vllm_audio.py Show resolved Hide resolved

Kipok approved these changes Jan 22, 2026

View reviewed changes

Merge branch 'main' into audio-input-output-integration

4fe9ef3

Kipok added run GPU tests and removed run GPU tests labels Jan 22, 2026

Jorjeous enabled auto-merge (squash) January 22, 2026 15:19

Jorjeous merged commit 4bc33a9 into main Jan 22, 2026
7 checks passed

Jorjeous deleted the audio-input-output-integration branch January 22, 2026 15:26

greptile-apps bot mentioned this pull request Jan 27, 2026

Hotfix: add missing apply_whisper_normalization #1188

Merged

coderabbitai bot mentioned this pull request Feb 11, 2026

Gnalbandyan/add physics #1214

Merged

coderabbitai bot mentioned this pull request Feb 18, 2026

Add emergent_tts dataset and evaluation scripts #1249

Open

2 tasks

coderabbitai bot mentioned this pull request Mar 4, 2026

Add HotpotQA multi-hop QA benchmark #1292

Merged

Conversation

Jorjeous commented Jan 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Key Changes

Critical Issue

Confidence Score: 0/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Key Changes

Architecture

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Jorjeous commented Jan 8, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 9, 2026 •

edited

Loading

greptile-apps bot commented Jan 11, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading