Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughAdds audio support across datasets, evaluation, inference, and metrics: new audio flags, normalization modes, audio utilities and chunking, VLLMMultimodal generation with chunked audio I/O, updated MMAU‑Pro judge prompt and multi‑criteria metrics, plus unit and GPU tests for audio flows. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Generate as Generate Pipeline
participant VLLM as VLLMMultimodalModel
participant AudioUtils
participant vLLMServer as vLLM Server
Client->>Generate: request with audio + enable_audio=true
Generate->>Generate: setup_llm() detects needs_audio -> selects vllm_multimodal
Generate->>VLLM: instantiate with audio chunking config
VLLM->>VLLM: _needs_audio_chunking(messages, task_type)
alt chunking needed
VLLM->>AudioUtils: load_audio_file(path)
AudioUtils-->>VLLM: (array, sr)
VLLM->>AudioUtils: chunk_audio(array, sr, threshold)
AudioUtils-->>VLLM: chunks[]
loop per chunk
VLLM->>AudioUtils: save_audio_chunk_to_base64(chunk)
AudioUtils-->>VLLM: base64_chunk
VLLM->>VLLM: content_text_to_list(messages + base64_chunk)
VLLM->>vLLMServer: generate_async(messages_chunk)
vLLMServer-->>VLLM: generation_result
end
VLLM->>VLLM: aggregate chunk results + metadata
else no chunking
VLLM->>VLLM: content_text_to_list(messages)
VLLM->>vLLMServer: generate_async(messages)
vLLMServer-->>VLLM: generation_result
end
VLLM->>Generate: return generation + audio metadata
Generate->>Generate: drop_fields_from_messages(output)
Generate-->>Client: final_output
sequenceDiagram
participant Evaluator
participant MMAU as MMAUProMetrics
participant Parser as extract_multicriteria_scores
Evaluator->>MMAU: update(prediction)
MMAU->>Parser: extract_multicriteria_scores(judgement_text)
Parser-->>MMAU: {correctness,relevance,completeness,clarity,overall}
MMAU->>MMAU: append scores to multicriteria_buffers
Evaluator->>MMAU: get_metrics()
MMAU->>MMAU: compute avg/std per criterion, scale to 0-100, derive good/poor rates
MMAU-->>Evaluator: aggregated_metrics
Estimated code review effort🎯 4 (Complex) | ⏱️ ~55 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Greptile Overview
Greptile Summary
This PR enhances the VLLM multimodal class with audio input capabilities while maintaining existing audio output functionality. The changes introduce audio preprocessing, chunking for long audio files, and integration with multiple audio benchmarks (MMAU-Pro, AudioBench, ASR-Leaderboard, LibriSpeech-PC).
Key Changes
Audio Input Support:
- Added
audio_utils.pymodule with utilities for audio encoding, loading, and chunking - Extended
VLLMMultimodalModelwith audio preprocessing (content_text_to_list) to convert audio file references to base64-encoded data URIs - Implemented audio chunking for long files (>30s threshold) to handle model duration limits
- Added model-specific preprocessing (strips
/no_thinkfrom system messages for Qwen compatibility)
Evaluation Updates:
- Migrated MMAU-Pro open-ended evaluation from binary Yes/No judgements to multi-criteria 1-5 scoring (correctness, relevance, completeness, clarity)
- Created new
judge/mmau-pro.yamlprompt config with structured multi-criteria format - Updated metrics to extract and report scores as percentages with good/poor response rate tracking
Configuration:
- Added
should_enable_audioflag to enable audio processing in benchmark configs - Introduced audio chunking configuration (enable_audio_chunking, chunk_audio_threshold_sec, audio_chunk_task_types)
- Implemented
drop_content_typesto remove base64 audio from output files (reduces file size)
Integration:
- Updated 6 dataset configs to enable audio processing
- Modified
generate.pyto automatically switch tovllm_multimodalserver when audio is needed
Critical Issue
generate_async method implementation in VLLMMultimodalModel is commented out (lines 412-445). This makes all audio input functionality completely non-functional. Without this method, audio files won't be converted to base64, chunking won't occur, and the model will receive raw audio file paths instead of processed audio data.
Confidence Score: 0/5
- This PR cannot be merged safely - the core audio input feature is completely non-functional due to commented-out code.
- The entire
generate_asyncmethod in VLLMMultimodalModel is commented out, which is the entry point for all audio input processing. This means the primary feature this PR introduces (audio input with chunking) will not work at all. When users attempt to use audio inputs, the model will fall back to the parent class behavior without any audio preprocessing, causing failures. Additionally, there are multiple logic errors in the audio processing code that would cause runtime errors if uncommented (IndexError in split operations, potential issues with empty audios list). - nemo_skills/inference/model/vllm_multimodal.py - commented-out generate_async method must be uncommented and logic errors fixed before merge
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| nemo_skills/inference/model/vllm_multimodal.py | 0/5 | Added audio input support with chunking, but entire generate_async method is commented out, making the feature non-functional. Multiple logic issues in audio processing. |
| nemo_skills/inference/model/audio_utils.py | 5/5 | New utility module for audio processing (base64 encoding, chunking, loading). Clean implementation with proper error handling. |
| nemo_skills/inference/generate.py | 4/5 | Added audio config parameters and drop_fields_from_messages to remove base64 audio from output. Logic is sound but drops audio by default. |
| nemo_skills/evaluation/metrics/mmau_pro_metrics.py | 5/5 | Changed from binary Yes/No judgement to multi-criteria 1-5 scoring. Well-structured with proper score extraction and percentage conversion. |
| tests/test_vllm_audio.py | 4/5 | New unit tests for audio utilities and preprocessing. Tests won't catch the commented-out generate_async issue since they mock the model. |
| tests/gpu-tests/test_vllm_audio.py | 3/5 | New GPU integration test for audio generation. Will fail due to commented-out generate_async implementation. |
Sequence Diagram
sequenceDiagram
participant User
participant GenerationTask
participant VLLMMultimodalModel
participant AudioUtils
participant VLLMServer
participant FileSystem
Note over User,FileSystem: Audio INPUT Flow (when should_enable_audio=true)
User->>GenerationTask: Request generation with audio
GenerationTask->>GenerationTask: Check should_enable_audio flag
GenerationTask->>GenerationTask: Switch server_type to vllm_multimodal
GenerationTask->>VLLMMultimodalModel: Initialize with audio config
GenerationTask->>VLLMMultimodalModel: generate_async(messages with audio)
Note over VLLMMultimodalModel: ❌ BLOCKED: generate_async is commented out
Note over VLLMMultimodalModel: Expected flow (if uncommented):
VLLMMultimodalModel->>VLLMMultimodalModel: _needs_audio_chunking()
VLLMMultimodalModel->>FileSystem: Load audio file
VLLMMultimodalModel->>AudioUtils: load_audio_file(audio_path)
AudioUtils-->>VLLMMultimodalModel: audio_array, sampling_rate
alt Audio duration > threshold
VLLMMultimodalModel->>AudioUtils: chunk_audio()
AudioUtils-->>VLLMMultimodalModel: audio chunks
loop For each chunk
VLLMMultimodalModel->>AudioUtils: save_audio_chunk_to_base64()
AudioUtils-->>VLLMMultimodalModel: base64_audio
VLLMMultimodalModel->>VLLMMultimodalModel: Build chunk messages
VLLMMultimodalModel->>VLLMServer: Generate with chunk
VLLMServer-->>VLLMMultimodalModel: chunk result
VLLMMultimodalModel->>VLLMMultimodalModel: _postprocess_chunk_generation()
end
VLLMMultimodalModel->>VLLMMultimodalModel: Aggregate results
else No chunking needed
VLLMMultimodalModel->>VLLMMultimodalModel: content_text_to_list()
VLLMMultimodalModel->>AudioUtils: audio_file_to_base64()
AudioUtils-->>VLLMMultimodalModel: base64 audio
VLLMMultimodalModel->>VLLMMultimodalModel: Insert audio before text
VLLMMultimodalModel->>VLLMServer: Generate with full audio
VLLMServer-->>VLLMMultimodalModel: result
end
Note over User,FileSystem: Audio OUTPUT Flow
VLLMServer->>VLLMMultimodalModel: Response with audio data
VLLMMultimodalModel->>VLLMMultimodalModel: _parse_chat_completion_response()
VLLMMultimodalModel->>VLLMMultimodalModel: _process_audio_response()
VLLMMultimodalModel->>FileSystem: Save audio to output_dir/audio/
FileSystem-->>VLLMMultimodalModel: Audio file path
VLLMMultimodalModel-->>GenerationTask: Result with audio path
GenerationTask->>GenerationTask: drop_fields_from_messages()
Note over GenerationTask: Remove base64 audio_url from output
GenerationTask->>FileSystem: Write final output.jsonl
FileSystem-->>User: Results with audio references
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/inference/generate.py (1)
417-456: This relies onVLLMMultimodalModel.generate_asyncactually preprocessing audio—currently it doesn’t.
As written, setup_llm will route audio workloads to vllm_multimodal, but the model’s generate_async override (in nemo_skills/inference/model/vllm_multimodal.py) is currently commented out and returnsNone.
🤖 Fix all issues with AI agents
In @nemo_skills/evaluation/evaluator/audio.py:
- Line 127: The project now imports EnglishTextNormalizer (from
whisper_normalizer.english) in preprocess_asr_text, but whisper_normalizer is
missing from requirements/main.txt causing ModuleNotFoundError at runtime; add
the whisper_normalizer package to requirements/main.txt (pin a version if known
or use a compatible latest), run pip install -r requirements/main.txt, and
confirm the API (EnglishTextNormalizer) matches how preprocess_asr_text
constructs and calls it (constructor args and normalize methods) adjusting the
call sites in preprocess_asr_text if the package uses a different class or
method names.
In @nemo_skills/evaluation/metrics/mmau_pro_metrics.py:
- Around line 108-110: The list comprehension that builds predicted_answers in
update() calls .strip() on pred.get("generation", None), which crashes when
generation is None; change it to safely handle missing values by first
retrieving gen = pred.get("generation") and then setting answer = gen.strip() if
gen is not None else None, and convert empty strings to None (e.g., answer =
answer or None); update the predicted_answers assignment accordingly before
calling _compute_pass_at_k and _compute_majority_at_k so both methods receive
None for missing/blank generations.
- Around line 54-58: The code currently overwrites an explicit overall score of
3.0 by treating 3.0 as "missing"; change the fallback so it only computes
overall when the "overall" key is absent. In the block that uses scores and the
"overall" key (the snippet assigning scores["overall"]), remove the comparison
to 3.0 and use only if "overall" not in scores: (compute criteria_scores from
correctness/relevance/completeness/clarity and set scores["overall"]
accordingly) so an explicitly provided overall=3.0 is preserved.
In @nemo_skills/inference/model/audio_utils.py:
- Around line 59-82: The chunk_audio function must validate inputs to avoid
divide-by-zero/zero-length chunks: inside chunk_audio (function name
chunk_audio) check that sampling_rate and chunk_duration_sec are > 0 and that
audio_array is not None/has positive length, and if any check fails raise a
ValueError with a clear message; compute chunk_samples only after validation and
ensure chunk_samples > 0 before using it (raise if not), so the subsequent
num_chunks = int(np.ceil(len(audio_array) / chunk_samples)) and loop cannot blow
up or produce zero-length chunks.
In @nemo_skills/inference/model/vllm_multimodal.py:
- Around line 156-200: The content_text_to_list function (and the similar code
block at lines 228-266) currently uses os.path.join(self.data_dir,
audio["path"]) which allows path traversal; fix by resolving the final path
using os.path.abspath/os.path.realpath (e.g., resolved =
os.path.realpath(os.path.join(self.data_dir, audio["path"]))) and verify it is
inside self.data_dir using os.path.commonpath or by ensuring
resolved.startswith(os.path.realpath(self.data_dir) + os.sep); if the check
fails, raise a ValueError/PermissionError instead of reading the file. Refactor
the resolution/check into a small helper like _resolve_data_path(path) and use
it wherever audio paths are handled.
- Around line 390-426: The generate_async method is currently commented out and
returns None; restore its implementation so it always returns the parent's dict
result: when prompt is a list, set messages=prompt, call
self._needs_audio_chunking(messages, task_type) and if needs_chunking return
await self._generate_with_chunking(messages, audio_path, duration,
tokens_to_generate, **kwargs); otherwise convert messages =
[self.content_text_to_list(msg.copy()) for msg in messages], then messages =
self._preprocess_messages_for_model(messages), set prompt=messages; finally call
and return await super().generate_async(prompt=prompt,
tokens_to_generate=tokens_to_generate, **kwargs) so the method adheres to the
BaseModel contract.
In @tests/gpu-tests/test_vllm_audio.py:
- Around line 33-37: The test uses string shell commands (e.g., subprocess.run
or similar) which pass through shell=True; update
tests/gpu-tests/test_vllm_audio.py to construct explicit argv lists and invoke
the command without a shell (shell=False). Locate the invocations that build
command strings (around the output_dir handling and the other commands in the
55-69 block), replace string commands with lists of program and args and call
subprocess.run(...) or subprocess.check_call(...) with shell=False, quoting each
arg separately and preserving the same environment/redirects.
- Around line 38-49: The hard-coded CI paths in the test_vllm_audio.py test_data
make the test environment-coupled; update the test to either check for the WAVs'
existence with Path("/nemo_run/code/...").exists() and call pytest.skip(...)
when missing, or replace those entries by generating lightweight temporary WAV
files (e.g., create small silent WAVs into tempfile.NamedTemporaryFile and use
their .name) and reference them in the test_data variable so the test is
self-contained; modify the block around the NamedTemporaryFile creation and the
test_data list to implement one of these two options and ensure cleanup if
generating temp files.
In @tests/test_vllm_audio.py:
- Around line 45-56: The patched VLLMMultimodalModel.__init__ in the
mock_vllm_multimodal_model fixture is too specific and triggers Ruff ARG005;
change the patch to accept arbitrary positional and keyword args (e.g., use
lambda *_args, **_kwargs: None or a tiny def __init__(*args, **kwargs): return
None) so the test remains robust to signature changes and stops lint noise while
keeping the rest of the fixture (model.data_dir, output_dir, output_audio_dir,
enable_audio_chunking, audio_chunk_task_types, chunk_audio_threshold_sec)
intact.
🧹 Nitpick comments (4)
nemo_skills/inference/model/__init__.py (1)
22-30: Consider adding__all__to make the public API explicit (and stable).
Right now these helpers are “public by import side-effect”;__all__would prevent accidental exports as this module grows.nemo_skills/inference/model/audio_utils.py (1)
116-133: Consider takingaudio_mime/formatinstead of hardcodingwav.
Right now callers can only emitdata:audio/wav;base64,...and{format:"wav"}even if the input is mp3/flac.nemo_skills/inference/model/vllm_multimodal.py (2)
256-265: Avoid silent exception swallowing in_needs_audio_chunking()(hard to debug).
At leastLOG.debug(..., exc_info=True)on failure; also consider catching narrower exceptions thanException.
183-199: Reusemake_audio_content_block()instead of duplicating “audio_url” formatting.
This prevents drift between helpers and keeps “audio_url vs input_audio” handling consistent.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
nemo_skills/dataset/asr-leaderboard/__init__.pynemo_skills/dataset/audiobench/judge/__init__.pynemo_skills/dataset/audiobench/nonjudge/__init__.pynemo_skills/dataset/librispeech-pc/__init__.pynemo_skills/dataset/mmau-pro/closed_form/__init__.pynemo_skills/dataset/mmau-pro/open_ended/__init__.pynemo_skills/evaluation/evaluator/audio.pynemo_skills/evaluation/evaluator/mmau_pro.pynemo_skills/evaluation/metrics/map_metrics.pynemo_skills/evaluation/metrics/mmau_pro_metrics.pynemo_skills/inference/generate.pynemo_skills/inference/model/__init__.pynemo_skills/inference/model/audio_utils.pynemo_skills/inference/model/vllm_multimodal.pynemo_skills/prompt/config/judge/mmau-pro.yamlnemo_skills/prompt/config/judge/speechlm.yamltests/gpu-tests/test_vllm_audio.pytests/test_vllm_audio.py
💤 Files with no reviewable changes (1)
- nemo_skills/prompt/config/judge/speechlm.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-12T16:09:53.870Z
Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.
Applied to files:
nemo_skills/dataset/audiobench/nonjudge/__init__.pynemo_skills/dataset/mmau-pro/closed_form/__init__.pynemo_skills/prompt/config/judge/mmau-pro.yamlnemo_skills/dataset/audiobench/judge/__init__.pynemo_skills/dataset/mmau-pro/open_ended/__init__.py
🧬 Code graph analysis (7)
tests/gpu-tests/test_vllm_audio.py (1)
tests/gpu-tests/utils.py (1)
require_env_var(18-23)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/audio_metrics.py (1)
AudioMetrics(43-345)
nemo_skills/evaluation/metrics/mmau_pro_metrics.py (1)
nemo_skills/evaluation/metrics/base.py (2)
as_int(443-446)as_percentage(437-440)
nemo_skills/inference/model/audio_utils.py (2)
nemo_skills/utils.py (1)
get_logger_name(39-43)nemo_skills/inference/patch_litellm_logging.py (1)
start(37-38)
nemo_skills/inference/model/__init__.py (1)
nemo_skills/inference/model/audio_utils.py (5)
audio_file_to_base64(30-41)chunk_audio(59-81)load_audio_file(44-56)make_audio_content_block(116-133)save_audio_chunk_to_base64(84-113)
tests/test_vllm_audio.py (2)
nemo_skills/inference/model/audio_utils.py (1)
audio_file_to_base64(30-41)nemo_skills/inference/model/vllm_multimodal.py (5)
VLLMMultimodalModel(44-445)content_text_to_list(156-200)_preprocess_messages_for_model(202-226)_needs_audio_chunking(228-266)_postprocess_chunk_generation(353-388)
nemo_skills/inference/model/vllm_multimodal.py (3)
nemo_skills/inference/model/audio_utils.py (4)
audio_file_to_base64(30-41)chunk_audio(59-81)load_audio_file(44-56)save_audio_chunk_to_base64(84-113)nemo_skills/inference/chat_interface/core.py (1)
get(136-151)nemo_skills/inference/model/base.py (1)
generate_async(218-325)
🪛 Ruff (0.14.10)
tests/gpu-tests/test_vllm_audio.py
33-33: Probable insecure usage of temporary file or directory: "/tmp/nemo-skills-tests/"
(S108)
68-68: subprocess call with shell=True identified, security issue
(S602)
nemo_skills/evaluation/metrics/mmau_pro_metrics.py
123-123: Loop control variable agg_mode not used within loop body
Rename unused agg_mode to _agg_mode
(B007)
tests/test_vllm_audio.py
48-48: Unused lambda argument: self
(ARG005)
48-48: Unused lambda argument: kwargs
(ARG005)
nemo_skills/inference/model/vllm_multimodal.py
179-179: Avoid specifying long messages outside the exception class
(TRY003)
228-228: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
263-264: try-except-pass detected, consider logging the exception
(S110)
263-263: Do not catch blind exception: Exception
(BLE001)
296-296: Loop control variable chunk_idx not used within loop body
Rename unused chunk_idx to _chunk_idx
(B007)
312-314: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
394-394: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
🔇 Additional comments (12)
nemo_skills/dataset/mmau-pro/closed_form/__init__.py (1)
18-19: LGTM - Audio enablement flags added correctly.The addition of
++should_enable_audio=trueto GENERATION_ARGS and the new EVAL_ARGS constant align with the PR's objective to enable audio input support for the vllm multimodal component.nemo_skills/evaluation/metrics/map_metrics.py (1)
53-53: LGTM - Backward compatibility alias added.The "speechlm" alias maintains backward compatibility while routing to the unified AudioMetrics class. This allows existing references to "speechlm" metric type to continue working without modification.
nemo_skills/dataset/librispeech-pc/__init__.py (1)
28-29: LGTM - Audio enablement and cleanup.The changes correctly add the audio enablement flag and remove trailing whitespace. This is consistent with the broader PR changes to enable audio input support.
nemo_skills/evaluation/evaluator/mmau_pro.py (1)
34-34: LGTM - Documentation updated to reflect new judge configuration.The docstring correctly references the new
judge/mmau-proprompt config, aligning with the broader migration from speechlm to mmau-pro judge configuration across the codebase.nemo_skills/dataset/asr-leaderboard/__init__.py (1)
21-21: LGTM - Audio enablement flag added correctly.The addition of
++should_enable_audio=trueto GENERATION_ARGS aligns with the PR objective to enable audio input support. The flag is placed appropriately alongside the existing audio-related arguments.nemo_skills/dataset/audiobench/nonjudge/__init__.py (1)
28-31: LGTM - Audio flag added and formatting cleaned up.The changes consistently add the
++should_enable_audio=trueflag and remove trailing spaces from both EVAL_ARGS and GENERATION_ARGS, aligning with the PR's audio enablement objective.nemo_skills/dataset/mmau-pro/open_ended/__init__.py (2)
18-18: Audio enablement flag added correctly.The addition of
++should_enable_audio=trueto GENERATION_ARGS is consistent with the PR's objective and matches the pattern applied across other dataset configurations.
26-26: Judge path migration to judge/mmau-pro is complete and correct.The judge config path has been successfully updated from
judge/speechlmtojudge/mmau-pro. Verification confirms no broken references tojudge/speechlmexist in the codebase, and themmau-pro.yamlconfiguration file is properly in place. The code change is sound.nemo_skills/dataset/audiobench/judge/__init__.py (1)
29-30: LGTM - Consistent audio enablement and cleanup.The changes add the
++should_enable_audio=trueflag to GENERATION_ARGS and remove trailing spaces from both constants, maintaining consistency with the audio enablement pattern applied across the codebase.nemo_skills/prompt/config/judge/mmau-pro.yaml (1)
1-30: Format specification correctly matches the evaluation metrics parser.The output format in the prompt (CORRECTNESS, RELEVANCE, COMPLETENESS, CLARITY, OVERALL with numeric scores) aligns perfectly with the regex patterns in
extract_multicriteria_scores(). The parser uses case-insensitive matching and safely defaults to 3.0 for any missing fields, making it robust against minor format variations.nemo_skills/evaluation/metrics/mmau_pro_metrics.py (1)
123-152: Multi-criteria aggregation ignores per-aggregation buckets.
self.multicriteria_scoresis accumulated once and then applied to everyagg_metricsentry; ifmetrics_dictcontains per-subgroup metrics, this will smear scores across groups. If MMAU-Pro never uses subgrouping, OK—but otherwise consider collecting multicriteria per agg key.nemo_skills/inference/generate.py (1)
590-607:drop_fields_from_messages()is a good safeguard, but ensure consumers tolerate emptycontentlists.
If a message contains only dropped content types, it becomes[]. That’s probably fine for persisted outputs, but worth confirming downstream readers don’t assume non-empty content.Also applies to: 624-627
There was a problem hiding this comment.
Greptile Overview
Greptile Summary
This PR enhances the VLLMMultimodalModel to support audio input processing while maintaining existing audio output capabilities.
Key Changes
Audio Input Processing:
- New
audio_utils.pymodule provides utilities for encoding audio files to base64, chunking long audio, and temporary file handling VLLMMultimodalModelnow accepts audio chunking configuration (threshold, task type filtering)- Implements automatic chunking for audio files exceeding duration limits, processing each chunk separately and aggregating results
- Converts audio file paths to base64-encoded data URIs in the proper format for vLLM/Qwen models
- Audio content is placed before text content (required for Qwen models)
Evaluation Enhancements:
mmau_pro_metrics.pynow supports multi-criteria scoring (correctness, relevance, completeness, clarity, overall) on 1-5 scale for open-ended questions- New
strip_helpful_prefixes()function inaudio.pyremoves LLM-generated prefixes like "The audio says:" from transcriptions before WER calculation - Judge prompt configuration changed from
speechlmtommau-profor multi-criteria evaluation
Pipeline Integration:
generate.pyadds audio configuration options anddrop_fields_from_messages()to remove base64 audio data from output files- Multiple dataset configs updated with
should_enable_audio=trueflag to enable audio processing
Architecture
The audio input flow follows this pattern:
- Check if audio exceeds duration threshold
- If yes: chunk audio → process each chunk → aggregate results
- If no: encode entire audio to base64 → process normally
- Audio responses from server are saved to disk and base64 data replaced with file paths
Confidence Score: 2/5
- Has critical bugs that will cause runtime failures and test failures
- Found P0 IndexError bug in string splitting (line 367-370), P1 implementation-documentation mismatch causing test failures, P2 shallow copy issue, and P2 fallback logic ambiguity. The IndexError could trigger during chunk generation post-processing, and the /no_think test will definitely fail.
- nemo_skills/inference/model/vllm_multimodal.py (has critical bugs in _postprocess_chunk_generation and _preprocess_messages_for_model), nemo_skills/evaluation/metrics/mmau_pro_metrics.py (ambiguous fallback logic)
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| nemo_skills/inference/model/vllm_multimodal.py | 2/5 | Added audio input processing (chunking, base64 encoding) and enhanced output handling. Found critical bugs: _preprocess_messages_for_model doesn't strip /no_think as documented and tested, potential IndexError in _postprocess_chunk_generation, and shallow copy issue in _generate_with_chunking. |
| nemo_skills/inference/model/audio_utils.py | 5/5 | New utility module for audio processing (base64 encoding, chunking, file I/O). Implementation is clean and straightforward with proper resource cleanup. |
| nemo_skills/inference/generate.py | 4/5 | Added audio configuration parameters and drop_fields_from_messages() to reduce output size. Clean integration with minimal changes. |
| nemo_skills/evaluation/metrics/mmau_pro_metrics.py | 3/5 | Enhanced with multi-criteria scoring (1-5 scale) for open-ended questions. Found logic bug in extract_multicriteria_scores where fallback condition will not distinguish between actual score of 3.0 and missing score. |
| nemo_skills/evaluation/evaluator/audio.py | 4/5 | Added strip_helpful_prefixes() to remove LLM-generated prefixes from transcriptions. Implementation looks sound with proper fallback strategies. |
| tests/test_vllm_audio.py | 3/5 | New unit tests for audio utilities and multimodal model. Test expects /no_think stripping behavior that the implementation doesn't provide, exposing a bug. |
Sequence Diagram
sequenceDiagram
participant User
participant GenerationTask
participant VLLMMultimodalModel
participant AudioUtils
participant VLLMServer
User->>GenerationTask: Run inference with audio
GenerationTask->>GenerationTask: Configure audio settings (chunking, etc)
GenerationTask->>VLLMMultimodalModel: Initialize model with audio config
alt Audio chunking needed
VLLMMultimodalModel->>AudioUtils: load_audio_file()
AudioUtils-->>VLLMMultimodalModel: audio_array, sampling_rate
VLLMMultimodalModel->>AudioUtils: chunk_audio()
AudioUtils-->>VLLMMultimodalModel: List of chunks
loop For each chunk
VLLMMultimodalModel->>AudioUtils: save_audio_chunk_to_base64()
AudioUtils-->>VLLMMultimodalModel: base64_audio
VLLMMultimodalModel->>VLLMMultimodalModel: content_text_to_list() [inline]
VLLMMultimodalModel->>VLLMMultimodalModel: _preprocess_messages_for_model()
VLLMMultimodalModel->>VLLMServer: send chunk request
VLLMServer-->>VLLMMultimodalModel: chunk response
VLLMMultimodalModel->>VLLMMultimodalModel: _postprocess_chunk_generation()
end
VLLMMultimodalModel->>VLLMMultimodalModel: Aggregate chunk results
else No chunking
VLLMMultimodalModel->>AudioUtils: audio_file_to_base64()
AudioUtils-->>VLLMMultimodalModel: base64_audio
VLLMMultimodalModel->>VLLMMultimodalModel: content_text_to_list()
VLLMMultimodalModel->>VLLMMultimodalModel: _preprocess_messages_for_model()
VLLMMultimodalModel->>VLLMServer: send request
VLLMServer-->>VLLMMultimodalModel: response
end
VLLMMultimodalModel->>VLLMMultimodalModel: _parse_chat_completion_response()
alt Audio in response
VLLMMultimodalModel->>VLLMMultimodalModel: _process_audio_response()
VLLMMultimodalModel->>VLLMMultimodalModel: Save audio to disk
end
VLLMMultimodalModel-->>GenerationTask: result with generation
GenerationTask->>GenerationTask: drop_fields_from_messages()
GenerationTask-->>User: Final output
Greptile SummaryThis PR adds comprehensive audio input/output support to the vLLM multimodal inference pipeline, enabling audio processing for benchmarks like MMAU-Pro, AudioBench, and ASR-Leaderboard. Key Changes:
Architecture: Previous Thread Issues Addressed: Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Generate
participant VLLMMultimodalModel
participant AudioUtils
participant VLLMServer
User->>Generate: Input with audio path
Generate->>Generate: Check enable_audio flag
Generate->>Generate: Auto-switch vllm to vllm_multimodal if needed
Generate->>VLLMMultimodalModel: Initialize with audio config
alt Audio chunking enabled
VLLMMultimodalModel->>VLLMMultimodalModel: _needs_audio_chunking()
VLLMMultimodalModel->>AudioUtils: load_audio_file()
AudioUtils-->>VLLMMultimodalModel: audio_array, sampling_rate
VLLMMultimodalModel->>VLLMMultimodalModel: Check duration > threshold
alt Duration exceeds threshold
VLLMMultimodalModel->>AudioUtils: chunk_audio()
AudioUtils-->>VLLMMultimodalModel: List of audio chunks
loop For each chunk
VLLMMultimodalModel->>AudioUtils: save_audio_chunk_to_base64()
AudioUtils-->>VLLMMultimodalModel: base64 encoded chunk
VLLMMultimodalModel->>VLLMMultimodalModel: content_text_to_list()
VLLMMultimodalModel->>VLLMMultimodalModel: _preprocess_messages_for_model()
VLLMMultimodalModel->>VLLMServer: generate_async() with chunk
VLLMServer-->>VLLMMultimodalModel: Chunk generation result
VLLMMultimodalModel->>VLLMMultimodalModel: Accumulate statistics
end
VLLMMultimodalModel->>VLLMMultimodalModel: Aggregate chunk results
VLLMMultimodalModel-->>Generate: Final result with summed stats
end
end
alt No chunking needed
VLLMMultimodalModel->>AudioUtils: audio_file_to_base64()
AudioUtils-->>VLLMMultimodalModel: base64 encoded audio
VLLMMultimodalModel->>VLLMMultimodalModel: content_text_to_list()
Note over VLLMMultimodalModel: Audio items prepended before text
VLLMMultimodalModel->>VLLMMultimodalModel: _preprocess_messages_for_model()
VLLMMultimodalModel->>VLLMServer: generate_async() with full audio
VLLMServer-->>VLLMMultimodalModel: Generation result
alt Response contains audio output
VLLMMultimodalModel->>VLLMMultimodalModel: _process_audio_response()
VLLMMultimodalModel->>VLLMMultimodalModel: Save audio to output_dir/audio/
VLLMMultimodalModel->>VLLMMultimodalModel: Replace base64 with file path
end
VLLMMultimodalModel-->>Generate: Final result
end
Generate->>Generate: Drop content_types if configured
Generate-->>User: Output with generation
|
Additional Comments (1)
Consider a more precise matching strategy that preserves word boundaries: |
|
@melllinia Can you plz check mmau-pro part? |
Kipok
left a comment
There was a problem hiding this comment.
thanks, just a few minor comments! I didn't check the underlying logic for audio processing too much, if you need a thorough review there please add relevant people as reviewers
|
Testing is wip |
Additional Comments (6)
if not audio_duration or audio_duration <= 0:
return {
"hallucination_rate": 0.0,
"char_rate": 0.0,
"is_correct": True, # ← This should be False
"error": "missing_audio_duration",
}When evaluation cannot be performed due to missing data,
For example, if reference is "Hello World Test" and hypothesis is "Hello World", the function returns 0.0 cap_accuracy even though 2 out of 2 words in the hypothesis are correctly capitalized. A better approach would be to check capitalization for the overlapping portion: min_len = min(len(ref_words), len(hyp_words))
if min_len > 0:
cap_matches = sum(
1 for r, h in zip(ref_words[:min_len], hyp_words[:min_len])
if r and h and r[0].isupper() == h[0].isupper()
)
cap_accuracy = cap_matches / min_len
else:
cap_accuracy = 0.0
However, if reference is empty but hypothesis is not (or vice versa), this creates artificial WER. For example:
This distorts metrics. Better approach: if not ref and not hyp:
return {"wer": 0.0, "is_correct": True} # Both empty = perfect match
if not ref or not hyp:
return {"wer": 1.0, "is_correct": False} # One empty = complete mismatch
This was already identified in a previous review comment. The function needs word boundary checking: def has_postscript(text, marker):
# Remove punctuation from marker for flexible matching
clean_marker = re.sub(r'[^a-zA-Z\s]', '', marker).strip()
# Use word boundaries to avoid substring matches
pattern = r'\b' + re.escape(clean_marker) + r'\b'
return bool(re.search(pattern, text, re.IGNORECASE))
Better implementation: if "serialized_output" in result:
for item in result["serialized_output"]:
if isinstance(item, dict) and "audio" in item:
# Remove entire audio data to save space, keep only metadata
if isinstance(item["audio"], dict) and "data" in item["audio"]:
# Keep only essential metadata
item["audio"] = {
k: v for k, v in item["audio"].items()
if k != "data"
}
For example, in a 100-character text:
A better approach would be:
Additionally, the code counts each punctuation mark only once (line 403 |
e7e27c7 to
79978f3
Compare
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
|
@coderabbitai help |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@nemo_skills/evaluation/evaluator/audio.py`:
- Around line 69-84: The sequential quote-extraction on the variable result
(using re.search(r'"((?:\\.|[^"\\])*)")', the colon-quote checks ":'" / ": '",
and re.search(r"'(.*)'")) can over-strip nested quotes (e.g., "He said:
'hello'"); change the flow so only one extraction applies: after a successful
double-quote match, or after a colon-quote branch, or after a single-quote
match, immediately accept that extraction and skip the remaining handlers
(either by converting the blocks into an if/elif chain or by returning the
extracted result from the surrounding function), and add a short comment
documenting the chosen precedence order for clarity.
- Line 475: The code uses direct dict access generation =
sample["generation"].strip() which can raise KeyError if "generation" is
missing; change this to defensively access the key (e.g., use
sample.get("generation", "") and ensure you handle None/empty string before
calling .strip()) so malformed samples won't crash the evaluator; update the
occurrence(s) of generation = sample["generation"].strip() to use the safe
lookup and preserve existing behavior when the field is present.
In `@nemo_skills/evaluation/metrics/mmau_pro_metrics.py`:
- Around line 140-145: The conversion from a 1-5 scale to percent in the block
that computes avg_score and std_score and writes to agg_metrics (keys like
f"avg_{criterion}" and f"std_{criterion}") is wrong: (score / 5.0) * 100 maps
1→20 not 0. If you intend 0–100, change the conversion to ((value - 1) / 4.0) *
100 for both avg_score and std_score; otherwise update the comment to state that
the mapping yields 20–100. Ensure you adjust both assignments that set
agg_metrics[f"avg_{criterion}"] and agg_metrics[f"std_{criterion}"] (using the
local names avg_score, std_score, and scores).
In `@nemo_skills/inference/model/audio_utils.py`:
- Around line 59-96: In chunk_audio, add input validation to ensure
sampling_rate and chunk_duration_sec are positive (e.g., > 0) before computing
chunk_samples; if not, raise a ValueError with a clear message. Specifically
validate the sampling_rate and chunk_duration_sec parameters at the start of
chunk_audio (and/or assert chunk_samples > 0) to avoid chunk_samples being zero
or negative and subsequent infinite/invalid loops or divide-by-zero issues, then
proceed with existing logic that computes chunk_samples and min_chunk_samples
and builds chunks.
♻️ Duplicate comments (8)
nemo_skills/inference/model/audio_utils.py (1)
114-128: Potential resource leak if exception occurs before try block.If an exception occurs between creating the temp file (line 114) and entering the try block (line 117), the file won't be cleaned up.
Safer structure
- # Create temporary file - tmp_file = tempfile.NamedTemporaryFile(suffix=".wav", delete=False) - tmp_path = tmp_file.name - - try: - tmp_file.close() + tmp_file = tempfile.NamedTemporaryFile(suffix=".wav", delete=False) + tmp_path = tmp_file.name + tmp_file.close() + + try: sf.write(tmp_path, audio_chunk, sampling_rate)This ensures
tmp_file.close()completes before entering the try block, and any failure insf.writeor later is properly caught with cleanup.tests/test_vllm_audio.py (1)
45-57: Tighten the__init__patch to suppress lint warnings.The lambda triggers ARG005 because
selfandkwargsare unused. Use*_args, **_kwargsto indicate intentional non-use.♻️ Proposed fix
- with patch.object(VLLMMultimodalModel, "__init__", lambda self, **kwargs: None): + with patch.object(VLLMMultimodalModel, "__init__", lambda *_args, **_kwargs: None):tests/gpu-tests/test_vllm_audio.py (2)
55-68: Avoidshell=Truewith string commands.Using
shell=Truewith a string command introduces potential command injection risks and is flagged by security linters. Convert to an argument list.♻️ Proposed fix using argument list
- cmd = ( - f"ns generate " - f" --cluster test-local --config_dir {Path(__file__).absolute().parent} " - f" --model {model_path} " - f" --output_dir {output_dir} " - f" --server_type vllm_multimodal " - f" --server_gpus 1 " - f" --server_nodes 1 " - f" --server_args '--enforce-eager' " - f" --input_file={input_file} " - f" ++prompt_format=openai " - f" ++skip_filled=False " - ) - subprocess.run(cmd, shell=True, check=True) + cmd = [ + "ns", "generate", + "--cluster", "test-local", + "--config_dir", str(Path(__file__).absolute().parent), + "--model", model_path, + "--output_dir", output_dir, + "--server_type", "vllm_multimodal", + "--server_gpus", "1", + "--server_nodes", "1", + "--server_args", "--enforce-eager", + f"--input_file={input_file}", + "++prompt_format=openai", + "++skip_filled=False", + ] + subprocess.run(cmd, check=True)
38-52: Add existence check for CI-specific audio files.The hardcoded paths (
/nemo_run/code/tests/slurm-tests/asr_nim/wavs/...) are CI-specific. Consider adding a guard to skip the test when files are missing.♻️ Proposed fix with existence check
# Create test input file with audio + audio_files = [ + "/nemo_run/code/tests/slurm-tests/asr_nim/wavs/t2_16.wav", + "/nemo_run/code/tests/slurm-tests/asr_nim/wavs/t3_16.wav", + ] + for audio_file in audio_files: + if not Path(audio_file).exists(): + pytest.skip(f"Test audio file not found: {audio_file}") + with tempfile.NamedTemporaryFile(mode="w", suffix=".jsonl", delete=False) as f: test_data = [ { "problem": "Transcribe this audio", - "audio": {"path": "/nemo_run/code/tests/slurm-tests/asr_nim/wavs/t2_16.wav"}, + "audio": {"path": audio_files[0]}, }, { "problem": "What is in this audio?", - "audio": {"path": "/nemo_run/code/tests/slurm-tests/asr_nim/wavs/t3_16.wav"}, + "audio": {"path": audio_files[1]}, }, ]nemo_skills/inference/model/vllm_multimodal.py (1)
184-197: Path traversal vulnerability when resolving audio paths.Using
os.path.join(self.data_dir, audio["path"])allows directory traversal ifaudio["path"]contains../sequences. A malicious input could read arbitrary files on the system.Consider validating that the resolved path stays within
data_dir:🔒 Proposed fix to prevent path traversal
+ def _resolve_safe_path(self, relative_path: str) -> str: + """Resolve path ensuring it stays within data_dir.""" + resolved = os.path.realpath(os.path.join(self.data_dir, relative_path)) + data_dir_resolved = os.path.realpath(self.data_dir) + if not resolved.startswith(data_dir_resolved + os.sep): + raise ValueError(f"Path traversal detected: {relative_path}") + return resolved if "audio" in message: audio = message["audio"] - audio_path = os.path.join(self.data_dir, audio["path"]) + audio_path = self._resolve_safe_path(audio["path"]) base64_audio = audio_file_to_base64(audio_path)nemo_skills/evaluation/evaluator/audio.py (2)
183-217: Digit-to-word mapping is incomplete for compound numbers.As noted in prior reviews, this handles 0-20 and decade multiples (30, 40, etc.) but not compound numbers like 21-29, 31-39. This is acceptable if documented, but numbers like "25" will remain as digits after normalization, causing WER discrepancies when references spell them out.
287-311: Verifywhisper_normalizerdependency is declared.The import of
EnglishTextNormalizerfromwhisper_normalizer.englishrequires thewhisper_normalizerpackage. Per prior discussion, this was addressed in PR#1140— please confirm that dependency is merged or included in this PR's requirements.nemo_skills/evaluation/metrics/mmau_pro_metrics.py (1)
133-134: Add a comment explaining the division by 2.0.This magic number was flagged in previous reviews but still lacks explanation. Please document why
no_answeris divided by 2.0 to prevent confusion for future maintainers.
🧹 Nitpick comments (10)
nemo_skills/inference/model/vllm_multimodal.py (3)
218-218: Fix implicitOptionaltype annotation.Per PEP 484,
task_type: str = Noneshould betask_type: str | None = Noneto explicitly indicate the parameter acceptsNone.♻️ Proposed fix
- def _needs_audio_chunking(self, messages: list[dict], task_type: str = None) -> tuple[bool, str, float]: + def _needs_audio_chunking(self, messages: list[dict], task_type: str | None = None) -> tuple[bool, str, float]:
357-357: Fix implicitOptionaltype annotation.Same issue as
_needs_audio_chunking- explicitly annotate asstr | None.♻️ Proposed fix
- task_type: str = None, + task_type: str | None = None,
294-294: Consider usingchunk_idxor renaming to indicate non-use.The loop variable
chunk_idxis declared but unused. Either use it for logging (e.g.,LOG.debug(f"Processing chunk {chunk_idx + 1}/{len(chunks)}")) or rename to_chunk_idxto suppress the lint warning.tests/test_vllm_audio.py (2)
73-78: Add assertion to verify originalaudiofield was removed.The test verifies the content structure but doesn't check that the original
audiokey was deleted after conversion (pervllm_multimodal.pyline 190).♻️ Proposed fix
assert result["content"][0]["type"] == "audio_url" assert result["content"][0]["audio_url"]["url"].startswith("data:audio/wav;base64,") assert result["content"][1]["type"] == "text" + assert "audio" not in result # Verify original field was removed
100-105: Add assertion to verify originalaudiosfield was removed.Similar to the single audio test, verify that the
audioskey was deleted after conversion.♻️ Proposed fix
assert result["content"][0]["type"] == "audio_url" assert result["content"][1]["type"] == "audio_url" assert result["content"][2]["type"] == "text" + assert "audios" not in result # Verify original field was removednemo_skills/evaluation/evaluator/audio.py (3)
42-48: Consider using compiled regex patterns for performance.These patterns are used on every sample evaluation. Pre-compiling them at module load would avoid repeated compilation overhead.
♻️ Suggested improvement
-_FAILURE_RESPONSES = [ - r"the speech is in audio format and needs to be transcribed", - r"i do not have access to audio", - r"i cannot access audio", - r"i'm sorry.*i do not have access", - r"as an ai language model.*i do not have access", -] +_FAILURE_RESPONSES = [ + re.compile(r"the speech is in audio format and needs to be transcribed", re.IGNORECASE), + re.compile(r"i do not have access to audio", re.IGNORECASE), + re.compile(r"i cannot access audio", re.IGNORECASE), + re.compile(r"i'm sorry.*i do not have access", re.IGNORECASE), + re.compile(r"as an ai language model.*i do not have access", re.IGNORECASE), +]Then update usage:
for failure_pattern in _FAILURE_RESPONSES: - if re.search(failure_pattern, result, flags=re.IGNORECASE): + if failure_pattern.search(result): return ""
220-248: Contraction expansion may conflict with jiwer'sExpandCommonEnglishContractions.In
preprocess_asr_textfor "audiobench" mode (lines 297-304),_expand_contractionsis called beforejiwer.ExpandCommonEnglishContractions(). Since both expand contractions, the jiwer transform becomes a no-op. This is harmless but redundant — consider removing one.
271-274: Static analysis: Consider a custom exception class for invalid normalization modes.Per TRY003, long messages in
raise ValueError(...)are better encapsulated in a custom exception. However, this is a minor style concern and acceptable for a configuration validation error.nemo_skills/evaluation/metrics/mmau_pro_metrics.py (2)
110-123: Good fix for theNone.strip()crash.Line 114 now safely handles missing/None generation values.
Note:
extract_multicriteria_scoresmay be called twice per prediction with a judgement—once in_get_score_dict(via the parent'supdate) and again in lines 121-123. If performance becomes a concern, consider caching the parsed result.
129-129: Unused loop variable.
agg_modeis not used within the loop body.Suggested fix
- for agg_mode, agg_metrics in metrics_dict.items(): + for _agg_mode, agg_metrics in metrics_dict.items():
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
nemo_skills/dataset/asr-leaderboard/__init__.pynemo_skills/dataset/audiobench/judge/__init__.pynemo_skills/dataset/audiobench/nonjudge/__init__.pynemo_skills/dataset/librispeech-pc/__init__.pynemo_skills/dataset/mmau-pro/closed_form/__init__.pynemo_skills/dataset/mmau-pro/open_ended/__init__.pynemo_skills/evaluation/evaluator/audio.pynemo_skills/evaluation/evaluator/mmau_pro.pynemo_skills/evaluation/metrics/map_metrics.pynemo_skills/evaluation/metrics/mmau_pro_metrics.pynemo_skills/inference/generate.pynemo_skills/inference/model/__init__.pynemo_skills/inference/model/audio_utils.pynemo_skills/inference/model/vllm_multimodal.pynemo_skills/prompt/config/judge/mmau-pro.yamlnemo_skills/prompt/config/judge/speechlm.yamltests/gpu-tests/test_vllm_audio.pytests/test_vllm_audio.py
💤 Files with no reviewable changes (1)
- nemo_skills/prompt/config/judge/speechlm.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-12T16:09:53.870Z
Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.
Applied to files:
nemo_skills/dataset/mmau-pro/open_ended/__init__.pynemo_skills/dataset/audiobench/nonjudge/__init__.pynemo_skills/dataset/mmau-pro/closed_form/__init__.pynemo_skills/prompt/config/judge/mmau-pro.yamlnemo_skills/dataset/audiobench/judge/__init__.pynemo_skills/evaluation/metrics/mmau_pro_metrics.py
🧬 Code graph analysis (7)
tests/gpu-tests/test_vllm_audio.py (1)
tests/gpu-tests/utils.py (1)
require_env_var(18-23)
nemo_skills/inference/model/audio_utils.py (1)
nemo_skills/utils.py (1)
get_logger_name(39-43)
tests/test_vllm_audio.py (2)
nemo_skills/inference/model/audio_utils.py (1)
audio_file_to_base64(30-41)nemo_skills/inference/model/vllm_multimodal.py (3)
content_text_to_list(157-203)_preprocess_messages_for_model(205-216)_needs_audio_chunking(218-256)
nemo_skills/evaluation/evaluator/audio.py (1)
nemo_skills/evaluation/metrics/audio_metrics.py (1)
update(177-228)
nemo_skills/inference/model/__init__.py (1)
nemo_skills/inference/model/audio_utils.py (5)
audio_file_to_base64(30-41)chunk_audio(59-96)load_audio_file(44-56)make_audio_content_block(133-150)save_audio_chunk_to_base64(99-130)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/audio_metrics.py (1)
AudioMetrics(43-352)
nemo_skills/inference/model/vllm_multimodal.py (1)
nemo_skills/inference/model/audio_utils.py (4)
audio_file_to_base64(30-41)chunk_audio(59-96)load_audio_file(44-56)save_audio_chunk_to_base64(99-130)
🪛 Ruff (0.14.11)
tests/gpu-tests/test_vllm_audio.py
33-33: Probable insecure usage of temporary file or directory: "/tmp/nemo-skills-tests/"
(S108)
68-68: subprocess call with shell=True identified, security issue
(S602)
nemo_skills/inference/model/audio_utils.py
78-80: Avoid specifying long messages outside the exception class
(TRY003)
nemo_skills/evaluation/metrics/mmau_pro_metrics.py
129-129: Loop control variable agg_mode not used within loop body
Rename unused agg_mode to _agg_mode
(B007)
tests/test_vllm_audio.py
48-48: Unused lambda argument: self
(ARG005)
48-48: Unused lambda argument: kwargs
(ARG005)
nemo_skills/evaluation/evaluator/audio.py
272-274: Avoid specifying long messages outside the exception class
(TRY003)
nemo_skills/inference/model/vllm_multimodal.py
180-180: Avoid specifying long messages outside the exception class
(TRY003)
218-218: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
284-284: Avoid specifying long messages outside the exception class
(TRY003)
294-294: Loop control variable chunk_idx not used within loop body
Rename unused chunk_idx to _chunk_idx
(B007)
310-312: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
340-340: Avoid specifying long messages outside the exception class
(TRY003)
357-357: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
🔇 Additional comments (26)
nemo_skills/evaluation/metrics/map_metrics.py (1)
53-54: LGTM! Clean alias addition.The new
"speechlm"alias correctly maps toAudioMetrics, consistent with the existing"audio"entry. This provides flexibility for configurations that reference the metric type by either name.nemo_skills/dataset/librispeech-pc/__init__.py (1)
28-29: LGTM!The configuration changes correctly enable audio support for LibriSpeech-PC. The
++enable_audio=trueflag and trailing whitespace cleanup align with the standardization across other dataset configurations in this PR.nemo_skills/dataset/audiobench/nonjudge/__init__.py (1)
28-31: LGTM!The updated
EVAL_ARGSwithnormalization_mode=audiobenchandGENERATION_ARGSwithenable_audio=truecorrectly configure this dataset for audio-enabled evaluation and generation, consistent with other AudioBench configurations in the PR.nemo_skills/dataset/audiobench/judge/__init__.py (1)
29-30: LGTM!The configuration aligns with the nonjudge variant and correctly enables audio support with the AudioBench-specific normalization mode for judge-based evaluation tasks.
nemo_skills/dataset/asr-leaderboard/__init__.py (1)
20-21: LGTM!The
enable_audio=trueflag naming aligns with the agreed convention from previous review feedback. Thehf_leaderboardnormalization mode is appropriate per the file's documentation about HuggingFace leaderboard preprocessing.nemo_skills/inference/model/audio_utils.py (3)
30-41: LGTM with upstream validation assumption.The function correctly encodes audio files to base64. Ensure callers that construct
audio_file_pathfrom user input validate paths to prevent traversal attacks (e.g., resolved path must stay under an expected base directory).
44-56: LGTM!Clean wrapper with lazy import for
soundfile.
133-150: LGTM!Clean implementation supporting both vLLM/Qwen and OpenAI/NVIDIA API audio content formats.
nemo_skills/inference/model/__init__.py (1)
22-30: LGTM - Audio utility exports are well-organized.The new audio utilities are properly imported and re-exported from the module's public interface. The imports are placed in alphabetical order and the functions (
audio_file_to_base64,chunk_audio,load_audio_file,make_audio_content_block,save_audio_chunk_to_base64) align with the implementations inaudio_utils.py.nemo_skills/inference/model/vllm_multimodal.py (2)
353-387: LGTM -generate_asyncimplementation properly handles audio input.The method correctly:
- Checks if chunking is needed via
_needs_audio_chunking- Delegates to
_generate_with_chunkingfor long audio- Converts audio fields to base64 format for non-chunked audio
- Falls back to parent's
generate_asyncfor final processingThe preprocessing order (content_text_to_list → _preprocess_messages_for_model) is consistent with the chunking path.
286-351: Good implementation of chunked audio generation with proper statistics aggregation.The cumulative statistics tracking (lines 289-292, 328-331, 347-349) correctly sums
input_tokens,generated_tokens, andtime_elapsedacross all chunks, addressing the concern about only preserving last-chunk metadata.nemo_skills/inference/generate.py (3)
208-215: Audio configuration fields are well-documented.The new configuration options (
drop_content_types,enable_audio,enable_audio_chunking,audio_chunk_task_types,chunk_audio_threshold_sec) provide good control over audio processing behavior. The defaults are sensible for typical workflows.
417-434: Audio processing setup looks correct.The logic properly:
- Detects when audio processing is needed based on
enable_audioorvllm_multimodalserver type- Auto-switches to
vllm_multimodalwith an appropriate warning- Passes audio chunking configuration to the server
The warning message at line 425 is helpful for users who may be surprised by the auto-switch behavior.
592-609: Clean implementation of content type filtering.The
drop_fields_from_messagesmethod correctly:
- Handles missing
messageskey gracefully- Skips non-list content (e.g., string content in system messages)
- Filters out specified content types to reduce output file size
tests/test_vllm_audio.py (1)
108-155: Good test coverage for edge cases and chunking behavior.The tests appropriately cover:
- Messages without audio returning unchanged
/no_thinksuffix preservation in system messages- Chunking disabled behavior
- Task type filtering with graceful handling of non-existent files
tests/gpu-tests/test_vllm_audio.py (1)
70-85: Test validation and cleanup look appropriate.The test properly validates output existence and content, and cleans up the temporary input file in the
finallyblock.nemo_skills/evaluation/evaluator/audio.py (2)
37-38: LGTM! New configuration fields are well-designed.The new
strip_helpful_prefixesandnormalization_modefields with sensible defaults provide good flexibility for different ASR evaluation scenarios.
495-517: LGTM! Normalization mode propagation is consistent.The pattern
config.normalization_mode if config.apply_whisper_normalization else "none"correctly respects both the legacy flag and the new mode setting across ASR, ASR-PC, and ASR_LEADERBOARD task types.nemo_skills/dataset/mmau-pro/closed_form/__init__.py (1)
18-19: LGTM! Configuration updates align with audio integration pattern.The addition of
++enable_audio=truetoGENERATION_ARGSand the newEVAL_ARGSfor mmau-pro evaluation are consistent with similar updates across other dataset configurations in this PR.nemo_skills/evaluation/evaluator/mmau_pro.py (1)
34-34: LGTM! Docstring correctly updated to reflect new judge config.The documentation now accurately points to
judge/mmau-proprompt config, which aligns with the newmmau-pro.yamlfile added in this PR.nemo_skills/dataset/mmau-pro/open_ended/__init__.py (1)
18-26: LGTM! Open-ended configuration correctly updated for audio and new judge prompt.The updates to
GENERATION_ARGS(enabling audio) andJUDGE_ARGS(pointing to the newjudge/mmau-proconfig) are consistent with the closed-form configuration and the new judge prompt file.nemo_skills/prompt/config/judge/mmau-pro.yaml (1)
1-30: Well-structured judge prompt with proper decimal score handling.The multi-criteria evaluation (Correctness, Relevance, Completeness, Clarity) with explicit formatting instructions produces parseable judgements. The OVERALL line correctly asks for an "average score," which may naturally be a decimal (e.g., 3.75). The extraction logic in
mmau_pro_metrics.py(lines 44-48) uses regex patterns\d+(?:\.\d+)?on all criteria to explicitly capture both integer and decimal scores, converting them to float. This design properly supports the prompt's format and handles the averaging case correctly.nemo_skills/evaluation/metrics/mmau_pro_metrics.py (4)
26-65: Good fix for the OVERALL score overwrite issue.The
found_overallflag correctly distinguishes between a missing OVERALL score and an explicitly provided score of 3.0. This addresses the previously flagged logic bug.One minor consideration: the regex accepts any numeric value, so scores outside the expected 1-5 range would be accepted without validation. If malformed judge output is a concern, you could add bounds checking (clamping to 1-5).
75-82: LGTM!The
multicriteria_scoresstructure is well-organized for accumulating per-criterion scores.
84-98: LGTM!The scoring logic correctly differentiates between open-ended (judgement-based) and closed-form (binary is_correct) questions, with an appropriate threshold of >= 3.0 for correctness on the 1-5 scale.
170-195: LGTM!The
metrics_to_printmethod correctly extends the output to include multi-criteria metrics when available.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
There was a problem hiding this comment.
Additional Comments (1)
-
nemo_skills/evaluation/evaluator/audio.py, line 340-341 (link)logic: Undefined variables
textandpred_textwill cause NameError. Should userefandhypinstead.
18 files reviewed, 1 comment
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
|
@Jorjeous @melllinia let me know when all looks good on your side and I can do another review and hopefully we can soon merge |
|
@Kipok Lets proceed. All looking good |
|
@Kipok Could you please take a look today? |
Kipok
left a comment
There was a problem hiding this comment.
just a small comment, but otherwise lgtm!
… audio processing Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
This PR enchance vllm multimodal class with audioinput, while maintating it's output capabilities
Summary by CodeRabbit
New Features
Chores
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.