Conversation
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> clean up Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
1) after refactoring AudioProcessor worked only if we pass ++audio arg. this is inconvenient 2) now we audio-enabling in based on dataset_group=speechlm or task_type=audio all audio containing benchmarks have one or both Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
📝 WalkthroughWalkthroughThis PR adds audio processing support to the inference pipeline, introduces multi-criteria scoring for MMAU-Pro benchmark evaluation, updates dataset configurations to use a new judge prompt (mmau-pro instead of speechlm), and refactors data preparation with improved choice formatting and system/user message structure. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client Code
participant Gen as generate.py
participant AP as AudioProcessor
participant Base as Base LLM
participant Backend as Backend Model
Client->>Gen: process_single_datapoint(data_point)
rect rgb(200, 230, 255)
Note over Gen,AP: Audio Initialization
Gen->>Gen: fill_prompt(data_point)
Note over Gen: Extract audio field from data_point
Gen->>Gen: preprocess_prompt(prompt)
Gen->>AP: Initialize AudioProcessor if audio enabled
AP->>AP: Infer data_dir from eval_type
end
rect rgb(230, 220, 255)
Note over AP,Backend: Audio Processing with Chunking
AP->>AP: Detect if audio needs chunking<br/>(duration > threshold)
alt Chunking Enabled
AP->>AP: chunk_audio(audio_array,<br/>sampling_rate, chunk_duration)
loop For Each Chunk
AP->>AP: save_audio_chunk_to_base64(chunk)
AP->>AP: make_audio_content_block(base64)
AP->>AP: Generate prompt for chunk
AP->>Base: generate_async(prompt_chunk)
Base->>Backend: inference
Backend-->>Base: chunk_result
Base-->>AP: chunk_result
AP->>AP: Accumulate chunk_result
end
AP->>AP: Aggregate results<br/>metadata: num_chunks, duration
else Single Audio
AP->>AP: audio_file_to_base64(audio_path)
AP->>AP: make_audio_content_block(base64)
AP->>Base: generate_async(prompt)
Base->>Backend: inference
Backend-->>Base: response
Base-->>AP: response
end
end
rect rgb(230, 255, 230)
Note over Gen,AP: Output Processing
AP-->>Gen: final_output (combined)
Gen->>Gen: postprocess_single_output(output)
Gen->>Gen: drop_binary_data(output)<br/>Remove audio_url, input_audio
Gen-->>Client: processed_output
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (3)
nemo_skills/evaluation/metrics/mmau_pro_metrics.py (1)
123-128: Rename unused loop variable per static analysis.The
agg_modevariable is not used within the loop body.🔎 Proposed fix
- for agg_mode, agg_metrics in metrics_dict.items(): + for _agg_mode, agg_metrics in metrics_dict.items():nemo_skills/inference/model/audio_processor.py (2)
199-204: Fix implicit Optional type hints.Per PEP 484, use explicit
T | Noneinstead of= Nonedefault without the type annotation.🔎 Proposed fix
async def generate_async( self, prompt: str | list[dict] | None = None, - task_type: str = None, + task_type: str | None = None, **kwargs, ) -> dict:
353-384: Rename unused loop variable.The
chunk_idxvariable is not used within the loop.🔎 Proposed fix
- for chunk_idx, audio_chunk in enumerate(chunks): + for _chunk_idx, audio_chunk in enumerate(chunks):Alternatively, if chunk indexing might be useful for debugging, consider adding it to the log:
- for chunk_idx, audio_chunk in enumerate(chunks): + for chunk_idx, audio_chunk in enumerate(chunks): chunk_messages = [] + LOG.debug("Processing chunk %d/%d", chunk_idx + 1, len(chunks))
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
nemo_skills/dataset/mmau-pro/closed_form/__init__.pynemo_skills/dataset/mmau-pro/open_ended/__init__.pynemo_skills/dataset/mmau-pro/prepare.pynemo_skills/evaluation/metrics/mmau_pro_metrics.pynemo_skills/inference/generate.pynemo_skills/inference/model/__init__.pynemo_skills/inference/model/audio_processor.pynemo_skills/inference/model/openai.pynemo_skills/inference/model/vllm.pynemo_skills/pipeline/utils/eval.pynemo_skills/prompt/config/judge/mmau-pro.yamlnemo_skills/prompt/config/judge/speechlm.yamltests/gpu-tests/run_qwen.shtests/gpu-tests/test_vllm_audio.pytests/test_vllm_audio.py
💤 Files with no reviewable changes (1)
- nemo_skills/prompt/config/judge/speechlm.yaml
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.
📚 Learning: 2025-12-12T16:09:53.870Z
Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.
Applied to files:
nemo_skills/prompt/config/judge/mmau-pro.yaml
🧬 Code graph analysis (5)
tests/gpu-tests/test_vllm_audio.py (2)
tests/gpu-tests/utils.py (1)
require_env_var(18-23)nemo_skills/pipeline/utils/declarative.py (1)
run(346-483)
nemo_skills/evaluation/metrics/mmau_pro_metrics.py (2)
nemo_skills/evaluation/metrics/base.py (2)
as_int(443-446)as_percentage(437-440)nemo_skills/utils.py (1)
get_logger_name(39-43)
nemo_skills/inference/generate.py (4)
nemo_skills/inference/model/audio_processor.py (2)
AudioProcessor(144-406)AudioProcessorConfig(34-51)nemo_skills/inference/chat_interface/core.py (3)
cfg(181-182)sandbox(177-178)get(136-151)nemo_skills/inference/model/__init__.py (1)
get_model(75-95)recipes/asr_tts/riva_generate.py (1)
fill_prompt(130-134)
tests/test_vllm_audio.py (2)
nemo_skills/inference/model/vllm.py (1)
VLLMModel(27-151)nemo_skills/inference/model/audio_processor.py (1)
audio_file_to_base64(54-58)
nemo_skills/inference/model/__init__.py (1)
nemo_skills/inference/model/audio_processor.py (2)
AudioProcessor(144-406)AudioProcessorConfig(34-51)
🪛 Ruff (0.14.10)
tests/gpu-tests/test_vllm_audio.py
31-31: Probable insecure usage of temporary file or directory: "/tmp/nemo-skills-tests/"
(S108)
66-66: subprocess call with shell=True identified, security issue
(S602)
nemo_skills/evaluation/metrics/mmau_pro_metrics.py
123-123: Loop control variable agg_mode not used within loop body
Rename unused agg_mode to _agg_mode
(B007)
nemo_skills/inference/generate.py
578-578: Unused method argument: data_points
(ARG002)
nemo_skills/inference/model/audio_processor.py
202-202: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
265-265: Avoid specifying long messages outside the exception class
(TRY003)
289-289: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
320-321: try-except-pass detected, consider logging the exception
(S110)
320-320: Do not catch blind exception: Exception
(BLE001)
353-353: Loop control variable chunk_idx not used within loop body
Rename unused chunk_idx to _chunk_idx
(B007)
369-371: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
tests/test_vllm_audio.py
45-45: Unused lambda argument: self
(ARG005)
45-45: Unused lambda argument: kwargs
(ARG005)
🔇 Additional comments (25)
nemo_skills/dataset/mmau-pro/closed_form/__init__.py (1)
19-19: LGTM!The EVAL_ARGS constant addition is consistent with the module's pattern and aligns with the MMAU-Pro evaluation flow introduced across the repository.
nemo_skills/inference/model/openai.py (1)
51-51: LGTM!The extension to support "inference-api.nvidia.com" alongside "api.nvidia.com" for NVIDIA API key detection is appropriate and maintains the same security posture.
tests/gpu-tests/run_qwen.sh (1)
23-25: LGTM!The audio test workflow addition follows the established pattern in the file and appropriately sets the model for Qwen2.5-Omni audio testing.
nemo_skills/dataset/mmau-pro/open_ended/__init__.py (1)
26-26: LGTM!The judge prompt configuration update from speechlm to mmau-pro aligns with the new multi-criteria evaluation approach introduced in this PR.
nemo_skills/pipeline/utils/eval.py (1)
134-138: LGTM!The dataset_group propagation follows the established pattern in this function and enables proper audio dataset handling in the generation pipeline.
tests/test_vllm_audio.py (4)
25-39: LGTM!The test correctly validates base64 encoding and decoding round-trip for audio files, with proper temporary file cleanup.
42-48: LGTM!The mock fixture appropriately bypasses VLLMModel initialization for isolated testing of audio preprocessing logic. The unused lambda arguments flagged by static analysis are intentional for the mocking pattern.
51-72: LGTM!The test properly validates that audio content is prepended before text content in the message list, which is critical for Qwen Audio models to transcribe correctly.
74-101: LGTM!The test correctly validates handling of multiple audio files, ensuring all audio entries precede the text content as required by Qwen Audio.
nemo_skills/prompt/config/judge/mmau-pro.yaml (1)
1-30: LGTM!The multi-criteria judge prompt is well-structured with clear evaluation criteria and output format requirements. The 4-criteria evaluation (Correctness, Relevance, Completeness, Clarity) with 1-5 scoring and brief justifications provides comprehensive assessment for MMAU-Pro open-ended responses.
nemo_skills/inference/model/vllm.py (1)
28-32: LGTM!The added docstring clearly describes the VLLMModel's purpose and appropriately documents that audio processing capabilities are provided through the AudioProcessor wrapper, maintaining clear separation of concerns.
nemo_skills/inference/model/__init__.py (2)
98-123: Clean factory function for audio model wrapping.The
get_audio_modelfunction follows the established factory pattern and correctly handles the three cases foraudio_config(None, dict, or AudioProcessorConfig).
161-178: LGTM on schema_overrides addition.The
schema_overridesparameter is properly threaded through toToolCallingWrapper.nemo_skills/dataset/mmau-pro/prepare.py (1)
93-98: System message structure looks good.The
/no_thinkdirective for non-open categories is a reasonable approach to control model reasoning behavior. The message structure with system + user messages follows OpenAI conventions.nemo_skills/evaluation/metrics/mmau_pro_metrics.py (1)
130-161: Multicriteria aggregation logic is well-structured.The conversion from 1-5 scale to 0-100 percentage is clear, and the good/poor response rate thresholds (≥4.0 and ≤2.0) are reasonable. The rounding at the end ensures consistent output formatting.
nemo_skills/inference/model/audio_processor.py (3)
54-58: LGTM on base64 encoding utility.Simple and correct implementation for audio file encoding.
112-141: Good use of tempfile with proper cleanup.The
save_audio_chunk_to_base64function correctly uses a temporary file with cleanup in afinallyblock. The check for file existence before unlinking is a good defensive practice.
241-287: Audio message preparation handles ordering correctly.The comment on line 247 about Qwen models requiring audio before text is important, and the implementation correctly places audio items before text content. Good documentation of this requirement.
nemo_skills/inference/generate.py (7)
40-49: LGTM on new imports.The AudioProcessor and AudioProcessorConfig imports are correctly added to support the audio processing feature.
193-204: New audio-related config fields are well-documented.The
drop_content_types,audio_format, anddataset_groupfields provide good flexibility for audio processing configuration. The defaults are sensible.
559-567: Audio field propagation in fill_prompt.The code correctly copies the
audiofield fromdata_pointto the user message. Note that this only handles the singularaudiokey; theaudios(plural) case is not handled here, but that appears to be intentional based on the AudioProcessor handling both cases.
582-596: drop_binary_data implementation is correct.The method properly handles edge cases (missing messages, non-list content) and filters content types based on configuration. The in-place mutation pattern is consistent with
postprocess_single_output.
578-580: Thedata_pointsparameter is kept for subclass compatibility.The static analysis flags
data_pointsas unused, but this is intentional - the method signature must remain consistent for subclasses that may overridedump_outputsand need access todata_points. This is a false positive.
646-658: Prompt preprocessing and task_type handling look correct.The preprocessing hook allows for custom logic in subclasses, and the task_type is correctly passed to AudioProcessor when available.
409-431: AudioProcessor wrapping logic is well-implemented and appropriately specific.The auto-enabling logic based on
server_typeandeval_type/dataset_groupis clear, with the dual condition providing good defensive coverage. The logging provides good visibility into when audio processing is enabled. The current conditionself.cfg.eval_type == "audio"is appropriately specific to the single audio evaluation type used in the codebase; different audio evaluation tasks are handled within the AudioEvaluator configuration rather than through separate eval_type values.
| if entry.get("audio_path"): | ||
| audio_path = entry["audio_path"] | ||
|
|
||
| if isinstance(audio_path, list) and audio_path: | ||
| user_message["audios"] = [{"path": path, "duration": 10.0} for path in audio_path] | ||
| elif isinstance(audio_path, str): | ||
| user_message["audio"] = {"path": audio_path, "duration": 10.0} | ||
|
|
||
| formatted_entry["messages"] = [user_message] | ||
| # Prepend /dataset/mmau-pro/ to make paths absolute for cluster | ||
| if len(audio_path) == 1: | ||
| user_message["audio"] = {"path": f"/dataset/mmau-pro/{audio_path[0]}"} | ||
| else: | ||
| user_message["audios"] = [{"path": f"/dataset/mmau-pro/{path}"} for path in audio_path] | ||
|
|
There was a problem hiding this comment.
Potential IndexError if audio_path is an empty list.
The code checks if entry.get("audio_path") which is truthy for a non-empty list, but if audio_path could be an empty list [], accessing audio_path[0] would fail. While likely not an issue with the dataset, consider adding explicit length check for robustness.
🔎 Suggested defensive check
if entry.get("audio_path"):
audio_path = entry["audio_path"]
# Prepend /dataset/mmau-pro/ to make paths absolute for cluster
- if len(audio_path) == 1:
+ if audio_path and len(audio_path) == 1:
user_message["audio"] = {"path": f"/dataset/mmau-pro/{audio_path[0]}"}
- else:
+ elif audio_path:
user_message["audios"] = [{"path": f"/dataset/mmau-pro/{path}"} for path in audio_path]🤖 Prompt for AI Agents
In nemo_skills/dataset/mmau-pro/prepare.py around lines 85 to 92, the code
assumes audio_path[0] exists when entry.get("audio_path") is truthy, which can
raise IndexError for an empty list; update the logic to explicitly check the
length before indexing—if audio_path is a non-empty list use the existing
single/multiple-path handling, and if it's empty skip adding "audio"/"audios"
(or set to an empty list) to user_message to handle the empty-list case safely.
| # Fallback: compute overall if missing or still 3.0 | ||
| if "overall" not in scores or scores["overall"] == 3.0: | ||
| criteria_scores = [scores.get(k, 3.0) for k in ["correctness", "relevance", "completeness", "clarity"]] | ||
| scores["overall"] = sum(criteria_scores) / len(criteria_scores) | ||
|
|
There was a problem hiding this comment.
Fallback logic may incorrectly overwrite a valid "overall" score of exactly 3.0.
The condition scores["overall"] == 3.0 will trigger the fallback even when the LLM judge explicitly rated "overall" as 3.0. This could lead to inconsistent behavior.
🔎 Suggested fix using a flag to track if score was parsed
+ overall_parsed = False
for criterion, pattern in patterns.items():
match = re.search(pattern, judgement_text, re.IGNORECASE)
- scores[criterion] = float(match.group(1)) if match else 3.0
+ if match:
+ scores[criterion] = float(match.group(1))
+ if criterion == "overall":
+ overall_parsed = True
+ else:
+ scores[criterion] = 3.0
# Fallback: compute overall if missing or still 3.0
- if "overall" not in scores or scores["overall"] == 3.0:
+ if not overall_parsed:
criteria_scores = [scores.get(k, 3.0) for k in ["correctness", "relevance", "completeness", "clarity"]]
scores["overall"] = sum(criteria_scores) / len(criteria_scores)Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In nemo_skills/evaluation/metrics/mmau_pro_metrics.py around lines 54-58, the
fallback currently triggers when scores["overall"] == 3.0 which incorrectly
overrides an explicitly parsed overall score of 3.0; change the logic to only
compute the fallback when the overall score is absent or when a parsing flag
indicates overall was not successfully parsed—introduce a boolean (e.g.,
overall_parsed) set true when you successfully parse a provided overall value,
and replace the condition with if "overall" not in scores or not overall_parsed
then compute overall as the mean of the criteria.
| def _check_chunking_needed(self, messages: list[dict], task_type: str = None) -> tuple[bool, str, float]: | ||
| """Check if audio in messages needs chunking. | ||
|
|
||
| Returns: | ||
| Tuple of (needs_chunking, audio_path, duration) | ||
| """ | ||
| if not self.config.enable_chunking: | ||
| return False, None, 0.0 | ||
|
|
||
| # Check if task type should be chunked (if filter is specified) | ||
| if self.config.chunk_task_types is not None: | ||
| if task_type not in self.config.chunk_task_types: | ||
| return False, None, 0.0 | ||
|
|
||
| # Find audio in messages | ||
| for msg in messages: | ||
| if msg.get("role") == "user": | ||
| audio_info = msg.get("audio") or (msg.get("audios", [{}])[0] if msg.get("audios") else {}) | ||
| if audio_info and "path" in audio_info: | ||
| audio_path = os.path.join(self.data_dir, audio_info["path"]) | ||
|
|
||
| if not os.path.exists(audio_path): | ||
| return False, None, 0.0 | ||
|
|
||
| # Load audio to check duration | ||
| try: | ||
| audio_array, sampling_rate = load_audio_file(audio_path) | ||
| duration = len(audio_array) / sampling_rate | ||
|
|
||
| if duration > self.config.chunk_threshold_sec: | ||
| return True, audio_path, duration | ||
| except Exception: | ||
| pass | ||
|
|
There was a problem hiding this comment.
Bare exception catch silently swallows errors.
The except Exception: pass at lines 320-321 hides potential issues with audio file loading. Consider logging the error or at least handling specific exceptions.
🔎 Suggested improvement with logging
- def _check_chunking_needed(self, messages: list[dict], task_type: str = None) -> tuple[bool, str, float]:
+ def _check_chunking_needed(self, messages: list[dict], task_type: str | None = None) -> tuple[bool, str, float]:
"""Check if audio in messages needs chunking.
Returns:
Tuple of (needs_chunking, audio_path, duration)
"""
if not self.config.enable_chunking:
return False, None, 0.0
# Check if task type should be chunked (if filter is specified)
if self.config.chunk_task_types is not None:
if task_type not in self.config.chunk_task_types:
return False, None, 0.0
# Find audio in messages
for msg in messages:
if msg.get("role") == "user":
audio_info = msg.get("audio") or (msg.get("audios", [{}])[0] if msg.get("audios") else {})
if audio_info and "path" in audio_info:
audio_path = os.path.join(self.data_dir, audio_info["path"])
if not os.path.exists(audio_path):
return False, None, 0.0
# Load audio to check duration
try:
audio_array, sampling_rate = load_audio_file(audio_path)
duration = len(audio_array) / sampling_rate
if duration > self.config.chunk_threshold_sec:
return True, audio_path, duration
- except Exception:
- pass
+ except Exception as e:
+ LOG.debug("Failed to load audio for chunking check: %s - %s", audio_path, e)
return False, None, 0.0Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.10)
289-289: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
320-321: try-except-pass detected, consider logging the exception
(S110)
320-320: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In nemo_skills/inference/model/audio_processor.py around lines 289 to 322, the
bare `except Exception: pass` swallows errors when loading audio; replace it
with targeted exception handling and logging: catch specific exceptions (e.g.,
FileNotFoundError, OSError, any codec/format error your audio loader raises) and
log a clear error message including the audio_path and exception details via the
class logger (or a provided logger) so failures are visible, then continue
processing (return False, None, 0.0 or skip to next message) rather than
silently ignoring all exceptions.
| @pytest.mark.gpu | ||
| def test_vllm_audio_generation(): | ||
| """Integration test: Generate with vLLM server using audio input.""" | ||
| model_path = require_env_var("NEMO_SKILLS_TEST_HF_MODEL") | ||
| model_type = require_env_var("NEMO_SKILLS_TEST_MODEL_TYPE") | ||
|
|
||
| output_dir = f"/tmp/nemo-skills-tests/{model_type}/vllm-audio-generation" | ||
| # Clean up output directory | ||
| if Path(output_dir).exists(): | ||
| shutil.rmtree(output_dir) | ||
|
|
||
| # Create test input file with audio | ||
| with tempfile.NamedTemporaryFile(mode="w", suffix=".jsonl", delete=False) as f: | ||
| test_data = [ | ||
| { | ||
| "question": "Transcribe this audio", | ||
| "audio": {"path": "/nemo_run/code/tests/slurm-tests/asr_nim/wavs/t2_16.wav"}, | ||
| }, | ||
| { | ||
| "question": "What is in this audio?", | ||
| "audio": {"path": "/nemo_run/code/tests/slurm-tests/asr_nim/wavs/t3_16.wav"}, | ||
| }, | ||
| ] | ||
| for item in test_data: | ||
| f.write(json.dumps(item) + "\n") | ||
| input_file = f.name | ||
|
|
||
| try: | ||
| cmd = ( | ||
| f"ns generate " | ||
| f" --cluster test-local --config_dir {Path(__file__).absolute().parent} " | ||
| f" --model {model_path} " | ||
| f" --output_dir {output_dir} " | ||
| f" --server_type vllm " | ||
| f" --server_gpus 1 " | ||
| f" --server_nodes 1 " | ||
| f" --server_args '--enforce-eager' " | ||
| f" --input_file={input_file} " | ||
| f" ++prompt_config=generic/default " | ||
| f" ++skip_filled=False " | ||
| ) | ||
| subprocess.run(cmd, shell=True, check=True) | ||
|
|
||
| # Verify output exists and has audio-related generation | ||
| with open(f"{output_dir}/output.jsonl") as fin: | ||
| lines = fin.readlines() | ||
|
|
||
| assert len(lines) == 2, "Should have 2 output lines" | ||
|
|
||
| for line in lines: | ||
| data = json.loads(line) | ||
| assert "generation" in data, "Should have generation field" | ||
| assert len(data["generation"]) > 0, "Generation should not be empty" | ||
| # If model supports audio, generation should contain something | ||
| print(f"Generated: {data['generation']}") | ||
|
|
||
| finally: | ||
| # Cleanup temp file | ||
| Path(input_file).unlink(missing_ok=True) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check if the audio files are referenced in the codebase
rg -n "t2_16\.wav|t3_16\.wav" --type pyRepository: NVIDIA-NeMo/Skills
Length of output: 325
🏁 Script executed:
# Check if the audio directory structure exists
find . -type f -name "*.wav" 2>/dev/null | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 147
🏁 Script executed:
# Check the actual test file to verify the line numbers and content match the review
wc -l tests/gpu-tests/test_vllm_audio.py 2>/dev/null || echo "File not found in expected location"
git ls-files | grep -E "test.*audio|audio.*test" | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 159
🏁 Script executed:
# Look for any test data or fixtures directories
find . -type d -name "wavs" -o -type d -name "test_data" -o -type d -name "fixtures" 2>/dev/null | grep -v ".git"Repository: NVIDIA-NeMo/Skills
Length of output: 94
🏁 Script executed:
# Check how other tests reference test data/fixtures
rg -B 2 -A 2 "tests/slurm-tests|tests/gpu-tests" --type py | head -50Repository: NVIDIA-NeMo/Skills
Length of output: 3457
🏁 Script executed:
# Look for any pattern of dynamic path resolution in tests
rg -n "Path(__file__)" tests/ --type py | head -10Repository: NVIDIA-NeMo/Skills
Length of output: 44
🏁 Script executed:
# Check if there's a pattern for base path configuration or environment variables
rg -n "nemo_run|NEMO.*PATH|TEST.*PATH" tests/ --type py | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 2148
🏁 Script executed:
# Look at the test file itself more closely, especially around path handling
cat -n tests/gpu-tests/test_vllm_audio.py | head -50Repository: NVIDIA-NeMo/Skills
Length of output: 2201
Integration test follows established codebase patterns with room for improvement.
The test structure is well-organized with proper cleanup in the finally block. The hardcoded /nemo_run/code/ paths follow the pattern used throughout the test suite for containerized/cluster environments, though they do limit portability to local development setups.
Consider these enhancements:
-
Hardcoded audio paths (lines 41, 45): While this pattern is consistent with other tests in the codebase, consider using an environment variable or
Path(__file__).parentto make the test work in both cluster and local development environments. -
Test assertions are minimal: The test only verifies that output exists and has non-empty generations. Consider adding assertions for audio-specific metadata to better validate audio processing.
-
No negative test cases: Consider adding tests for missing audio files or invalid audio formats to ensure proper error handling.
🧰 Tools
🪛 Ruff (0.14.10)
31-31: Probable insecure usage of temporary file or directory: "/tmp/nemo-skills-tests/"
(S108)
66-66: subprocess call with shell=True identified, security issue
(S602)
🤖 Prompt for AI Agents
In tests/gpu-tests/test_vllm_audio.py around lines 25-83, the test uses
hardcoded container paths for audio files and has minimal assertions; update it
to obtain audio file locations from an environment variable (e.g.,
NEMO_TEST_ASSETS) or relative to Path(__file__).parent so it works both in
CI/container and local dev, replace the literal "/nemo_run/code/..." paths with
a resolved Path built from that env var or the test file's parent, strengthen
assertions to check audio-related metadata in each output (e.g., presence of
audio tokens/flags, expected transcript substrings or an audio_processing_status
field) and add at least one negative subtest or input case that uses a
missing/invalid audio path and asserts the tool returns a handled error or
specific failure field; keep the existing cleanup logic for the temporary input
file.
|
I thought we agreed to move audio related stuff into another class (vllm_multimodal)? https://nvidia.slack.com/archives/C0A0VRM2S5A/p1766173279180119 |
|
Closing, on top of #1157 |
Addition on top of vllm audio support with openai api request to inference server with audio
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.