Conversation
📝 WalkthroughWalkthroughThis PR introduces comprehensive audio/speech evaluation support to the NemoSkills framework. Changes encompass dataset preparation updates for MMAU-Pro with revised message formatting and absolute audio path handling, vLLM model extensions for base64 audio encoding and content preprocessing, inference layer updates to strip binary data during postprocessing, multi-criteria scoring mechanisms in metrics calculation, new judge prompt configuration, and accompanying integration and unit tests. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User/Dataset
participant Inference as Inference<br/>(generate.py)
participant VLLMModel as VLLMModel<br/>(vllm.py)
participant vLLMServer as vLLM Server
participant Judge as Judge Model
participant Metrics as Metrics<br/>(mmau_pro_metrics.py)
User->>Inference: Input with audio path
Inference->>VLLMModel: Prepare message with audio
VLLMModel->>VLLMModel: audio_file_to_base64()
VLLMModel->>VLLMModel: content_text_to_list()<br/>(normalize to text + audio_url)
VLLMModel->>vLLMServer: _build_chat_request_params<br/>(send base64 data URL)
vLLMServer->>vLLMServer: Process audio + text
vLLMServer-->>VLLMModel: Generation response
VLLMModel-->>Inference: Model output
Inference->>Inference: drop_binary_data()<br/>(strip audio_url)
Inference-->>User: Cleaned output
User->>Judge: Eval: generation vs reference
Judge->>Judge: Score on 1-5 scale
Judge-->>Metrics: Judgement text
Metrics->>Metrics: extract_multicriteria_scores()
Metrics->>Metrics: Calculate per-criterion<br/>avg/std/rates
Metrics-->>User: Multi-criteria metrics
Estimated code review effort🎯 4 (Complex) | ⏱️ ~55 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/inference/generate.py (1)
383-398:data_diris set but never passed to model factory functions.
self.data_diris extracted fromcfg.eval_config(lines 384-385) but not passed to any of the three model factory calls (get_code_execution_model,get_tool_calling_model, orget_model). SinceVLLMModelexpectsdata_diras a constructor parameter for resolving relative audio paths, this needs to be added: includedata_dir=self.data_dirin each factory call.
🧹 Nitpick comments (9)
nemo_skills/inference/model/vllm.py (3)
112-132: Method mutates input message dictionary.
content_text_to_listmodifies themessagedict in-place, which could cause unintended side effects if the same message object is reused elsewhere. Consider working on a copy.def content_text_to_list(self, message): + message = message.copy() # Avoid mutating the original if "audio" in message or "audios" in message: content = message["content"] if isinstance(content, str):
29-33: Add error handling for file operations.The function doesn't handle
FileNotFoundErrororIOError, which could lead to cryptic error messages when audio files are missing or inaccessible.def audio_file_to_base64(audio_file_path: str): """Encodes an audio file into a base64 string.""" - with open(audio_file_path, "rb") as audio_file: - audio_content = audio_file.read() - return base64.b64encode(audio_content).decode("utf-8") + try: + with open(audio_file_path, "rb") as audio_file: + audio_content = audio_file.read() + return base64.b64encode(audio_content).decode("utf-8") + except FileNotFoundError: + raise FileNotFoundError(f"Audio file not found: {audio_file_path}") + except IOError as e: + raise IOError(f"Failed to read audio file {audio_file_path}: {e}")
125-125: Hardcoded MIME type may be incorrect for non-WAV audio files.The MIME type is hardcoded as
audio/wav, but audio files could be in other formats (e.g., MP3, FLAC, OGG). Consider detecting the format from the file extension or magic bytes.nemo_skills/evaluation/metrics/mmau_pro_metrics.py (1)
123-123: Rename unused loop variable per static analysis.The loop variable
agg_modeis not used within the loop body.- for agg_mode, agg_metrics in metrics_dict.items(): + for _agg_mode, agg_metrics in metrics_dict.items():tests/gpu-tests/test_vllm_audio.py (1)
92-94: Missing output directory cleanup in finally block.The
finallyblock cleans up the temp input file but doesn't removeoutput_dir. While the test pre-cleans the directory at the start (lines 32-34), adding cleanup to thefinallyblock ensures cleanup happens even if the test fails before reaching that point in subsequent runs.finally: # Cleanup temp file Path(input_file).unlink(missing_ok=True) + # Cleanup output directory + if Path(output_dir).exists(): + shutil.rmtree(output_dir, ignore_errors=True)tests/test_vllm_audio.py (1)
123-146: Consider testing actual audio preprocessing instead of mocking the entire method.This test mocks
generate_asyncentirely, which means it doesn't actually exercise the audio preprocessing logic (content_text_to_list, base64 encoding). The real value would be testing that audio messages are correctly transformed before the API call.Consider either:
- Mocking at a lower level (e.g., the HTTP client) to test the actual transformation
- Removing this test since
test_build_chat_request_with_audioalready covers the preprocessing logicnemo_skills/inference/generate.py (1)
383-385: Simplify the None check.The
isinstance(..., type(None))check is unconventional. A simpleris not Nonewould be clearer.self.data_dir = None - if "data_dir" in self.cfg.eval_config and not isinstance(self.cfg.eval_config.get("data_dir"), type(None)): + if self.cfg.eval_config.get("data_dir") is not None: self.data_dir = self.cfg.eval_config["data_dir"]docs/evaluation/speech-audio.md (2)
72-92: Convert indented code blocks to fenced syntax for consistency and readability. Static analysis flags indicate these code blocks use indentation-based syntax rather than fenced blocks (triple backticks), which is the modern markdown convention and enables language-specific syntax highlighting.Convert indented code blocks to fenced blocks with language specification:
- ```python - import os - from nemo_skills.pipeline.cli import wrap_arguments, eval +```python +import os +from nemo_skills.pipeline.cli import wrap_arguments, eval - eval( - ctx=wrap_arguments(""), +eval( + ctx=wrap_arguments(""), - ) - ``` +) +```Apply the same pattern to all indented code blocks in the Python API Examples and Alternative command-line usage sections (lines 72–92, 96–116, 120–160).
Also applies to: 96-116, 120-160
263-263: Specify language for fenced code blocks at lines 263 and 320. Static analysis flags indicate these code blocks lack language specification, which prevents syntax highlighting and readability.Add language specification to the fenced code blocks:
**Instruction Following:** -``` +```text ----------------------- mmau-pro.instruction_following -------------------------And similarly for line 320:
**Overall Aggregate Score:** -``` +```text -------------------------------- mmau-pro -----------------------------------------Also applies to: 320-320 </blockquote></details> </blockquote></details> <details> <summary>📜 Review details</summary> **Configuration used**: Path: .coderabbit.yaml **Review profile**: CHILL **Plan**: Pro <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 48cbf77596cb6abd6abf65b912ed408368bfd678 and 7287e0ad617ad785e39b1c1f62f116b20900123b. </details> <details> <summary>📒 Files selected for processing (13)</summary> * `docs/evaluation/speech-audio.md` (7 hunks) * `nemo_skills/dataset/mmau-pro/closed_form/__init__.py` (1 hunks) * `nemo_skills/dataset/mmau-pro/open_ended/__init__.py` (1 hunks) * `nemo_skills/dataset/mmau-pro/prepare.py` (1 hunks) * `nemo_skills/evaluation/metrics/mmau_pro_metrics.py` (3 hunks) * `nemo_skills/inference/generate.py` (3 hunks) * `nemo_skills/inference/model/vllm.py` (4 hunks) * `nemo_skills/prompt/config/judge/mmau-pro.yaml` (1 hunks) * `nemo_skills/prompt/config/judge/speechlm.yaml` (0 hunks) * `tests/gpu-tests/run_qwen.sh` (1 hunks) * `tests/gpu-tests/test-local.yaml` (1 hunks) * `tests/gpu-tests/test_vllm_audio.py` (1 hunks) * `tests/test_vllm_audio.py` (1 hunks) </details> <details> <summary>💤 Files with no reviewable changes (1)</summary> * nemo_skills/prompt/config/judge/speechlm.yaml </details> <details> <summary>🧰 Additional context used</summary> <details> <summary>🧬 Code graph analysis (4)</summary> <details> <summary>nemo_skills/evaluation/metrics/mmau_pro_metrics.py (2)</summary><blockquote> <details> <summary>nemo_skills/evaluation/metrics/base.py (3)</summary> * `as_int` (443-446) * `as_percentage` (437-440) * `_compute_pass_at_k` (352-423) </details> <details> <summary>nemo_skills/utils.py (1)</summary> * `get_logger_name` (39-43) </details> </blockquote></details> <details> <summary>tests/gpu-tests/test_vllm_audio.py (2)</summary><blockquote> <details> <summary>tests/gpu-tests/utils.py (1)</summary> * `require_env_var` (18-23) </details> <details> <summary>nemo_skills/pipeline/utils/declarative.py (1)</summary> * `run` (346-483) </details> </blockquote></details> <details> <summary>tests/test_vllm_audio.py (1)</summary><blockquote> <details> <summary>nemo_skills/inference/model/vllm.py (2)</summary> * `audio_file_to_base64` (29-33) * `content_text_to_list` (112-132) </details> </blockquote></details> <details> <summary>nemo_skills/dataset/mmau-pro/prepare.py (1)</summary><blockquote> <details> <summary>nemo_skills/inference/chat_interface/core.py (1)</summary> * `get` (136-151) </details> </blockquote></details> </details><details> <summary>🪛 markdownlint-cli2 (0.18.1)</summary> <details> <summary>docs/evaluation/speech-audio.md</summary> 72-72: Code block style Expected: fenced; Actual: indented (MD046, code-block-style) --- 96-96: Code block style Expected: fenced; Actual: indented (MD046, code-block-style) --- 120-120: Code block style Expected: fenced; Actual: indented (MD046, code-block-style) --- 263-263: Fenced code blocks should have a language specified (MD040, fenced-code-language) --- 320-320: Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> </details> <details> <summary>🪛 Ruff (0.14.7)</summary> <details> <summary>nemo_skills/evaluation/metrics/mmau_pro_metrics.py</summary> 123-123: Loop control variable `agg_mode` not used within loop body Rename unused `agg_mode` to `_agg_mode` (B007) </details> <details> <summary>tests/gpu-tests/test_vllm_audio.py</summary> 31-31: Probable insecure usage of temporary file or directory: "/tmp/nemo-skills-tests/" (S108) --- 79-79: `subprocess` call with `shell=True` identified, security issue (S602) </details> </details> </details> <details> <summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)</summary> * GitHub Check: unit-tests * GitHub Check: pre-commit </details> <details> <summary>🔇 Additional comments (21)</summary><blockquote> <details> <summary>nemo_skills/dataset/mmau-pro/open_ended/__init__.py (1)</summary><blockquote> `26-26`: **LGTM!** The prompt configuration reference is correctly updated to use the new `judge/mmau-pro` configuration, which aligns with the new prompt file introduced in this PR. </blockquote></details> <details> <summary>nemo_skills/dataset/mmau-pro/closed_form/__init__.py (1)</summary><blockquote> `19-19`: **LGTM!** The new `EVAL_ARGS` constant follows the established pattern and properly configures evaluation type for closed-form MMAU-Pro assessments. </blockquote></details> <details> <summary>nemo_skills/prompt/config/judge/mmau-pro.yaml (1)</summary><blockquote> `1-30`: **LGTM!** The judge prompt is well-structured with clear multi-criteria scoring instructions. The output format aligns correctly with the `extract_multicriteria_scores` function in `mmau_pro_metrics.py`. </blockquote></details> <details> <summary>nemo_skills/dataset/mmau-pro/prepare.py (3)</summary><blockquote> `87-91`: **Verify the hardcoded absolute path prefix.** The path prefix `/dataset/mmau-pro/` is hardcoded, which assumes a specific cluster mount point. This could cause issues if the deployment environment differs or if the dataset is mounted at a different location. Consider making this configurable via an argument or environment variable for flexibility across different deployment environments. --- `78-79`: **LGTM!** The MCQ formatting is improved with clearer option labels using "A) B) ..." format, and the instruction "Respond with the complete text of the correct option, not just the letter" provides explicit guidance to the model. --- `93-98`: **LGTM!** The system message structure with conditional `/no_think` for non-open-ended questions appropriately controls reasoning behavior. The two-element message structure (system + user) is a clean approach. </blockquote></details> <details> <summary>nemo_skills/evaluation/metrics/mmau_pro_metrics.py (2)</summary><blockquote> `82-91`: **LGTM!** The scoring logic correctly derives correctness from the multicriteria judgement with the threshold of >= 3.0 for open-ended questions, while maintaining the existing binary correctness path for closed-form evaluations. --- `130-151`: **LGTM!** The multi-criteria metric aggregation is well-implemented with proper conversion from 1-5 scale to percentage format, and the good/poor response rate thresholds (>=4 and <=2) provide useful quality indicators. </blockquote></details> <details> <summary>tests/gpu-tests/test-local.yaml (1)</summary><blockquote> `20-20`: **Container registry access consideration.** The `nvcr.io/nvidian/` registry is an internal NVIDIA registry. External contributors or CI systems without proper credentials may not be able to pull this image. Consider documenting authentication requirements or providing an alternative public image for broader compatibility. </blockquote></details> <details> <summary>tests/gpu-tests/run_qwen.sh (1)</summary><blockquote> `16-18`: **LGTM!** The audio test integration follows the existing pattern correctly. The model switch to `Qwen2.5-Omni-3B` is appropriate for audio testing, and the placement between generation tests and contamination tests is logical. </blockquote></details> <details> <summary>tests/test_vllm_audio.py (5)</summary><blockquote> `25-39`: **LGTM!** The `test_audio_file_to_base64` test correctly validates the encoding roundtrip with proper temp file cleanup. --- `42-48`: **LGTM!** The fixture properly initializes `VLLMModel` with `data_dir` for relative path resolution in audio handling. --- `51-66`: **LGTM!** Good coverage of single audio content conversion with proper assertions on structure and data URL format. --- `69-90`: **LGTM!** Tests the multiple audio scenario correctly, verifying that each audio file results in a separate `audio_url` entry. --- `93-120`: **LGTM!** Comprehensive test that validates the entire request building flow including base64 encoding verification. </blockquote></details> <details> <summary>nemo_skills/inference/generate.py (2)</summary><blockquote> `532-544`: **LGTM!** The `drop_binary_data` method correctly: 1. Guards against missing `messages` key 2. Guards against non-list content (string content in simple messages) 3. Filters out `audio_url` items to reduce output file size The in-place mutation is appropriate here since the output dict is transient. --- `561-562`: **LGTM!** Correct placement of `drop_binary_data` call - after merging original data but before reasoning parsing, ensuring binary data doesn't interfere with text processing. </blockquote></details> <details> <summary>tests/gpu-tests/test_vllm_audio.py (1)</summary><blockquote> `38-59`: The hardcoded audio paths are part of the intended container setup and do not pose a risk. The audio files (`t2_16.wav` and `t3_16.wav`) exist in the repository at `tests/slurm-tests/asr_nim/wavs/`, and the `/nemo_run/code/` prefix is the standard container mount point path used consistently throughout the test suite. This is expected behavior for container-based tests. </blockquote></details> <details> <summary>docs/evaluation/speech-audio.md (3)</summary><blockquote> `3-16`: **Clear categorical introduction and server support overview.** The new structure effectively explains the scope (Audio understanding, ASR, AST) and provides a quick reference table for supported server types. Well organized. --- `34-64`: **Data preparation section is clear and well-documented.** Explains the default behavior (audio download), provides explicit guidance on text-only mode with appropriate warnings, and includes environment variable and flag requirements. The distinction between modes and use cases is helpful. --- `232-323`: **Results section is comprehensive and well-structured.** Directory structure, output format, and example metrics are clearly documented. The metrics shown across categories are internally consistent and demonstrate realistic variation (e.g., average tokens lower for open-ended than closed-form, appropriate num_entries for each category). </blockquote></details> </blockquote></details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
| # Fallback: compute overall if missing or still 3.0 | ||
| if "overall" not in scores or scores["overall"] == 3.0: | ||
| criteria_scores = [scores.get(k, 3.0) for k in ["correctness", "relevance", "completeness", "clarity"]] | ||
| scores["overall"] = sum(criteria_scores) / len(criteria_scores) |
There was a problem hiding this comment.
Fallback logic incorrectly overwrites legitimate score of 3.0.
The condition scores["overall"] == 3.0 will overwrite a legitimately assigned overall score of 3.0 with the computed average. A score of 3.0 could be a valid LLM response.
Consider tracking whether the overall score was actually matched vs. defaulted:
for criterion, pattern in patterns.items():
match = re.search(pattern, judgement_text, re.IGNORECASE)
- scores[criterion] = float(match.group(1)) if match else 3.0
+ if match:
+ scores[criterion] = float(match.group(1))
+ else:
+ scores[criterion] = 3.0
+ if criterion == "overall":
+ scores["_overall_missing"] = True
- # Fallback: compute overall if missing or still 3.0
- if "overall" not in scores or scores["overall"] == 3.0:
+ # Fallback: compute overall if missing
+ if scores.get("_overall_missing", False):
criteria_scores = [scores.get(k, 3.0) for k in ["correctness", "relevance", "completeness", "clarity"]]
scores["overall"] = sum(criteria_scores) / len(criteria_scores)
+ del scores["_overall_missing"]
return scoresCommittable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
nemo_skills/evaluation/metrics/mmau_pro_metrics.py around lines 54-57: the
current fallback checks scores["overall"] == 3.0 and thus overwrites a
legitimately assigned 3.0; change the logic to only compute and set overall when
the key is absent or explicitly marked as a default (e.g. use None as the
default overall value or a separate flag like scores["_overall_defaulted"]), so
update callers to set scores["overall"]=None or
scores["_overall_defaulted"]=True when no real overall exists, and then replace
the condition with a check for key absence or None/flag before computing the
average from the four criteria.
nemo_skills/inference/model/vllm.py
Outdated
| if "audio" in message: | ||
| audio = message["audio"] | ||
| base64_audio = audio_file_to_base64(os.path.join(self.data_dir, audio["path"])) | ||
| audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}} | ||
| message["content"].append(audio_message) | ||
| elif "audios" in message: | ||
| for audio in message["audios"]: | ||
| base64_audio = audio_file_to_base64(os.path.join(self.data_dir, audio["path"])) | ||
| audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}} | ||
| message["content"].append(audio_message) |
There was a problem hiding this comment.
Absolute paths will bypass data_dir prefix.
When audio["path"] starts with / (as generated by prepare.py with paths like /dataset/mmau-pro/...), os.path.join ignores data_dir entirely. This means the data_dir parameter becomes ineffective for the absolute paths currently being generated.
Either:
- Store relative paths in
prepare.pyand rely ondata_dirat inference time, or - Remove
data_dirhandling if paths are always absolute.
if "audio" in message:
audio = message["audio"]
- base64_audio = audio_file_to_base64(os.path.join(self.data_dir, audio["path"]))
+ audio_path = audio["path"]
+ if not os.path.isabs(audio_path):
+ audio_path = os.path.join(self.data_dir, audio_path)
+ base64_audio = audio_file_to_base64(audio_path)
audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}}
message["content"].append(audio_message)
elif "audios" in message:
for audio in message["audios"]:
- base64_audio = audio_file_to_base64(os.path.join(self.data_dir, audio["path"]))
+ audio_path = audio["path"]
+ if not os.path.isabs(audio_path):
+ audio_path = os.path.join(self.data_dir, audio_path)
+ base64_audio = audio_file_to_base64(audio_path)
audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}}
message["content"].append(audio_message)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if "audio" in message: | |
| audio = message["audio"] | |
| base64_audio = audio_file_to_base64(os.path.join(self.data_dir, audio["path"])) | |
| audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}} | |
| message["content"].append(audio_message) | |
| elif "audios" in message: | |
| for audio in message["audios"]: | |
| base64_audio = audio_file_to_base64(os.path.join(self.data_dir, audio["path"])) | |
| audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}} | |
| message["content"].append(audio_message) | |
| if "audio" in message: | |
| audio = message["audio"] | |
| audio_path = audio["path"] | |
| if not os.path.isabs(audio_path): | |
| audio_path = os.path.join(self.data_dir, audio_path) | |
| base64_audio = audio_file_to_base64(audio_path) | |
| audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}} | |
| message["content"].append(audio_message) | |
| elif "audios" in message: | |
| for audio in message["audios"]: | |
| audio_path = audio["path"] | |
| if not os.path.isabs(audio_path): | |
| audio_path = os.path.join(self.data_dir, audio_path) | |
| base64_audio = audio_file_to_base64(audio_path) | |
| audio_message = {"type": "audio_url", "audio_url": {"url": f"data:audio/wav;base64,{base64_audio}"}} | |
| message["content"].append(audio_message) |
🤖 Prompt for AI Agents
nemo_skills/inference/model/vllm.py around lines 122-131: the current
os.path.join(self.data_dir, audio["path"]) will ignore data_dir when
audio["path"] is absolute (starts with '/'), making data_dir ineffective; update
both branches to first check if audio["path"] is absolute (os.path.isabs) and
use it directly, otherwise join with self.data_dir, or alternatively normalize
by stripping a leading slash (audio["path"].lstrip("/")) before joining so
prepared absolute-looking paths become relative; apply the same change for the
single "audio" and the "audios" loop.
f3c7729 to
933db61
Compare
gwarmstrong
left a comment
There was a problem hiding this comment.
Have a few suggestions/questions for moving forward
nemo_skills/inference/model/vllm.py
Outdated
| class VLLMModel(BaseModel): | ||
| def __init__(self, **kwargs): | ||
| def __init__(self, data_dir: str = "", **kwargs): | ||
| self.data_dir = data_dir |
There was a problem hiding this comment.
Can we move this out of the VLLModel class? E.g., if we format the audio file path in around here: https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/inference/generate.py#L579
Also, would it be safe to make an assumption like paths are relative to the input_file? So we could reduce a parameter here?
There was a problem hiding this comment.
Please check again, if not outdated, than no - we should have it here as now we need to chunk audio
nemo_skills/inference/model/vllm.py
Outdated
| "extra_body": self._build_request_body(top_k, min_p, repetition_penalty, extra_body=extra_body), | ||
| } | ||
|
|
||
| def content_text_to_list(self, message): |
There was a problem hiding this comment.
Is any of this logic able to be used outside of vllm? E.g., does it also work for openai APIs?
There was a problem hiding this comment.
Even if not, I think it may have more utility if we make a hook for it in generate.py rather than here
nemo_skills/inference/generate.py
Outdated
| continue | ||
|
|
||
| # Filter out audio_url items from list-style content | ||
| message["content"] = [content for content in message["content"] if content.get("type") != "audio_url"] |
There was a problem hiding this comment.
can you make "audio_url" here a list that is specified in this config? that way it can be configurable and other fields can be included/excluded if desired
tests/gpu-tests/test-local.yaml
Outdated
| containers: | ||
| trtllm: nvcr.io/nvidia/tensorrt-llm/release:1.0.0 | ||
| vllm: vllm/vllm-openai:v0.10.1.1 | ||
| vllm-audio: nvcr.io/nvidian/ac-aiapps/vllm-openai-audio:v1.0.0 |
There was a problem hiding this comment.
Is this org/image gated? That will cause a failure if it is used in CI. I don't actually see this being used though--would it be used as the server container in the gpu tests?
There was a problem hiding this comment.
I am pretty sure this image is org‑gated. Yes, I am using it in the GPU tests. What’s the recommended workaround in this case?
There was a problem hiding this comment.
This will be built into the base vllm image now--no need to pull nvcr container
630e2a1 to
4d8e8c9
Compare
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> clean up Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
|
Added postprocessing specific for QWEN models, may be we should add check for model type |
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
5e9cc99 to
62d5166
Compare
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
gwarmstrong
left a comment
There was a problem hiding this comment.
Please see #1137 for a suggested architectural change to help consolidate all the audio-specific logic and clean up the VLLM client here a little bit
Signed-off-by: George Armstrong <georgea@nvidia.com>
|
Issues with task type. |
|
About chunking size: lets leave it 30s as most models works fine with it, while enabling 60s default would significantly affect results. |
1) after refactoring AudioProcessor worked only if we pass ++audio arg. this is inconvenient 2) now we audio-enabling in based on dataset_group=speechlm or task_type=audio all audio containing benchmarks have one or both Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
|
Tested on QWEN, LGTM |
|
Need to fix import of audio_file_to_base64 |
| # Don't use /no_think for open-ended questions to allow reasoning | ||
| system_content = "You are a helpful assistant." | ||
| if category != "open": | ||
| system_content += " /no_think" |
There was a problem hiding this comment.
this is specific to qwen models, I don't think we should be doing it globally here?
There was a problem hiding this comment.
Imho, this is not problem for other models
There was a problem hiding this comment.
It certainly introduces a little mismatch, whether it's significant or not for accuracy is unclear, but I think it's much safer not to have it. What's the reason why we need to hardcode it here?
|
|
||
| # Audio wrapper (preprocesses messages before they reach the model) | ||
| # Auto-enable for audio benchmarks on vLLM (eval_type=audio OR dataset_group=speechlm) | ||
| should_enable_audio = self.cfg.audio is not None or ( |
There was a problem hiding this comment.
I'd rather configure this through prepare.py of relevant benchmarks. Otherwise this logic is too coupled with datasets but this module is very general. So instead of dataset_group, I'd add a flag should_enable_audio and you can just set it as default parameter in the init of those benchmarks. You can still have a check for vllm to be used as a server and other logic for wrapping, just let's remove the dataset_group based behaviour
There was a problem hiding this comment.
Got your point, when moving to separate class PR #1157 this would be auto-resolved
| for output in outputs: | ||
| fout.write(json.dumps(output) + "\n") | ||
|
|
||
| def drop_binary_data(self, output): |
There was a problem hiding this comment.
maybe just drop_data or better drop_fields_from_messages or something like this as it's not limited to binary but just checks what to drop from the parameter?
| generation_args = f"{eval_args} {generation_args}" | ||
| generation_args += f" ++eval_config.split={split} " | ||
|
|
||
| # Pass dataset_group from benchmark config if defined |
There was a problem hiding this comment.
No harm from one additional field, but makes a but more convenient to separate datasets
Still want to remove?
|
should we close this one then @Jorjeous ? |
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.