Audiobench and LibriSpeech-PC Benchmarks Evaluation#1060
Audiobench and LibriSpeech-PC Benchmarks Evaluation#1060
Conversation
502ba8b to
c20a232
Compare
Resolve conflict Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Author did not signed commit This reverts commit ecfafd1. Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
- Add comprehensive documentation for LibriSpeech-PC benchmark in speech-audio.md - Fix jiwer import to be lazy (only import when needed for ASR evaluation) Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
This reverts commit e5f0c2f. Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
91423e6 to
a908156
Compare
gwarmstrong
left a comment
There was a problem hiding this comment.
A few comments/questions about the general approach. Also it looks like there are a lot of unrelated changes. Can you fix those?
| Returns: | ||
| True if successful, False otherwise | ||
| """ | ||
| audiobench_url = "https://github.com/AudioLLMs/AudioBench.git" |
There was a problem hiding this comment.
Is it possible to clone in the dockerfile instead? Dockerfile.nemo-skills.
There was a problem hiding this comment.
Probbaly possible, but why?
we use it as souce for dataset preparation, what's the point to clone it to dockerfile?
There was a problem hiding this comment.
It's a more predictable user and dev experience.
from user side, e.g.
Skills/nemo_skills/dataset/audiobench/prepare.py
Lines 254 to 259 in a908156
if it is in the dockerfile, we wouldn't need to have the user check the dependencies, we'll have a better idea if it's set up properly.
From the dev side, if this repo changes and we need to pin it or it has compatibility issue with other dependencies, it's much easier to track down if all the dependencies are in one place (like, the dockerfile and requirements.txt). I don't usually expect my scripts to start puling git repos and failing with import errors if those don't get pulled properly. And reproducing for debug wouldn't be great either because the dependencies aren't static in the image--I would have to start the container and git pull with the script here, which is a trickier workflow than just having it saved to known location from the start.
There was a problem hiding this comment.
actually we do not use deps's from this repo, if more stable solution is needed i can hardcore sources.
PS
Moved this PR to new [https://github.com//pull/1103], as to many irrelevant changes emerged here.
Leaving this open just to keep comments and converstaions
Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
📝 WalkthroughWalkthroughThis PR introduces AudioBench and LibriSpeech-PC benchmark support to the nemo-skills evaluation framework. It adds dataset modules with configuration constants, data preparation scripts for downloading/processing audio data, evaluation logic for ASR and translation tasks with automatic metrics and LLM judge scoring, and integrates these into the existing evaluator and metrics registries. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50-70 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (12)
nemo_skills/inference/eval/swebench.py (3)
402-403: Redundant null check.This check is already performed in
__init__at line 222. If reached here,agent_framework_repois already set. Consider removing this duplicate check.
471-472: Redundant null check.Same as above—this check is already performed in
__init__at line 236.
580-583: Duplicateoutput_dirinitialization.This logic duplicates what's already done in
__init__(lines 191-194). Sinceprocess_single_datapointis called after initialization,self.output_dirshould already be set. If this is intentional (e.g., for async safety), consider adding a comment to clarify.async def process_single_datapoint(self, data_point, data): """Will do all necessary generations to get a single answer for the data point.""" - self.output_dir = Path(self.cfg.output_file).parent - if self.cfg.inference.random_seed is not None: - self.output_dir = self.output_dir / f"rs{self.cfg.inference.random_seed}" - self.output_dir.mkdir(exist_ok=True) + # Note: self.output_dir is already set in __init__tests/test_datasets.py (1)
60-61: New audio datasets wired into DATASETS correctlyAdding
audiobenchandlibrispeech-pchere will ensure the default init checks run for them, which is good. One minor note: LibriSpeech‑PC’sDEFAULT_SPLITistest-clean, while this list uses"test"; if these split names ever get used by tests or tooling in this file, consider aligning them to avoid confusion.nemo_skills/pipeline/prepare_data.py (1)
34-34: Adding audiobench and librispeech-pc to DATASETS_REQUIRE_DATA_DIR is reasonableMarking these new audio benchmarks as requiring
data_diris consistent with how other large datasets (e.g.,mmau-pro) are handled and should prevent accidentally copying large blobs into each job. You may eventually want to honor the existing TODO and derive this list from dataset configs instead of hard‑coding.nemo_skills/dataset/librispeech-pc/__init__.py (1)
15-26: LibriSpeech-PC dataset config aligns with new speechlm stack
DATASET_GROUP="speechlm"andMETRICS_TYPE="speechlm"match the new metrics registration, and routing via++eval_type=audiobenchis consistent with the evaluator map. Usingtest-cleanasDEFAULT_SPLITalso matches the documented LibriSpeech-PC splits; just keep in mind tests list"test"as the only split for this dataset, so if those ever start validating splits, you may want to align the naming.nemo_skills/dataset/audiobench/prepare.py (3)
231-233: Type hint inconsistency: mixing built-intuplewithtyping.List.The return type uses lowercase
tuple(Python 3.9+ built-in generic) butListfromtyping. For consistency, either usetuple[int, list[dict]]orTuple[int, List[Dict]].-) -> tuple[int, List[Dict]]: +) -> tuple[int, list[dict]]:
250-259: Consider using exception chaining for better debugging.When re-raising exceptions within an
exceptblock, useraise ... from eto preserve the original traceback context.try: from dataset import Dataset except ImportError as e: raise ImportError( f"Failed to import AudioBench Dataset class: {e}\n" f"AudioBench path: {audiobench_path}\n" f"Make sure AudioBench repository is properly set up." - ) + ) from e
265-267: Use exception chaining for re-raised exceptions.except Exception as e: - raise Exception(f"Failed to load dataset {dataset_name}: {e}") + raise RuntimeError(f"Failed to load dataset {dataset_name}: {e}") from enemo_skills/evaluation/metrics/speechlm_metrics.py (1)
107-107: Rename unused loop variables.Per Python convention, prefix unused loop variables with underscore.
- for agg_mode, agg_metrics in metrics_dict.items(): + for _agg_mode, agg_metrics in metrics_dict.items():And in
compute_score:- for benchmark_name, benchmark_data in benchmarks.items(): + for _benchmark_name, benchmark_data in benchmarks.items():Also applies to: 199-199
nemo_skills/evaluation/evaluator/audiobench.py (2)
240-240: Unusedconfigparameter.The
configparameter is passed but never used inevaluate_sample. Either use it or remove it from the signature if it's reserved for future use.-def evaluate_sample(sample: dict[str, Any], config: AudioBenchEvaluatorConfig) -> dict[str, Any]: +def evaluate_sample(sample: dict[str, Any], _config: AudioBenchEvaluatorConfig) -> dict[str, Any]:Or if intended for future use, add a comment:
+ # config is reserved for future judge evaluation settings + _ = config
159-164: Empty string handling with "empty" placeholder.Using "empty" as a placeholder to avoid jiwer errors is a pragmatic workaround. Consider adding a brief comment explaining why this is done.
- # Handle empty strings + # Handle empty strings - jiwer requires non-empty inputs if not ref: ref = "empty" if not hyp: hyp = "empty"
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (21)
.gitignore(1 hunks)docs/evaluation/code.md(1 hunks)docs/evaluation/speech-audio.md(4 hunks)nemo_skills/dataset/audiobench/__init__.py(1 hunks)nemo_skills/dataset/audiobench/judge/__init__.py(1 hunks)nemo_skills/dataset/audiobench/nonjudge/__init__.py(1 hunks)nemo_skills/dataset/audiobench/prepare.py(1 hunks)nemo_skills/dataset/icpc25/__init__.py(1 hunks)nemo_skills/dataset/librispeech-pc/__init__.py(1 hunks)nemo_skills/dataset/librispeech-pc/prepare.py(1 hunks)nemo_skills/dataset/mmau-pro/prepare.py(1 hunks)nemo_skills/evaluation/evaluator/__init__.py(2 hunks)nemo_skills/evaluation/evaluator/audiobench.py(1 hunks)nemo_skills/evaluation/metrics/icpc_metrics.py(0 hunks)nemo_skills/evaluation/metrics/map_metrics.py(2 hunks)nemo_skills/evaluation/metrics/speechlm_metrics.py(1 hunks)nemo_skills/inference/eval/swebench.py(7 hunks)nemo_skills/inference/generate.py(0 hunks)nemo_skills/pipeline/prepare_data.py(1 hunks)nemo_skills/prompt/config/judge/audiobench.yaml(1 hunks)tests/test_datasets.py(1 hunks)
💤 Files with no reviewable changes (2)
- nemo_skills/inference/generate.py
- nemo_skills/evaluation/metrics/icpc_metrics.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-12-02T21:26:17.342Z
Learnt from: CR
Repo: NVIDIA-NeMo/Skills PR: 0
File: CONTRIBUTING.md:0-0
Timestamp: 2025-12-02T21:26:17.342Z
Learning: Update documentation to reflect changes made in the codebase
Applied to files:
docs/evaluation/code.mdnemo_skills/dataset/icpc25/__init__.py
📚 Learning: 2025-12-02T21:26:17.342Z
Learnt from: CR
Repo: NVIDIA-NeMo/Skills PR: 0
File: CONTRIBUTING.md:0-0
Timestamp: 2025-12-02T21:26:17.342Z
Learning: Follow the existing code style and conventions in the Nemo-Skills project
Applied to files:
nemo_skills/dataset/audiobench/prepare.py
🧬 Code graph analysis (4)
nemo_skills/evaluation/evaluator/audiobench.py (1)
nemo_skills/utils.py (2)
get_logger_name(39-43)nested_dataclass(69-102)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/speechlm_metrics.py (1)
SpeechLMMetrics(23-163)
nemo_skills/evaluation/evaluator/__init__.py (1)
nemo_skills/evaluation/evaluator/audiobench.py (1)
eval_audiobench(200-237)
nemo_skills/inference/eval/swebench.py (1)
nemo_skills/inference/chat_interface/core.py (1)
cfg(181-182)
🪛 LanguageTool
docs/evaluation/speech-audio.md
[style] ~321-~321: Using many exclamation marks might seem excessive (in this case: 9 exclamation marks for a text that’s 5206 characters long)
Context: ... ## Running LibriSpeech-PC Evaluation !!! note Currently supports only Megatr...
(EN_EXCESSIVE_EXCLAMATION)
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/speech-audio.md
280-280: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
288-288: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
313-313: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
320-320: Link text should be descriptive
(MD059, descriptive-link-text)
327-327: Link text should be descriptive
(MD059, descriptive-link-text)
332-332: Link text should be descriptive
(MD059, descriptive-link-text)
337-337: Link text should be descriptive
(MD059, descriptive-link-text)
342-342: Link text should be descriptive
(MD059, descriptive-link-text)
347-347: Link text should be descriptive
(MD059, descriptive-link-text)
352-352: Link text should be descriptive
(MD059, descriptive-link-text)
🪛 Ruff (0.14.8)
nemo_skills/evaluation/metrics/speechlm_metrics.py
107-107: Loop control variable agg_mode not used within loop body
Rename unused agg_mode to _agg_mode
(B007)
199-199: Loop control variable benchmark_name not used within loop body
Rename unused benchmark_name to _benchmark_name
(B007)
nemo_skills/evaluation/evaluator/audiobench.py
188-191: Consider moving this statement to an else block
(TRY300)
192-192: Do not catch blind exception: Exception
(BLE001)
240-240: Unused function argument: config
(ARG001)
nemo_skills/dataset/audiobench/prepare.py
210-210: subprocess call: check for execution of untrusted input
(S603)
211-211: Starting a process with a partial executable path
(S607)
217-217: Consider moving this statement to an else block
(TRY300)
231-231: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
255-259: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
255-259: Avoid specifying long messages outside the exception class
(TRY003)
266-266: Do not catch blind exception: Exception
(BLE001)
267-267: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
267-267: Create your own exception
(TRY002)
267-267: Avoid specifying long messages outside the exception class
(TRY003)
333-333: Do not catch blind exception: Exception
(BLE001)
351-351: Do not catch blind exception: Exception
(BLE001)
498-498: Do not catch blind exception: Exception
(BLE001)
nemo_skills/dataset/librispeech-pc/prepare.py
39-39: Unused function argument: blocknum
(ARG001)
44-44: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: unit-tests
🔇 Additional comments (23)
.gitignore (1)
48-49: LGTM!The
.gitignoreentry forAudioBench/with a descriptive comment is appropriate to prevent the auto-cloned benchmark repository from being tracked.nemo_skills/dataset/audiobench/nonjudge/__init__.py (1)
15-31: LGTM!The configuration constants are well-defined with clear documentation. The trailing spaces in
EVAL_ARGSandGENERATION_ARGSappear intentional for argument concatenation.nemo_skills/dataset/audiobench/judge/__init__.py (1)
31-39: LGTM with a note on hardcoded GPU count.The judge configuration aligns with the AudioBench official implementation. The
server_gpus: 8default is documented as overridable in run scripts, which is acceptable for flexibility.nemo_skills/inference/eval/swebench.py (3)
414-424: Verify SWE-agent installation commands.The installation flow looks correct: installs
uv, clones repo, checks out commit, creates venv, and installs dependencies. This duplicates installation per container execution but is necessary since containers start fresh.
513-525: OpenHands installation usesmake buildwhich may be heavyweight.The comment at line 262 in
__init__states "we no longer use 'make build' because it installs lots of unnecessary dependencies," yet this container command still usesmake build. Consider aligning the approach or verifying this discrepancy is intentional.
621-631: SWE-bench evaluation harness installation in container.The installation commands for the eval harness look correct and follow the same pattern as agent installation.
nemo_skills/dataset/icpc25/__init__.py (1)
15-17: LGTM!Documentation update correctly references "ICPC25" to match the module name.
nemo_skills/evaluation/metrics/map_metrics.py (1)
40-41: SpeechLM metrics registration looks consistentImporting
SpeechLMMetricsand adding the"speechlm"entry toMETRICS_MAPaligns with the newMETRICS_TYPE="speechlm"datasets and keeps the registry pattern consistent. No issues from this file.Also applies to: 70-70
nemo_skills/prompt/config/judge/audiobench.yaml (1)
1-29: AudioBench judge prompt is consistent with yes/no scoringThe prompt cleanly defines the reference/model/question sections and asks for a
Judgement: Yes or No, which matches the expected yes/no parsing inSpeechLMMetrics. This should work well for LLM‑judge evaluation.nemo_skills/evaluation/evaluator/__init__.py (1)
18-18: Audiobench / LibriSpeech-PC evaluator wiring looks correctImporting
eval_audiobenchand registering both"audiobench"and"librispeech-pc"inEVALUATOR_MAPis consistent with the new datasets and avoids any clash with class‑based evaluators. This should make both eval types available via the unifiedevaluateentrypoint.Also applies to: 44-62
nemo_skills/dataset/audiobench/__init__.py (1)
15-36: LGTM!The module metadata and documentation are well-structured. The docstring clearly explains the AudioBench benchmark tasks and evaluation categories, and the constants are appropriately defined for integration with the nemo-skills framework.
docs/evaluation/speech-audio.md (2)
5-8: LGTM!The consolidated warning block at the top is well-positioned and clearly explains the
--no-audioand--skip_data_dir_checkflags.
273-400: Comprehensive LibriSpeech-PC documentation.The documentation section is thorough, covering dataset location, available splits, data preparation, evaluation examples, metrics explanation, and output format. This provides users with everything needed to use the benchmark.
nemo_skills/dataset/audiobench/prepare.py (2)
369-412: CLI argument structure is well-designed.The argument parser provides flexible options for processing specific datasets, categories, or all datasets with appropriate defaults and help text.
480-501: Robust error handling for dataset processing loop.The try/except pattern here appropriately catches failures for individual datasets while allowing the script to continue processing remaining datasets, with proper tracking and summary reporting.
nemo_skills/dataset/librispeech-pc/prepare.py (3)
35-44: LGTM!The download progress implementation is clean. The
blocknumparameter is required byurllib.request.urlretrieve's reporthook signature even if not used directly.
89-144: Well-structured data processing with proper validation.The function handles missing entries gracefully, creates proper nemo-skills format with messages structure, and cleans up intermediate manifest files after successful processing.
75-86: No action required. The audio paths are handled consistently: the underscore conversion indownload_audio(line 77) matches the directory structure created by tar extraction, and the manifest'saudio_filepathvalues already reference these correct paths from the OpenSLR source.nemo_skills/evaluation/metrics/speechlm_metrics.py (3)
107-109: Verify the intent of dividingno_answerby 2.0.This halving of
no_answerappears unusual and may be a bug. If this is intentional (e.g., compensating for double-counting), please add a clarifying comment explaining the rationale.
117-128: WER/BLEU metrics computation looks correct.The averaging and percentage conversion are properly implemented, and the metrics are conditionally added only when scores exist.
166-223: Aggregation function handles weighted averaging correctly.The
compute_scorefunction properly weights metrics bynum_entrieswhen combining sub-benchmarks, which is the correct approach for aggregate statistics.nemo_skills/evaluation/evaluator/audiobench.py (2)
56-96: PER calculation using edit distance is correctly implemented.The dynamic programming approach for computing Punctuation Error Rate follows the arXiv:2310.02943 formula correctly. The handling of empty punctuation sequences (returning 0.0) is appropriate.
99-131: Comprehensive ASR-PC evaluation with multiple WER variants.The implementation correctly computes WER (standard), WER_C (with capitalization), WER_PC (full punctuation + capitalization), and PER as separate metrics, providing granular insight into model performance.
| - **++eval_harness_commit:** The commit hash, branch or tag to checkout after cloning eval_harness_repo. Defaults to `HEAD`, i.e. the latest commit. | ||
|
|
||
| - **++setup_timeout:** The timeout for downloading & installing the agent framework and the evaluation harness, in seconds. Defaults to 1200, i.e. 20 minutes. | ||
| - **++eval_harness_commit:** The commit hash, branch or tag to checkout after cloning agent_harness_repo. Defaults to `HEAD`, i.e. the latest commit. |
There was a problem hiding this comment.
Fix parameter name in ++eval_harness_commit description
++eval_harness_commit currently says “after cloning agent_harness_repo”, but there is no such parameter; the actual flag is ++eval_harness_repo. This is likely a copy‑paste typo and can confuse users; suggest changing it to:
- **++eval_harness_commit:** The commit hash, branch or tag to checkout after cloning `eval_harness_repo`. Defaults to `HEAD`, i.e. the latest commit.The updated ++max_retries description (covering inference and evaluation retries) looks aligned with the new SWE‑bench flow.
Based on learnings, keeping docs tightly aligned with flags helps avoid user confusion.
Also applies to: 89-89
🤖 Prompt for AI Agents
In docs/evaluation/code.md around line 85 (also apply same change at line 89),
the description for ++eval_harness_commit incorrectly references cloning
"agent_harness_repo" — change that reference to "eval_harness_repo" so the text
reads that the commit/branch/tag is checked out after cloning eval_harness_repo;
keep the rest of the sentence (Defaults to HEAD) unchanged and ensure both
occurrences (lines 85 and 89) are updated to avoid the copy‑paste confusion.
| if entry.get("audio_path"): | ||
| audio_path = entry["audio_path"] | ||
|
|
||
| if isinstance(audio_path, list) and audio_path: | ||
| user_message["audios"] = [{"path": path, "duration": 10.0} for path in audio_path] | ||
| elif isinstance(audio_path, str): | ||
| user_message["audio"] = {"path": audio_path, "duration": 10.0} | ||
|
|
||
| formatted_entry["messages"] = [user_message] | ||
| # Prepend /dataset/mmau-pro/ to make paths absolute for cluster | ||
| if len(audio_path) == 1: | ||
| user_message["audio"] = {"path": f"/dataset/mmau-pro/{audio_path[0]}"} | ||
| else: | ||
| user_message["audios"] = [{"path": f"/dataset/mmau-pro/{path}"} for path in audio_path] | ||
|
|
||
| formatted_entry["messages"] = [ | ||
| {"role": "system", "content": "You are a helpful assistant. /no_think"}, | ||
| user_message, | ||
| ] |
There was a problem hiding this comment.
Honor the with_audio flag when attaching audio paths
Right now format_entry ignores the with_audio argument: even when --no-audio is used (so with_audio=False), you still add audio/audios fields pointing to /dataset/mmau-pro/.... Since audio download is skipped in that mode, these paths are likely invalid and can break downstream evaluation or inference that assumes existing audio when the field is present.
You can keep the new absolute-path behavior while respecting the flag by gating this block:
- if entry.get("audio_path"):
- audio_path = entry["audio_path"]
- # Prepend /dataset/mmau-pro/ to make paths absolute for cluster
- if len(audio_path) == 1:
- user_message["audio"] = {"path": f"/dataset/mmau-pro/{audio_path[0]}"}
- else:
- user_message["audios"] = [{"path": f"/dataset/mmau-pro/{path}"} for path in audio_path]
+ if with_audio and entry.get("audio_path"):
+ audio_path = entry["audio_path"]
+ # Prepend /dataset/mmau-pro/ to make paths absolute for cluster
+ if len(audio_path) == 1:
+ user_message["audio"] = {"path": f"/dataset/mmau-pro/{audio_path[0]}"}
+ else:
+ user_message["audios"] = [{"path": f"/dataset/mmau-pro/{path}"} for path in audio_path]This restores the intent of --no-audio while preserving the cluster‑friendly absolute paths when audio is enabled.
🤖 Prompt for AI Agents
In nemo_skills/dataset/mmau-pro/prepare.py around lines 85 to 96, the code
always attaches audio paths to user_message even when with_audio is False;
update the block so it only adds the "audio" or "audios" fields when with_audio
is True. Specifically, wrap the existing if entry.get("audio_path") ... logic
inside a conditional checking if with_audio is truthy, preserving the current
absolute-path construction and single-vs-multiple handling, and leaving
formatted_entry["messages"] unchanged outside that gated block.
| samples_already_evaluated = sum(1 for sample in data if "is_correct" in sample) | ||
|
|
||
| if samples_already_evaluated > 0: | ||
| LOG.info(f"Resuming evaluation: {samples_already_evaluated}/{len(data)} samples already evaluated") | ||
|
|
||
| for idx, sample in enumerate(tqdm(data, desc="Evaluating samples")): | ||
| data[idx] = evaluate_sample(sample, eval_config) |
There was a problem hiding this comment.
Resume logic logs progress but doesn't skip already-evaluated samples.
The code counts already-evaluated samples and logs resume info, but the loop still re-evaluates all samples. If resumption is intended, samples with is_correct already set should be skipped.
for idx, sample in enumerate(tqdm(data, desc="Evaluating samples")):
+ if "is_correct" in sample:
+ continue # Skip already evaluated samples
data[idx] = evaluate_sample(sample, eval_config)🤖 Prompt for AI Agents
In nemo_skills/evaluation/evaluator/audiobench.py around lines 224-230, the
resume log notes already-evaluated samples but the loop still re-processes them;
change the loop to skip any sample that already contains "is_correct" (or other
evaluation markers) so previously evaluated entries are left unchanged, e.g.,
check if "is_correct" in sample at start of the loop and continue if true,
keeping the same tqdm description and index handling so data[idx] is only
overwritten for newly evaluated samples.
|
New version of PR exist: #1103 |
The PR is the implementation of AudioBench and LibriSpeech-PC benchmarks evaluation.
Summary by CodeRabbit
Release Notes
New Features
Documentation
Chores
✏️ Tip: You can customize this high-level summary in your review settings.