Adding clan PR with AudioBench and Librispeech PC.#1103
Conversation
…nified asr metrics file Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
📝 WalkthroughWalkthroughThis PR introduces support for AudioBench and LibriSpeech-PC audio datasets, including dataset preparation scripts that download, process, and convert audio samples into nemo-skills manifest format. It adds corresponding configuration modules, updates evaluation metrics to handle judge-based rating extraction, and registers the datasets in the preparation pipeline. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~35 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/evaluation/evaluator/audio.py (1)
328-334: Missing-generation payload is inconsistent forTranslation(and likelyAST/CER)Line 328 expands the guard to
Translation, but the early return still hard-codes"wer": 1.0. For translation tasks, this can (a) publish misleading metrics and/or (b) break downstream aggregation expecting"bleu"(and for CER expecting"cer").Consider returning task-specific “worst” metrics based on
task_type, e.g.:- if task_type in ["ASR", "ASR-PC", "AST", "Translation", "CER", "ASR_LEADERBOARD"] and not generation: - return { - "is_correct": False, - "wer": 1.0, - "error": "missing_generation", - "predicted_answer": "", - } + if task_type in ["ASR", "ASR-PC", "ASR_LEADERBOARD", "AST", "Translation", "CER"] and not generation: + base = { + "is_correct": False, + "error": "missing_generation", + "predicted_answer": "", + } + if task_type in ["AST", "Translation"]: + return {**base, "bleu": 0.0} + if task_type == "CER": + return {**base, "cer": 1.0} + # ASR / ASR-PC / ASR_LEADERBOARD + return {**base, "wer": 1.0}
🧹 Nitpick comments (5)
.gitignore (1)
48-51: Remove redundant.ideaignore entry (optional cleanup).Line 48 (
.idea) is redundant given Line 44-45 already ignore.idea/and.idea/*. Consider dropping it to avoid confusion.@@ -.idea - # AudioBench repository (auto-cloned during data preparation) AudioBench/nemo_skills/pipeline/prepare_data.py (1)
33-35: Good safety check expansion; consider making matching less brittle.Adding
librispeech-pcandaudiobenchtoDATASETS_REQUIRE_DATA_DIR(Line 34) matches the intent. Theif dataset in extra_argumentssubstring match can mis-detect; consider tokenizing/parsing args (or using a regex with word boundaries) and using asetfor membership.docs/evaluation/speech-audio.md (1)
275-338: Make CLI examples consistent about required--clusterwhen--data_diris used.Some examples include
--data_dirwithout--cluster(e.g., Line 330-337). Ifns prepare_dataenforcesclusterwhendata_diris provided, mirror that in docs to avoid copy/paste failures.nemo_skills/dataset/audiobench/judge/__init__.py (1)
32-39: Consider lowering the default judge serving footprint (8 GPUs) or making it “unset by default.”
Hardcoding 8 GPUs is a frequent footgun for anyone trying to run quick smoke tests locally/CI.JUDGE_PIPELINE_ARGS = { "model": "meta-llama/Meta-Llama-3.1-70B-Instruct", "server_type": "vllm", - "server_gpus": 8, + # Prefer a conservative default; allow run scripts to override for the official setup. + "server_gpus": 1, "server_args": "--max-model-len 8192 --gpu-memory-utilization 0.95", }nemo_skills/dataset/audiobench/prepare.py (1)
194-227:git clonesubprocess: add a timeout + prefer resolvinggitpath (ruff S603/S607).
Inputs are currently constant, but this still spawns an external process with no timeout.@@ - subprocess.run( - ["git", "clone", audiobench_url, str(target_dir)], + git = shutil.which("git") + if not git: + print("✗ git command not found. Please install git or manually clone AudioBench.") + return False + subprocess.run( + [git, "clone", audiobench_url, str(target_dir)], check=True, capture_output=True, text=True, + timeout=1800, )
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
.gitignore(1 hunks)docs/evaluation/speech-audio.md(3 hunks)nemo_skills/dataset/audiobench/__init__.py(1 hunks)nemo_skills/dataset/audiobench/judge/__init__.py(1 hunks)nemo_skills/dataset/audiobench/nonjudge/__init__.py(1 hunks)nemo_skills/dataset/audiobench/prepare.py(1 hunks)nemo_skills/dataset/librispeech-pc/__init__.py(1 hunks)nemo_skills/dataset/librispeech-pc/prepare.py(1 hunks)nemo_skills/evaluation/evaluator/audio.py(2 hunks)nemo_skills/pipeline/prepare_data.py(1 hunks)nemo_skills/prompt/config/judge/audiobench.yaml(1 hunks)nemo_skills/prompt/config/judge/audiobench_binary.yaml(1 hunks)tests/test_datasets.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
nemo_skills/dataset/audiobench/prepare.py
213-213: subprocess call: check for execution of untrusted input
(S603)
214-214: Starting a process with a partial executable path
(S607)
220-220: Consider moving this statement to an else block
(TRY300)
234-234: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
258-262: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
258-262: Avoid specifying long messages outside the exception class
(TRY003)
269-269: Do not catch blind exception: Exception
(BLE001)
270-270: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
270-270: Create your own exception
(TRY002)
270-270: Avoid specifying long messages outside the exception class
(TRY003)
336-336: Do not catch blind exception: Exception
(BLE001)
354-354: Do not catch blind exception: Exception
(BLE001)
511-511: Do not catch blind exception: Exception
(BLE001)
nemo_skills/dataset/librispeech-pc/prepare.py
39-39: Unused function argument: blocknum
(ARG001)
44-44: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: unit-tests
🔇 Additional comments (6)
nemo_skills/evaluation/evaluator/audio.py (1)
353-357: RoutingTranslationthrough the translation evaluation path looks correctLine 353 cleanly aligns
TranslationwithASTto shareevaluate_translation(...)and update"predicted_answer"consistently.nemo_skills/dataset/audiobench/nonjudge/__init__.py (1)
15-32: LGTM: clear dataset config constants for non-judge AudioBench tasks.Constants are minimal and consistent (
DATASET_GROUP="speechlm",METRICS_TYPE="audio", args strings).tests/test_datasets.py (1)
58-62: LGTM: adds coverage for new datasets in the import/validation loop.These additions should ensure the new dataset modules expose required defaults.
docs/evaluation/speech-audio.md (1)
48-53: Doc update aligns with new--skip_data_dir_checkbehavior.Including
--skip_data_dir_checkfor the--no-audiopath is consistent with the new global guidance.nemo_skills/dataset/librispeech-pc/__init__.py (1)
15-27: LGTM: minimal, consistent LibriSpeech-PC dataset defaults.Defaults (
DEFAULT_SPLIT="test-clean",METRICS_TYPE="audio", prompt/eval args) look aligned with the audio evaluation pipeline.nemo_skills/dataset/audiobench/__init__.py (1)
15-36: Config-only initializer looks consistent and clear.
Docstring +DATASET_GROUP/SCORE_MODULE/BENCHMARKSlayout is straightforward and should integrate cleanly with dataset discovery.
| def get_audio_duration(audio_array: np.ndarray, sampling_rate: int) -> float: | ||
| """Compute audio duration in seconds from array and sampling rate.""" | ||
| if audio_array is None or len(audio_array) == 0: | ||
| return 0.0 | ||
| return float(len(audio_array) / sampling_rate) | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the full file to understand context and verify the issues
wc -l nemo_skills/dataset/audiobench/prepare.pyRepository: NVIDIA-NeMo/Skills
Length of output: 107
🏁 Script executed:
# Check the actual code at the specified line ranges
sed -n '111,116p' nemo_skills/dataset/audiobench/prepare.pyRepository: NVIDIA-NeMo/Skills
Length of output: 337
🏁 Script executed:
# Check lines 229-370 and 475-513
sed -n '229,370p' nemo_skills/dataset/audiobench/prepare.pyRepository: NVIDIA-NeMo/Skills
Length of output: 5062
🏁 Script executed:
# Check lines 475-513
sed -n '475,513p' nemo_skills/dataset/audiobench/prepare.pyRepository: NVIDIA-NeMo/Skills
Length of output: 1644
🏁 Script executed:
# Check imports and Python version requirements
head -30 nemo_skills/dataset/audiobench/prepare.pyRepository: NVIDIA-NeMo/Skills
Length of output: 1267
🏁 Script executed:
# Search for JUDGE_DATASETS and NONJUDGE_DATASETS definitions
grep -n "JUDGE_DATASETS\|NONJUDGE_DATASETS" nemo_skills/dataset/audiobench/prepare.pyRepository: NVIDIA-NeMo/Skills
Length of output: 707
🏁 Script executed:
# Check if sys.path is cleaned up or if there's a sys.path manipulation issue
grep -n -A5 "sys.path" nemo_skills/dataset/audiobench/prepare.pyRepository: NVIDIA-NeMo/Skills
Length of output: 303
🏁 Script executed:
# Check if there's cleanup after sys.path.insert
sed -n '254,270p' nemo_skills/dataset/audiobench/prepare.pyRepository: NVIDIA-NeMo/Skills
Length of output: 716
🏁 Script executed:
# Check the template files - do they exist in the package?
find nemo_skills/dataset/audiobench -name "__init__.py"Repository: NVIDIA-NeMo/Skills
Length of output: 205
🏁 Script executed:
# Check the full context around category logging issue
sed -n '485,505p' nemo_skills/dataset/audiobench/prepare.pyRepository: NVIDIA-NeMo/Skills
Length of output: 933
🏁 Script executed:
# Check what JUDGE_DATASETS and NONJUDGE_DATASETS contain
sed -n '41,120p' nemo_skills/dataset/audiobench/prepare.pyRepository: NVIDIA-NeMo/Skills
Length of output: 2323
🏁 Script executed:
# Verify the init.py template files exist
ls -la nemo_skills/dataset/audiobench/ | grep -E "judge|nonjudge|__init__"Repository: NVIDIA-NeMo/Skills
Length of output: 241
🏁 Script executed:
# Check directory structure
find nemo_skills/dataset/audiobench -type dRepository: NVIDIA-NeMo/Skills
Length of output: 169
🏁 Script executed:
# Verify what's actually in the judge/__init__.py and nonjudge/__init__.py
cat nemo_skills/dataset/audiobench/judge/__init__.pyRepository: NVIDIA-NeMo/Skills
Length of output: 1605
🏁 Script executed:
# Check how the code actually copies init files (lines 479-483)
sed -n '475,485p' nemo_skills/dataset/audiobench/prepare.pyRepository: NVIDIA-NeMo/Skills
Length of output: 486
🏁 Script executed:
# Verify the actual sys.path behavior - check if there's any cleanup elsewhere
grep -n "sys.path.pop\|sys.path.remove\|sys.path =" nemo_skills/dataset/audiobench/prepare.pyRepository: NVIDIA-NeMo/Skills
Length of output: 44
Guard sampling_rate against zero/falsy values; fix sys.path pollution; correct template __init__.py copy logic; use normalized dataset_name for category logging.
These are correctness footguns causing runtime failures (division by zero), module import state corruption across datasets, broken template initialization, and misleading logs.
sampling_rate guard (line 115): Add zero check before division:
def get_audio_duration(audio_array: np.ndarray, sampling_rate: int) -> float:
"""Compute audio duration in seconds from array and sampling rate."""
if audio_array is None or len(audio_array) == 0:
return 0.0
+ if not sampling_rate:
+ return 0.0
return float(len(audio_array) / sampling_rate)sys.path cleanup (line 254): Add finally block to remove inserted path:
sys.path.insert(0, str(audiobench_path / "src"))
try:
from dataset import Dataset
except ImportError as e:
raise ImportError(
f"Failed to import AudioBench Dataset class: {e}\n"
f"AudioBench path: {audiobench_path}\n"
f"Make sure AudioBench repository is properly set up."
)
+ finally:
+ if sys.path and sys.path[0] == str(audiobench_path / "src"):
+ sys.path.pop(0)Template init copy logic (lines 479–483): Copy from package template, not from output directory:
for category in ["judge", "nonjudge"]:
category_dir = output_dir / category
category_dir.mkdir(exist_ok=True)
# Copy category __init__.py if missing
init_file = category_dir / "__init__.py"
- template_init = output_dir / category / "__init__.py"
+ template_init = Path(__file__).parent / category / "__init__.py"
if not init_file.exists() and template_init.exists():
shutil.copy2(template_init, init_file)Category logging (line 497): Use normalized dataset_name instead of loop variable name:
- category = "judge" if name in JUDGE_DATASETS else "nonjudge"
+ category = "judge" if dataset_name in JUDGE_DATASETS else "nonjudge"
+ if dataset_name not in JUDGE_DATASETS and dataset_name not in NONJUDGE_DATASETS:
+ category = "unknown"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def get_audio_duration(audio_array: np.ndarray, sampling_rate: int) -> float: | |
| """Compute audio duration in seconds from array and sampling rate.""" | |
| if audio_array is None or len(audio_array) == 0: | |
| return 0.0 | |
| return float(len(audio_array) / sampling_rate) | |
| def get_audio_duration(audio_array: np.ndarray, sampling_rate: int) -> float: | |
| """Compute audio duration in seconds from array and sampling rate.""" | |
| if audio_array is None or len(audio_array) == 0: | |
| return 0.0 | |
| if not sampling_rate: | |
| return 0.0 | |
| return float(len(audio_array) / sampling_rate) |
🤖 Prompt for AI Agents
In nemo_skills/dataset/audiobench/prepare.py around lines 111-116, 254, 479-483
and 497: guard against a zero or falsy sampling_rate by returning 0.0 or raising
a clear error before performing the division in get_audio_duration to avoid
division-by-zero; when temporarily inserting a path into sys.path, ensure you
remove that path in a finally block (or use context management) so sys.path is
not polluted after imports; when copying the template __init__.py, copy from the
package template source location (the template package path) rather than from
the output directory so the correct template is used; and for category logging
replace the loop variable use with the normalized dataset_name so logs reference
the normalized name rather than the iterator variable.
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (13)
nemo_skills/prompt/config/judge/audiobench_binary.yaml (1)
5-29: Keep the “rating” wording as-is (per prior clarification).
No further changes suggested here.nemo_skills/dataset/librispeech-pc/prepare.py (5)
35-45: Harden downloader: enforce https + fix tqdm accounting.def download_with_progress(url: str, output_path: Path, desc: str): """Download file with tqdm progress bar.""" + if not url.startswith("https://"): + raise ValueError(f"Refusing to download non-https URL: {url}") with tqdm(unit="B", unit_scale=True, unit_divisor=1024, desc=desc) as pbar: def reporthook(blocknum, blocksize, totalsize): if pbar.total != totalsize: pbar.total = totalsize - pbar.update(blocksize) + downloaded = blocknum * blocksize + pbar.update(max(0, downloaded - pbar.n)) urllib.request.urlretrieve(url, output_path, reporthook)
57-73: Fix manifest presence check + removetarfile(..., filter="data")incompat + add safe extraction.
filter="data"will break on some supported Python versions, andextract*needs traversal hardening. Also, check both manifests before early return.def download_manifests(output_dir: Path) -> Path: """Download LibriSpeech-PC manifests if not already present.""" - if (output_dir / "test-clean.json").exists(): + output_dir.mkdir(parents=True, exist_ok=True) + if (output_dir / "test-clean.json").exists() and (output_dir / "test-other.json").exists(): return output_dir + + def safe_extract_member(tar: tarfile.TarFile, member: tarfile.TarInfo, dest: Path) -> None: + dest = dest.resolve() + target = (dest / member.name).resolve() + if not str(target).startswith(str(dest) + os.sep): + raise ValueError(f"Refusing to extract path outside destination: {member.name}") + tar.extract(member, dest) @@ with tarfile.open(tar_path, "r:gz") as tar: for member in tar.getmembers(): if member.name in ["test-clean.json", "test-other.json"]: - tar.extract(member, output_dir, filter="data") + safe_extract_member(tar, member, output_dir)
75-86: Fix split directory check + avoidfilter="data"in extractall.
split.replace("-", "_")likely won’t match the extractedLibriSpeech/test-cleandirectory name, causing repeated downloads.def download_audio(split: str, audio_dir: Path): """Download LibriSpeech audio files if not already present.""" - split_dir = audio_dir / "LibriSpeech" / split.replace("-", "_") + split_dir = audio_dir / "LibriSpeech" / split if split_dir.exists(): return @@ with tarfile.open(tar_path, "r:gz") as tar: - tar.extractall(audio_dir, filter="data") + # Avoid `filter=...` for broader Python compatibility; consider adding + # a safe-extract wrapper (like in download_manifests) if you want traversal protection here too. + tar.extractall(audio_dir)
89-145: Fix container path composition (hardcoded root + likely double “LibriSpeech/” prefix).
Make container root configurable and normalizeaudio_filepathfrom the manifest to a relative path underLibriSpeech/.audio_id = Path(audio_filepath).stem - container_path = f"/dataset/librispeech-pc/LibriSpeech/{audio_filepath}" + container_root = os.getenv("NEMO_SKILLS_DATASET_ROOT", "/dataset") + parts = Path(audio_filepath).parts + if "LibriSpeech" in parts: + rel = Path(*parts[parts.index("LibriSpeech") + 1 :]) + else: + rel = Path(audio_filepath.lstrip("/")) + container_path = f"{container_root}/librispeech-pc/LibriSpeech/{rel.as_posix()}"
147-172: Add--data_dir(docs mention it) and don’t write into the installed package dir by default.def main(): parser = argparse.ArgumentParser(description="Prepare LibriSpeech-PC for ASR evaluation") + parser.add_argument( + "--data_dir", + type=str, + default=os.getenv("NEMO_SKILLS_DATA_DIR"), + help="Base data dir. Output goes under <data_dir>/librispeech-pc.", + ) @@ args = parser.parse_args() - data_dir = Path(__file__).parent - audio_dir = data_dir - audio_dir.mkdir(exist_ok=True) + if not args.data_dir: + raise SystemExit("Missing --data_dir and NEMO_SKILLS_DATA_DIR is not set.") + data_dir = (Path(args.data_dir).expanduser().resolve() / "librispeech-pc") + audio_dir = data_dir + audio_dir.mkdir(parents=True, exist_ok=True)nemo_skills/dataset/audiobench/prepare.py (7)
111-116: Guard against zero/falsy sampling_rate (division-by-zero).def get_audio_duration(audio_array: np.ndarray, sampling_rate: int) -> float: """Compute audio duration in seconds from array and sampling rate.""" if audio_array is None or len(audio_array) == 0: return 0.0 + if not sampling_rate: + return 0.0 return float(len(audio_array) / sampling_rate)
124-176: Make manifest audio root configurable (don’t hardcode/data).
Right now the script writes underoutput_dir/...but the manifest always points at/data/audiobench/....def create_manifest_entry( @@ sample_id: int, category: str, + audio_root: str, ) -> Dict: @@ - audio_rel_path = f"/data/audiobench/{category}/audio/{dataset_name}/{audio_filename}" + audio_rel_path = f"{audio_root.rstrip('/')}/audiobench/{category}/audio/{dataset_name}/{audio_filename}" @@ entry = { @@ }
253-263: Avoid leaking sys.path mutations (add finally cleanup).- sys.path.insert(0, str(audiobench_path / "src")) + inserted = str(audiobench_path / "src") + sys.path.insert(0, inserted) try: from dataset import Dataset except ImportError as e: raise ImportError( @@ f"Make sure AudioBench repository is properly set up." ) + finally: + if sys.path and sys.path[0] == inserted: + sys.path.pop(0)
372-426: Fix default output_dir (don’t write into site-packages); align with CLI help.parser.add_argument( "--output_dir", type=str, default=None, help="Output directory (defaults to $NEMO_SKILLS_DATA_DIR/audiobench)", ) + parser.add_argument( + "--audio-root", + type=str, + default=os.getenv("NEMO_SKILLS_AUDIO_ROOT", "/data"), + help="Container-visible root prefix for audio paths (default: /data).", + ) @@ - if args.output_dir: - output_dir = Path(args.output_dir) - else: - # Use dataset directory as output (files will be in nemo_skills/dataset/audiobench/) - output_dir = Path(__file__).parent + if args.output_dir: + output_dir = Path(args.output_dir) + else: + base = os.getenv("NEMO_SKILLS_DATA_DIR") + if not base: + raise SystemExit( + "Missing --output_dir and NEMO_SKILLS_DATA_DIR is not set. " + "Refusing to write into the installed package directory." + ) + output_dir = Path(base) / "audiobench"
474-483: Fix init.py template copy (current code copies from the destination to itself).for category in ["judge", "nonjudge"]: category_dir = output_dir / category category_dir.mkdir(exist_ok=True) @@ init_file = category_dir / "__init__.py" - template_init = output_dir / category / "__init__.py" + template_init = Path(__file__).parent / category / "__init__.py" if not init_file.exists() and template_init.exists(): shutil.copy2(template_init, init_file)
488-498: Use normalized dataset_name for category logging (and handle unknown).- category = "judge" if name in JUDGE_DATASETS else "nonjudge" + if dataset_name in JUDGE_DATASETS: + category = "judge" + elif dataset_name in NONJUDGE_DATASETS: + category = "nonjudge" + else: + category = "unknown"
341-350: Pass the new audio_root through to manifest entries.
Follow-up to making/dataconfigurable.entry = create_manifest_entry( sample=sample, audio_filename=audio_filename, duration=duration, dataset_name=dataset_name, sample_id=idx, category=category, + audio_root=os.getenv("NEMO_SKILLS_AUDIO_ROOT", "/data"), )
🧹 Nitpick comments (1)
nemo_skills/evaluation/evaluator/audio.py (1)
328-339: Nice missing-generation normalization; consider a constant/set for task types.
Current logic looks correct; using aset(or module-level constant) would reduce churn risk as more task types are added.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
nemo_skills/dataset/audiobench/__init__.py(1 hunks)nemo_skills/dataset/audiobench/judge/__init__.py(1 hunks)nemo_skills/dataset/audiobench/nonjudge/__init__.py(1 hunks)nemo_skills/dataset/audiobench/prepare.py(1 hunks)nemo_skills/dataset/librispeech-pc/__init__.py(1 hunks)nemo_skills/dataset/librispeech-pc/prepare.py(1 hunks)nemo_skills/evaluation/evaluator/audio.py(2 hunks)nemo_skills/evaluation/metrics/audio_metrics.py(5 hunks)nemo_skills/prompt/config/judge/audiobench.yaml(1 hunks)nemo_skills/prompt/config/judge/audiobench_binary.yaml(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- nemo_skills/dataset/audiobench/init.py
- nemo_skills/prompt/config/judge/audiobench.yaml
- nemo_skills/dataset/audiobench/judge/init.py
- nemo_skills/dataset/librispeech-pc/init.py
- nemo_skills/dataset/audiobench/nonjudge/init.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:48.719Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.
📚 Learning: 2025-12-12T16:09:48.719Z
Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:48.719Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.
Applied to files:
nemo_skills/dataset/audiobench/prepare.pynemo_skills/evaluation/metrics/audio_metrics.pynemo_skills/prompt/config/judge/audiobench_binary.yaml
🪛 Ruff (0.14.8)
nemo_skills/dataset/librispeech-pc/prepare.py
39-39: Unused function argument: blocknum
(ARG001)
44-44: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
nemo_skills/dataset/audiobench/prepare.py
213-213: subprocess call: check for execution of untrusted input
(S603)
214-214: Starting a process with a partial executable path
(S607)
220-220: Consider moving this statement to an else block
(TRY300)
234-234: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
258-262: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
258-262: Avoid specifying long messages outside the exception class
(TRY003)
269-269: Do not catch blind exception: Exception
(BLE001)
270-270: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
270-270: Create your own exception
(TRY002)
270-270: Avoid specifying long messages outside the exception class
(TRY003)
336-336: Do not catch blind exception: Exception
(BLE001)
354-354: Do not catch blind exception: Exception
(BLE001)
511-511: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
🔇 Additional comments (4)
nemo_skills/evaluation/evaluator/audio.py (1)
358-362: Treating Translation like AST is consistent with BLEU-based evaluation.
LGTM.nemo_skills/evaluation/metrics/audio_metrics.py (3)
215-219: Good: only collect judge_ratings when judgement exists + category is open.
LGTM.
245-250: judge_score computation is fine once ratings are consistently 0–5.
After the binary mapping fix above,* 20is consistent.
312-315: Conditional printing of judge_score is clean.
LGTM.
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
1 similar comment
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (6)
nemo_skills/dataset/librispeech-pc/prepare.py (1)
36-46: Consider validating URL scheme before download.
urllib.request.urlretrieveaccepts arbitrary schemes includingfile://. Since the URLs are constants in this file, the risk is low, but adding a scheme check provides defense-in-depth.def download_with_progress(url: str, output_path: Path, desc: str): """Download file with tqdm progress bar.""" + if not url.startswith(("https://", "http://")): + raise ValueError(f"Refusing to download non-http(s) URL: {url}") with tqdm(unit="B", unit_scale=True, unit_divisor=1024, desc=desc) as pbar:docs/evaluation/speech-audio.md (1)
5-8: Fix grammar in warning block.Line 6: "If you want to evaluation…" → "If you want to evaluate…".
- If you want to evaluation without audio files (not recommended) use + If you want to evaluate without audio files (not recommended) usenemo_skills/dataset/audiobench/prepare.py (4)
109-113: Guard against zerosampling_rateto prevent division by zero.If
sampling_rateis 0 or falsy (e.g., missing from audio metadata), this will raise aZeroDivisionError.def get_audio_duration(audio_array: np.ndarray, sampling_rate: int) -> float: """Compute audio duration in seconds from array and sampling rate.""" if audio_array is None or len(audio_array) == 0: return 0.0 + if not sampling_rate: + return 0.0 return float(len(audio_array) / sampling_rate)
526-531: Defaultoutput_dirwrites into the package directory (often read-only).When
--output_diris not provided, the script defaults toPath(__file__).parent, which is the installed package directory (e.g.,site-packages/.../nemo_skills/dataset/audiobench/). This location is typically read-only and inappropriate for runtime data.# Determine output directory if args.output_dir: output_dir = Path(args.output_dir) else: - # Use dataset directory as output (files will be in nemo_skills/dataset/audiobench/) - output_dir = Path(__file__).parent + base = os.getenv("NEMO_SKILLS_DATA_DIR") + if not base: + raise SystemExit( + "Missing --output_dir and NEMO_SKILLS_DATA_DIR is not set. " + "Refusing to write into the installed package directory." + ) + output_dir = Path(base) / "audiobench"
556-565: Template__init__.pycopy logic references output directory instead of package source.Lines 563-564 set
template_init = output_dir / category / "__init__.py", which is the same location asinit_file. The template should be copied from the package's source directory.# Copy category __init__.py init_file = category_dir / "__init__.py" - template_init = output_dir / category / "__init__.py" + template_init = Path(__file__).parent / category / "__init__.py" if not init_file.exists() and template_init.exists(): shutil.copy2(template_init, init_file)
578-579: Use normalizeddataset_namefor category determination, not loop variablename.After normalization (lines 572-576) may add
_testsuffix, the category check at line 579 still uses the originalname. This could cause incorrect category assignment.# Determine category for logging - category = "judge" if name in JUDGE_DATASETS else "nonjudge" + category = "judge" if dataset_name in JUDGE_DATASETS else "nonjudge"
🧹 Nitpick comments (3)
nemo_skills/dataset/audiobench/prepare.py (3)
161-163: Hardcoded/data/prefix may not match local or alternative deployment roots.The audio path prefix is hardcoded to
/data/audiobench/.... Consider making this configurable via an environment variable or CLI argument for flexibility across different deployment scenarios.Please verify if this hardcoded path is intentional for the target deployment environment, or if it should be configurable similar to how
librispeech-pc/prepare.pyusesNEMO_SKILLS_AUDIO_ROOT.
229-234: Preserve exception chain when re-raising.Use
raise ... from eto maintain the exception chain for better debugging.try: from datasets import load_dataset except Exception as e: raise ImportError( f"Failed to import HuggingFace 'datasets'. Please ensure it is installed.\nOriginal error: {e}" - ) + ) from e
379-388: Preserve exception chain when re-raising.Use
raise ... from eto maintain the exception chain for debugging.except Exception as e: raise Exception( "Failed to load AudioBench dataset via HuggingFace.\n" f"- Requested dataset_name: {dataset_name}\n" ... - ) + ) from e
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
docs/evaluation/speech-audio.md(4 hunks)nemo_skills/dataset/audiobench/prepare.py(1 hunks)nemo_skills/dataset/librispeech-pc/prepare.py(1 hunks)nemo_skills/evaluation/metrics/audio_metrics.py(5 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-12T16:09:53.870Z
Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.
Applied to files:
nemo_skills/evaluation/metrics/audio_metrics.pynemo_skills/dataset/audiobench/prepare.py
🪛 Ruff (0.14.8)
nemo_skills/dataset/librispeech-pc/prepare.py
46-46: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
93-93: Uses of tarfile.extractall()
(S202)
181-181: Avoid specifying long messages outside the exception class
(TRY003)
nemo_skills/dataset/audiobench/prepare.py
231-231: Do not catch blind exception: Exception
(BLE001)
232-234: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
232-234: Avoid specifying long messages outside the exception class
(TRY003)
371-371: Do not catch blind exception: Exception
(BLE001)
379-379: Do not catch blind exception: Exception
(BLE001)
380-388: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
380-388: Create your own exception
(TRY002)
380-388: Avoid specifying long messages outside the exception class
(TRY003)
449-449: Do not catch blind exception: Exception
(BLE001)
467-467: Do not catch blind exception: Exception
(BLE001)
592-592: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (8)
docs/evaluation/speech-audio.md (1)
347-409: LGTM!The new AudioBench and LibriSpeech-PC documentation sections are well-structured and follow the existing documentation patterns. They clearly document dataset locations, preparation commands, and available splits.
nemo_skills/evaluation/metrics/audio_metrics.py (3)
77-113: LGTM! Judge result extraction properly handles both formats.The implementation correctly:
- Parses decimal ratings with
[0-9]+(?:\.[0-9]+)?- Clamps ratings to [0, 5] range
- Maps legacy Yes/No to 5.0/0.0 for consistent 0-100 scaling when multiplied by 20
246-250: LGTM!The judge_score computation correctly converts the 0-5 rating scale to 0-100 by multiplying the average by 20, and the empty-list guard prevents division by zero.
313-315: LGTM!The judge_score is correctly formatted as a plain number (0-100 scale) rather than as a percentage, which aligns with the scaling applied in
get_metrics().nemo_skills/dataset/librispeech-pc/prepare.py (4)
67-73: LGTM! Tarfile filter compatibility handled correctly.The version guard
sys.version_info >= (3, 11, 4)properly enables thefilter="data"security feature on supported Python versions while maintaining compatibility with Python 3.10.
126-130: LGTM! Audio path construction handles prefix correctly.The logic properly strips leading slashes and the
LibriSpeech/prefix to avoid double-prefixing, and usesNEMO_SKILLS_AUDIO_ROOTenvironment variable for configurability.
180-184: LGTM! Data directory handling is robust.The script correctly requires
--data_dirorNEMO_SKILLS_DATA_DIRenvironment variable, preventing accidental writes to the package installation directory.
89-93: Acceptable risk:extractallon trusted source.The
tarfile.extractall()call is flagged by static analysis (S202), but the risk is mitigated by:
- The archive source is the official OpenSLR repository (trusted)
- The
filter="data"security feature is enabled on Python 3.11.4+For Python 3.10/3.11.0-3.11.3, manual member filtering could further harden this, but given the trusted source this is acceptable.
gwarmstrong
left a comment
There was a problem hiding this comment.
Quick Q about merge confict.
Also, I suspect we will need to add these datasets to the exclude list for gpu-tests: https://github.com/NVIDIA-NeMo/Skills/blob/main/tests/gpu-tests/test_eval.py#L29-L47
Re-running slurm tests to confirm.
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
|
Yep, probably we should add this to exlude list |
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Rebasing audiobench and audio tasks to unified asr metrics
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.