add musan dataset by Jorjeous · Pull Request #1139 · NVIDIA-NeMo/Skills

Jorjeous · 2025-12-22T10:14:24Z

Summary by CodeRabbit

New Features
- Added MUSAN dataset module with audio benchmarking configuration.
- Implemented dataset preparation tool supporting multiple sources (HuggingFace, Kaggle, OpenSLR).
- Added audio processing and manifest generation for dataset standardization and evaluation workflows.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

coderabbitai · 2025-12-22T10:17:59Z

📝 Walkthrough

Walkthrough

Introduces MUSAN dataset integration with configuration metadata and preparation utilities. Adds constants for dataset group, evaluation settings, and benchmarking configuration. Provides preparation module supporting multiple data sources (HuggingFace, Kaggle, OpenSLR) with audio processing, manifest generation, and command-line interface.

Changes

Cohort / File(s)	Summary
Dataset Configuration `nemo_skills/dataset/musan/__init__.py`	Defines dataset metadata and benchmarking constants: `DATASET_GROUP`, `IS_BENCHMARK_GROUP`, `SCORE_MODULE`, `METRICS_TYPE`, `EVAL_ARGS`, `GENERATION_ARGS`, and `BENCHMARKS` dictionary.
Dataset Preparation `nemo_skills/dataset/musan/prepare.py`	Implements dataset preparation with support for HuggingFace, Kaggle, and OpenSLR sources. Includes functions for downloading (Kaggle/OpenSLR), audio processing (duration calculation, WAV file handling), manifest generation with structured payloads, per-category processing, and CLI argument parsing.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI / main()
    participant Load as load_dataset_from_source()
    participant HF as HuggingFace Dataset
    participant KG as Kaggle/OpenSLR
    participant Process as process_category()
    participant Audio as Audio Processing
    participant Manifest as Manifest Writer

    CLI->>Load: load(source, output_dir)
    alt Source == "huggingface"
        Load->>HF: datasets.load_dataset()
        HF-->>Load: dataset object
    else Source == "kaggle" or "openslr"
        Load->>KG: download_from_kaggle/openslr()
        KG-->>Load: local dataset path
    end
    
    CLI->>Process: for each category
    Process->>Audio: get_audio_duration()
    Audio-->>Process: duration (float)
    
    alt save_audio == True
        Process->>Audio: save_audio_file()
        Audio->>Manifest: write WAV files
    end
    
    Process->>Process: create_manifest_entry()
    Process-->>Manifest: per-sample entry
    
    Manifest->>Manifest: write test.jsonl (per-category)
    Manifest-->>CLI: consolidated manifest

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Multiple data source handling paths: Requires understanding branching logic for HuggingFace, Kaggle, and OpenSLR sources with different download/load mechanisms
Audio processing logic: Includes WAV file reading, duration calculation, sample rate handling, and conditional file writing
Manifest generation complexity: Structured payload creation with role-based messages and Hallucination task type specification
Error handling coverage: Per-sample error handling with counters and logging across multiple processing stages
File I/O operations: Path validation, directory creation, and batch manifest writing

Suggested reviewers

gwarmstrong

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'add musan dataset' directly and clearly summarizes the main change—adding a new MUSAN dataset module with configuration, preparation utilities, and manifest generation capabilities.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch musan-hallucination-eval

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (7)

nemo_skills/dataset/musan/prepare.py (7)
43-53: Inconsistent label mappings.

CATEGORY_LABELS maps "music" to 1, but LABEL_TO_CATEGORY maps 1 back to "other" instead of "music". This asymmetry appears intentional but could cause confusion. Additionally, LABEL_TO_CATEGORY is defined but never used in the code.
Consider removing unused constant or aligning mappings

If LABEL_TO_CATEGORY is not needed, remove it. Otherwise, align the mappings:
 LABEL_TO_CATEGORY = {
-    1: "other",
+    1: "music",
 }
55-69: Unused output_dir parameter and missing exception chaining.

The output_dir parameter is never used since kagglehub manages download locations internally. Also, exceptions should be chained with from e for better stack traces.
Proposed fix
-def download_from_kaggle(output_dir: Path) -> Path:
+def download_from_kaggle() -> Path:
     """Download MUSAN dataset from Kaggle using kagglehub."""
     try:
         import kagglehub
     except ImportError:
-        raise ImportError("kagglehub not installed. Run: pip install kagglehub")
+        raise ImportError("kagglehub not installed. Run: pip install kagglehub") from None
     
     print("Downloading from Kaggle (requires API key in ~/.kaggle/kaggle.json)")
     
     try:
         path = kagglehub.dataset_download("dogrose/musan-dataset")
         print(f"Downloaded to: {path}")
         return Path(path)
     except Exception as e:
-        raise Exception(f"Kaggle download failed: {e}")
+        raise RuntimeError(f"Kaggle download failed: {e}") from e
Note: If you remove the parameter, update the call site at line 116 accordingly.
106-142: LGTM with minor suggestion.

The function handles all three sources well with appropriate logging. Consider adding a return type annotation for better documentation.
-def load_dataset_from_source(source: str, output_dir: Path):
+def load_dataset_from_source(source: str, output_dir: Path) -> tuple[Path | Any, str]:
158-194: Hardcoded path prefix may limit portability.

The audio_rel_path uses a hardcoded prefix "/data/musan/" which assumes a specific deployment structure. Consider making this configurable or documenting this requirement.
Consider parameterizing the base path
 def create_manifest_entry(
     audio_filename: str,
     duration: float,
     category: str,
     sample_id: int,
     label: str,
+    base_path: str = "/data/musan",
 ) -> Dict:
     """Create nemo-skills manifest entry."""
-    audio_rel_path = f"/data/musan/{category}/audio/{audio_filename}"
+    audio_rel_path = f"{base_path}/{category}/audio/{audio_filename}"
197-204: Unused split parameter.

The split parameter is defined but never used in this function. Either remove it or implement split-based logic if intended.
Remove unused parameter
 def process_category_from_files(
     category: str,
     dataset_path: Path,
     output_dir: Path,
     save_audio: bool = True,
-    split: str = "train",
     max_samples: int = -1,
 ) -> tuple[int, List[Dict]]:
Update the call site accordingly at lines 287-294.
420-426: Minor: Unnecessary f-string prefix.

Line 421 has an f-string without placeholders.
     print("\n" + "=" * 60)
-    print(f"MUSAN Dataset Preparation")
+    print("MUSAN Dataset Preparation")
     print("=" * 60)
399-399: Unused --split argument.

The --split argument is parsed but the split parameter is never used in processing functions. Either implement split-based behavior (e.g., naming output files differently) or remove this argument to avoid user confusion.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7205c43 and 3734ed3.

📒 Files selected for processing (2)

nemo_skills/dataset/musan/__init__.py
nemo_skills/dataset/musan/prepare.py

🧰 Additional context used

🪛 Ruff (0.14.8)

nemo_skills/dataset/musan/prepare.py

55-55: Unused function argument: output_dir

(ARG001)

60-60: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

60-60: Avoid specifying long messages outside the exception class

(TRY003)

68-68: Do not catch blind exception: Exception

(BLE001)

69-69: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

69-69: Create your own exception

(TRY002)

69-69: Avoid specifying long messages outside the exception class

(TRY003)

78-78: f-string without any placeholders

Remove extraneous f prefix

(F541)

89-89: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

98-98: Uses of tarfile.extractall()

(S202)

119-119: Avoid specifying long messages outside the exception class

(TRY003)

142-142: Avoid specifying long messages outside the exception class

(TRY003)

202-202: Unused function argument: split

(ARG001)

208-208: Avoid specifying long messages outside the exception class

(TRY003)

239-239: Do not catch blind exception: Exception

(BLE001)

255-255: Do not catch blind exception: Exception

(BLE001)

357-357: Do not catch blind exception: Exception

(BLE001)

374-374: Do not catch blind exception: Exception

(BLE001)

421-421: f-string without any placeholders

Remove extraneous f prefix

(F541)

430-430: Do not catch blind exception: Exception

(BLE001)

454-454: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: unit-tests

🔇 Additional comments (2)

nemo_skills/dataset/musan/__init__.py (1)

43-57: LGTM!

The configuration constants are well-structured and appropriately define the dataset metadata for benchmarking integration. The module docstring provides clear documentation of download options with useful metrics.

nemo_skills/dataset/musan/prepare.py (1)

272-388: LGTM with observation.

The function properly routes processing based on source type. There's some code duplication with process_category_from_files (directory creation, manifest writing, error counting), which could be refactored into shared helpers, but the current structure is readable and acceptable.

coderabbitai · 2025-12-22T10:18:03Z

nemo_skills/dataset/musan/prepare.py

+def download_from_openslr(output_dir: Path) -> Path:
+    """Download MUSAN dataset from OpenSLR (11 GB)."""
+    url = "https://www.openslr.org/resources/17/musan.tar.gz"
+    download_path = output_dir / "musan.tar.gz"
+    extract_path = output_dir / "musan_openslr"
+
+    print(f"Downloading from OpenSLR (~11 GB)")
+    print(f"URL: {url}")
+
+    if not download_path.exists():
+        def reporthook(block_num, block_size, total_size):
+            downloaded = block_num * block_size
+            percent = min(downloaded / total_size * 100, 100)
+            mb_downloaded = downloaded / (1024 * 1024)
+            mb_total = total_size / (1024 * 1024)
+            print(f"\r{percent:.1f}% ({mb_downloaded:.1f}/{mb_total:.1f} MB)", end='')
+
+        urllib.request.urlretrieve(url, download_path, reporthook)
+        print("\nDownload complete")
+    else:
+        print(f"Using cached archive: {download_path}")
+
+    if not extract_path.exists():
+        print(f"Extracting to {extract_path}...")
+        extract_path.mkdir(parents=True, exist_ok=True)
+        with tarfile.open(download_path, 'r:gz') as tar:
+            tar.extractall(extract_path)
+        print("Extraction complete")
+    else:
+        print(f"Using extracted data: {extract_path}")
+
+    return extract_path / "musan"


⚠️ Potential issue | 🟡 Minor

Security and reliability concerns with archive extraction and network request.

Line 78: Unnecessary f prefix on string without placeholders.

Line 89: urlretrieve has no timeout, which could hang indefinitely on network issues.

Line 98: tarfile.extractall() is vulnerable to path traversal attacks. While the OpenSLR source is trusted, using a safer extraction pattern is recommended.

Proposed improvements

- print(f"Downloading from OpenSLR (~11 GB)") + print("Downloading from OpenSLR (~11 GB)")

For safer tarfile extraction (Python 3.12+ has filter parameter, for older versions):

def safe_extract(tar, path): for member in tar.getmembers(): member_path = os.path.join(path, member.name) if not os.path.abspath(member_path).startswith(os.path.abspath(path)): raise ValueError(f"Attempted path traversal: {member.name}") tar.extractall(path)

🧰 Tools

🪛 Ruff (0.14.8)

78-78: f-string without any placeholders

Remove extraneous f prefix

(F541)

89-89: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

98-98: Uses of tarfile.extractall()

(S202)

🤖 Prompt for AI Agents

In nemo_skills/dataset/musan/prepare.py around lines 72-103, address three issues: remove the unnecessary f-string on line 78 (use a plain string), replace the blocking urllib.request.urlretrieve call on line 89 with a network request that supports a timeout (e.g., use urllib.request.urlopen with a timeout and stream the response to download_path) to avoid hanging indefinitely, and replace tar.extractall() on line 98 with a safe extraction routine that iterates tar.getmembers(), computes each member's destination path, verifies that os.path.abspath(dest).startswith(os.path.abspath(extract_path)) to prevent path traversal, and only then extracts that member (creating parent directories as needed).

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

karpnv

need to add to tests/gpu-tests/test_eval.py and tests/test_datasets.py

karpnv

add to nemo_skills/pipeline/prepare_data.py

melllinia · 2025-12-30T15:16:45Z

can you please add this benchmark to gpu test exclusion list, so that gpu tests will not run so long?

greptile-apps

Greptile Overview

Greptile Summary

Adds MUSAN dataset module supporting three download sources (HuggingFace, Kaggle, OpenSLR) with audio processing and manifest generation for hallucination detection benchmarks. Fixes hallucination detection metric calculation in audio evaluator from chars/second to chars/minute with corresponding threshold adjustment from 25 to 1500. Updates audio metrics aggregation to compute char_rate across all hallucination samples using cumulative totals instead of per-sample averages.

Confidence Score: 2/5

Contains security vulnerability in tarball extraction that could enable path traversal attacks
The tarball extraction in prepare.py line 98 uses unsafe extractall() without filter parameter, allowing malicious archives to write files outside the intended directory. This is a known security issue (CVE-2007-4559) that the codebase already addresses correctly in librispeech-pc/prepare.py with version-specific filtering. The metric calculation fixes appear sound with proper unit conversions.
nemo_skills/dataset/musan/prepare.py requires fixing tarfile extraction safety

Important Files Changed

File Analysis

Filename	Score	Overview
nemo_skills/dataset/musan/prepare.py	2/5	Implements MUSAN dataset preparation with multi-source support; missing tarfile extraction safety check
nemo_skills/evaluation/evaluator/audio.py	4/5	Corrects hallucination detection metric from chars/second to chars/minute with proper threshold adjustment
nemo_skills/evaluation/metrics/audio_metrics.py	4/5	Updates char_rate calculation to aggregate across all Hallucination samples using total chars/minutes

Sequence Diagram

sequenceDiagram
    participant User
    participant prepare.py
    participant Source as Data Source
    participant FS as File System
    participant Evaluator as audio.py
    participant Metrics as audio_metrics.py

    User->>prepare.py: Run with --source kaggle/openslr/huggingface
    
    alt Kaggle/OpenSLR
        prepare.py->>Source: Download dataset (10-11 GB)
        Source-->>prepare.py: Return dataset path
        prepare.py->>FS: Extract tarball (unsafe)
    else HuggingFace
        prepare.py->>Source: Load via datasets library
        Source-->>prepare.py: Return dataset object
    end
    
    prepare.py->>prepare.py: Process categories (music/speech/noise)
    
    loop For each audio file
        prepare.py->>FS: Read audio file
        prepare.py->>FS: Save as standardized WAV
        prepare.py->>FS: Write manifest entry (JSONL)
    end
    
    Note over User,Metrics: Evaluation Phase
    
    User->>Evaluator: Evaluate MUSAN samples
    Evaluator->>Evaluator: evaluate_hallucination()<br/>(char_rate in chars/minute)
    Evaluator-->>Metrics: Return metrics per sample
    Metrics->>Metrics: Accumulate total_hallucinated_chars<br/>and total_audio_seconds
    Metrics->>Metrics: Compute final char_rate<br/>(chars/minute)
    Metrics-->>User: Return aggregated metrics

nemo_skills/dataset/musan/prepare.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>

greptile-apps

Greptile Overview

Greptile Summary

This PR adds the MUSAN dataset (Music, Speech, and Noise Corpus) for hallucination detection in audio models, and standardizes character rate metrics to use chars/minute instead of chars/second across the audio evaluation framework.

Major Changes

New MUSAN dataset preparation script supporting HuggingFace, Kaggle, and OpenSLR sources
Dataset configured for hallucination detection (expects empty transcription for non-speech audio)
Character rate calculations converted from chars/second to chars/minute in evaluator and metrics
Metric aggregation updated to accumulate total characters and duration for overall char_rate

Critical Issues Found

Missing sys import: Line 97 uses sys.version_info but sys is never imported, will cause NameError
Undefined tar variable: Lines 98-100 attempt to use tar.extractall() but the tarfile is never opened with tarfile.open(), will cause NameError
Incomplete category mapping: CATEGORY_LABELS only includes 'noise' and 'music', missing 'speech' category - HuggingFace source will silently fail for speech data
Dead code: LABEL_TO_CATEGORY constant defined but never used
Unused parameter: split parameter accepted but never used in processing functions

Confidence Score: 1/5

This PR has critical runtime errors that will cause immediate crashes when using the OpenSLR source
Score reflects two critical syntax/logic bugs (missing import and undefined variable) that will cause NameError exceptions at runtime, plus a logic bug with missing category mapping that breaks HuggingFace source for speech data. The audio evaluation changes are correct, but the dataset preparation script cannot function without fixes.
Pay close attention to nemo_skills/dataset/musan/prepare.py - requires immediate fixes for missing import, undefined variable, and incomplete category mapping before it can be used

Important Files Changed

File Analysis

Filename	Score	Overview
nemo_skills/dataset/musan/prepare.py	0/5	New MUSAN dataset preparation script with critical bugs: missing `sys` import (line 97), undefined `tar` variable (lines 98-100), incomplete `CATEGORY_LABELS` dict missing 'speech' category, and unused dead code
nemo_skills/dataset/musan/init.py	5/5	New MUSAN dataset configuration file with proper metadata and benchmark setup, following established patterns from other audio datasets
nemo_skills/evaluation/evaluator/audio.py	5/5	Converts character rate calculations from chars/second to chars/minute for better readability, mathematically correct changes with consistent threshold updates
nemo_skills/evaluation/metrics/audio_metrics.py	5/5	Updates metric aggregation for char_rate to use chars/minute instead of chars/second, adds `as_float` import for proper formatting

Sequence Diagram

sequenceDiagram
    participant User
    participant prepare.py
    participant Source as Data Source
    participant Evaluator as audio.py
    participant Metrics as audio_metrics.py
    
    User->>prepare.py: Run with --source & --categories
    
    alt HuggingFace
        prepare.py->>Source: load_dataset()
        Source-->>prepare.py: Dataset samples
        prepare.py->>prepare.py: Filter by CATEGORY_LABELS
    else Kaggle/OpenSLR
        prepare.py->>Source: Download & extract files
        Source-->>prepare.py: WAV files
        prepare.py->>prepare.py: Read WAV files
    end
    
    prepare.py->>prepare.py: create_manifest_entry()<br/>(task_type: Hallucination)
    prepare.py->>prepare.py: Save to test.jsonl
    
    Note over User,Metrics: Evaluation Phase
    
    User->>Evaluator: Run evaluation
    Evaluator->>Evaluator: evaluate_hallucination()<br/>(check char_rate in chars/min)
    Evaluator->>Metrics: Pass results
    Metrics->>Metrics: Aggregate char_rate<br/>(total_chars / total_minutes)
    Metrics-->>User: Final metrics

nemo_skills/dataset/musan/prepare.py

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

…lls into musan-hallucination-eval

greptile-apps

Greptile Overview

Greptile Summary

This PR adds the MUSAN dataset for audio hallucination detection and updates char_rate metrics to use chars/minute instead of chars/second.

Major Changes:

New MUSAN dataset module supporting 3 sources (HuggingFace, Kaggle, OpenSLR)
Hallucination detection via speaking rate analysis (expects empty transcription for non-speech audio)
Metrics refactoring: char_rate now aggregated across samples in chars/minute

Critical Issues Found:

Missing import sys will cause NameError when using OpenSLR source
Undefined tar variable will cause NameError during extraction
Missing speech label in CATEGORY_LABELS will break HuggingFace source for speech category

Style Note:
The LABEL_TO_CATEGORY dictionary on line 48-51 appears unused throughout the codebase.

Confidence Score: 1/5

NOT safe to merge - contains critical runtime errors that will cause crashes
Three critical bugs in prepare.py will cause immediate NameError exceptions: missing sys import (line 97), undefined tar variable (lines 98-100), and missing speech category label (line 303). These are blocking issues that make the OpenSLR source and HuggingFace source completely non-functional.
Critical attention needed for nemo_skills/dataset/musan/prepare.py - fix import statement, tarfile context manager, and category labels before merging

Important Files Changed

File Analysis

Filename	Score	Overview
nemo_skills/dataset/musan/prepare.py	1/5	Critical bugs: missing `sys` import causes NameError, undefined `tar` variable causes NameError, missing `speech` category label will cause HuggingFace source to fail for speech category
nemo_skills/evaluation/evaluator/audio.py	5/5	Unit conversion from chars/second to chars/minute is correct and consistent throughout
nemo_skills/evaluation/metrics/audio_metrics.py	5/5	Refactored char_rate calculation to aggregate across all samples, correctly imports `as_float`

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareData as prepare_data.py
    participant Prepare as musan/prepare.py
    participant Source as Data Source<br/>(HF/Kaggle/OpenSLR)
    participant Evaluator as audio.py
    participant Metrics as audio_metrics.py

    User->>PrepareData: ns prepare_data musan
    PrepareData->>Prepare: Execute prepare.py
    Prepare->>Source: Download dataset
    Source-->>Prepare: Audio files
    Prepare->>Prepare: Process categories<br/>(music/speech/noise)
    Prepare->>Prepare: Create manifest entries<br/>(task_type=Hallucination)
    Prepare-->>PrepareData: test.jsonl files

    User->>Evaluator: ns eval musan
    Evaluator->>Evaluator: Read test.jsonl
    Evaluator->>Evaluator: Generate predictions
    Evaluator->>Evaluator: evaluate_hallucination()<br/>(check char_rate)
    Evaluator-->>Metrics: Predictions with metrics
    Metrics->>Metrics: Aggregate char_rate<br/>across all samples
    Metrics-->>User: Final metrics report

nemo_skills/dataset/musan/prepare.py

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps

Greptile Overview

Greptile Summary

Added MUSAN (Music, Speech, and Noise Corpus) dataset support with multi-source download capabilities and improved audio hallucination detection metrics by standardizing character rate calculations to chars/minute.

Major Changes:

Implemented MUSAN dataset preparation with support for HuggingFace, Kaggle, and OpenSLR data sources
Created manifest generation for hallucination detection tasks (expects empty transcription for non-speech audio)
Refactored character rate calculation from chars/second to chars/minute across evaluator and metrics (multiplied by 60)
Updated hallucination threshold from 25 chars/sec to 1500 chars/min (equivalent but more readable)
Changed metrics aggregation from averaging individual char_rate scores to computing total chars / total minutes for hallucination tasks
Added MUSAN to dataset pipeline and test configurations

Issues Found:

CATEGORY_LABELS in prepare.py is missing "speech" label mapping - HuggingFace path will fail silently for speech category

Confidence Score: 3/5

This PR is mostly safe to merge with one critical logic issue that needs fixing
The PR adds substantial new functionality with proper error handling and multiple data source support. However, there's a logic bug where the CATEGORY_LABELS dictionary is missing the "speech" label, which will cause the HuggingFace processing path to silently return 0 samples when users request the speech category. The refactored metrics calculations appear correct and well-documented. Test coverage is appropriate with the dataset excluded from GPU tests.
Pay close attention to nemo_skills/dataset/musan/prepare.py - fix the missing "speech" label mapping before merging

Important Files Changed

File Analysis

Filename	Score	Overview
nemo_skills/dataset/musan/prepare.py	3/5	Implemented MUSAN dataset preparation with support for HuggingFace/Kaggle/OpenSLR sources - missing speech label mapping for HuggingFace path
nemo_skills/evaluation/evaluator/audio.py	5/5	Fixed character rate calculation from chars/second to chars/minute (multiplied by 60) for better readability and consistency
nemo_skills/evaluation/metrics/audio_metrics.py	5/5	Refactored char_rate metric to aggregate total chars and duration for hallucination tasks, changed from individual scores to total-based calculation

Sequence Diagram

sequenceDiagram
    participant User
    participant prepare.py
    participant DataSource as Data Source<br/>(HF/Kaggle/OpenSLR)
    participant FileSystem
    participant audio.py as Audio Evaluator
    participant audio_metrics.py as Audio Metrics

    Note over User,audio_metrics.py: Dataset Preparation Phase
    User->>prepare.py: Run with --source, --categories
    prepare.py->>DataSource: load_dataset_from_source()
    alt HuggingFace
        DataSource-->>prepare.py: HF dataset object
    else Kaggle/OpenSLR
        DataSource-->>prepare.py: Path to WAV files
    end
    
    prepare.py->>prepare.py: process_category()
    loop For each category (noise/music/speech)
        alt HuggingFace source
            prepare.py->>prepare.py: Filter by CATEGORY_LABELS
        else Kaggle/OpenSLR
            prepare.py->>prepare.py: process_category_from_files()
            prepare.py->>FileSystem: Read WAV files
        end
        prepare.py->>prepare.py: create_manifest_entry()
        Note right of prepare.py: task_type: "Hallucination"<br/>expected_answer: ""
        prepare.py->>FileSystem: Save audio + test.jsonl
    end

    Note over User,audio_metrics.py: Evaluation Phase
    User->>audio.py: evaluate_sample()
    audio.py->>audio.py: evaluate_hallucination()
    Note right of audio.py: char_rate = (chars/duration) * 60<br/>threshold: 1500 chars/min
    audio.py-->>User: Sample metrics
    
    User->>audio_metrics.py: aggregate_predictions()
    loop For each prediction
        alt task_type == "Hallucination"
            audio_metrics.py->>audio_metrics.py: Accumulate total_hallucinated_chars<br/>and total_audio_seconds
        end
    end
    audio_metrics.py->>audio_metrics.py: get_metrics()
    Note right of audio_metrics.py: char_rate = total_chars / total_minutes
    audio_metrics.py-->>User: Aggregated metrics

nemo_skills/dataset/musan/prepare.py

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps · 2026-01-12T11:59:57Z

Greptile Overview

Greptile Summary

This PR adds the MUSAN (Music, Speech, and Noise Corpus) dataset to nemo-skills for audio hallucination detection benchmarking. The implementation includes:

Key Components

Dataset Preparation (musan/prepare.py):

Supports 3 data sources: HuggingFace (incomplete), Kaggle (recommended), and OpenSLR (official)
Processes audio files from music, speech, and noise categories
Creates manifest entries with task_type="Hallucination" and empty expected transcriptions
Saves audio files in standardized format with metadata

Audio Evaluation Improvements:

Updated char_rate metric from chars/second to chars/minute for better readability (threshold: 1500 chars/min)
Refactored char_rate aggregation from per-sample average to global aggregate calculation (total_chars/total_minutes)
Maintains hallucination detection via speaking rate anomaly (>1500 chars/min indicates likely hallucination)

Infrastructure Updates:

Added "musan" to DATASETS_REQUIRE_DATA_DIR list
Added musan to test exclusions (requires large data files)
Registered dataset with speechlm group and audio metrics

Critical Issues Identified

Missing speech category mapping: CATEGORY_LABELS dictionary only defines mappings for "noise" (0) and "music" (1), but "speech" is an allowed category. This will cause HuggingFace source to fail when processing speech samples.
Inconsistent label mapping: LABEL_TO_CATEGORY maps label 1 to "other" instead of "music", creating a mismatch with CATEGORY_LABELS.
Broken manifest entries with save_audio=False: When audio saving is disabled, manifest entries are still created with paths to non-existent audio files, which will cause runtime failures during evaluation.
Unclear speech category intent: All categories (including "speech") have empty expected transcriptions. It's unclear whether speech samples should be transcribed or treated as hallucination test cases like noise/music.
Unused split parameter: The split parameter is accepted but never used in actual processing - all output files are named "test.jsonl" regardless of split value.

Confidence Score: 2/5

This PR contains critical bugs that will cause runtime failures
Score reflects multiple critical logic errors in the core dataset preparation code: (1) Missing category mapping will cause immediate failure when processing speech with HuggingFace source, (2) Inconsistent label mappings will produce incorrect category labels, (3) save_audio=False option creates broken manifest entries that reference non-existent files. The audio evaluation improvements are solid, but the dataset preparation has fundamental issues that must be fixed before merging.
Critical attention needed for nemo_skills/dataset/musan/prepare.py - contains category mapping bugs, broken save_audio=False logic, and unclear intent for speech category handling

Important Files Changed

File Analysis

Filename	Score	Overview
nemo_skills/dataset/musan/prepare.py	2/5	New MUSAN dataset preparation script with critical bugs in category mappings and unused parameters
nemo_skills/evaluation/evaluator/audio.py	5/5	Improved char_rate calculation from chars/second to chars/minute for better readability
nemo_skills/evaluation/metrics/audio_metrics.py	5/5	Refactored char_rate metric to use global aggregation instead of per-sample average

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareScript as prepare.py
    participant DataSource as Data Source<br/>(HF/Kaggle/OpenSLR)
    participant FileSystem as File System
    participant Evaluator as audio.py
    participant Metrics as audio_metrics.py

    User->>PrepareScript: Run dataset preparation
    PrepareScript->>DataSource: Load MUSAN dataset
    DataSource-->>PrepareScript: Return audio files

    loop For each category (music/speech/noise)
        PrepareScript->>PrepareScript: Filter samples by category
        loop For each audio sample
            PrepareScript->>PrepareScript: Read audio array & compute duration
            alt save_audio=True
                PrepareScript->>FileSystem: Save WAV file
            end
            PrepareScript->>PrepareScript: Create manifest entry with<br/>task_type="Hallucination"<br/>expected_answer=""
        end
        PrepareScript->>FileSystem: Write test.jsonl manifest
    end

    User->>Evaluator: Run evaluation
    loop For each sample in manifest
        Evaluator->>Evaluator: Load audio & generate transcription
        Evaluator->>Evaluator: evaluate_hallucination()<br/>(check char_rate > 1500 chars/min)
        Evaluator->>Metrics: Update metrics with results
    end

    Metrics->>Metrics: Aggregate char_rate globally<br/>(total_chars / total_minutes)
    Metrics-->>User: Return evaluation results

greptile-apps

_{3 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/dataset/musan/prepare.py

nemo_skills/evaluation/metrics/audio_metrics.py

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps

_{5 files reviewed, 5 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/dataset/musan/prepare.py

karpnv

LGTM

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: dgitman <dgitman@nvidia.com>

add musan dataset

3734ed3

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

coderabbitai bot reviewed Dec 22, 2025

View reviewed changes

Jorjeous added 3 commits December 22, 2025 03:28

Pre commit and converting from chars/sec to chars/min

7b35079

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

pre commit, upd to mins

f8d71a6

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

using standart float

15245ac

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous self-assigned this Dec 22, 2025

Jorjeous requested review from gwarmstrong and karpnv December 22, 2025 15:26

karpnv requested changes Dec 26, 2025

View reviewed changes

Merge branch 'main' into musan-hallucination-eval

ad2d899

greptile-apps bot reviewed Jan 8, 2026

View reviewed changes

nemo_skills/dataset/musan/prepare.py Outdated Show resolved Hide resolved

Update nemo_skills/dataset/musan/prepare.py

ed9d5e3

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>

greptile-apps bot reviewed Jan 8, 2026

View reviewed changes

Jorjeous removed their assignment Jan 8, 2026

Jorjeous added 2 commits January 8, 2026 09:42

add to test and exclude

73b8074

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Merge branch 'musan-hallucination-eval' of github.com:NVIDIA-NeMo/Ski…

1938132

…lls into musan-hallucination-eval

greptile-apps bot reviewed Jan 8, 2026

View reviewed changes

nemo_skills/dataset/musan/prepare.py Show resolved Hide resolved

nemo_skills/dataset/musan/prepare.py Outdated Show resolved Hide resolved

nemo_skills/dataset/musan/prepare.py Show resolved Hide resolved

pre commit and revert git proposal

e51990f

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps bot reviewed Jan 8, 2026

View reviewed changes

nemo_skills/dataset/musan/prepare.py Show resolved Hide resolved

Jorjeous added the run GPU tests label Jan 8, 2026

Jorjeous requested a review from karpnv January 10, 2026 09:17

karpnv requested a review from melllinia January 12, 2026 05:48

Merge remote-tracking branch 'origin/main' into musan-hallucination-eval

b858b26

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps bot reviewed Jan 12, 2026

View reviewed changes

nemo_skills/dataset/musan/prepare.py Show resolved Hide resolved

nemo_skills/dataset/musan/prepare.py Outdated Show resolved Hide resolved

nemo_skills/evaluation/metrics/audio_metrics.py Show resolved Hide resolved

update hardcoded to NEMO_SKILLS_AUDIO_ROOT

c1b6cd8

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps bot reviewed Jan 12, 2026

View reviewed changes

karpnv approved these changes Jan 12, 2026

View reviewed changes

karpnv merged commit 079b106 into main Jan 12, 2026
5 checks passed

karpnv deleted the musan-hallucination-eval branch January 12, 2026 16:51

coderabbitai bot mentioned this pull request Feb 19, 2026

Add aime26 #1256

Merged

Conversation

Jorjeous commented Dec 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 22, 2025

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

karpnv left a comment

Choose a reason for hiding this comment

Uh oh!

karpnv left a comment

Choose a reason for hiding this comment

Uh oh!

melllinia commented Dec 30, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Major Changes

Critical Issues Found

Confidence Score: 1/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 1/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

greptile-apps bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Key Components

Critical Issues Identified

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Jorjeous commented Dec 22, 2025 •

edited by coderabbitai bot

Loading

greptile-apps bot commented Jan 12, 2026 •

edited

Loading