Skip to content

add musan dataset#1139

Merged
karpnv merged 11 commits intomainfrom
musan-hallucination-eval
Jan 12, 2026
Merged

add musan dataset#1139
karpnv merged 11 commits intomainfrom
musan-hallucination-eval

Conversation

@Jorjeous
Copy link
Copy Markdown
Member

@Jorjeous Jorjeous commented Dec 22, 2025

Summary by CodeRabbit

  • New Features
    • Added MUSAN dataset module with audio benchmarking configuration.
    • Implemented dataset preparation tool supporting multiple sources (HuggingFace, Kaggle, OpenSLR).
    • Added audio processing and manifest generation for dataset standardization and evaluation workflows.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Dec 22, 2025

📝 Walkthrough

Walkthrough

Introduces MUSAN dataset integration with configuration metadata and preparation utilities. Adds constants for dataset group, evaluation settings, and benchmarking configuration. Provides preparation module supporting multiple data sources (HuggingFace, Kaggle, OpenSLR) with audio processing, manifest generation, and command-line interface.

Changes

Cohort / File(s) Summary
Dataset Configuration
nemo_skills/dataset/musan/__init__.py
Defines dataset metadata and benchmarking constants: DATASET_GROUP, IS_BENCHMARK_GROUP, SCORE_MODULE, METRICS_TYPE, EVAL_ARGS, GENERATION_ARGS, and BENCHMARKS dictionary.
Dataset Preparation
nemo_skills/dataset/musan/prepare.py
Implements dataset preparation with support for HuggingFace, Kaggle, and OpenSLR sources. Includes functions for downloading (Kaggle/OpenSLR), audio processing (duration calculation, WAV file handling), manifest generation with structured payloads, per-category processing, and CLI argument parsing.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI / main()
    participant Load as load_dataset_from_source()
    participant HF as HuggingFace Dataset
    participant KG as Kaggle/OpenSLR
    participant Process as process_category()
    participant Audio as Audio Processing
    participant Manifest as Manifest Writer

    CLI->>Load: load(source, output_dir)
    alt Source == "huggingface"
        Load->>HF: datasets.load_dataset()
        HF-->>Load: dataset object
    else Source == "kaggle" or "openslr"
        Load->>KG: download_from_kaggle/openslr()
        KG-->>Load: local dataset path
    end
    
    CLI->>Process: for each category
    Process->>Audio: get_audio_duration()
    Audio-->>Process: duration (float)
    
    alt save_audio == True
        Process->>Audio: save_audio_file()
        Audio->>Manifest: write WAV files
    end
    
    Process->>Process: create_manifest_entry()
    Process-->>Manifest: per-sample entry
    
    Manifest->>Manifest: write test.jsonl (per-category)
    Manifest-->>CLI: consolidated manifest
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Multiple data source handling paths: Requires understanding branching logic for HuggingFace, Kaggle, and OpenSLR sources with different download/load mechanisms
  • Audio processing logic: Includes WAV file reading, duration calculation, sample rate handling, and conditional file writing
  • Manifest generation complexity: Structured payload creation with role-based messages and Hallucination task type specification
  • Error handling coverage: Per-sample error handling with counters and logging across multiple processing stages
  • File I/O operations: Path validation, directory creation, and batch manifest writing

Suggested reviewers

  • gwarmstrong

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'add musan dataset' directly and clearly summarizes the main change—adding a new MUSAN dataset module with configuration, preparation utilities, and manifest generation capabilities.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch musan-hallucination-eval

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (7)
nemo_skills/dataset/musan/prepare.py (7)

43-53: Inconsistent label mappings.

CATEGORY_LABELS maps "music" to 1, but LABEL_TO_CATEGORY maps 1 back to "other" instead of "music". This asymmetry appears intentional but could cause confusion. Additionally, LABEL_TO_CATEGORY is defined but never used in the code.

Consider removing unused constant or aligning mappings

If LABEL_TO_CATEGORY is not needed, remove it. Otherwise, align the mappings:

 LABEL_TO_CATEGORY = {
-    1: "other",
+    1: "music",
 }

55-69: Unused output_dir parameter and missing exception chaining.

The output_dir parameter is never used since kagglehub manages download locations internally. Also, exceptions should be chained with from e for better stack traces.

Proposed fix
-def download_from_kaggle(output_dir: Path) -> Path:
+def download_from_kaggle() -> Path:
     """Download MUSAN dataset from Kaggle using kagglehub."""
     try:
         import kagglehub
     except ImportError:
-        raise ImportError("kagglehub not installed. Run: pip install kagglehub")
+        raise ImportError("kagglehub not installed. Run: pip install kagglehub") from None
     
     print("Downloading from Kaggle (requires API key in ~/.kaggle/kaggle.json)")
     
     try:
         path = kagglehub.dataset_download("dogrose/musan-dataset")
         print(f"Downloaded to: {path}")
         return Path(path)
     except Exception as e:
-        raise Exception(f"Kaggle download failed: {e}")
+        raise RuntimeError(f"Kaggle download failed: {e}") from e

Note: If you remove the parameter, update the call site at line 116 accordingly.


106-142: LGTM with minor suggestion.

The function handles all three sources well with appropriate logging. Consider adding a return type annotation for better documentation.

-def load_dataset_from_source(source: str, output_dir: Path):
+def load_dataset_from_source(source: str, output_dir: Path) -> tuple[Path | Any, str]:

158-194: Hardcoded path prefix may limit portability.

The audio_rel_path uses a hardcoded prefix "/data/musan/" which assumes a specific deployment structure. Consider making this configurable or documenting this requirement.

Consider parameterizing the base path
 def create_manifest_entry(
     audio_filename: str,
     duration: float,
     category: str,
     sample_id: int,
     label: str,
+    base_path: str = "/data/musan",
 ) -> Dict:
     """Create nemo-skills manifest entry."""
-    audio_rel_path = f"/data/musan/{category}/audio/{audio_filename}"
+    audio_rel_path = f"{base_path}/{category}/audio/{audio_filename}"

197-204: Unused split parameter.

The split parameter is defined but never used in this function. Either remove it or implement split-based logic if intended.

Remove unused parameter
 def process_category_from_files(
     category: str,
     dataset_path: Path,
     output_dir: Path,
     save_audio: bool = True,
-    split: str = "train",
     max_samples: int = -1,
 ) -> tuple[int, List[Dict]]:

Update the call site accordingly at lines 287-294.


420-426: Minor: Unnecessary f-string prefix.

Line 421 has an f-string without placeholders.

     print("\n" + "=" * 60)
-    print(f"MUSAN Dataset Preparation")
+    print("MUSAN Dataset Preparation")
     print("=" * 60)

399-399: Unused --split argument.

The --split argument is parsed but the split parameter is never used in processing functions. Either implement split-based behavior (e.g., naming output files differently) or remove this argument to avoid user confusion.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7205c43 and 3734ed3.

📒 Files selected for processing (2)
  • nemo_skills/dataset/musan/__init__.py
  • nemo_skills/dataset/musan/prepare.py
🧰 Additional context used
🪛 Ruff (0.14.8)
nemo_skills/dataset/musan/prepare.py

55-55: Unused function argument: output_dir

(ARG001)


60-60: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


60-60: Avoid specifying long messages outside the exception class

(TRY003)


68-68: Do not catch blind exception: Exception

(BLE001)


69-69: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


69-69: Create your own exception

(TRY002)


69-69: Avoid specifying long messages outside the exception class

(TRY003)


78-78: f-string without any placeholders

Remove extraneous f prefix

(F541)


89-89: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


98-98: Uses of tarfile.extractall()

(S202)


119-119: Avoid specifying long messages outside the exception class

(TRY003)


142-142: Avoid specifying long messages outside the exception class

(TRY003)


202-202: Unused function argument: split

(ARG001)


208-208: Avoid specifying long messages outside the exception class

(TRY003)


239-239: Do not catch blind exception: Exception

(BLE001)


255-255: Do not catch blind exception: Exception

(BLE001)


357-357: Do not catch blind exception: Exception

(BLE001)


374-374: Do not catch blind exception: Exception

(BLE001)


421-421: f-string without any placeholders

Remove extraneous f prefix

(F541)


430-430: Do not catch blind exception: Exception

(BLE001)


454-454: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (2)
nemo_skills/dataset/musan/__init__.py (1)

43-57: LGTM!

The configuration constants are well-structured and appropriately define the dataset metadata for benchmarking integration. The module docstring provides clear documentation of download options with useful metrics.

nemo_skills/dataset/musan/prepare.py (1)

272-388: LGTM with observation.

The function properly routes processing based on source type. There's some code duplication with process_category_from_files (directory creation, manifest writing, error counting), which could be refactored into shared helpers, but the current structure is readable and acceptable.

Comment on lines +72 to +103
def download_from_openslr(output_dir: Path) -> Path:
"""Download MUSAN dataset from OpenSLR (11 GB)."""
url = "https://www.openslr.org/resources/17/musan.tar.gz"
download_path = output_dir / "musan.tar.gz"
extract_path = output_dir / "musan_openslr"

print(f"Downloading from OpenSLR (~11 GB)")
print(f"URL: {url}")

if not download_path.exists():
def reporthook(block_num, block_size, total_size):
downloaded = block_num * block_size
percent = min(downloaded / total_size * 100, 100)
mb_downloaded = downloaded / (1024 * 1024)
mb_total = total_size / (1024 * 1024)
print(f"\r{percent:.1f}% ({mb_downloaded:.1f}/{mb_total:.1f} MB)", end='')

urllib.request.urlretrieve(url, download_path, reporthook)
print("\nDownload complete")
else:
print(f"Using cached archive: {download_path}")

if not extract_path.exists():
print(f"Extracting to {extract_path}...")
extract_path.mkdir(parents=True, exist_ok=True)
with tarfile.open(download_path, 'r:gz') as tar:
tar.extractall(extract_path)
print("Extraction complete")
else:
print(f"Using extracted data: {extract_path}")

return extract_path / "musan"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Security and reliability concerns with archive extraction and network request.

  1. Line 78: Unnecessary f prefix on string without placeholders.
  2. Line 89: urlretrieve has no timeout, which could hang indefinitely on network issues.
  3. Line 98: tarfile.extractall() is vulnerable to path traversal attacks. While the OpenSLR source is trusted, using a safer extraction pattern is recommended.
Proposed improvements
-    print(f"Downloading from OpenSLR (~11 GB)")
+    print("Downloading from OpenSLR (~11 GB)")

For safer tarfile extraction (Python 3.12+ has filter parameter, for older versions):

def safe_extract(tar, path):
    for member in tar.getmembers():
        member_path = os.path.join(path, member.name)
        if not os.path.abspath(member_path).startswith(os.path.abspath(path)):
            raise ValueError(f"Attempted path traversal: {member.name}")
    tar.extractall(path)
🧰 Tools
🪛 Ruff (0.14.8)

78-78: f-string without any placeholders

Remove extraneous f prefix

(F541)


89-89: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


98-98: Uses of tarfile.extractall()

(S202)

🤖 Prompt for AI Agents
In nemo_skills/dataset/musan/prepare.py around lines 72-103, address three
issues: remove the unnecessary f-string on line 78 (use a plain string), replace
the blocking urllib.request.urlretrieve call on line 89 with a network request
that supports a timeout (e.g., use urllib.request.urlopen with a timeout and
stream the response to download_path) to avoid hanging indefinitely, and replace
tar.extractall() on line 98 with a safe extraction routine that iterates
tar.getmembers(), computes each member's destination path, verifies that
os.path.abspath(dest).startswith(os.path.abspath(extract_path)) to prevent path
traversal, and only then extracts that member (creating parent directories as
needed).

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
@Jorjeous Jorjeous self-assigned this Dec 22, 2025
Copy link
Copy Markdown
Collaborator

@karpnv karpnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add to tests/gpu-tests/test_eval.py and tests/test_datasets.py

Copy link
Copy Markdown
Collaborator

@karpnv karpnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add to nemo_skills/pipeline/prepare_data.py

@melllinia
Copy link
Copy Markdown
Member

can you please add this benchmark to gpu test exclusion list, so that gpu tests will not run so long?

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Adds MUSAN dataset module supporting three download sources (HuggingFace, Kaggle, OpenSLR) with audio processing and manifest generation for hallucination detection benchmarks. Fixes hallucination detection metric calculation in audio evaluator from chars/second to chars/minute with corresponding threshold adjustment from 25 to 1500. Updates audio metrics aggregation to compute char_rate across all hallucination samples using cumulative totals instead of per-sample averages.

Confidence Score: 2/5

  • Contains security vulnerability in tarball extraction that could enable path traversal attacks
  • The tarball extraction in prepare.py line 98 uses unsafe extractall() without filter parameter, allowing malicious archives to write files outside the intended directory. This is a known security issue (CVE-2007-4559) that the codebase already addresses correctly in librispeech-pc/prepare.py with version-specific filtering. The metric calculation fixes appear sound with proper unit conversions.
  • nemo_skills/dataset/musan/prepare.py requires fixing tarfile extraction safety

Important Files Changed

File Analysis

Filename Score Overview
nemo_skills/dataset/musan/prepare.py 2/5 Implements MUSAN dataset preparation with multi-source support; missing tarfile extraction safety check
nemo_skills/evaluation/evaluator/audio.py 4/5 Corrects hallucination detection metric from chars/second to chars/minute with proper threshold adjustment
nemo_skills/evaluation/metrics/audio_metrics.py 4/5 Updates char_rate calculation to aggregate across all Hallucination samples using total chars/minutes

Sequence Diagram

sequenceDiagram
    participant User
    participant prepare.py
    participant Source as Data Source
    participant FS as File System
    participant Evaluator as audio.py
    participant Metrics as audio_metrics.py

    User->>prepare.py: Run with --source kaggle/openslr/huggingface
    
    alt Kaggle/OpenSLR
        prepare.py->>Source: Download dataset (10-11 GB)
        Source-->>prepare.py: Return dataset path
        prepare.py->>FS: Extract tarball (unsafe)
    else HuggingFace
        prepare.py->>Source: Load via datasets library
        Source-->>prepare.py: Return dataset object
    end
    
    prepare.py->>prepare.py: Process categories (music/speech/noise)
    
    loop For each audio file
        prepare.py->>FS: Read audio file
        prepare.py->>FS: Save as standardized WAV
        prepare.py->>FS: Write manifest entry (JSONL)
    end
    
    Note over User,Metrics: Evaluation Phase
    
    User->>Evaluator: Evaluate MUSAN samples
    Evaluator->>Evaluator: evaluate_hallucination()<br/>(char_rate in chars/minute)
    Evaluator-->>Metrics: Return metrics per sample
    Metrics->>Metrics: Accumulate total_hallucinated_chars<br/>and total_audio_seconds
    Metrics->>Metrics: Compute final char_rate<br/>(chars/minute)
    Metrics-->>User: Return aggregated metrics
Loading

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR adds the MUSAN dataset (Music, Speech, and Noise Corpus) for hallucination detection in audio models, and standardizes character rate metrics to use chars/minute instead of chars/second across the audio evaluation framework.

Major Changes

  • New MUSAN dataset preparation script supporting HuggingFace, Kaggle, and OpenSLR sources
  • Dataset configured for hallucination detection (expects empty transcription for non-speech audio)
  • Character rate calculations converted from chars/second to chars/minute in evaluator and metrics
  • Metric aggregation updated to accumulate total characters and duration for overall char_rate

Critical Issues Found

  • Missing sys import: Line 97 uses sys.version_info but sys is never imported, will cause NameError
  • Undefined tar variable: Lines 98-100 attempt to use tar.extractall() but the tarfile is never opened with tarfile.open(), will cause NameError
  • Incomplete category mapping: CATEGORY_LABELS only includes 'noise' and 'music', missing 'speech' category - HuggingFace source will silently fail for speech data
  • Dead code: LABEL_TO_CATEGORY constant defined but never used
  • Unused parameter: split parameter accepted but never used in processing functions

Confidence Score: 1/5

  • This PR has critical runtime errors that will cause immediate crashes when using the OpenSLR source
  • Score reflects two critical syntax/logic bugs (missing import and undefined variable) that will cause NameError exceptions at runtime, plus a logic bug with missing category mapping that breaks HuggingFace source for speech data. The audio evaluation changes are correct, but the dataset preparation script cannot function without fixes.
  • Pay close attention to nemo_skills/dataset/musan/prepare.py - requires immediate fixes for missing import, undefined variable, and incomplete category mapping before it can be used

Important Files Changed

File Analysis

Filename Score Overview
nemo_skills/dataset/musan/prepare.py 0/5 New MUSAN dataset preparation script with critical bugs: missing sys import (line 97), undefined tar variable (lines 98-100), incomplete CATEGORY_LABELS dict missing 'speech' category, and unused dead code
nemo_skills/dataset/musan/init.py 5/5 New MUSAN dataset configuration file with proper metadata and benchmark setup, following established patterns from other audio datasets
nemo_skills/evaluation/evaluator/audio.py 5/5 Converts character rate calculations from chars/second to chars/minute for better readability, mathematically correct changes with consistent threshold updates
nemo_skills/evaluation/metrics/audio_metrics.py 5/5 Updates metric aggregation for char_rate to use chars/minute instead of chars/second, adds as_float import for proper formatting

Sequence Diagram

sequenceDiagram
    participant User
    participant prepare.py
    participant Source as Data Source
    participant Evaluator as audio.py
    participant Metrics as audio_metrics.py
    
    User->>prepare.py: Run with --source & --categories
    
    alt HuggingFace
        prepare.py->>Source: load_dataset()
        Source-->>prepare.py: Dataset samples
        prepare.py->>prepare.py: Filter by CATEGORY_LABELS
    else Kaggle/OpenSLR
        prepare.py->>Source: Download & extract files
        Source-->>prepare.py: WAV files
        prepare.py->>prepare.py: Read WAV files
    end
    
    prepare.py->>prepare.py: create_manifest_entry()<br/>(task_type: Hallucination)
    prepare.py->>prepare.py: Save to test.jsonl
    
    Note over User,Metrics: Evaluation Phase
    
    User->>Evaluator: Run evaluation
    Evaluator->>Evaluator: evaluate_hallucination()<br/>(check char_rate in chars/min)
    Evaluator->>Metrics: Pass results
    Metrics->>Metrics: Aggregate char_rate<br/>(total_chars / total_minutes)
    Metrics-->>User: Final metrics
Loading

@Jorjeous Jorjeous removed their assignment Jan 8, 2026
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR adds the MUSAN dataset for audio hallucination detection and updates char_rate metrics to use chars/minute instead of chars/second.

Major Changes:

  • New MUSAN dataset module supporting 3 sources (HuggingFace, Kaggle, OpenSLR)
  • Hallucination detection via speaking rate analysis (expects empty transcription for non-speech audio)
  • Metrics refactoring: char_rate now aggregated across samples in chars/minute

Critical Issues Found:

  • Missing import sys will cause NameError when using OpenSLR source
  • Undefined tar variable will cause NameError during extraction
  • Missing speech label in CATEGORY_LABELS will break HuggingFace source for speech category

Style Note:
The LABEL_TO_CATEGORY dictionary on line 48-51 appears unused throughout the codebase.

Confidence Score: 1/5

  • NOT safe to merge - contains critical runtime errors that will cause crashes
  • Three critical bugs in prepare.py will cause immediate NameError exceptions: missing sys import (line 97), undefined tar variable (lines 98-100), and missing speech category label (line 303). These are blocking issues that make the OpenSLR source and HuggingFace source completely non-functional.
  • Critical attention needed for nemo_skills/dataset/musan/prepare.py - fix import statement, tarfile context manager, and category labels before merging

Important Files Changed

File Analysis

Filename Score Overview
nemo_skills/dataset/musan/prepare.py 1/5 Critical bugs: missing sys import causes NameError, undefined tar variable causes NameError, missing speech category label will cause HuggingFace source to fail for speech category
nemo_skills/evaluation/evaluator/audio.py 5/5 Unit conversion from chars/second to chars/minute is correct and consistent throughout
nemo_skills/evaluation/metrics/audio_metrics.py 5/5 Refactored char_rate calculation to aggregate across all samples, correctly imports as_float

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareData as prepare_data.py
    participant Prepare as musan/prepare.py
    participant Source as Data Source<br/>(HF/Kaggle/OpenSLR)
    participant Evaluator as audio.py
    participant Metrics as audio_metrics.py

    User->>PrepareData: ns prepare_data musan
    PrepareData->>Prepare: Execute prepare.py
    Prepare->>Source: Download dataset
    Source-->>Prepare: Audio files
    Prepare->>Prepare: Process categories<br/>(music/speech/noise)
    Prepare->>Prepare: Create manifest entries<br/>(task_type=Hallucination)
    Prepare-->>PrepareData: test.jsonl files

    User->>Evaluator: ns eval musan
    Evaluator->>Evaluator: Read test.jsonl
    Evaluator->>Evaluator: Generate predictions
    Evaluator->>Evaluator: evaluate_hallucination()<br/>(check char_rate)
    Evaluator-->>Metrics: Predictions with metrics
    Metrics->>Metrics: Aggregate char_rate<br/>across all samples
    Metrics-->>User: Final metrics report
Loading

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Added MUSAN (Music, Speech, and Noise Corpus) dataset support with multi-source download capabilities and improved audio hallucination detection metrics by standardizing character rate calculations to chars/minute.

Major Changes:

  • Implemented MUSAN dataset preparation with support for HuggingFace, Kaggle, and OpenSLR data sources
  • Created manifest generation for hallucination detection tasks (expects empty transcription for non-speech audio)
  • Refactored character rate calculation from chars/second to chars/minute across evaluator and metrics (multiplied by 60)
  • Updated hallucination threshold from 25 chars/sec to 1500 chars/min (equivalent but more readable)
  • Changed metrics aggregation from averaging individual char_rate scores to computing total chars / total minutes for hallucination tasks
  • Added MUSAN to dataset pipeline and test configurations

Issues Found:

  • CATEGORY_LABELS in prepare.py is missing "speech" label mapping - HuggingFace path will fail silently for speech category

Confidence Score: 3/5

  • This PR is mostly safe to merge with one critical logic issue that needs fixing
  • The PR adds substantial new functionality with proper error handling and multiple data source support. However, there's a logic bug where the CATEGORY_LABELS dictionary is missing the "speech" label, which will cause the HuggingFace processing path to silently return 0 samples when users request the speech category. The refactored metrics calculations appear correct and well-documented. Test coverage is appropriate with the dataset excluded from GPU tests.
  • Pay close attention to nemo_skills/dataset/musan/prepare.py - fix the missing "speech" label mapping before merging

Important Files Changed

File Analysis

Filename Score Overview
nemo_skills/dataset/musan/prepare.py 3/5 Implemented MUSAN dataset preparation with support for HuggingFace/Kaggle/OpenSLR sources - missing speech label mapping for HuggingFace path
nemo_skills/evaluation/evaluator/audio.py 5/5 Fixed character rate calculation from chars/second to chars/minute (multiplied by 60) for better readability and consistency
nemo_skills/evaluation/metrics/audio_metrics.py 5/5 Refactored char_rate metric to aggregate total chars and duration for hallucination tasks, changed from individual scores to total-based calculation

Sequence Diagram

sequenceDiagram
    participant User
    participant prepare.py
    participant DataSource as Data Source<br/>(HF/Kaggle/OpenSLR)
    participant FileSystem
    participant audio.py as Audio Evaluator
    participant audio_metrics.py as Audio Metrics

    Note over User,audio_metrics.py: Dataset Preparation Phase
    User->>prepare.py: Run with --source, --categories
    prepare.py->>DataSource: load_dataset_from_source()
    alt HuggingFace
        DataSource-->>prepare.py: HF dataset object
    else Kaggle/OpenSLR
        DataSource-->>prepare.py: Path to WAV files
    end
    
    prepare.py->>prepare.py: process_category()
    loop For each category (noise/music/speech)
        alt HuggingFace source
            prepare.py->>prepare.py: Filter by CATEGORY_LABELS
        else Kaggle/OpenSLR
            prepare.py->>prepare.py: process_category_from_files()
            prepare.py->>FileSystem: Read WAV files
        end
        prepare.py->>prepare.py: create_manifest_entry()
        Note right of prepare.py: task_type: "Hallucination"<br/>expected_answer: ""
        prepare.py->>FileSystem: Save audio + test.jsonl
    end

    Note over User,audio_metrics.py: Evaluation Phase
    User->>audio.py: evaluate_sample()
    audio.py->>audio.py: evaluate_hallucination()
    Note right of audio.py: char_rate = (chars/duration) * 60<br/>threshold: 1500 chars/min
    audio.py-->>User: Sample metrics
    
    User->>audio_metrics.py: aggregate_predictions()
    loop For each prediction
        alt task_type == "Hallucination"
            audio_metrics.py->>audio_metrics.py: Accumulate total_hallucinated_chars<br/>and total_audio_seconds
        end
    end
    audio_metrics.py->>audio_metrics.py: get_metrics()
    Note right of audio_metrics.py: char_rate = total_chars / total_minutes
    audio_metrics.py-->>User: Aggregated metrics
Loading

@Jorjeous Jorjeous requested a review from karpnv January 10, 2026 09:17
@karpnv karpnv requested a review from melllinia January 12, 2026 05:48
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Jan 12, 2026

Greptile Overview

Greptile Summary

This PR adds the MUSAN (Music, Speech, and Noise Corpus) dataset to nemo-skills for audio hallucination detection benchmarking. The implementation includes:

Key Components

Dataset Preparation (musan/prepare.py):

  • Supports 3 data sources: HuggingFace (incomplete), Kaggle (recommended), and OpenSLR (official)
  • Processes audio files from music, speech, and noise categories
  • Creates manifest entries with task_type="Hallucination" and empty expected transcriptions
  • Saves audio files in standardized format with metadata

Audio Evaluation Improvements:

  • Updated char_rate metric from chars/second to chars/minute for better readability (threshold: 1500 chars/min)
  • Refactored char_rate aggregation from per-sample average to global aggregate calculation (total_chars/total_minutes)
  • Maintains hallucination detection via speaking rate anomaly (>1500 chars/min indicates likely hallucination)

Infrastructure Updates:

  • Added "musan" to DATASETS_REQUIRE_DATA_DIR list
  • Added musan to test exclusions (requires large data files)
  • Registered dataset with speechlm group and audio metrics

Critical Issues Identified

  1. Missing speech category mapping: CATEGORY_LABELS dictionary only defines mappings for "noise" (0) and "music" (1), but "speech" is an allowed category. This will cause HuggingFace source to fail when processing speech samples.

  2. Inconsistent label mapping: LABEL_TO_CATEGORY maps label 1 to "other" instead of "music", creating a mismatch with CATEGORY_LABELS.

  3. Broken manifest entries with save_audio=False: When audio saving is disabled, manifest entries are still created with paths to non-existent audio files, which will cause runtime failures during evaluation.

  4. Unclear speech category intent: All categories (including "speech") have empty expected transcriptions. It's unclear whether speech samples should be transcribed or treated as hallucination test cases like noise/music.

  5. Unused split parameter: The split parameter is accepted but never used in actual processing - all output files are named "test.jsonl" regardless of split value.

Confidence Score: 2/5

  • This PR contains critical bugs that will cause runtime failures
  • Score reflects multiple critical logic errors in the core dataset preparation code: (1) Missing category mapping will cause immediate failure when processing speech with HuggingFace source, (2) Inconsistent label mappings will produce incorrect category labels, (3) save_audio=False option creates broken manifest entries that reference non-existent files. The audio evaluation improvements are solid, but the dataset preparation has fundamental issues that must be fixed before merging.
  • Critical attention needed for nemo_skills/dataset/musan/prepare.py - contains category mapping bugs, broken save_audio=False logic, and unclear intent for speech category handling

Important Files Changed

File Analysis

Filename Score Overview
nemo_skills/dataset/musan/prepare.py 2/5 New MUSAN dataset preparation script with critical bugs in category mappings and unused parameters
nemo_skills/evaluation/evaluator/audio.py 5/5 Improved char_rate calculation from chars/second to chars/minute for better readability
nemo_skills/evaluation/metrics/audio_metrics.py 5/5 Refactored char_rate metric to use global aggregation instead of per-sample average

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareScript as prepare.py
    participant DataSource as Data Source<br/>(HF/Kaggle/OpenSLR)
    participant FileSystem as File System
    participant Evaluator as audio.py
    participant Metrics as audio_metrics.py

    User->>PrepareScript: Run dataset preparation
    PrepareScript->>DataSource: Load MUSAN dataset
    DataSource-->>PrepareScript: Return audio files

    loop For each category (music/speech/noise)
        PrepareScript->>PrepareScript: Filter samples by category
        loop For each audio sample
            PrepareScript->>PrepareScript: Read audio array & compute duration
            alt save_audio=True
                PrepareScript->>FileSystem: Save WAV file
            end
            PrepareScript->>PrepareScript: Create manifest entry with<br/>task_type="Hallucination"<br/>expected_answer=""
        end
        PrepareScript->>FileSystem: Write test.jsonl manifest
    end

    User->>Evaluator: Run evaluation
    loop For each sample in manifest
        Evaluator->>Evaluator: Load audio & generate transcription
        Evaluator->>Evaluator: evaluate_hallucination()<br/>(check char_rate > 1500 chars/min)
        Evaluator->>Metrics: Update metrics with results
    end

    Metrics->>Metrics: Aggregate char_rate globally<br/>(total_chars / total_minutes)
    Metrics-->>User: Return evaluation results
Loading

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Copy link
Copy Markdown
Collaborator

@karpnv karpnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@karpnv karpnv merged commit 079b106 into main Jan 12, 2026
5 checks passed
@karpnv karpnv deleted the musan-hallucination-eval branch January 12, 2026 16:51
hsiehjackson pushed a commit that referenced this pull request Jan 13, 2026
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
arnavkomaragiri pushed a commit that referenced this pull request Jan 13, 2026
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
arnavkomaragiri pushed a commit that referenced this pull request Jan 13, 2026
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
@coderabbitai coderabbitai bot mentioned this pull request Feb 19, 2026
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants