Conversation
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
📝 WalkthroughWalkthroughIntroduces MUSAN dataset integration with configuration metadata and preparation utilities. Adds constants for dataset group, evaluation settings, and benchmarking configuration. Provides preparation module supporting multiple data sources (HuggingFace, Kaggle, OpenSLR) with audio processing, manifest generation, and command-line interface. Changes
Sequence DiagramsequenceDiagram
participant CLI as CLI / main()
participant Load as load_dataset_from_source()
participant HF as HuggingFace Dataset
participant KG as Kaggle/OpenSLR
participant Process as process_category()
participant Audio as Audio Processing
participant Manifest as Manifest Writer
CLI->>Load: load(source, output_dir)
alt Source == "huggingface"
Load->>HF: datasets.load_dataset()
HF-->>Load: dataset object
else Source == "kaggle" or "openslr"
Load->>KG: download_from_kaggle/openslr()
KG-->>Load: local dataset path
end
CLI->>Process: for each category
Process->>Audio: get_audio_duration()
Audio-->>Process: duration (float)
alt save_audio == True
Process->>Audio: save_audio_file()
Audio->>Manifest: write WAV files
end
Process->>Process: create_manifest_entry()
Process-->>Manifest: per-sample entry
Manifest->>Manifest: write test.jsonl (per-category)
Manifest-->>CLI: consolidated manifest
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (7)
nemo_skills/dataset/musan/prepare.py (7)
43-53: Inconsistent label mappings.
CATEGORY_LABELSmaps"music"to1, butLABEL_TO_CATEGORYmaps1back to"other"instead of"music". This asymmetry appears intentional but could cause confusion. Additionally,LABEL_TO_CATEGORYis defined but never used in the code.Consider removing unused constant or aligning mappings
If
LABEL_TO_CATEGORYis not needed, remove it. Otherwise, align the mappings:LABEL_TO_CATEGORY = { - 1: "other", + 1: "music", }
55-69: Unusedoutput_dirparameter and missing exception chaining.The
output_dirparameter is never used sincekagglehubmanages download locations internally. Also, exceptions should be chained withfrom efor better stack traces.Proposed fix
-def download_from_kaggle(output_dir: Path) -> Path: +def download_from_kaggle() -> Path: """Download MUSAN dataset from Kaggle using kagglehub.""" try: import kagglehub except ImportError: - raise ImportError("kagglehub not installed. Run: pip install kagglehub") + raise ImportError("kagglehub not installed. Run: pip install kagglehub") from None print("Downloading from Kaggle (requires API key in ~/.kaggle/kaggle.json)") try: path = kagglehub.dataset_download("dogrose/musan-dataset") print(f"Downloaded to: {path}") return Path(path) except Exception as e: - raise Exception(f"Kaggle download failed: {e}") + raise RuntimeError(f"Kaggle download failed: {e}") from eNote: If you remove the parameter, update the call site at line 116 accordingly.
106-142: LGTM with minor suggestion.The function handles all three sources well with appropriate logging. Consider adding a return type annotation for better documentation.
-def load_dataset_from_source(source: str, output_dir: Path): +def load_dataset_from_source(source: str, output_dir: Path) -> tuple[Path | Any, str]:
158-194: Hardcoded path prefix may limit portability.The
audio_rel_pathuses a hardcoded prefix"/data/musan/"which assumes a specific deployment structure. Consider making this configurable or documenting this requirement.Consider parameterizing the base path
def create_manifest_entry( audio_filename: str, duration: float, category: str, sample_id: int, label: str, + base_path: str = "/data/musan", ) -> Dict: """Create nemo-skills manifest entry.""" - audio_rel_path = f"/data/musan/{category}/audio/{audio_filename}" + audio_rel_path = f"{base_path}/{category}/audio/{audio_filename}"
197-204: Unusedsplitparameter.The
splitparameter is defined but never used in this function. Either remove it or implement split-based logic if intended.Remove unused parameter
def process_category_from_files( category: str, dataset_path: Path, output_dir: Path, save_audio: bool = True, - split: str = "train", max_samples: int = -1, ) -> tuple[int, List[Dict]]:Update the call site accordingly at lines 287-294.
420-426: Minor: Unnecessary f-string prefix.Line 421 has an f-string without placeholders.
print("\n" + "=" * 60) - print(f"MUSAN Dataset Preparation") + print("MUSAN Dataset Preparation") print("=" * 60)
399-399: Unused--splitargument.The
--splitargument is parsed but thesplitparameter is never used in processing functions. Either implement split-based behavior (e.g., naming output files differently) or remove this argument to avoid user confusion.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_skills/dataset/musan/__init__.pynemo_skills/dataset/musan/prepare.py
🧰 Additional context used
🪛 Ruff (0.14.8)
nemo_skills/dataset/musan/prepare.py
55-55: Unused function argument: output_dir
(ARG001)
60-60: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
60-60: Avoid specifying long messages outside the exception class
(TRY003)
68-68: Do not catch blind exception: Exception
(BLE001)
69-69: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
69-69: Create your own exception
(TRY002)
69-69: Avoid specifying long messages outside the exception class
(TRY003)
78-78: f-string without any placeholders
Remove extraneous f prefix
(F541)
89-89: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
98-98: Uses of tarfile.extractall()
(S202)
119-119: Avoid specifying long messages outside the exception class
(TRY003)
142-142: Avoid specifying long messages outside the exception class
(TRY003)
202-202: Unused function argument: split
(ARG001)
208-208: Avoid specifying long messages outside the exception class
(TRY003)
239-239: Do not catch blind exception: Exception
(BLE001)
255-255: Do not catch blind exception: Exception
(BLE001)
357-357: Do not catch blind exception: Exception
(BLE001)
374-374: Do not catch blind exception: Exception
(BLE001)
421-421: f-string without any placeholders
Remove extraneous f prefix
(F541)
430-430: Do not catch blind exception: Exception
(BLE001)
454-454: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: unit-tests
🔇 Additional comments (2)
nemo_skills/dataset/musan/__init__.py (1)
43-57: LGTM!The configuration constants are well-structured and appropriately define the dataset metadata for benchmarking integration. The module docstring provides clear documentation of download options with useful metrics.
nemo_skills/dataset/musan/prepare.py (1)
272-388: LGTM with observation.The function properly routes processing based on source type. There's some code duplication with
process_category_from_files(directory creation, manifest writing, error counting), which could be refactored into shared helpers, but the current structure is readable and acceptable.
| def download_from_openslr(output_dir: Path) -> Path: | ||
| """Download MUSAN dataset from OpenSLR (11 GB).""" | ||
| url = "https://www.openslr.org/resources/17/musan.tar.gz" | ||
| download_path = output_dir / "musan.tar.gz" | ||
| extract_path = output_dir / "musan_openslr" | ||
|
|
||
| print(f"Downloading from OpenSLR (~11 GB)") | ||
| print(f"URL: {url}") | ||
|
|
||
| if not download_path.exists(): | ||
| def reporthook(block_num, block_size, total_size): | ||
| downloaded = block_num * block_size | ||
| percent = min(downloaded / total_size * 100, 100) | ||
| mb_downloaded = downloaded / (1024 * 1024) | ||
| mb_total = total_size / (1024 * 1024) | ||
| print(f"\r{percent:.1f}% ({mb_downloaded:.1f}/{mb_total:.1f} MB)", end='') | ||
|
|
||
| urllib.request.urlretrieve(url, download_path, reporthook) | ||
| print("\nDownload complete") | ||
| else: | ||
| print(f"Using cached archive: {download_path}") | ||
|
|
||
| if not extract_path.exists(): | ||
| print(f"Extracting to {extract_path}...") | ||
| extract_path.mkdir(parents=True, exist_ok=True) | ||
| with tarfile.open(download_path, 'r:gz') as tar: | ||
| tar.extractall(extract_path) | ||
| print("Extraction complete") | ||
| else: | ||
| print(f"Using extracted data: {extract_path}") | ||
|
|
||
| return extract_path / "musan" |
There was a problem hiding this comment.
Security and reliability concerns with archive extraction and network request.
- Line 78: Unnecessary
fprefix on string without placeholders. - Line 89:
urlretrievehas no timeout, which could hang indefinitely on network issues. - Line 98:
tarfile.extractall()is vulnerable to path traversal attacks. While the OpenSLR source is trusted, using a safer extraction pattern is recommended.
Proposed improvements
- print(f"Downloading from OpenSLR (~11 GB)")
+ print("Downloading from OpenSLR (~11 GB)")For safer tarfile extraction (Python 3.12+ has filter parameter, for older versions):
def safe_extract(tar, path):
for member in tar.getmembers():
member_path = os.path.join(path, member.name)
if not os.path.abspath(member_path).startswith(os.path.abspath(path)):
raise ValueError(f"Attempted path traversal: {member.name}")
tar.extractall(path)🧰 Tools
🪛 Ruff (0.14.8)
78-78: f-string without any placeholders
Remove extraneous f prefix
(F541)
89-89: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
98-98: Uses of tarfile.extractall()
(S202)
🤖 Prompt for AI Agents
In nemo_skills/dataset/musan/prepare.py around lines 72-103, address three
issues: remove the unnecessary f-string on line 78 (use a plain string), replace
the blocking urllib.request.urlretrieve call on line 89 with a network request
that supports a timeout (e.g., use urllib.request.urlopen with a timeout and
stream the response to download_path) to avoid hanging indefinitely, and replace
tar.extractall() on line 98 with a safe extraction routine that iterates
tar.getmembers(), computes each member's destination path, verifies that
os.path.abspath(dest).startswith(os.path.abspath(extract_path)) to prevent path
traversal, and only then extracts that member (creating parent directories as
needed).
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
karpnv
left a comment
There was a problem hiding this comment.
need to add to tests/gpu-tests/test_eval.py and tests/test_datasets.py
karpnv
left a comment
There was a problem hiding this comment.
add to nemo_skills/pipeline/prepare_data.py
|
can you please add this benchmark to gpu test exclusion list, so that gpu tests will not run so long? |
There was a problem hiding this comment.
Greptile Overview
Greptile Summary
Adds MUSAN dataset module supporting three download sources (HuggingFace, Kaggle, OpenSLR) with audio processing and manifest generation for hallucination detection benchmarks. Fixes hallucination detection metric calculation in audio evaluator from chars/second to chars/minute with corresponding threshold adjustment from 25 to 1500. Updates audio metrics aggregation to compute char_rate across all hallucination samples using cumulative totals instead of per-sample averages.
Confidence Score: 2/5
- Contains security vulnerability in tarball extraction that could enable path traversal attacks
- The tarball extraction in
prepare.pyline 98 uses unsafeextractall()without filter parameter, allowing malicious archives to write files outside the intended directory. This is a known security issue (CVE-2007-4559) that the codebase already addresses correctly inlibrispeech-pc/prepare.pywith version-specific filtering. The metric calculation fixes appear sound with proper unit conversions. - nemo_skills/dataset/musan/prepare.py requires fixing tarfile extraction safety
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| nemo_skills/dataset/musan/prepare.py | 2/5 | Implements MUSAN dataset preparation with multi-source support; missing tarfile extraction safety check |
| nemo_skills/evaluation/evaluator/audio.py | 4/5 | Corrects hallucination detection metric from chars/second to chars/minute with proper threshold adjustment |
| nemo_skills/evaluation/metrics/audio_metrics.py | 4/5 | Updates char_rate calculation to aggregate across all Hallucination samples using total chars/minutes |
Sequence Diagram
sequenceDiagram
participant User
participant prepare.py
participant Source as Data Source
participant FS as File System
participant Evaluator as audio.py
participant Metrics as audio_metrics.py
User->>prepare.py: Run with --source kaggle/openslr/huggingface
alt Kaggle/OpenSLR
prepare.py->>Source: Download dataset (10-11 GB)
Source-->>prepare.py: Return dataset path
prepare.py->>FS: Extract tarball (unsafe)
else HuggingFace
prepare.py->>Source: Load via datasets library
Source-->>prepare.py: Return dataset object
end
prepare.py->>prepare.py: Process categories (music/speech/noise)
loop For each audio file
prepare.py->>FS: Read audio file
prepare.py->>FS: Save as standardized WAV
prepare.py->>FS: Write manifest entry (JSONL)
end
Note over User,Metrics: Evaluation Phase
User->>Evaluator: Evaluate MUSAN samples
Evaluator->>Evaluator: evaluate_hallucination()<br/>(char_rate in chars/minute)
Evaluator-->>Metrics: Return metrics per sample
Metrics->>Metrics: Accumulate total_hallucinated_chars<br/>and total_audio_seconds
Metrics->>Metrics: Compute final char_rate<br/>(chars/minute)
Metrics-->>User: Return aggregated metrics
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
There was a problem hiding this comment.
Greptile Overview
Greptile Summary
This PR adds the MUSAN dataset (Music, Speech, and Noise Corpus) for hallucination detection in audio models, and standardizes character rate metrics to use chars/minute instead of chars/second across the audio evaluation framework.
Major Changes
- New MUSAN dataset preparation script supporting HuggingFace, Kaggle, and OpenSLR sources
- Dataset configured for hallucination detection (expects empty transcription for non-speech audio)
- Character rate calculations converted from chars/second to chars/minute in evaluator and metrics
- Metric aggregation updated to accumulate total characters and duration for overall char_rate
Critical Issues Found
- Missing
sysimport: Line 97 usessys.version_infobutsysis never imported, will causeNameError - Undefined
tarvariable: Lines 98-100 attempt to usetar.extractall()but the tarfile is never opened withtarfile.open(), will causeNameError - Incomplete category mapping:
CATEGORY_LABELSonly includes 'noise' and 'music', missing 'speech' category - HuggingFace source will silently fail for speech data - Dead code:
LABEL_TO_CATEGORYconstant defined but never used - Unused parameter:
splitparameter accepted but never used in processing functions
Confidence Score: 1/5
- This PR has critical runtime errors that will cause immediate crashes when using the OpenSLR source
- Score reflects two critical syntax/logic bugs (missing import and undefined variable) that will cause
NameErrorexceptions at runtime, plus a logic bug with missing category mapping that breaks HuggingFace source for speech data. The audio evaluation changes are correct, but the dataset preparation script cannot function without fixes. - Pay close attention to
nemo_skills/dataset/musan/prepare.py- requires immediate fixes for missing import, undefined variable, and incomplete category mapping before it can be used
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| nemo_skills/dataset/musan/prepare.py | 0/5 | New MUSAN dataset preparation script with critical bugs: missing sys import (line 97), undefined tar variable (lines 98-100), incomplete CATEGORY_LABELS dict missing 'speech' category, and unused dead code |
| nemo_skills/dataset/musan/init.py | 5/5 | New MUSAN dataset configuration file with proper metadata and benchmark setup, following established patterns from other audio datasets |
| nemo_skills/evaluation/evaluator/audio.py | 5/5 | Converts character rate calculations from chars/second to chars/minute for better readability, mathematically correct changes with consistent threshold updates |
| nemo_skills/evaluation/metrics/audio_metrics.py | 5/5 | Updates metric aggregation for char_rate to use chars/minute instead of chars/second, adds as_float import for proper formatting |
Sequence Diagram
sequenceDiagram
participant User
participant prepare.py
participant Source as Data Source
participant Evaluator as audio.py
participant Metrics as audio_metrics.py
User->>prepare.py: Run with --source & --categories
alt HuggingFace
prepare.py->>Source: load_dataset()
Source-->>prepare.py: Dataset samples
prepare.py->>prepare.py: Filter by CATEGORY_LABELS
else Kaggle/OpenSLR
prepare.py->>Source: Download & extract files
Source-->>prepare.py: WAV files
prepare.py->>prepare.py: Read WAV files
end
prepare.py->>prepare.py: create_manifest_entry()<br/>(task_type: Hallucination)
prepare.py->>prepare.py: Save to test.jsonl
Note over User,Metrics: Evaluation Phase
User->>Evaluator: Run evaluation
Evaluator->>Evaluator: evaluate_hallucination()<br/>(check char_rate in chars/min)
Evaluator->>Metrics: Pass results
Metrics->>Metrics: Aggregate char_rate<br/>(total_chars / total_minutes)
Metrics-->>User: Final metrics
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
…lls into musan-hallucination-eval
There was a problem hiding this comment.
Greptile Overview
Greptile Summary
This PR adds the MUSAN dataset for audio hallucination detection and updates char_rate metrics to use chars/minute instead of chars/second.
Major Changes:
- New MUSAN dataset module supporting 3 sources (HuggingFace, Kaggle, OpenSLR)
- Hallucination detection via speaking rate analysis (expects empty transcription for non-speech audio)
- Metrics refactoring: char_rate now aggregated across samples in chars/minute
Critical Issues Found:
- Missing
import syswill causeNameErrorwhen using OpenSLR source - Undefined
tarvariable will causeNameErrorduring extraction - Missing
speechlabel inCATEGORY_LABELSwill break HuggingFace source for speech category
Style Note:
The LABEL_TO_CATEGORY dictionary on line 48-51 appears unused throughout the codebase.
Confidence Score: 1/5
- NOT safe to merge - contains critical runtime errors that will cause crashes
- Three critical bugs in
prepare.pywill cause immediateNameErrorexceptions: missingsysimport (line 97), undefinedtarvariable (lines 98-100), and missingspeechcategory label (line 303). These are blocking issues that make the OpenSLR source and HuggingFace source completely non-functional. - Critical attention needed for
nemo_skills/dataset/musan/prepare.py- fix import statement, tarfile context manager, and category labels before merging
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| nemo_skills/dataset/musan/prepare.py | 1/5 | Critical bugs: missing sys import causes NameError, undefined tar variable causes NameError, missing speech category label will cause HuggingFace source to fail for speech category |
| nemo_skills/evaluation/evaluator/audio.py | 5/5 | Unit conversion from chars/second to chars/minute is correct and consistent throughout |
| nemo_skills/evaluation/metrics/audio_metrics.py | 5/5 | Refactored char_rate calculation to aggregate across all samples, correctly imports as_float |
Sequence Diagram
sequenceDiagram
participant User
participant PrepareData as prepare_data.py
participant Prepare as musan/prepare.py
participant Source as Data Source<br/>(HF/Kaggle/OpenSLR)
participant Evaluator as audio.py
participant Metrics as audio_metrics.py
User->>PrepareData: ns prepare_data musan
PrepareData->>Prepare: Execute prepare.py
Prepare->>Source: Download dataset
Source-->>Prepare: Audio files
Prepare->>Prepare: Process categories<br/>(music/speech/noise)
Prepare->>Prepare: Create manifest entries<br/>(task_type=Hallucination)
Prepare-->>PrepareData: test.jsonl files
User->>Evaluator: ns eval musan
Evaluator->>Evaluator: Read test.jsonl
Evaluator->>Evaluator: Generate predictions
Evaluator->>Evaluator: evaluate_hallucination()<br/>(check char_rate)
Evaluator-->>Metrics: Predictions with metrics
Metrics->>Metrics: Aggregate char_rate<br/>across all samples
Metrics-->>User: Final metrics report
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
There was a problem hiding this comment.
Greptile Overview
Greptile Summary
Added MUSAN (Music, Speech, and Noise Corpus) dataset support with multi-source download capabilities and improved audio hallucination detection metrics by standardizing character rate calculations to chars/minute.
Major Changes:
- Implemented MUSAN dataset preparation with support for HuggingFace, Kaggle, and OpenSLR data sources
- Created manifest generation for hallucination detection tasks (expects empty transcription for non-speech audio)
- Refactored character rate calculation from chars/second to chars/minute across evaluator and metrics (multiplied by 60)
- Updated hallucination threshold from 25 chars/sec to 1500 chars/min (equivalent but more readable)
- Changed metrics aggregation from averaging individual char_rate scores to computing total chars / total minutes for hallucination tasks
- Added MUSAN to dataset pipeline and test configurations
Issues Found:
CATEGORY_LABELSinprepare.pyis missing "speech" label mapping - HuggingFace path will fail silently for speech category
Confidence Score: 3/5
- This PR is mostly safe to merge with one critical logic issue that needs fixing
- The PR adds substantial new functionality with proper error handling and multiple data source support. However, there's a logic bug where the
CATEGORY_LABELSdictionary is missing the "speech" label, which will cause the HuggingFace processing path to silently return 0 samples when users request the speech category. The refactored metrics calculations appear correct and well-documented. Test coverage is appropriate with the dataset excluded from GPU tests. - Pay close attention to
nemo_skills/dataset/musan/prepare.py- fix the missing "speech" label mapping before merging
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| nemo_skills/dataset/musan/prepare.py | 3/5 | Implemented MUSAN dataset preparation with support for HuggingFace/Kaggle/OpenSLR sources - missing speech label mapping for HuggingFace path |
| nemo_skills/evaluation/evaluator/audio.py | 5/5 | Fixed character rate calculation from chars/second to chars/minute (multiplied by 60) for better readability and consistency |
| nemo_skills/evaluation/metrics/audio_metrics.py | 5/5 | Refactored char_rate metric to aggregate total chars and duration for hallucination tasks, changed from individual scores to total-based calculation |
Sequence Diagram
sequenceDiagram
participant User
participant prepare.py
participant DataSource as Data Source<br/>(HF/Kaggle/OpenSLR)
participant FileSystem
participant audio.py as Audio Evaluator
participant audio_metrics.py as Audio Metrics
Note over User,audio_metrics.py: Dataset Preparation Phase
User->>prepare.py: Run with --source, --categories
prepare.py->>DataSource: load_dataset_from_source()
alt HuggingFace
DataSource-->>prepare.py: HF dataset object
else Kaggle/OpenSLR
DataSource-->>prepare.py: Path to WAV files
end
prepare.py->>prepare.py: process_category()
loop For each category (noise/music/speech)
alt HuggingFace source
prepare.py->>prepare.py: Filter by CATEGORY_LABELS
else Kaggle/OpenSLR
prepare.py->>prepare.py: process_category_from_files()
prepare.py->>FileSystem: Read WAV files
end
prepare.py->>prepare.py: create_manifest_entry()
Note right of prepare.py: task_type: "Hallucination"<br/>expected_answer: ""
prepare.py->>FileSystem: Save audio + test.jsonl
end
Note over User,audio_metrics.py: Evaluation Phase
User->>audio.py: evaluate_sample()
audio.py->>audio.py: evaluate_hallucination()
Note right of audio.py: char_rate = (chars/duration) * 60<br/>threshold: 1500 chars/min
audio.py-->>User: Sample metrics
User->>audio_metrics.py: aggregate_predictions()
loop For each prediction
alt task_type == "Hallucination"
audio_metrics.py->>audio_metrics.py: Accumulate total_hallucinated_chars<br/>and total_audio_seconds
end
end
audio_metrics.py->>audio_metrics.py: get_metrics()
Note right of audio_metrics.py: char_rate = total_chars / total_minutes
audio_metrics.py-->>User: Aggregated metrics
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Greptile OverviewGreptile SummaryThis PR adds the MUSAN (Music, Speech, and Noise Corpus) dataset to nemo-skills for audio hallucination detection benchmarking. The implementation includes: Key ComponentsDataset Preparation (
Audio Evaluation Improvements:
Infrastructure Updates:
Critical Issues Identified
Confidence Score: 2/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant User
participant PrepareScript as prepare.py
participant DataSource as Data Source<br/>(HF/Kaggle/OpenSLR)
participant FileSystem as File System
participant Evaluator as audio.py
participant Metrics as audio_metrics.py
User->>PrepareScript: Run dataset preparation
PrepareScript->>DataSource: Load MUSAN dataset
DataSource-->>PrepareScript: Return audio files
loop For each category (music/speech/noise)
PrepareScript->>PrepareScript: Filter samples by category
loop For each audio sample
PrepareScript->>PrepareScript: Read audio array & compute duration
alt save_audio=True
PrepareScript->>FileSystem: Save WAV file
end
PrepareScript->>PrepareScript: Create manifest entry with<br/>task_type="Hallucination"<br/>expected_answer=""
end
PrepareScript->>FileSystem: Write test.jsonl manifest
end
User->>Evaluator: Run evaluation
loop For each sample in manifest
Evaluator->>Evaluator: Load audio & generate transcription
Evaluator->>Evaluator: evaluate_hallucination()<br/>(check char_rate > 1500 chars/min)
Evaluator->>Metrics: Update metrics with results
end
Metrics->>Metrics: Aggregate char_rate globally<br/>(total_chars / total_minutes)
Metrics-->>User: Return evaluation results
|
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.