HF ASR Leaderboard Evaluation by melllinia · Pull Request #1104 · NVIDIA-NeMo/Skills

melllinia · 2025-12-12T17:09:30Z

Summary by CodeRabbit

New Features
- Added ASR Leaderboard dataset integration with automatic downloading and preparation of audio speech datasets
- Enabled audio file processing and standardized formatting for dataset evaluation workflows
- Supports optional audio saving and configurable preprocessing for multiple dataset sources

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

coderabbitai · 2025-12-12T17:13:00Z

📝 Walkthrough

Walkthrough

This PR introduces a new ASR Leaderboard dataset integration module with configuration constants and dataset preparation utilities for downloading, formatting, and exporting ASR datasets from HuggingFace, including audio normalization, filtering, and JSONL export functionality.

Changes

Cohort / File(s)	Summary
ASR Leaderboard Configuration `nemo_skills/dataset/asr-leaderboard/__init__.py`	Adds module configuration constants: `DATASET_GROUP` ("speechlm"), `METRICS_TYPE` ("audio"), and `GENERATION_ARGS` for audio evaluation with HuggingFace preprocessing; includes license header and module documentation for ASR_LEADERBOARD task type with WER calculation.
ASR Leaderboard Dataset Preparation `nemo_skills/dataset/asr-leaderboard/prepare.py`	Introduces dataset preparation pipeline with registry mapping dataset names to HuggingFace identifiers and configs; implements `save_audio_and_format_entry()` for text normalization, audio extraction, and ASR_LEADERBOARD format conversion with task_type and expected_answer fields; provides `prepare_dataset()` for streaming/full dataset loading, per-dataset JSONL export, and audio file organization; includes `main()` CLI workflow for batch processing, aggregation into combined test.jsonl, and runtime logging.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI / User
    participant Prep as prepare.py
    participant HF as HuggingFace<br/>Dataset API
    participant Fmt as Format<br/>Engine
    participant FS as File System
    participant Output as Output<br/>JSONL

    CLI->>Prep: main(dataset_names, output_dir, with_audio)
    
    rect rgb(200, 220, 255)
    note over Prep,HF: Per-Dataset Processing
    loop For each dataset
        Prep->>HF: Load dataset (streaming or full)
        HF-->>Prep: Dataset entries
        
        loop For each entry
            Prep->>Fmt: save_audio_and_format_entry()
            
            rect rgb(240, 200, 200)
            note over Fmt: Validate & Filter
            Fmt->>Fmt: Check audio duration<br/>≥ MIN_AUDIO_DURATION
            alt Duration < threshold
                Fmt-->>Prep: Skip (None)
            else Duration valid
                Fmt->>Fmt: Normalize text fields<br/>Extract/transcribe
            end
            end
            
            alt with_audio enabled
                Fmt->>FS: Save audio to<br/>data/{dataset}/*
                FS-->>Fmt: audio_path
            end
            
            Fmt->>Fmt: Build ASR_LEADERBOARD<br/>format dict
            Fmt-->>Prep: Formatted entry
        end
        
        Prep->>Output: Write per-dataset JSONL
        Output-->>Prep: Saved (dataset.jsonl)
    end
    end
    
    rect rgb(200, 255, 200)
    note over Prep,Output: Aggregation
    Prep->>Output: Concatenate all JSONL files<br/>→ test.jsonl
    Output-->>CLI: Final aggregated output
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Audio processing logic and duration filtering: Verify minimum duration threshold application and audio extraction correctness
Dataset configuration and HuggingFace integration: Review dataset registry mappings and streaming vs. full load behavior
Format conversion and field validation: Ensure ASR_LEADERBOARD format compliance and required field presence (task_type, expected_answer, messages)
File I/O and directory structure: Validate audio directory creation, path handling, and per-dataset JSONL organization
Error handling: Check robustness of dataset loading failures and missing audio edge cases

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'HF ASR Leaderboard Evaluation' is directly related to the main changes, which add a complete ASR Leaderboard dataset integration module with dataset preparation and evaluation configuration.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch hf-asr-leaderboard

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

nemo_skills/dataset/asr-leaderboard/prepare.py (2)
107-114: Consider more specific exception handling.

The broad Exception catch may hide specific issues during dataset loading. While the current approach provides resilience, more targeted exception handling would improve debuggability.

As per static analysis hints, consider catching specific exceptions:
     print(f"Loading {dataset_name} from {hf_dataset} (streaming={streaming})...")
     try:
         if hf_config:
             dataset = load_dataset(hf_dataset, hf_config, split=hf_split, trust_remote_code=True, streaming=streaming)
         else:
             dataset = load_dataset(hf_dataset, split=hf_split, trust_remote_code=True, streaming=streaming)
-    except Exception as e:
+    except (ValueError, OSError, RuntimeError) as e:
         print(f"Warning: Failed to load {dataset_name}: {e}")
         return 0
149-155: Consider using spread operator for list construction.

As per static analysis hints, the choices list can be constructed more cleanly using the spread operator.

Apply this diff:
     parser.add_argument(
         "--datasets",
         nargs="+",
         default=["all"],
-        choices=list(DATASET_CONFIGS.keys()) + ["all"],
+        choices=[*DATASET_CONFIGS.keys(), "all"],
         help="Datasets to prepare (default: all)",
     )

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 281c487 and 1915a44.

📒 Files selected for processing (2)

nemo_skills/dataset/asr-leaderboard/__init__.py (1 hunks)
nemo_skills/dataset/asr-leaderboard/prepare.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.14.8)

nemo_skills/dataset/asr-leaderboard/prepare.py

102-102: Avoid specifying long messages outside the exception class

(TRY003)

112-112: Do not catch blind exception: Exception

(BLE001)

153-153: Consider [*list(DATASET_CONFIGS.keys()), "all"] instead of concatenation

Replace with [*list(DATASET_CONFIGS.keys()), "all"]

(RUF005)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: unit-tests

🔇 Additional comments (4)

nemo_skills/dataset/asr-leaderboard/__init__.py (1)

19-21: LGTM! Configuration constants are well-defined.

The module constants appropriately configure the ASR Leaderboard evaluation pipeline with audio-specific settings and OpenAI prompt format.

nemo_skills/dataset/asr-leaderboard/prepare.py (3)

26-35: LGTM! Imports and constants are appropriate.

The dependencies and threshold constants are well-suited for ASR dataset preparation.

38-47: LGTM! Dataset registry is well-structured.

The configuration clearly maps dataset names to their HuggingFace identifiers with appropriate streaming flags for large datasets.

179-195: LGTM! Dataset aggregation logic is sound.

The combined JSONL creation correctly concatenates per-dataset outputs while preserving dataset boundaries and providing clear progress logging.

coderabbitai · 2025-12-12T17:13:03Z

nemo_skills/dataset/asr-leaderboard/prepare.py

+        if with_audio:
+            sf.write(str(audio_dir / audio_filename), audio_array, sampling_rate)


🛠️ Refactor suggestion | 🟠 Major

Add error handling for audio file writes.

The sf.write call could fail due to disk space, permissions, or invalid audio data, potentially causing the entire dataset preparation to crash.

Apply this diff to add error handling:

if with_audio: - sf.write(str(audio_dir / audio_filename), audio_array, sampling_rate) + try: + sf.write(str(audio_dir / audio_filename), audio_array, sampling_rate) + except Exception as e: + print(f"Warning: Failed to save audio for sample {sample_id}: {e}") + return None

🤖 Prompt for AI Agents

In nemo_skills/dataset/asr-leaderboard/prepare.py around lines 76-77, wrap the sf.write call in a try/except to prevent the whole preparation from crashing on disk/permission/invalid-data errors: catch Exception, log the failure with contextual details (full audio path, sampling_rate, audio_filename, and the exception/traceback) using the module logger (create one if missing), and then either skip this file (continue) or raise a controlled error with a clear message depending on the surrounding flow; ensure no bare prints are used and that the handler provides enough detail for debugging.

Since this in prepare stage - not a problem

@Jorjeous, understood! You're right that for a data preparation script, failing fast on errors can be more appropriate than silently skipping problematic samples. It makes debugging easier during the setup phase.

Jorjeous · 2025-12-12T19:38:36Z

LGTM

Jorjeous · 2025-12-12T19:40:46Z

nemo_skills/dataset/asr-leaderboard/prepare.py

+        entry.get("text", "")  # ami, LS, gigaspeech, tedlium
+        or entry.get("normalized_text", "")  # voxpopuli
+        or entry.get("transcript", "")  # spgispeech
+        or entry.get("transcription", "")  # earnings22


Just quick question - is there a chance that some set have >1 fitting fields?

If no - than we good to go

gwarmstrong · 2025-12-12T20:05:40Z

Let's just run the integration tests to see if we need to include this in the ignored datasets. I am using those CI runners for some development today. I'll run the tests EOD and merge once complete.

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

…leaderboard

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

melllinia added 2 commits December 12, 2025 12:59

asr leaderboard

4c277f4

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

Update asr-leaderboard prepare.py

1915a44

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

coderabbitai bot reviewed Dec 12, 2025

View reviewed changes

melllinia self-assigned this Dec 12, 2025

melllinia requested review from Jorjeous and gwarmstrong December 12, 2025 17:25

Jorjeous approved these changes Dec 12, 2025

View reviewed changes

Merge branch 'main' into hf-asr-leaderboard

7ae0132

Merge branch 'main' into hf-asr-leaderboard

cd6c293

gwarmstrong added the run GPU tests label Dec 13, 2025

melllinia added 3 commits December 15, 2025 17:37

adding dataset to gpu tests exclusion list and updating docs

02d7ce6

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

Merge remote-tracking branch 'origin/main' into hf-asr-leaderboard

60d1add

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>

Merge remote-tracking branch 'origin/hf-asr-leaderboard' into hf-asr-…

95cca74

…leaderboard

melllinia added run GPU tests and removed run GPU tests labels Dec 15, 2025

gwarmstrong added run GPU tests and removed run GPU tests labels Dec 15, 2025

Merge branch 'main' into hf-asr-leaderboard

d714a46

gwarmstrong merged commit 294b552 into main Dec 15, 2025
5 checks passed

gwarmstrong deleted the hf-asr-leaderboard branch December 15, 2025 22:46

coderabbitai bot mentioned this pull request Dec 23, 2025

HF ASR Leaderboard Fix #1140

Merged

coderabbitai bot mentioned this pull request Jan 14, 2026

Audio input output integration #1157

Merged

coderabbitai bot mentioned this pull request Jan 22, 2026

Numb3rs ds addition #1174

Merged

coderabbitai bot mentioned this pull request Mar 3, 2026

Option to pass a template to format input #883

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HF ASR Leaderboard Evaluation#1104

HF ASR Leaderboard Evaluation#1104
gwarmstrong merged 8 commits intomainfrom
hf-asr-leaderboard

melllinia commented Dec 12, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 12, 2025

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 12, 2025 •

edited

Loading

Uh oh!

Jorjeous Dec 12, 2025

Uh oh!

coderabbitai bot Dec 12, 2025

Uh oh!

Jorjeous commented Dec 12, 2025

Uh oh!

Jorjeous Dec 12, 2025

Uh oh!

Jorjeous Dec 12, 2025

Uh oh!

gwarmstrong commented Dec 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if with_audio:
		sf.write(str(audio_dir / audio_filename), audio_array, sampling_rate)

Conversation

melllinia commented Dec 12, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 12, 2025

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jorjeous Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Jorjeous commented Dec 12, 2025

Uh oh!

Jorjeous Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Jorjeous Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

gwarmstrong commented Dec 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

melllinia commented Dec 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot Dec 12, 2025 •

edited

Loading