Skip to content

HF ASR Leaderboard Evaluation#1104

Merged
gwarmstrong merged 8 commits intomainfrom
hf-asr-leaderboard
Dec 15, 2025
Merged

HF ASR Leaderboard Evaluation#1104
gwarmstrong merged 8 commits intomainfrom
hf-asr-leaderboard

Conversation

@melllinia
Copy link
Member

@melllinia melllinia commented Dec 12, 2025

Summary by CodeRabbit

  • New Features
    • Added ASR Leaderboard dataset integration with automatic downloading and preparation of audio speech datasets
    • Enabled audio file processing and standardized formatting for dataset evaluation workflows
    • Supports optional audio saving and configurable preprocessing for multiple dataset sources

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 12, 2025

📝 Walkthrough

Walkthrough

This PR introduces a new ASR Leaderboard dataset integration module with configuration constants and dataset preparation utilities for downloading, formatting, and exporting ASR datasets from HuggingFace, including audio normalization, filtering, and JSONL export functionality.

Changes

Cohort / File(s) Summary
ASR Leaderboard Configuration
nemo_skills/dataset/asr-leaderboard/__init__.py
Adds module configuration constants: DATASET_GROUP ("speechlm"), METRICS_TYPE ("audio"), and GENERATION_ARGS for audio evaluation with HuggingFace preprocessing; includes license header and module documentation for ASR_LEADERBOARD task type with WER calculation.
ASR Leaderboard Dataset Preparation
nemo_skills/dataset/asr-leaderboard/prepare.py
Introduces dataset preparation pipeline with registry mapping dataset names to HuggingFace identifiers and configs; implements save_audio_and_format_entry() for text normalization, audio extraction, and ASR_LEADERBOARD format conversion with task_type and expected_answer fields; provides prepare_dataset() for streaming/full dataset loading, per-dataset JSONL export, and audio file organization; includes main() CLI workflow for batch processing, aggregation into combined test.jsonl, and runtime logging.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI / User
    participant Prep as prepare.py
    participant HF as HuggingFace<br/>Dataset API
    participant Fmt as Format<br/>Engine
    participant FS as File System
    participant Output as Output<br/>JSONL

    CLI->>Prep: main(dataset_names, output_dir, with_audio)
    
    rect rgb(200, 220, 255)
    note over Prep,HF: Per-Dataset Processing
    loop For each dataset
        Prep->>HF: Load dataset (streaming or full)
        HF-->>Prep: Dataset entries
        
        loop For each entry
            Prep->>Fmt: save_audio_and_format_entry()
            
            rect rgb(240, 200, 200)
            note over Fmt: Validate & Filter
            Fmt->>Fmt: Check audio duration<br/>≥ MIN_AUDIO_DURATION
            alt Duration < threshold
                Fmt-->>Prep: Skip (None)
            else Duration valid
                Fmt->>Fmt: Normalize text fields<br/>Extract/transcribe
            end
            end
            
            alt with_audio enabled
                Fmt->>FS: Save audio to<br/>data/{dataset}/*
                FS-->>Fmt: audio_path
            end
            
            Fmt->>Fmt: Build ASR_LEADERBOARD<br/>format dict
            Fmt-->>Prep: Formatted entry
        end
        
        Prep->>Output: Write per-dataset JSONL
        Output-->>Prep: Saved (dataset.jsonl)
    end
    end
    
    rect rgb(200, 255, 200)
    note over Prep,Output: Aggregation
    Prep->>Output: Concatenate all JSONL files<br/>→ test.jsonl
    Output-->>CLI: Final aggregated output
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Audio processing logic and duration filtering: Verify minimum duration threshold application and audio extraction correctness
  • Dataset configuration and HuggingFace integration: Review dataset registry mappings and streaming vs. full load behavior
  • Format conversion and field validation: Ensure ASR_LEADERBOARD format compliance and required field presence (task_type, expected_answer, messages)
  • File I/O and directory structure: Validate audio directory creation, path handling, and per-dataset JSONL organization
  • Error handling: Check robustness of dataset loading failures and missing audio edge cases

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'HF ASR Leaderboard Evaluation' is directly related to the main changes, which add a complete ASR Leaderboard dataset integration module with dataset preparation and evaluation configuration.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch hf-asr-leaderboard

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
nemo_skills/dataset/asr-leaderboard/prepare.py (2)

107-114: Consider more specific exception handling.

The broad Exception catch may hide specific issues during dataset loading. While the current approach provides resilience, more targeted exception handling would improve debuggability.

As per static analysis hints, consider catching specific exceptions:

     print(f"Loading {dataset_name} from {hf_dataset} (streaming={streaming})...")
     try:
         if hf_config:
             dataset = load_dataset(hf_dataset, hf_config, split=hf_split, trust_remote_code=True, streaming=streaming)
         else:
             dataset = load_dataset(hf_dataset, split=hf_split, trust_remote_code=True, streaming=streaming)
-    except Exception as e:
+    except (ValueError, OSError, RuntimeError) as e:
         print(f"Warning: Failed to load {dataset_name}: {e}")
         return 0

149-155: Consider using spread operator for list construction.

As per static analysis hints, the choices list can be constructed more cleanly using the spread operator.

Apply this diff:

     parser.add_argument(
         "--datasets",
         nargs="+",
         default=["all"],
-        choices=list(DATASET_CONFIGS.keys()) + ["all"],
+        choices=[*DATASET_CONFIGS.keys(), "all"],
         help="Datasets to prepare (default: all)",
     )
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 281c487 and 1915a44.

📒 Files selected for processing (2)
  • nemo_skills/dataset/asr-leaderboard/__init__.py (1 hunks)
  • nemo_skills/dataset/asr-leaderboard/prepare.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
nemo_skills/dataset/asr-leaderboard/prepare.py

102-102: Avoid specifying long messages outside the exception class

(TRY003)


112-112: Do not catch blind exception: Exception

(BLE001)


153-153: Consider [*list(DATASET_CONFIGS.keys()), "all"] instead of concatenation

Replace with [*list(DATASET_CONFIGS.keys()), "all"]

(RUF005)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (4)
nemo_skills/dataset/asr-leaderboard/__init__.py (1)

19-21: LGTM! Configuration constants are well-defined.

The module constants appropriately configure the ASR Leaderboard evaluation pipeline with audio-specific settings and OpenAI prompt format.

nemo_skills/dataset/asr-leaderboard/prepare.py (3)

26-35: LGTM! Imports and constants are appropriate.

The dependencies and threshold constants are well-suited for ASR dataset preparation.


38-47: LGTM! Dataset registry is well-structured.

The configuration clearly maps dataset names to their HuggingFace identifiers with appropriate streaming flags for large datasets.


179-195: LGTM! Dataset aggregation logic is sound.

The combined JSONL creation correctly concatenates per-dataset outputs while preserving dataset boundaries and providing clear progress logging.

Comment on lines +76 to +77
if with_audio:
sf.write(str(audio_dir / audio_filename), audio_array, sampling_rate)
Copy link
Contributor

@coderabbitai coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add error handling for audio file writes.

The sf.write call could fail due to disk space, permissions, or invalid audio data, potentially causing the entire dataset preparation to crash.

Apply this diff to add error handling:

         if with_audio:
-            sf.write(str(audio_dir / audio_filename), audio_array, sampling_rate)
+            try:
+                sf.write(str(audio_dir / audio_filename), audio_array, sampling_rate)
+            except Exception as e:
+                print(f"Warning: Failed to save audio for sample {sample_id}: {e}")
+                return None
🤖 Prompt for AI Agents
In nemo_skills/dataset/asr-leaderboard/prepare.py around lines 76-77, wrap the
sf.write call in a try/except to prevent the whole preparation from crashing on
disk/permission/invalid-data errors: catch Exception, log the failure with
contextual details (full audio path, sampling_rate, audio_filename, and the
exception/traceback) using the module logger (create one if missing), and then
either skip this file (continue) or raise a controlled error with a clear
message depending on the surrounding flow; ensure no bare prints are used and
that the handler provides enough detail for debugging.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this in prepare stage - not a problem

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jorjeous, understood! You're right that for a data preparation script, failing fast on errors can be more appropriate than silently skipping problematic samples. It makes debugging easier during the setup phase.

@melllinia melllinia self-assigned this Dec 12, 2025
@Jorjeous
Copy link
Member

LGTM

entry.get("text", "") # ami, LS, gigaspeech, tedlium
or entry.get("normalized_text", "") # voxpopuli
or entry.get("transcript", "") # spgispeech
or entry.get("transcription", "") # earnings22
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just quick question - is there a chance that some set have >1 fitting fields?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no - than we good to go

@gwarmstrong
Copy link
Collaborator

Let's just run the integration tests to see if we need to include this in the ignored datasets. I am using those CI runners for some development today. I'll run the tests EOD and merge once complete.

@gwarmstrong gwarmstrong merged commit 294b552 into main Dec 15, 2025
5 checks passed
@gwarmstrong gwarmstrong deleted the hf-asr-leaderboard branch December 15, 2025 22:46
wasiahmad pushed a commit that referenced this pull request Dec 19, 2025
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Dec 19, 2025
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
@coderabbitai coderabbitai bot mentioned this pull request Dec 23, 2025
hsiehjackson pushed a commit that referenced this pull request Jan 13, 2026
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
@coderabbitai coderabbitai bot mentioned this pull request Jan 22, 2026
wasiahmad pushed a commit that referenced this pull request Feb 4, 2026
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants