Conversation
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
📝 WalkthroughWalkthroughThis PR introduces a new ASR Leaderboard dataset integration module with configuration constants and dataset preparation utilities for downloading, formatting, and exporting ASR datasets from HuggingFace, including audio normalization, filtering, and JSONL export functionality. Changes
Sequence DiagramsequenceDiagram
participant CLI as CLI / User
participant Prep as prepare.py
participant HF as HuggingFace<br/>Dataset API
participant Fmt as Format<br/>Engine
participant FS as File System
participant Output as Output<br/>JSONL
CLI->>Prep: main(dataset_names, output_dir, with_audio)
rect rgb(200, 220, 255)
note over Prep,HF: Per-Dataset Processing
loop For each dataset
Prep->>HF: Load dataset (streaming or full)
HF-->>Prep: Dataset entries
loop For each entry
Prep->>Fmt: save_audio_and_format_entry()
rect rgb(240, 200, 200)
note over Fmt: Validate & Filter
Fmt->>Fmt: Check audio duration<br/>≥ MIN_AUDIO_DURATION
alt Duration < threshold
Fmt-->>Prep: Skip (None)
else Duration valid
Fmt->>Fmt: Normalize text fields<br/>Extract/transcribe
end
end
alt with_audio enabled
Fmt->>FS: Save audio to<br/>data/{dataset}/*
FS-->>Fmt: audio_path
end
Fmt->>Fmt: Build ASR_LEADERBOARD<br/>format dict
Fmt-->>Prep: Formatted entry
end
Prep->>Output: Write per-dataset JSONL
Output-->>Prep: Saved (dataset.jsonl)
end
end
rect rgb(200, 255, 200)
note over Prep,Output: Aggregation
Prep->>Output: Concatenate all JSONL files<br/>→ test.jsonl
Output-->>CLI: Final aggregated output
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
nemo_skills/dataset/asr-leaderboard/prepare.py (2)
107-114: Consider more specific exception handling.The broad
Exceptioncatch may hide specific issues during dataset loading. While the current approach provides resilience, more targeted exception handling would improve debuggability.As per static analysis hints, consider catching specific exceptions:
print(f"Loading {dataset_name} from {hf_dataset} (streaming={streaming})...") try: if hf_config: dataset = load_dataset(hf_dataset, hf_config, split=hf_split, trust_remote_code=True, streaming=streaming) else: dataset = load_dataset(hf_dataset, split=hf_split, trust_remote_code=True, streaming=streaming) - except Exception as e: + except (ValueError, OSError, RuntimeError) as e: print(f"Warning: Failed to load {dataset_name}: {e}") return 0
149-155: Consider using spread operator for list construction.As per static analysis hints, the choices list can be constructed more cleanly using the spread operator.
Apply this diff:
parser.add_argument( "--datasets", nargs="+", default=["all"], - choices=list(DATASET_CONFIGS.keys()) + ["all"], + choices=[*DATASET_CONFIGS.keys(), "all"], help="Datasets to prepare (default: all)", )
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_skills/dataset/asr-leaderboard/__init__.py(1 hunks)nemo_skills/dataset/asr-leaderboard/prepare.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
nemo_skills/dataset/asr-leaderboard/prepare.py
102-102: Avoid specifying long messages outside the exception class
(TRY003)
112-112: Do not catch blind exception: Exception
(BLE001)
153-153: Consider [*list(DATASET_CONFIGS.keys()), "all"] instead of concatenation
Replace with [*list(DATASET_CONFIGS.keys()), "all"]
(RUF005)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: unit-tests
🔇 Additional comments (4)
nemo_skills/dataset/asr-leaderboard/__init__.py (1)
19-21: LGTM! Configuration constants are well-defined.The module constants appropriately configure the ASR Leaderboard evaluation pipeline with audio-specific settings and OpenAI prompt format.
nemo_skills/dataset/asr-leaderboard/prepare.py (3)
26-35: LGTM! Imports and constants are appropriate.The dependencies and threshold constants are well-suited for ASR dataset preparation.
38-47: LGTM! Dataset registry is well-structured.The configuration clearly maps dataset names to their HuggingFace identifiers with appropriate streaming flags for large datasets.
179-195: LGTM! Dataset aggregation logic is sound.The combined JSONL creation correctly concatenates per-dataset outputs while preserving dataset boundaries and providing clear progress logging.
| if with_audio: | ||
| sf.write(str(audio_dir / audio_filename), audio_array, sampling_rate) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add error handling for audio file writes.
The sf.write call could fail due to disk space, permissions, or invalid audio data, potentially causing the entire dataset preparation to crash.
Apply this diff to add error handling:
if with_audio:
- sf.write(str(audio_dir / audio_filename), audio_array, sampling_rate)
+ try:
+ sf.write(str(audio_dir / audio_filename), audio_array, sampling_rate)
+ except Exception as e:
+ print(f"Warning: Failed to save audio for sample {sample_id}: {e}")
+ return None🤖 Prompt for AI Agents
In nemo_skills/dataset/asr-leaderboard/prepare.py around lines 76-77, wrap the
sf.write call in a try/except to prevent the whole preparation from crashing on
disk/permission/invalid-data errors: catch Exception, log the failure with
contextual details (full audio path, sampling_rate, audio_filename, and the
exception/traceback) using the module logger (create one if missing), and then
either skip this file (continue) or raise a controlled error with a clear
message depending on the surrounding flow; ensure no bare prints are used and
that the handler provides enough detail for debugging.
There was a problem hiding this comment.
Since this in prepare stage - not a problem
There was a problem hiding this comment.
@Jorjeous, understood! You're right that for a data preparation script, failing fast on errors can be more appropriate than silently skipping problematic samples. It makes debugging easier during the setup phase.
|
LGTM |
| entry.get("text", "") # ami, LS, gigaspeech, tedlium | ||
| or entry.get("normalized_text", "") # voxpopuli | ||
| or entry.get("transcript", "") # spgispeech | ||
| or entry.get("transcription", "") # earnings22 |
There was a problem hiding this comment.
Just quick question - is there a chance that some set have >1 fitting fields?
|
Let's just run the integration tests to see if we need to include this in the ignored datasets. I am using those CI runners for some development today. I'll run the tests EOD and merge once complete. |
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.