Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughThe PR refactors the ASR evaluation pipeline by centralizing package dependencies into a requirements file, separating evaluation and generation arguments, enhancing audio preprocessing with filtering and metadata exposure, and restructuring the audio evaluator to use lazy-loaded normalization with unified return payloads. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧹 Recent nitpick comments
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (6)
🧰 Additional context used🧠 Learnings (2)📚 Learning: 2025-11-23T17:56:57.556ZApplied to files:
📚 Learning: 2025-12-12T16:09:53.870ZApplied to files:
🧬 Code graph analysis (2)nemo_skills/evaluation/evaluator/compute_eval.py (1)
nemo_skills/evaluation/evaluator/audio.py (1)
🪛 Ruff (0.14.11)nemo_skills/evaluation/evaluator/audio.py177-182: Consider moving this statement to an (TRY300) 183-183: Do not catch blind exception: (BLE001) ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (12)
✏️ Tip: You can disable this entire section by setting Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
README.mddocs/evaluation/index.mddocs/evaluation/speech-audio.mddocs/index.mdnemo_skills/dataset/asr-leaderboard/__init__.pynemo_skills/dataset/asr-leaderboard/prepare.pynemo_skills/evaluation/evaluator/audio.pynemo_skills/pipeline/prepare_data.pytests/gpu-tests/test_eval.py
🧰 Additional context used
🪛 LanguageTool
docs/evaluation/speech-audio.md
[style] ~37-~37: Using many exclamation marks might seem excessive (in this case: 6 exclamation marks for a text that’s 2899 characters long)
Context: ...default** to ensure proper evaluation. !!! warning "Running without audio files" ...
(EN_EXCESSIVE_EXCLAMATION)
🪛 Ruff (0.14.10)
nemo_skills/dataset/asr-leaderboard/prepare.py
115-115: Avoid specifying long messages outside the exception class
(TRY003)
125-125: Do not catch blind exception: Exception
(BLE001)
171-171: Consider [*list(DATASET_CONFIGS.keys()), "all"] instead of concatenation
Replace with [*list(DATASET_CONFIGS.keys()), "all"]
(RUF005)
🔇 Additional comments (11)
tests/gpu-tests/test_eval.py (1)
46-46: LGTM!Appropriately excludes the new asr-leaderboard dataset from automated test coverage, consistent with the pattern for other datasets requiring explicit parameters or heavy preparation.
docs/index.md (1)
25-25: LGTM!Documentation correctly reflects the addition of asr-leaderboard as a new Speech & Audio benchmark alongside mmau-pro.
README.md (1)
22-22: LGTM!Documentation update is consistent with the corresponding changes in docs/index.md.
docs/evaluation/index.md (1)
13-13: LGTM!Documentation update is consistent with the broader changes introducing asr-leaderboard support.
nemo_skills/dataset/asr-leaderboard/__init__.py (1)
19-21: LGTM!Configuration constants follow the standard pattern used by other datasets in the codebase. The settings appropriately specify the dataset group, metrics type, and default generation arguments for ASR Leaderboard evaluation.
nemo_skills/dataset/asr-leaderboard/prepare.py (3)
112-162: LGTM!The dataset preparation logic is well-structured with appropriate error handling, filtering for invalid content (short audio, non-speech, specific speaker IDs), and clear logging. The bare exception catch at Line 125 is acceptable for gracefully handling dataset loading failures.
165-218: LGTM!The CLI implementation correctly handles dataset selection, audio saving options, and combines individual dataset JSONL files into a unified test.jsonl for evaluation.
63-109: Path format is consistent with documentation; verification recommended for deployment environment only.The hardcoded audio path at line 93 matches the module docstring (line 18) and is correctly designed for the
/datasetmount point documented inspeech-audio.md. However, note that the ASR_LEADERBOARD evaluator only compares text transcriptions and does not load audio from this path. If the inference model needs to load audio files, ensure the deployment environment mounts data at/dataset/asr-leaderboard/data/as specified.nemo_skills/pipeline/prepare_data.py (1)
34-34: LGTM!Appropriately adds asr-leaderboard to the list of datasets requiring a data directory, consistent with its audio file handling requirements and the pattern used for other large datasets like mmau-pro.
docs/evaluation/speech-audio.md (2)
10-19: No issues found—dataset configuration aligns with implementation. All 8 datasets (librispeech_clean,librispeech_other,voxpopuli,tedlium,gigaspeech,spgispeech,earnings22,ami) are correctly defined innemo_skills/dataset/asr-leaderboard/prepare.pywith matching names and order. The__init__.pyfile is properly configured with audio metrics and WER evaluation as documented.
1-256: All referenced implementation files verified as existing and correctly configured.The documentation accurately references:
nemo_skills/dataset/asr-leaderboard/__init__.pyandprepare.pyexist and define the 8 datasets (librispeech_clean, librispeech_other, voxpopuli, tedlium, gigaspeech, spgispeech, earnings22, ami) exactly as documentednemo_skills/evaluation/evaluator/audio.pyexistsnemo_skills/pipeline/prepare_data.pyexists- MMAU-Pro structure correctly implements the three benchmark categories (closed_form, open_ended, instruction_following) as documented
b41e2db to
2db4594
Compare
2db4594 to
d497d8b
Compare
Greptile SummaryThis PR refactors the ASR Leaderboard evaluation to use Key improvements:
Most previously identified issues have been addressed or acknowledged by the development team. Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant PrepareScript as prepare.py
participant HFDataset as HuggingFace Dataset
participant AudioEval as audio.py
participant WhisperNorm as whisper-normalizer
participant Jiwer as jiwer
User->>PrepareScript: Run dataset preparation
PrepareScript->>HFDataset: Load dataset (librispeech, voxpopuli, etc.)
HFDataset-->>PrepareScript: Return audio + transcription entries
loop For each entry
PrepareScript->>PrepareScript: Check audio duration > 0.1s
PrepareScript->>PrepareScript: Filter speaker_id in SKIP_SPEAKER_IDS
PrepareScript->>PrepareScript: Filter non-speech tokens (SIL, MUSIC, etc.)
PrepareScript->>PrepareScript: Validate audio array not empty
PrepareScript->>PrepareScript: Save audio file + create JSONL entry
Note over PrepareScript: task_type="ASR" (was ASR_LEADERBOARD)
end
User->>AudioEval: Evaluate ASR predictions
AudioEval->>AudioEval: Check task_type="ASR"
AudioEval->>AudioEval: evaluate_asr(reference, hypothesis)
AudioEval->>WhisperNorm: preprocess_asr_text(reference)
WhisperNorm-->>AudioEval: Normalized reference text
AudioEval->>WhisperNorm: preprocess_asr_text(hypothesis)
WhisperNorm-->>AudioEval: Normalized hypothesis text
AudioEval->>AudioEval: Replace empty strings with "empty"
AudioEval->>Jiwer: Calculate WER
Jiwer-->>AudioEval: WER score
AudioEval-->>User: Return {wer, is_correct, text, pred_text}
|
Additional Comments (2)
|
d497d8b to
f42dd55
Compare
aa825fe to
b49281b
Compare
b49281b to
2d67aa4
Compare
Additional Comments (1)
Consider sanitizing the sample_id: |
| # Store normalized texts before empty substitution | ||
| text = ref | ||
| pred_text = hyp |
There was a problem hiding this comment.
Potential issue with storing empty normalized text. The code stores the normalized text before replacing empty strings with "empty", which means text and pred_text could be empty strings in the returned dictionary.
This could cause issues downstream if code expects non-empty strings. The pattern is inconsistent with other evaluation functions in this file:
evaluate_asr_pc()(line 122-123) stores the normalized text BEFORE empty substitutionevaluate_translation()(line 171-172) stores the original.strip()textevaluate_cer()(line 201-202) stores the original text without any empty handling
The inconsistency suggests this might not be intentional. If empty strings in text/pred_text fields are acceptable, this is fine. Otherwise, consider storing after the empty substitution or storing the original text like other functions do.
Jorjeous
left a comment
There was a problem hiding this comment.
version of jiwer
other LGTM or will be overwritten lated
2d67aa4 to
722e79c
Compare
| @lru_cache(maxsize=1) | ||
| def _get_english_normalizer(): | ||
| """Lazily initialize and cache the English text normalizer.""" | ||
| from whisper_normalizer.english import EnglishTextNormalizer | ||
|
|
||
| return EnglishTextNormalizer() |
There was a problem hiding this comment.
import inside function with lru_cache could cause issues. if import fails after first call, the cached normalizer becomes stale. move import to module level or inside the returned function.
There was a problem hiding this comment.
that's problem to be solved by user
| speaker_id = entry.get("speaker_id", "") | ||
| expected = formatted["expected_answer"] | ||
| if expected and speaker_id not in SKIP_SPEAKER_IDS and not is_nonspeech_only(expected): |
There was a problem hiding this comment.
logic checks expected for truthiness but is_nonspeech_only() already handles empty strings correctly (returns False for empty). the expected check is redundant for the is_nonspeech_only() branch.
Additional Comments (2)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! |
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
7285847 to
98ffac2
Compare
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
98ffac2 to
b262889
Compare
Greptile found no issues!From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
|
@coderabbitai help |
ChatThere are 3 ways to chat with CodeRabbit:
CodeRabbit commands
Other keywords and placeholders
Status, support, documentation and community
|
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Summary by CodeRabbit
New Features
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.