Conversation
📝 WalkthroughWalkthroughThis pull request introduces comprehensive audio evaluation capabilities by adding a new audio evaluator module with task-specific evaluation functions (ASR, ASR-PC, translation, CER, hallucination detection, punctuation/capitalization analysis), an AudioMetrics class for aggregating evaluation metrics, and registering these components in the existing evaluator and metrics registries. Changes
Sequence DiagramsequenceDiagram
participant User
participant eval_audio as eval_audio()
participant evaluate_sample as evaluate_sample()
participant TaskEval as Task-Specific<br/>Evaluators
participant FileOps as File I/O
User->>eval_audio: cfg (AudioEvaluatorConfig)
activate eval_audio
eval_audio->>FileOps: Read JSONL input
activate FileOps
FileOps-->>eval_audio: sample[]
deactivate FileOps
loop For each sample
eval_audio->>evaluate_sample: sample, config
activate evaluate_sample
evaluate_sample->>evaluate_sample: Inspect task_type
alt Task: ASR
evaluate_sample->>TaskEval: evaluate_asr(ref, hyp)
else Task: ASR-PC
evaluate_sample->>TaskEval: evaluate_asr_pc(ref, hyp)
else Task: Translation
evaluate_sample->>TaskEval: evaluate_translation(ref, hyp)
else Task: CER
evaluate_sample->>TaskEval: evaluate_cer(ref, hyp)
else Task: Hallucination
evaluate_sample->>TaskEval: evaluate_hallucination(ref, hyp, context)
else Task: PC-Rate
evaluate_sample->>TaskEval: evaluate_pc_rate(ref, hyp)
end
activate TaskEval
TaskEval-->>evaluate_sample: metrics dict
deactivate TaskEval
evaluate_sample->>evaluate_sample: Augment sample<br/>(is_correct, char_rate, etc.)
evaluate_sample-->>eval_audio: enriched sample
deactivate evaluate_sample
end
eval_audio->>FileOps: Write results to JSONL
deactivate eval_audio
FileOps-->>User: Updated file
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (5)
nemo_skills/evaluation/metrics/audio_metrics.py (3)
43-76: Consider adding aresetmethod to clear metric score lists.The
__init__initializes various score lists (wer_scores,cer_scores, etc.), but there's noresetmethod to clear them. IfBaseMetrics.reset()is called, only the parent's state is reset while these lists retain stale data.Consider adding:
+ def reset(self): + """Reset all metric tracking lists.""" + super().reset() + self.wer_scores = [] + self.wer_c_scores = [] + self.wer_pc_scores = [] + self.per_scores = [] + self.bleu_scores = [] + self.cer_scores = [] + self.hallucination_scores = [] + self.pc_rate_scores = [] + self.punct_f1_scores = [] + self.cap_accuracy_scores = [] + self.char_rate_scores = []
77-95: Consider movingreimport to module level.The
remodule is imported inside this method, butreis already used at the module level in other parts of the codebase. Since this is a standard library module with minimal overhead, importing at module level improves clarity.import logging +import re from nemo_skills.evaluation.metrics.base import BaseMetrics, as_int, as_percentageThen remove the local import:
- import re - if re.search(r"\byes\b", judgement_text, re.IGNORECASE):
346-359: Rename unused loop variable per linter recommendation.The
benchmark_namevariable is not used in the loop body.- for benchmark_name, benchmark_data in benchmarks.items(): + for _benchmark_name, benchmark_data in benchmarks.items():nemo_skills/evaluation/evaluator/audio.py (2)
190-209: Avoid catching broadException.Catching all exceptions masks specific errors (e.g., import failures vs. computation errors). Consider catching more specific exceptions from
sacrebleu.try: import sacrebleu ref = [reference.strip()] hyp = hypothesis.strip() bleu = sacrebleu.sentence_bleu(hyp, ref) bleu_score = bleu.score / 100.0 return { "bleu": bleu_score, "is_correct": bleu_score > 0.3, } - except Exception as e: + except (ImportError, ValueError, TypeError) as e: return { "bleu": 0.0, "is_correct": False, "error": str(e), }
302-316: In-place file overwrite risks data loss on failure.If the process crashes during the write phase (lines 312-314), the original data could be corrupted. Consider writing to a temporary file first, then atomically renaming.
+ import tempfile + import os + with open(jsonl_file, "rt", encoding="utf-8") as fin: data = [json.loads(line) for line in fin] ... - with open(jsonl_file, "wt", encoding="utf-8") as fout: - for sample in data: - fout.write(json.dumps(sample) + "\n") + # Write to temp file first, then rename atomically + dirname = os.path.dirname(jsonl_file) or "." + with tempfile.NamedTemporaryFile(mode="wt", dir=dirname, delete=False, suffix=".tmp") as fout: + for sample in data: + fout.write(json.dumps(sample) + "\n") + temp_path = fout.name + os.replace(temp_path, jsonl_file)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
README.md(1 hunks)nemo_skills/evaluation/evaluator/__init__.py(2 hunks)nemo_skills/evaluation/evaluator/audio.py(1 hunks)nemo_skills/evaluation/metrics/audio_metrics.py(1 hunks)nemo_skills/evaluation/metrics/map_metrics.py(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-02T21:26:17.342Z
Learnt from: CR
Repo: NVIDIA-NeMo/Skills PR: 0
File: CONTRIBUTING.md:0-0
Timestamp: 2025-12-02T21:26:17.342Z
Learning: Follow the existing code style and conventions in the Nemo-Skills project
Applied to files:
README.md
🧬 Code graph analysis (4)
nemo_skills/evaluation/evaluator/__init__.py (1)
nemo_skills/evaluation/evaluator/audio.py (1)
eval_audio(293-316)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/audio_metrics.py (1)
AudioMetrics(43-306)
nemo_skills/evaluation/metrics/audio_metrics.py (2)
nemo_skills/evaluation/metrics/base.py (1)
BaseMetrics(23-434)nemo_skills/utils.py (1)
get_logger_name(39-43)
nemo_skills/evaluation/evaluator/audio.py (2)
nemo_skills/utils.py (2)
get_logger_name(39-43)nested_dataclass(69-102)nemo_skills/evaluation/metrics/audio_metrics.py (1)
update(157-198)
🪛 Ruff (0.14.8)
nemo_skills/evaluation/metrics/audio_metrics.py
211-211: Loop control variable agg_mode not used within loop body
Rename unused agg_mode to _agg_mode
(B007)
346-346: Loop control variable benchmark_name not used within loop body
Rename unused benchmark_name to _benchmark_name
(B007)
nemo_skills/evaluation/evaluator/audio.py
200-203: Consider moving this statement to an else block
(TRY300)
204-204: Do not catch blind exception: Exception
(BLE001)
223-223: Unused function argument: reference
(ARG001)
223-223: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
277-277: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: pre-commit
🔇 Additional comments (8)
README.md (1)
5-5: Minor formatting change looks fine.This is just an extra blank line addition with no impact on content.
nemo_skills/evaluation/evaluator/__init__.py (2)
18-18: Import follows existing conventions.The import of
eval_audiois correctly placed and follows the existing pattern for function-based evaluators.
60-61: Registration in EVALUATOR_MAP looks correct.The audio evaluator is properly registered with a descriptive comment, consistent with other entries in the map.
nemo_skills/evaluation/metrics/map_metrics.py (2)
22-22: Import correctly placed.The
AudioMetricsimport maintains alphabetical ordering with other metrics imports.
51-51: Metrics registration is consistent.The "audio" key correctly maps to
AudioMetricsand aligns with the evaluator registration inEVALUATOR_MAP.nemo_skills/evaluation/evaluator/audio.py (3)
53-87: PER calculation logic looks correct.The Punctuation Error Rate implementation using dynamic programming for edit distance is well-structured. The handling of edge cases (both empty punctuation sequences) is appropriate.
319-394: Sample evaluation dispatch logic is well-structured.The function handles multiple task types cleanly with appropriate fallbacks for missing generation and unknown task types. The additional char_rate metrics provide useful diagnostic information.
124-143: Preprocessing functions are well-implemented.The lazy import of
EnglishTextNormalizerinsidepreprocess_asr_textis appropriate sincewhispermay be an optional dependency. Both normalization functions follow consistent patterns.
|
workflows/tests.yml |
gwarmstrong
left a comment
There was a problem hiding this comment.
Overall looks pretty good, just a note to please try to use the BaseEvaluator. It is an open issue to convert other evaluations (#829 ), but I think it will be helpful to have it here from the start, if possible.
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
…ize_robustness generic for more benchmarks, update docstrings. (#1079) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
- Make AudioEvaluatorConfig inherit from BaseEvaluatorConfig - Create AudioEvaluator class with async eval_single() method - Refactor evaluate_sample() to return updates dict instead of modified sample - Register AudioEvaluator in EVALUATOR_CLASS_MAP - Keep eval_audio() function for backward compatibility - Enables interleaved evaluation with inference for better GPU utilization Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
cfe2cd1 to
c31a76b
Compare
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Signed-off-by: dgitman <dgitman@nvidia.com>
This PR is part of bulk improvements proposed introduced in #1072
Current one for unification of all audio-related metrics
Now all tasks related to audio can be evaluated with one evaluator, making it easier to add new datasets
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.