Skip to content

Option to pass a template to format input#883

Open
wasiahmad wants to merge 21 commits intomainfrom
apply_input_template
Open

Option to pass a template to format input#883
wasiahmad wants to merge 21 commits intomainfrom
apply_input_template

Conversation

@wasiahmad
Copy link
Collaborator

@wasiahmad wasiahmad commented Oct 3, 2025

Summary by CodeRabbit

  • New Features

    • Optional input templating for SFT datasets: supply a YAML template to produce a consolidated formatted_input used during preprocessing and message construction for train/validation.
  • Bug Fixes

    • ASR evaluation now uses unified Whisper-style text normalization for more consistent WER results; ensure samples are marked as ASR where applicable.
  • Chores

    • Simplified default ASR evaluation/generation flags for clearer behavior.

Note

Adds optional YAML-driven input templating to combine multiple fields into a formatted_input used for message construction, configurable via data.input_template_path.

  • Data preprocessing (PromptResponseDataset):
    • Optional templating: load YAML via load_prompt_config and render formatted_input from multiple fields (semicolon-separated input_key). Requires user key in template.
    • Message construction now uses templated formatted_input when provided; add_messages_key updated to accept input_key parameter.
    • New apply_input_template map step applied before message creation; asserts no pre-existing messages.
  • Configuration wiring:
    • data.input_template_path accepted and passed into dataset setup.

Written by Cursor Bugbot for commit 2353efe. This will update automatically on new commits. Configure here.

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 3, 2025

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds optional input templating to SFT dataset processing: PromptResponseDataset accepts an input_template_path, loads a YAML template when provided, generates a formatted_input via apply_input_template, switches the effective input_key to "formatted_input" for downstream processing, and setup_data forwards the template path. (48 words)

Changes

Cohort / File(s) Summary
SFT input templating support
nemo_skills/training/nemo_rl/start_sft.py
Added input_template_path to PromptResponseDataset.__init__; loads YAML prompt template and validates it. New public apply_input_template renders formatted_input from one or more input keys (semicolon-separated). load_or_process_split now applies the template when present and sets current_input_key="formatted_input". add_messages_key signature changed to accept input_key and uses it for message construction. setup_data forwards input_template_path when creating PromptResponseDataset.
ASR leaderboard defaults
nemo_skills/dataset/asr-leaderboard/__init__.py
Updated evaluator description to Whisper-style normalization and clarified requirement for task_type="ASR". Reduced default args: simplified EVAL_ARGS to eval_type=audio; removed normalization/audio flags and set GENERATION_ARGS prompt_format=openai.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Trainer as setup_data
  participant PRD as PromptResponseDataset
  participant DS as Dataset.map

  User->>Trainer: call setup_data(..., input_template_path)
  Trainer->>PRD: PromptResponseDataset(..., input_template_path)
  alt input_template provided
    PRD->>PRD: load_prompt_config (YAML) -> input_template
    PRD->>DS: map(apply_input_template)
    DS-->>PRD: examples with `formatted_input`
    PRD->>PRD: set current_input_key = "formatted_input"
  else no template
    PRD->>PRD: keep configured input_key
  end
  PRD->>DS: map(add_messages_key, fn_kwargs={"input_key": current_input_key})
  DS-->>PRD: dataset with messages
  PRD-->>Trainer: return prepared dataset
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • gwarmstrong
  • Jorjeous
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Option to pass a template to format input' accurately summarizes the main change: adding an input_template_path parameter to enable optional YAML-driven input templating.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch apply_input_template

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@wasiahmad wasiahmad marked this pull request as draft October 3, 2025 04:25
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c6edcf2 and ec7a4f6.

📒 Files selected for processing (1)
  • nemo_skills/training/nemo_rl/start_sft.py (5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit
🔇 Additional comments (2)
nemo_skills/training/nemo_rl/start_sft.py (2)

89-89: LGTM!

The parameter addition maintains backward compatibility with an appropriate default value.


259-259: LGTM!

Safe retrieval of optional configuration parameter with appropriate default.

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
@wasiahmad wasiahmad marked this pull request as ready for review October 3, 2025 05:01
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
nemo_skills/training/nemo_rl/start_sft.py (2)

143-143: Add descriptive assertion message.

The assertion lacks an explanatory message. When it fails, users won't understand why template application is incompatible with message-formatted data.

Apply this diff:

-        assert "messages" not in dataset.column_names
+        assert "messages" not in dataset.column_names, (
+            "Cannot apply input_template to datasets that already have 'messages' format. "
+            "Input templates are only supported for input/output format datasets."
+        )

181-186: Add error handling, validation, and documentation.

The method has several issues identified in past reviews that remain unaddressed:

  1. Undocumented multi-key feature: Line 182 splits input_key by comma but this isn't documented.
  2. Missing validation: No check that keys exist in examples or that batch lengths are consistent.
  3. Missing error handling: Lines 183-185 can raise KeyError, IndexError, or ValueError without context.

Apply this diff to add comprehensive error handling:

 def apply_input_template(self, examples: dict[str, list[Any]]) -> dict[str, list[str]]:
+    """Apply input template to examples, supporting comma-separated input keys.
+    
+    Args:
+        examples: Batched examples dict with keys specified in self.input_key
+        
+    Returns:
+        Dict with "formatted_input" key containing formatted strings
+        
+    Raises:
+        KeyError: If template references keys not in examples
+        ValueError: If examples are empty or template format is invalid
+    """
     keys = [k.strip() for k in self.input_key.split(",")]
+    
+    # Validate all keys exist
+    missing_keys = [k for k in keys if k not in examples]
+    if missing_keys:
+        raise KeyError(
+            f"Template references missing keys: {missing_keys}. "
+            f"Available keys: {list(examples.keys())}"
+        )
+    
+    # Validate non-empty
+    if not examples[keys[0]]:
+        return {"formatted_input": []}
+    
+    try:
-        examples["formatted_input"] = [
-            self.input_template.format(**{k: examples[k][i] for k in keys}) for i in range(len(examples[keys[0]]))
-        ]
+            examples["formatted_input"] = [
+                self.input_template.format(**{k: examples[k][i] for k in keys}) 
+                for i in range(len(examples[keys[0]]))
+            ]
+    except (KeyError, ValueError, IndexError) as e:
+        raise ValueError(
+            f"Failed to apply template: {e}. "
+            f"Template: {self.input_template[:100]}..."
+        ) from e
 
     return examples
🧹 Nitpick comments (2)
nemo_skills/training/nemo_rl/start_sft.py (2)

103-109: LGTM! Template loading is well-implemented.

The error handling for template file operations is comprehensive, including proper encoding specification and validation of the required "user" key in the YAML structure.

Optional: Consider adding a check for empty template values to catch configuration errors earlier:

                 if "user" not in data:
                     raise KeyError(f"'user' key is missing in the YAML file: {input_template_path}")
                 self.input_template = data["user"]
+                if not self.input_template or not self.input_template.strip():
+                    raise ValueError(f"Template 'user' field is empty in: {input_template_path}")

168-179: LGTM! Critical bug fixed.

The method signature change correctly addresses the critical self.input_key mutation bug identified in past reviews by accepting input_key as a parameter.

Consider adding strict=True to zip() for data integrity.

The zip() call on line 177 should use strict=True (Python 3.10+) to catch mismatched input/output list lengths, which could indicate data quality issues.

Apply this diff:

             "messages": [
                 [
                     {"role": "user", "content": input_},
                     {"role": "assistant", "content": output},
                 ]
-                for input_, output in zip(examples[input_key], examples[self.output_key])
+                for input_, output in zip(examples[input_key], examples[self.output_key], strict=True)
             ]
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ec7a4f6 and a08fcdd.

📒 Files selected for processing (1)
  • nemo_skills/training/nemo_rl/start_sft.py (6 hunks)
🧰 Additional context used
🪛 Ruff (0.13.2)
nemo_skills/training/nemo_rl/start_sft.py

108-108: Avoid specifying long messages outside the exception class

(TRY003)


177-177: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🔇 Additional comments (1)
nemo_skills/training/nemo_rl/start_sft.py (1)

266-266: LGTM!

The propagation of input_template_path from configuration to PromptResponseDataset is implemented correctly with appropriate default handling.

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
nemo_skills/training/nemo_rl/start_sft.py (1)

181-186: Previous review comment still applies.

The concerns raised in the previous review regarding missing error handling, undocumented multi-key support (line 182 now uses semicolon separator), and lack of validation remain valid.

🧹 Nitpick comments (2)
nemo_skills/training/nemo_rl/start_sft.py (2)

103-109: Add error handling for YAML parsing failures.

While the code checks for the 'user' key, it doesn't handle potential YAML parsing errors or empty template content. If the YAML file is malformed or the template is empty, users will encounter unclear errors during dataset processing.

Apply this diff to add validation:

     self.input_template = None
     if input_template_path:
-        with open(input_template_path, "rt", encoding="utf-8") as fin:
-            data = yaml.safe_load(fin)
-            if "user" not in data:
-                raise KeyError(f"'user' key is missing in the YAML file: {input_template_path}")
-            self.input_template = data["user"]
+        try:
+            with open(input_template_path, "rt", encoding="utf-8") as fin:
+                data = yaml.safe_load(fin)
+                if not data or not isinstance(data, dict):
+                    raise ValueError(f"Template file must contain a YAML dictionary: {input_template_path}")
+                if "user" not in data:
+                    raise KeyError(f"'user' key is missing in the YAML file: {input_template_path}")
+                self.input_template = data["user"]
+                if not self.input_template or not self.input_template.strip():
+                    raise ValueError(f"Template 'user' field is empty: {input_template_path}")
+        except (FileNotFoundError, yaml.YAMLError) as e:
+            raise ValueError(f"Failed to load template file {input_template_path}: {e}")

168-170: LGTM: Signature change is correct.

The addition of the input_key parameter is necessary and correctly enables dynamic input key selection when templates are used.

Optional: Consider adding strict=True to the zip() call at line 177 to ensure batch consistency:

-                for input_, output in zip(examples[input_key], examples[self.output_key])
+                for input_, output in zip(examples[input_key], examples[self.output_key], strict=True)

This provides an extra safeguard, though HuggingFace datasets batching already ensures equal-length lists.

Also applies to: 177-177

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a08fcdd and 545f09e.

📒 Files selected for processing (1)
  • nemo_skills/training/nemo_rl/start_sft.py (6 hunks)
🧰 Additional context used
🪛 Ruff (0.13.2)
nemo_skills/training/nemo_rl/start_sft.py

108-108: Avoid specifying long messages outside the exception class

(TRY003)


177-177: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit
🔇 Additional comments (3)
nemo_skills/training/nemo_rl/start_sft.py (3)

25-25: LGTM!

The yaml import is necessary for loading the template file and is used correctly at line 106.


141-149: LGTM: Critical mutation issue resolved.

The introduction of the current_input_key local variable correctly addresses the previous critical issue where self.input_key was mutated during processing, which would have caused incorrect behavior with validation datasets.

Note: The assertion at line 143 still lacks a descriptive message, as flagged in a previous review comment.


266-266: LGTM!

The parameter forwarding correctly passes the input_template_path from the data configuration to the PromptResponseDataset constructor with a safe default.

output_key: str = "output",
num_proc: int | None = None,
force_reprocess: bool = False,
input_template_path: str | None = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also wouldn't this fail for typical cases when e.g. you have keys like problem and generation in your file, but the logic for setting defaults checks for input / output or messages? So that part should also be updated

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kipok I didn't understand this. When a user is passing input_template_path, their provided input_key field must match with keys present in the input_template. Why having keys like problem and generation would cause failure?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the logic in detect_data_format function

@wasiahmad wasiahmad enabled auto-merge (squash) October 4, 2025 17:40
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
nemo_skills/training/nemo_rl/start_sft.py (1)

181-186: Add error handling and document multi-key support.

The method can fail with cryptic errors if the template references missing keys, examples are empty, or the template syntax is invalid. Additionally, the semicolon-separated multi-key support is undocumented.

Apply this diff to add error handling and documentation:

 def apply_input_template(self, examples: dict[str, list[Any]]) -> dict[str, list[str]]:
+    """Apply input template to examples, supporting semicolon-separated input keys.
+    
+    Args:
+        examples: Batched examples dict with input keys
+        
+    Returns:
+        Dict with "formatted_input" key containing formatted strings
+        
+    Raises:
+        KeyError: If template references keys not in examples
+        ValueError: If examples are empty or template syntax is invalid
+    """
     keys = [k.strip() for k in self.input_key.split(";")]
+    
+    # Validate all keys exist
+    missing_keys = [k for k in keys if k not in examples]
+    if missing_keys:
+        raise KeyError(
+            f"Template references missing keys: {missing_keys}. "
+            f"Available keys: {list(examples.keys())}"
+        )
+    
+    # Validate non-empty
+    if not examples or not examples[keys[0]]:
+        return {"formatted_input": []}
+    
+    try:
-        examples["formatted_input"] = [
-            self.input_template.format(**{k: examples[k][i] for k in keys}) for i in range(len(examples[keys[0]]))
-        ]
+            examples["formatted_input"] = [
+                self.input_template.format(**{k: examples[k][i] for k in keys}) 
+                for i in range(len(examples[keys[0]]))
+            ]
+    except (KeyError, ValueError) as e:
+        raise ValueError(
+            f"Failed to apply template: {e}. Template: {self.input_template[:100]}..."
+        ) from e
 
     return examples
🧹 Nitpick comments (2)
nemo_skills/training/nemo_rl/start_sft.py (2)

141-149: Critical bug fixed! Consider adding assertion message for better debugging.

Excellent fix! The code now correctly uses a local variable current_input_key instead of mutating self.input_key, which resolves the critical bug from previous reviews where validation datasets would fail.

Optional improvement: Add a descriptive message to the assertion at line 143 for better debugging:

-            assert "messages" not in dataset.column_names
+            assert "messages" not in dataset.column_names, (
+                "Cannot apply input_template to datasets that already have 'messages' format. "
+                "Input templates are only supported for input/output format datasets."
+            )

177-177: LGTM! Consider adding strict=True to zip() for safety.

The code correctly uses the passed input_key parameter.

Optional improvement: Add strict=True to catch length mismatches between input and output data:

-                for input_, output in zip(examples[input_key], examples[self.output_key])
+                for input_, output in zip(examples[input_key], examples[self.output_key], strict=True)

Note: This requires Python 3.10+. If targeting earlier versions, this change should be skipped.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 545f09e and 62164ed.

📒 Files selected for processing (1)
  • nemo_skills/training/nemo_rl/start_sft.py (6 hunks)
🧰 Additional context used
🪛 Ruff (0.13.3)
nemo_skills/training/nemo_rl/start_sft.py

108-108: Avoid specifying long messages outside the exception class

(TRY003)


177-177: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit
🔇 Additional comments (5)
nemo_skills/training/nemo_rl/start_sft.py (5)

39-40: LGTM!

The import correctly uses nemo-skills utilities as requested, and the alias avoids naming conflicts with the existing load_config import.


91-91: LGTM!

The parameter is well-named and appropriately optional.


156-156: LGTM!

Correctly passes the input key to add_messages_key via fn_kwargs, completing the fix for the validation dataset issue.


168-170: LGTM!

The signature correctly accepts input_key as a parameter, making the method more flexible and fixing the validation dataset bug.


266-266: LGTM!

The template path is correctly propagated from the data config to the dataset constructor.

Comment on lines +104 to +109
self.input_template = None
if input_template_path:
input_template_config = load_prompt_config(input_template_path)
if "user" not in input_template_config:
raise KeyError(f"'user' key is missing in the YAML file: {input_template_path}")
self.input_template = input_template_config["user"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Verify error handling in load_prompt_config.

The template loading logic is sound, but ensure that load_prompt_config properly handles file errors (missing files, permission issues, invalid YAML) to provide clear error messages to users.

Run the following script to examine the error handling in load_prompt_config:

Note: Line 108 has a long error message in the exception. Consider extracting it to a variable for better style compliance.

-            if "user" not in input_template_config:
-                raise KeyError(f"'user' key is missing in the YAML file: {input_template_path}")
+            if "user" not in input_template_config:
+                msg = f"'user' key is missing in the YAML file: {input_template_path}"
+                raise KeyError(msg)

🏁 Script executed:

#!/bin/bash
# Description: Check error handling in load_prompt_config function

# Find and display the load_config function implementation
ast-grep --pattern $'def load_config($$$) {
  $$$
}'

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Locate the implementation of load_prompt_config in the Python codebase
rg -nP --type=py "def load_prompt_config" -C3

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for load_prompt_config references across the repository
rg -n "load_prompt_config"

Length of output: 237


🏁 Script executed:

#!/bin/bash
# Display the implementation of load_config in the prompt utils module
rg -n "def load_config" -C10 nemo_skills/prompt

Length of output: 1538


🏁 Script executed:

#!/bin/bash
# Show full implementation of load_config to inspect error handling
sed -n '350,450p' nemo_skills/prompt/utils.py

Length of output: 1814


Add error handling in load_config and extract literal messages to variables

  • In nemo_skills/prompt/utils.py (load_config), wrap the open(config_path) and yaml.safe_load calls in a try/except that catches FileNotFoundError, PermissionError, and yaml.YAMLError, then rethrows with a clear message including config_path.
  • In nemo_skills/training/nemo_rl/start_sft.py (lines 108–109), extract the KeyError message into a msg variable before raising.
🧰 Tools
🪛 Ruff (0.13.3)

108-108: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In nemo_skills/prompt/utils.py (around the load_config function) wrap the file
open and yaml.safe_load calls in a try/except that catches FileNotFoundError,
PermissionError, and yaml.YAMLError and rethrows a new exception (or raise) with
a clear message that includes the config_path; ensure the original exception is
chained. In nemo_skills/training/nemo_rl/start_sft.py around lines 104–109,
extract the KeyError message into a local variable msg (e.g. msg = f"...") and
raise KeyError(msg) instead of inlining the formatted string in the raise
statement.

cursor[bot]

This comment was marked as outdated.

Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a few more comments. If you want to merge this please also add a new gpu test in https://github.com/NVIDIA-NeMo/Skills/blob/main/tests/gpu-tests/test_train.py to ensure this is being tested in the ci

output_key: str = "output",
num_proc: int | None = None,
force_reprocess: bool = False,
input_template_path: str | None = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the logic in detect_data_format function

print(f"[Map] Processing {split_name} dataset from: {path}")
dataset = load_dataset("json", data_files=str(path))["train"]

current_input_key = self.input_key
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wasiahmad this is an important thing to fix if you want to merge this

@Kipok
Copy link
Collaborator

Kipok commented Jan 8, 2026

@wasiahmad do you still need this feature? If not, maybe we close this without merging for now as some comments are still unaddressed here

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +184 to +185
examples["formatted_input"] = [
self.input_template.format(**{k: examples[k][i] for k in keys}) for i in range(len(examples[keys[0]]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accessing examples[k][i] will fail if any key k doesn't exist in the dataset

Suggested change
examples["formatted_input"] = [
self.input_template.format(**{k: examples[k][i] for k in keys}) for i in range(len(examples[keys[0]]))
formatted_inputs = []
for i in range(len(examples[keys[0]])):
format_dict = {k: examples[k][i] for k in keys}
formatted_inputs.append(self.input_template.format(**format_dict))
examples["formatted_input"] = formatted_inputs

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 30, 2026

Additional Comments (2)

nemo_skills/training/nemo_rl/start_sft.py
cache doesn't account for changes in input_template_path - if the template file changes or the path changes, stale cached data will be used

    def load_or_process_split(self, path: str, split_name: str) -> Dataset:
        data_path = Path(path)
        cache_dir = data_path.parent / ".cache" / f"{split_name}_{data_path.stem}"
        sig_file = cache_dir / "signature.json"
        file_size = str(data_path.stat().st_size)
        template_sig = str(Path(self.input_template_path).stat().st_size) if hasattr(self, 'input_template_path') and self.input_template_path else "none"
        current_sig = {"size": file_size, "template": template_sig}
        if cache_dir.exists() and sig_file.exists() and not self.force_reprocess:
            with open(sig_file) as f:
                old_sig = json.load(f)
            if old_sig == current_sig:
                print(f"[Cache] Loading {split_name} dataset from: {cache_dir}")
                return load_from_disk(str(cache_dir))
            else:
                print(f"[Cache] Invalidated (signature changed): {path}")

nemo_skills/training/nemo_rl/start_sft.py
cache signature should include template information - currently if template changes or is added/removed, stale cache will be used

Store input_template_path in __init__ and include it in signature

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 3, 2026

Caution

Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted.

Error details
{"name":"HttpError","status":500,"request":{"method":"PATCH","url":"https://api.github.com/repos/NVIDIA-NeMo/Skills/issues/comments/3364204828","headers":{"accept":"application/vnd.github.v3+json","user-agent":"octokit.js/0.0.0-development octokit-core.js/7.0.6 Node.js/24","authorization":"token [REDACTED]","content-type":"application/json; charset=utf-8"},"body":{"body":"<!-- This is an auto-generated comment: summarize by coderabbit.ai -->\n<!-- This is an auto-generated comment: review paused by coderabbit.ai -->\n\n> [!NOTE]\n> ## Reviews paused\n> \n> It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the `reviews.auto_review.auto_pause_after_reviewed_commits` setting.\n> \n> Use the following commands to manage reviews:\n> - `@coderabbitai resume` to resume automatic reviews.\n> - `@coderabbitai review` to trigger a single review.\n> \n> Use the checkboxes below for quick actions:\n> - [ ] <!-- {\"checkboxId\": \"7f6cc2e2-2e4e-497a-8c31-c9e4573e93d1\"} --> ▶️ Resume reviews\n> - [ ] <!-- {\"checkboxId\": \"e9bb8d72-00e8-4f67-9cb2-caf3b22574fe\"} --> 🔍 Trigger review\n\n<!-- end of auto-generated comment: review paused by coderabbit.ai -->\n\n<!-- walkthrough_start -->\n\n<details>\n<summary>📝 Walkthrough</summary>\n\n## Walkthrough\nAdds optional input templating to SFT dataset processing and forwards the template path through setup_data; changes ASR leaderboard default eval/generation argument strings; refactors ASR text normalization to a cached English normalizer and simplifies evaluate_asr/evaluate_sample flows.\n\n## Changes\n|Cohort / File(s)|Summary|\n|---|---|\n|**SFT input templating support** <br> `nemo_skills/training/nemo_rl/start_sft.py`|Added `input_template_path` to `PromptResponseDataset.__init__`, `apply_input_template` to render `formatted_input` (supports multiple input keys separated by `;`), updated `load_or_process_split` to call templating and switch effective input_key to `formatted_input`, changed `add_messages_key` signature to accept an `input_key`, and propagated `input_template_path` via `setup_data`.|\n|**ASR leaderboard defaults** <br> `nemo_skills/dataset/asr-leaderboard/__init__.py`|Changed constant defaults: `EVAL_ARGS` from `\"++eval_type=audio ++eval_config.normalization_mode=hf_leaderboard\"` to `\"++eval_type=audio\"` and `GENERATION_ARGS` from `\"++prompt_format=openai ++enable_audio=true\"` to `\"++prompt_format=openai\"`. Updated related comments.|\n|**ASR normalization & evaluator refactor** <br> `nemo_skills/evaluation/evaluator/audio.py`|Introduced cached `_get_english_normalizer()` and new `preprocess_asr_text(text)`, removed legacy normalization helpers and mode branching, changed `evaluate_asr` to accept (reference, hypothesis) and rely on `preprocess_asr_text`, simplified `evaluate_asr_pc` and `evaluate_sample`, and preserved normalized text snapshots in results. Added `functools.lru_cache` import.|\n\n## Sequence Diagram(s)\n```mermaid\nsequenceDiagram\n  autonumber\n  actor User\n  participant Trainer as setup_data\n  participant PRD as PromptResponseDataset\n  participant DS as Dataset\n\n  User->>Trainer: setup_data(..., input_template_path)\n  Trainer->>PRD: __init__(..., input_template_path)\n  alt input_template provided\n    PRD->>PRD: load_prompt_config (YAML) -> input_template\n    PRD->>DS: map(apply_input_template)\n    note right of DS: produce `formatted_input` from configured input_key(s)\n    DS-->>PRD: examples with `formatted_input`\n    PRD->>PRD: set current_input_key = \"formatted_input\"\n  else no template\n    PRD->>PRD: keep configured input_key\n  end\n  PRD->>DS: map(add_messages_key, fn_kwargs={\"input_key\": current_input_key})\n  DS-->>PRD: processed dataset (messages added)\n  PRD-->>Trainer: return prepared dataset\n```\n\n## Estimated code review effort\n🎯 3 (Moderate) | ⏱️ ~25 minutes\n\n## Possibly related PRs\n- NVIDIA-NeMo/Skills#1140 — Overlapping changes to ASR leaderboard defaults and audio evaluation normalization refactor.\n- NVIDIA-NeMo/Skills#1104 — Related edits to `EVAL_ARGS`/`GENERATION_ARGS` for ASR leaderboard defaults.\n\n## Suggested reviewers\n- gwarmstrong\n- Jorjeous\n\n</details>\n\n<!-- walkthrough_end -->\n\n\n<!-- pre_merge_checks_walkthrough_start -->\n\n<details>\n<summary>🚥 Pre-merge checks | ✅ 3</summary>\n\n<details>\n<summary>✅ Passed checks (3 passed)</summary>\n\n|     Check name     | Status   | Explanation                                                                                                                                   |\n| :----------------: | :------- | :-------------------------------------------------------------------------------------------------------------------------------------------- |\n|  Description Check | ✅ Passed | Check skipped - CodeRabbit’s high-level summary is enabled.                                                                                   |\n|     Title check    | ✅ Passed | The title clearly and concisely summarizes the main feature introduced in the PR: adding an optional template parameter to format input data. |\n| Docstring Coverage | ✅ Passed | Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.                                                          |\n\n</details>\n\n<sub>✏️ Tip: You can configure your own custom pre-merge checks in the settings.</sub>\n\n</details>\n\n<!-- pre_merge_checks_walkthrough_end -->\n\n<!-- finishing_touch_checkbox_start -->\n\n<details>\n<summary>✨ Finishing Touches</summary>\n\n- [ ] <!-- {\"checkboxId\": \"7962f53c-55bc-4827-bfbf-6a18da830691\"} --> 📝 Generate docstrings (stacked PR)\n- [ ] <!-- {\"checkboxId\": \"3e1879ae-f29b-4d0d-8e06-d12b7ba33d98\"} --> 📝 Generate docstrings (commit on current branch)\n<details>\n<summary>🧪 Generate unit tests (beta)</summary>\n\n- [ ] <!-- {\"checkboxId\": \"f47ac10b-58cc-4372-a567-0e02b2c3d479\", \"radioGroupId\": \"utg-output-choice-group-unknown_comment_id\"} -->   Create PR with unit tests\n- [ ] <!-- {\"checkboxId\": \"07f1e7d6-8a8e-4e23-9900-8731c2c87f58\", \"radioGroupId\": \"utg-output-choice-group-unknown_comment_id\"} -->   Post copyable unit tests in a comment\n- [ ] <!-- {\"checkboxId\": \"6ba7b810-9dad-11d1-80b4-00c04fd430c8\", \"radioGroupId\": \"utg-output-choice-group-unknown_comment_id\"} -->   Commit unit tests in branch `apply_input_template`\n\n</details>\n\n</details>\n\n<!-- finishing_touch_checkbox_end -->\n\n<!-- announcements_start -->\n\n> [!TIP]\n> Try [Coding Plans](https://www.coderabbit.ai/issue-planner). Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).\n> Share your feedback on [Discord](https://discord.com/invite/coderabbit).\n\n<!-- announcements_end -->\n\n<!-- tips_start -->\n\n---\n\nThanks for using [CodeRabbit](https://coderabbit.ai?utm_source=oss&utm_medium=github&utm_campaign=NVIDIA-NeMo/Skills&utm_content=883)! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.\n\n<details>\n<summary>❤️ Share</summary>\n\n- [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai)\n- [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai)\n- [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai)\n- [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)\n\n</details>\n\n<sub>Comment `@coderabbitai help` to get the list of available commands and usage tips.</sub>\n\n<!-- tips_end -->\n\n<!-- internal state start -->\n\n\n<!-- DwQgtGAEAqAWCWBnSTIEMB26CuAXA9mAOYCmGJATmriQCaQDG+Ats2bgFyQAOFk+AIwBWJBrngA3EsgEBPRvlqU0AgfFwA6NPEgQAfACgjoCEYDEZyAAUASpETZWaCrIPR1AGxJcA8t3H4WAQ8aIjIaJA0zNwe1CSR+JAAZvgUzNQoGNx4BgByjgKUXAAcxQDMBgCqNgAyXLC4uNyIHAD0rUTqsNgCGkzMrbkAagCSACIjAIJguSQAsvitAMoA1vAeHoit2RutpRWViEWQAO6h8Giw6bQGS/jYFAzxAlQYDLBcaNwxsgD68Fk8L8ojE4m5nKRcJAXph3lx0gDbrhqNgWvxuGQDABhCgkOL0ahcABMAAYiQBWMAARhJYBJZWgJIALBwKRxmQAtIxjaQMCjwfzwQIcAxQSa0WjIfCCwJoDyQACakzmNTAtH5UiwAOyUJBsXEGCICQUzDU5EgzGwHnEMXiSXgJA8ksywQiKTS1BotH+gKhqLoyVSFukiDQpAUGEQuAo2DEQowABoI/aiA8VF5IBILpBaNQ0BptUC9XFftxqLANKLIGM8zxcbx8E8wgCjQAKKwUFj+GzSbiBI415FHXAAShFkF0kD8AQwcsiJGi+pbXA8+DQ9CVKsz2dX69LneiuF+TAwKfQGHouIvlEDHsadB9OuSB4tVptGftjudraOzHgTFXDAwCOMsqC9TIdV+FYSFkEcNEgHsAEdsHgXFkH9PhoPkAF50XOJKwnKA5hDMN4hPKMYzjQJIAwfATkgf1kGLcD3XSe9vULKETlgMg63wLMlFoABudAJV+NgwlIxAoJghjuFzcDXQYJ5/AgoEsJCKg2BoCgCMnWZ6K+H5HyLBdQRoC0vnsGhuHQb4PAdehCndeIJNDcM+TxGcRNCI4KFwZBaLrEgwBIAAPJADSNNypMrKAsUCFM0xnU5UOXAxCJzPMC19YEzP1EhS3LdBlJIfwA0wegyzCAMAWCBTQhIKFh2wbhKyqeT8X4YRRHEKRkAcJwXEgVtsDeWBMFIehwpUqF60SaoagUChcTEOgxzcTxvCnGUgkSarwlw8z4mCViMk4vICmOfYqlqepGmaNoOi6Ho+hYQZRgmaZZgWZY1g2LYdg8PZyiqPyuDORALiuddbnuR5nleOE7OMzi8rwmhwQoSFoSRj5LMRJZkVwVEuGlTEcS8irOEgUkKWpWl6UZFk2U5cxLAS1h2AGxx0hcKseyzEgThvABpAV8BWdBEGg+huKani+FwHicGVoMo3WeVyDoJiEGQJIvIeeJKvsbAiFIKMAwYVcocNSIVdsVK1bwYNsZbFAkhQXAAHJAsSQDSD4bXBKTWiovsFgyJYNgMACyBcQRLAxvXdUQzoeDW2I2P4wDDCuAAAUh6HrhHdnIE5v8oRii3TaG1w4FQI4qK1KVK/AwoGDQf17fiBsBKt6P1GDSTw1QUhyH5BhWjG3ECoJC8c3waQaPwKEU5ddhoRgwJ6GV+JHaURA+QFGd2v0YxwCgMh6HwT2u4IYgyGUcD+hjmneG6kQ4366F5CYJQqCqHUFoHQ58TBQAbsgVAmBVaEHHs/AeXNY5cCoPRQafN5ByAUAAlQahNDaF0GAQwF9TAGHIMwfAvwZaay2NGbQGAWytHIZQigwMozOCPIgJImhuCyBFAAIkEQYCwkBJgjEfhPLq6DnDyFvowCahppBGCgCMWOnZaCxmXpxI6S47YOG+KkKEWYIgRHIPRNGzFCplmVppNA2kbzBA7F2XAPZEB9kjCQQcjVNC/B9OoXxSZdy0HdhETcS1LEjWMZAIJ+5nHHkSvAIgI5zz0CjKkd2Q9Qj2EdEkHKkFLEiQkHKeACll68GkGQJ4/A76QH4RhfhkAsJxUgAAdR4lgGBFj8pxBQMgPuJS6CBLXN6VIsSmzSTcQ5KEtF6JXgAYdM67ETJQk/E6X+KMPCyBCfZP4XSMYnUSHvLKQ4mpJhNogE46h3jLyOSQJIBtv7xDRhpYI/DFlemWQ090psBBHGQpvBs4yWzNMqJ1GgyAGrDj4kCu2SRVz0VdDs28bFw7OVSPEBRtAHJ22rlompETUDlKOLHM5C8Do3JVkZTsvALgWWebJV0YlcXSQ0lEpIGAoJnGxogZp4pnSmOFjwHoDkGDBjVgSHZyz0bHWNHMm87yHzaKwUZTZ7sjlBIDBE4IRy3GiHgJ+eg9LZA/mST+GIQ8sG/n/PgQCiA4JVgWMEz84QmUkQtjJeQUMiCzhJriY0aBSqqSNSkhiRxdYZBeYkcKdCxBqTjt83FEYKKxhSgCS265qnx0dFsu21EjgeFyUa5pTiyxEDiFA3Kliio2OVp2M2sBskk24L8BqxpyLRhTdYA83Zez9k8XmKFlya0qzMbY+xFAkxkHTGq7p4ccJ0IBO7VtgKQzArLpMa0z94xMUOSrJQ1tnDUG3Zm8KfZ/IBiDNkAQIrIDsHUA6RAyjIDEXFfYRJPqjZyVKbQcck4AAGShPa+MXUeX4P4clJgXRyyU1b8YUSTEUjwLbpLWLg9GSAAAfSAuRAjxAALzYdw0mI1XAKKQAI/wzi/Ckz3CaOpGCpH0MUdozqajNFHBjK4HVTDhHzQEZw+QJM7oni/HrJ2cZXABD4BteRyAAAxOURxiOVtnVY8sjG+BYYE/h3jJARx/syIwWIYRIB/qcYeVx7iBwDqagZnCf7mFUP+psVoUHGGOdYa0dh/kqHcI0Lwv9DqFaKDfd6lEfrWrft/VAADdzRLemZR68DBbJ1hTsbaNEwSxAAG14PRMitlyYGBZAAF0SvKcglhDTySiE5n/LgXL0ZAkFYclGbLWWGt5aK6VsrJW7NYAPSZsz3aXG9o8V44c/XTOOeoQDVzVBF2GiYQuFhbDkQ+a4Tw2QgWoAGSFde/8YrYCKGi6ZwDGzdkqf2clpIqX0teEy/VxrE78tte62Vmreg6s5by61zr0YytTcG8gYbzjLN9om7ZwzDmVtOZofN+h7nYeee85wvzAWguvq9R+iLYK6Cndi57GJoyV1hCoea3AN2kyoY00mSZ/jZxsGq7oL7kPcAGYTvQxASYBAuzqpQWc8pUSkRXvRTuANTP5tySqy7+TVMGZNuSkH7LOUQkQHhgA3pR3KWF+FcAYA8K8R4jUAF8DPBD/SncSbrpAer/c0l9x3UnvvC/ESL+ICfnZas2hqrYCDQQYQAL2OJMPA+BoCSzIPAYPL2GrxNPIkscBn3eKX2r5e2daiANr/Xs46sG8Nx5PCmDQkJWza7l/s2DbHtP6dOO0pNHa4x21BxZsb1mTns+hzN5ztCFsMKWx5tbHDfNbcC4I/hyjSHd/h5CpqrRQgUDAF4dclApPOFoK0YDDDQP+b4QYcfwjLBiIkQg1JvMZGZveJNJRVZQXfp7rexD2Aj3URNvAsC8Z0DY0cJvQDXdrQtAZSTgACiQwkwNQvwkwNgAA4ksPItfvQEkC+PwgANQoEkCIbAiyAYh4ZdzBKJBoEYFyjx7F60QegOSB4v4coUJKB4awBJC/DL4AJr4UC0ANKvKEGYG4DYEkC4HYD4H8J6RQDQHAG5DAE2CTDQAjA+C5CQEwFwFX6KKIHIFoENiHi/CLJ4bkyzg6CEGzjXqFR4FCh4YdokDsGJCoEoFqH+AaGpBsRaEYg6GCG35470CvzcynY9ighPCXh3KUCVIHKQAAASZsnQhoCmVSTBq+a4rBwUpOtsRoQ6DabSSAGIi+UYsgGYNAYU0ydhxSlBp8QBYoEoFUK8dKF4/4R6dsysGQraoYi4y8iAx2Vo9AE0UgkQoQKwWBOB/CkwSwNgXyl61KN4LS4hjAcoBuS4gQzSCUaiMmcKdEIahQbRQoFAnhK2Ugu8Ks9BjBeIzBMR9AZB6QFBVByQsQRAxGi4OaRof+b4K85B0epxtAS8fsfoRw9sGQNBJAQhCEGxmqKsU6BhvwRhiQcKYYlxPwIS/BQooU+hXg9A7+pxjED+tx1oW8KxqQ7UT6Vgwq/4rQp6hiAY+6sQH+/Y8BShhmT+zwLsYchm2Orup2oB4BchsBmYco2AZECiU0PxIhYhEhUhMhLJcBVJ5J3J66m6pJkYxoRyxJh6M4UonsBJ56N8fAV6N6d64gN+8U/YyIscop+OpmTJEBUBsBU2MOFCcOc2s+uA8+iAi+URFALBG+W+/ivwu+BmSBLApmlhRBSG3BOBIJkAnBxBReiSGgRx+RVB4kigvBOxDpTp/C5uiQf6PpXBPBfBAhO25cOpmAUIihU0XAf6vJ4hkh0hshJpSwZp0+VpNmNpC+S+ex0R6+m+fiO+AWz4XpKZqhI2thd4DhU6uh6BcJhh0J+AJhMYZhSZ3p3ZcSmh2h2giZVYFcm8KeAYrYtEQESp4E9JvqemXAzxBub8px2RUIh8x8ZoRovpz+M4TCeRJxKUyxaAWYQYJsfRdgraCAepKcdArQNB+qjkI0QUdyDyUIeaLuu5dqZccwmA+q0gUIcm6w8QRWcosgMeRgNQAIy8+ZBpKBTIrQYAFQBgwBGsbEA8SgWaQs9EwFhiXAxEwSjg++QiooU+sOs2LmV5VB+JT+1AqQ8+o5u+AiQiIix+iJ255+w0ci2Fj6VYfKVsAan5RoPEHgaRt4kAwBhorWDaEZ95n+US+cHgMYx48lIU+6qQXUvwkIvwZARAmlvw2l0elArYySJsdU6imi9AsQge2E0Qhima6lNlSAsA0A0aOGDxMezSXhsQPhPAlAYA4uWsd5jxKUoEdiTUlASVn+SR6Ab6GlvcYmjYIYwJdpeUORfu0azONRbxFKZExl9A/lml9xxxDlFAIkCc/EAYXxYAAgjUTkeM7s3y9lBR8YzSHFNARVfAO5n62FyhXpo1hhdprYuIBsV4TwSYsA2Bq8PEUM3OjVkZM40ZSgySwQc141i1fhK1JAa1G1e821zlC8Myoa1VNEgq5S8R41JVKyQYUmNiS1/hbwxsC861fYN1SAzSRKlA7Rg1mq0a9gs4zQx2cclqBA/I1RMNiuuI3oJ5W8LkyQAIc4Dgvy4gJM8ppKvhvqQQW1TyFN8QaEdxY0ACtSJ5DSJs/C5SmN0azhV83FY1C+pYoqUMi4/5Bpg1pxD1coZwsg6EYaD+o6r14mhVvNJ5Z1y1ARd1VU+V4y71StQNm10gSAyS8sWAUNVCupuYrBvwIsfAqA0AE5IkRy5Sz5qIFoMZX+qYb8IQNiqAbVmxEVfxBI0SEI8QUKciXgZaDA8gItKUSlaR4QC87auZyArYdliVweLaiS6g0kBAFtqQkoSYVlYUZYF48eMaJNkAomfxydHKuqogsAVlXgb8O1YBNQ4wvwuQPgNgcw4BIwHIpZgpcwPgYwwBSwzlAcTsDaRyNqhqDRR5KUcictBVZOitMNlVJUs0pRXxY6aVukVYJ19Rtob6gtBqKCft1k/I3w7sci0dSQVowU9oYU0grVJ9b5vwNQwBkwg9NgAAQj4FAWMMBBiAwP+aKqhoZkkE+UGHIhEGNELfQG+R7bACJORDaiUl1HA6+f0WAFYFiGcYscELPLIkbYKidQviGqub4Q4GiaspKM0qoqIv0bjLCA2kkThLvfdpdbeg5H+D6h1TGV1T1btTpYECJA9cEqtLgJsuMRLsQwtaer1A+JgBcpQEmKJfGPaiolgG+Zg9gzCONEmNBKpNIxQHzaGiLrzmvA5N6gGFlaOpHZ/gsScA7gCEGCeNGDamAPY6JEIKiLgA3Z8LQF45bNNM4BI3+M2IaCfpKZAJitikaIij8EmpFOwBI98m+a0Jo1g2PbRi6G5YA3omVHKYjIw8vCk/0S/W/R/d/b/TQz5f5F+nEH44JB2cwMkGNGINJpsCgDU1CAZdgEZdcu1MJRKVQTuiiaICScMyegXYSSqftuqdnJqdJcUYJPuXFhZU1FZRpYFcnWFY5Z9mpZs00cFTkaFU1THiNDSkUhZNHZQAbV0JAPpYZZ3Ncq2OkGFFDMHnhlSMkjhNWexdzfGFxeyTxRQHxfgbvlWDiQdgwCs57PPZrUvaVSeczrVhRBC7idCzmHFoY8rX9U8LTlE9dVtUgMi19h1s9kmO9mi1CzC4/kCzzcVdwAwDixdfizrSDWiHlsbewheOvhbccFJjJgRjbRyaHCnVGV8RprJvwty2bWwXs2S11sViViNJNX6sBb1JINmo9bQJBs0VngIxlYEAdU8n7NEoEIHCEDVLvPtBrSGKXFAJCyKlwK2D2BQohnamXcbcEp0AFMCJQicDnYgGVTkSS6fVS06yNK6/xIpskvnYXd6C41QM3EG0i6fXs6iw6+i861G+67G97YVBueTiQDXXXQuNzMGzTBROm9GOG/+NmxsTG5AE3S3W3R3V3T3QKbIf3YPXAQRmXjK+vmxvwiCYUONEO3GY2Y6QcUOxuWYfa9YFm7Sx4NeYVHvV4K2BoJu8kg9ciWQ0u9eXY/Cm2nKPKIY5AK0PuyWLzYy0mN7e7BRAKL8FfVaPuHcvAPfVLSLibEcs/a/e/eIZUzYH/bqoA/aMA+WOhZhcgNNVwCgUSESPhVSOSBUAfsxcQuAregvJA2HhE11O4cgvHGgGguJZgn/DGYAngiAoQoYBh6/P4iUtJLiJRQ+KjtRyQlfAwAAOxoBMhJAABsnHpIeIJAVInH5IfHSQZQJA5IVIfHAAnCQCSDJ3x0yJx6p3J0SEyCSIUHJ2xxh2gCSMUEkAwCUQINp2gAp0yAZ1JwwESGZ3J0yJ3GUJx1SGgAZ85CSMWwIHp5fJAOSEyOSEkCSAp+SEZ3x3x8ULmJFyvg5zxwZ5xxJ8F3iHJySEkMUAwAFz5xAJAHx0SLJ0yHQAIJxwwMUAp0kLQJxySCSHx7QHJ3Jw8mUGgHx3ckSAwAJ/x+SAIPJ1l1fHV/B0SNxyQGZ1SLQNp8UJx2UDSLl256Vyp0yGUOSHJ2gOSCSBl7V7p+fAYLR4PMbjBkxw6CLN6NfD18FNbtjIVNcgwCsBMutlCJtxrkAZRogLYJ/auFd3QMubHFYPgIE3rskIppdY900fcE6K942CsLYH92A5sIDxOE9z4FICtCUkoBgFDwDwmI98ErQDYGNGMI2ETCjUQIgFiDxFd396YRj3D1jzjxgO4OIyQCT6ICsOTxOZT7UtT7j7yGfTOIz2T1wND0po99irLCMGEByYgAT394Imz/wsZrgLzysK4m+IgH99lkAROA9xOFr7UpdysLkKlVLzyEfNz5/gr9Rur1r9K8TKiCzyKxb3D6erED6vGFLwr/YGsN8AGNqUoDYLguoIAJgEyACAWeDZUg8o0iw0qAgJ8JGg5v2vcPXxUvXK/eRAcf8f/C6SYRcoCv+vbAUvZ5JvgQE+2vJubPGv9vOvpPevBvXA/CdPGYuvaf2vVvKIKvXAFPFf/CjvmAVBUvcAJ0W0RmeIrC8gJsJ4gD+anq4lDlusrk9CyQhsfqrlig7lhmRytgDTISWA0oM4c4ESKV46xoiycaxy+YTflviftfyfLY5/cPq0CSqYuIaPMPZflvmfeNHgOfNfjNW0xfWvpfFvTXun11658zCtfPHgwAfZ2wEoiPUiLf1qTsISYbfSIKz077d9neRfcAY2CgFGgmAsA0eMgBpAkgNAVXAAKR15/wDaRuNgHuT/gHQsceCP3yzTIRUI/xNCMdjWSoBigxAsgbH1f4J8YySfZwCn3gEZ9+QWfT/lX1AH59sB0YFsCrwt4m4gC5WIXqEFwC2Ajex8XaFLzM4LcZO5IXMESDKALdiu5IO5EyAED8cUuAnJkLQH85ydigdyZSJpwYBUg5OKnLrm5zKACAqQZQYoHZ2c58dO4qnEgJx045p9Zeag2wPXzAG1IlOAgOTmUHg5ydOOcnJDuJ1SFUgDYexXjqJypACBHOWQkrk1ySCycFOmQtAKkJIC5dOOSQSwZxwEAMA6h6XEkPkIiEHkcBMA5QKQFUQ6RBcRMOIH9yAFw9fmWwUagC1Gq8UQSu+IYZ3wIDIgPAcmVpvKT+7kh+BtSa+m8HlItIugEAnAcgPJCKCgCSgpQVt187lIzupAY8FX2kjHcwEvne+JQjLD+gTaJYVjvd0iFRgrAXcI4LQEmCjZmOtACuOoAShjRcAeuckKcP05h4ioLwxAYVDuFEIgAA -->\n\n<!-- internal state end -->"},"request":{"retryCount":3,"retries":3,"retryAfter":16}},"response":{"url":"https://api.github.com/repos/NVIDIA-NeMo/Skills/issues/comments/3364204828","status":500,"headers":{"access-control-allow-origin":"*","access-control-expose-headers":"ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset","content-length":"0","content-security-policy":"default-src 'none'","content-type":"application/json; charset=utf-8","date":"Tue, 03 Mar 2026 18:59:11 GMT","referrer-policy":"origin-when-cross-origin, strict-origin-when-cross-origin","server":"github.com","strict-transport-security":"max-age=31536000; includeSubdomains; preload","vary":"Accept-Encoding, Accept, X-Requested-With","x-content-type-options":"nosniff","x-frame-options":"deny","x-github-media-type":"github.v3; format=json","x-github-request-id":"8C4D:3BEA78:B5890:30E65A:69A72F7E","x-ratelimit-limit":"60","x-ratelimit-remaining":"0","x-ratelimit-reset":"1772567328","x-ratelimit-resource":"core","x-ratelimit-used":"8304","x-xss-protection":"0"},"data":""}}

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/evaluation/evaluator/audio.py (1)

376-386: ⚠️ Potential issue | 🟡 Minor

Return a stable ASR-PC metric schema for missing_generation.

At Line 376 through Line 386, the ASR-PC early return only includes wer, while the normal ASR-PC path returns wer, wer_c, wer_pc, and per. This can break downstream consumers expecting ASR-PC keys.

Suggested fix
-        # ASR / ASR-PC
-        return {**base, "wer": 1.0}
+        if task_type == "ASR-PC":
+            return {**base, "wer": 1.0, "wer_c": 1.0, "wer_pc": 1.0, "per": 1.0}
+        return {**base, "wer": 1.0}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/audio.py` around lines 376 - 386, The
early-return for missing_generation when task_type is "ASR-PC" only returns
"wer" but downstream expects the full ASR-PC metric schema; update the branch in
the block that checks task_type and generation (the if task_type in [...] and
not generation) so that when task_type == "ASR-PC" it returns the complete set
of keys used by the normal ASR-PC path (e.g., wer, wer_c, wer_pc, per) with
appropriate default values (e.g., 1.0 for error rates) by merging them into the
existing base dict instead of returning only wer.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/evaluation/evaluator/audio.py`:
- Around line 166-167: The code currently always applies Whisper normalization
by calling preprocess_asr_text(reference) and preprocess_asr_text(hypothesis)
and calls evaluate_asr(...) without honoring config.apply_whisper_normalization
or config.normalization_mode; change the flow so that before preprocessing (or
before calling evaluate_asr) you check config.apply_whisper_normalization and
config.normalization_mode: if apply_whisper_normalization is True and
normalization_mode == "whisper" call the Whisper-specific normalization routine
(or call preprocess_asr_text with a mode parameter), if
apply_whisper_normalization is False skip Whisper normalization and use the
standard text preprocessing, and if normalization_mode is set to an unsupported
value raise an explicit error; also update the evaluate_asr(...) call site to
pass or respect these config flags rather than ignoring them so user settings
are enforced (refer to preprocess_asr_text, evaluate_asr,
config.apply_whisper_normalization, and config.normalization_mode).

---

Outside diff comments:
In `@nemo_skills/evaluation/evaluator/audio.py`:
- Around line 376-386: The early-return for missing_generation when task_type is
"ASR-PC" only returns "wer" but downstream expects the full ASR-PC metric
schema; update the branch in the block that checks task_type and generation (the
if task_type in [...] and not generation) so that when task_type == "ASR-PC" it
returns the complete set of keys used by the normal ASR-PC path (e.g., wer,
wer_c, wer_pc, per) with appropriate default values (e.g., 1.0 for error rates)
by merging them into the existing base dict instead of returning only wer.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 62164ed and e992227.

📒 Files selected for processing (2)
  • nemo_skills/dataset/asr-leaderboard/__init__.py
  • nemo_skills/evaluation/evaluator/audio.py

Comment on lines +166 to +167
ref_std = preprocess_asr_text(reference)
hyp_std = preprocess_asr_text(hypothesis)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

apply_whisper_normalization / normalization_mode are effectively ignored in the new flow.

At Line 166 and Line 167, ASR-PC standard WER always uses Whisper normalization when enabled, regardless of normalization_mode. At Line 399, ASR always calls evaluate_asr(...) without honoring config.apply_whisper_normalization. This silently ignores user-provided settings.

Suggested fix (make config effective and fail on unsupported modes)
 def evaluate_asr_pc(
     reference: str, hypothesis: str, normalize_standard_wer: bool = True, normalization_mode: str = "standard"
 ) -> dict[str, Any]:
@@
-    if normalize_standard_wer:
-        ref_std = preprocess_asr_text(reference)
-        hyp_std = preprocess_asr_text(hypothesis)
+    if normalize_standard_wer:
+        if normalization_mode == "standard":
+            ref_std = preprocess_asr_text(reference)
+            hyp_std = preprocess_asr_text(hypothesis)
+        elif normalization_mode == "none":
+            ref_std = normalize_whitespace(re.sub(r"[^\w\s]", "", reference.lower()))
+            hyp_std = normalize_whitespace(re.sub(r"[^\w\s]", "", hypothesis.lower()))
+        else:
+            raise ValueError(f"Unsupported normalization_mode: {normalization_mode}")
     else:
         ref_std = normalize_whitespace(re.sub(r"[^\w\s]", "", reference.lower()))
         hyp_std = normalize_whitespace(re.sub(r"[^\w\s]", "", hypothesis.lower()))
-def evaluate_asr(reference: str, hypothesis: str) -> dict[str, Any]:
+def evaluate_asr(reference: str, hypothesis: str, apply_whisper_normalization: bool = True) -> dict[str, Any]:
@@
-    ref = preprocess_asr_text(reference)
-    hyp = preprocess_asr_text(hypothesis)
+    if apply_whisper_normalization:
+        ref = preprocess_asr_text(reference)
+        hyp = preprocess_asr_text(hypothesis)
+    else:
+        ref = normalize_whitespace(re.sub(r"[^\w\s]", "", reference.lower()))
+        hyp = normalize_whitespace(re.sub(r"[^\w\s]", "", hypothesis.lower()))
-    elif task_type == "ASR":
-        metrics = evaluate_asr(expected_answer, generation)
+    elif task_type == "ASR":
+        metrics = evaluate_asr(
+            expected_answer,
+            generation,
+            apply_whisper_normalization=config.apply_whisper_normalization,
+        )
         updates.update(metrics)

As per coding guidelines "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".

Also applies to: 399-399

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/audio.py` around lines 166 - 167, The code
currently always applies Whisper normalization by calling
preprocess_asr_text(reference) and preprocess_asr_text(hypothesis) and calls
evaluate_asr(...) without honoring config.apply_whisper_normalization or
config.normalization_mode; change the flow so that before preprocessing (or
before calling evaluate_asr) you check config.apply_whisper_normalization and
config.normalization_mode: if apply_whisper_normalization is True and
normalization_mode == "whisper" call the Whisper-specific normalization routine
(or call preprocess_asr_text with a mode parameter), if
apply_whisper_normalization is False skip Whisper normalization and use the
standard text preprocessing, and if normalization_mode is set to an unsupported
value raise an explicit error; also update the evaluate_asr(...) call site to
pass or respect these config flags rather than ignoring them so user settings
are enforced (refer to preprocess_asr_text, evaluate_asr,
config.apply_whisper_normalization, and config.normalization_mode).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/asr-leaderboard/__init__.py`:
- Around line 17-18: The evaluator currently uses sample.get("task_type",
"unknown") which silently skips WER computation; update evaluate_sample in
nemo_skills/evaluation/evaluator/audio.py to access sample["task_type"] directly
and add explicit validation (raise a clear exception) for allowed values (e.g.,
"ASR", "ASR-PC", "ASR_LEADERBOARD") before any metric computation so
missing/invalid task_type fails fast instead of falling through to the else
branch that omits WER.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 62164ed and 8d6766d.

📒 Files selected for processing (1)
  • nemo_skills/dataset/asr-leaderboard/__init__.py

Comment on lines +17 to +18
# Uses the audio evaluator which computes WER with Whisper-style text normalization
# Data samples should have task_type="ASR" for proper WER calculation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Locate task_type handling
echo "=== Search 1: task_type handling ==="
rg -n -C3 --type=py '\btask_type\b' 2>/dev/null | head -100

# 2) Locate audio evaluation / WER entry points
echo ""
echo "=== Search 2: Audio eval / WER entry points ==="
rg -n -C3 --type=py 'WER|word.?error|whisper|eval_type\s*=\s*audio|METRICS_TYPE\s*=\s*"audio"' 2>/dev/null | head -100

# 3) Look specifically for ASR guards/assertions
echo ""
echo "=== Search 3: ASR guards/assertions ==="
rg -n -C3 --type=py 'assert .*task_type|if .*task_type.*ASR|task_type\s*==\s*["'"'"']ASR["'"'"']' 2>/dev/null | head -100

Repository: NVIDIA-NeMo/Skills

Length of output: 18378


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Get the complete evaluate_sample function
echo "=== Full evaluate_sample function ==="
rg -A 70 'def evaluate_sample\(sample: dict' nemo_skills/evaluation/evaluator/audio.py

# Check if there's any else/default handling for unmatched task_type
echo ""
echo "=== Check for else/default handling ==="
rg -A 80 'elif task_type == "PC-Rate"' nemo_skills/evaluation/evaluator/audio.py

Repository: NVIDIA-NeMo/Skills

Length of output: 4129


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find where evaluate_sample is called
echo "=== Callers of evaluate_sample ==="
rg -B5 -A3 'evaluate_sample\(' nemo_skills/evaluation/evaluator/audio.py | head -60

# Check if there's any validation of task_type before calling evaluate_sample
echo ""
echo "=== Check for task_type validation before evaluate_sample call ==="
rg -B10 'evaluate_sample' nemo_skills/evaluation/evaluator/audio.py | grep -A 10 'def '

Repository: NVIDIA-NeMo/Skills

Length of output: 1613


Enforce task_type="ASR" validation in the evaluator, not only in comments.

The evaluate_sample function at nemo_skills/evaluation/evaluator/audio.py:472 uses .get("task_type", "unknown") with a silent default. If task_type is missing or doesn't match expected values (ASR, ASR-PC, ASR_LEADERBOARD, etc.), the code silently falls through to the else clause (lines 528–531), which skips WER computation and returns minimal fields. This contradicts the documented requirement in nemo_skills/dataset/asr-leaderboard/__init__.py:17-18 that data samples should have task_type="ASR" for proper WER calculation.

Use direct access sample["task_type"] instead of .get() and add explicit validation before metric computation to fail fast when task_type is missing or invalid, preventing silent metric loss.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/asr-leaderboard/__init__.py` around lines 17 - 18, The
evaluator currently uses sample.get("task_type", "unknown") which silently skips
WER computation; update evaluate_sample in
nemo_skills/evaluation/evaluator/audio.py to access sample["task_type"] directly
and add explicit validation (raise a clear exception) for allowed values (e.g.,
"ASR", "ASR-PC", "ASR_LEADERBOARD") before any metric computation so
missing/invalid task_type fails fast instead of falling through to the else
branch that omits WER.

@Kipok
Copy link
Collaborator

Kipok commented Mar 6, 2026

probably won't merge this @wasiahmad as we are trying to upstream everything into gym / nemo-rl. If you want to keep this functionality please create a pr into nemo-rl directly (probably not using our prompt template, but you can introduce similar logic there). In the future, we will try to always use a built-in script and if we need changes, they'd need to go to nemo-rl directly to avoid divergence

@Kipok Kipok added the reviewed label Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants