Fix AA-LCR evaluation by hsiehjackson · Pull Request #952 · NVIDIA-NeMo/Skills

hsiehjackson · 2025-10-15T20:31:22Z

Summary by CodeRabbit

Bug Fixes
- Evaluation now requires both an answer and its accompanying reasoning to be present before marking a judgment correct, avoiding false positives from empty outputs.
Behavior Change
- Scoring is stricter for incomplete predictions; some items previously marked correct may now be labeled incorrect.
New Data
- Source items now preserve the original question text used in prompts for clearer context.
Notes
- Prompt references updated to use the preserved original question.

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

coderabbitai · 2025-10-15T20:31:47Z

Walkthrough

Updates AALCR evaluation to pass generation into is_aalcr_correct and gate judge_correct on a new reasoning_valid flag; dataset entries now include original_question; judge prompt replaced {question} with {original_question}.

Changes

Cohort / File(s)	Summary
Metrics validation & API `nemo_skills/evaluation/metrics/aalcr_metrics.py`	Changes `is_aalcr_correct` signature to accept `(judgement: str, generation: str)` and return False for empty generation; `_get_score_dict` computes `reasoning_valid` from `"_full_generation"`, calls `is_aalcr_correct(prediction["judgement"], prediction["generation"])` to get `judge_correct_raw`, stores `judge_correct_raw`, and sets `judge_correct` = `judge_correct_raw` only if `reasoning_valid` is True, otherwise False.
Dataset preparation `nemo_skills/dataset/aalcr/prepare.py`	Adds new field `original_question` to each output entry, populated from `question_text`.
Prompt configuration `nemo_skills/prompt/config/judge/aalcr.yaml`	Replaces placeholder `{question}` with `{original_question}` in the judge prompt template.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Caller
  participant Metrics as AALCRMetrics._get_score_dict
  participant Pred as prediction
  participant Helper as is_aalcr_correct(judgement,generation)

  Caller->>Metrics: _get_score_dict(prediction)
  Metrics->>Pred: read "generation", "_full_generation", "judgement"
  Note over Metrics: reasoning_valid = non-empty "_full_generation"
  Metrics->>Helper: is_aalcr_correct(prediction["judgement"], prediction["generation"])
  Helper-->>Metrics: judge_correct_raw (bool) or False if generation empty
  alt reasoning_valid
    Metrics-->>Caller: judge_correct = judge_correct_raw
  else
    Metrics-->>Caller: judge_correct = False
  end

sequenceDiagram
  autonumber
  participant Prep as prepare.py
  participant File as dataset entry
  participant Prompt as aalcr.yaml

  Prep->>File: write entry with `question_text` and `original_question = question_text`
  File->>Prompt: fill template using `{original_question}` placeholder
  Prompt-->>Caller: prompt now references original_question

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I nibbled notes and rewrote clues,
I carried questions in my shoes,
A judgement paired with its full tale,
No empty hops along the trail—
I twitch, I check, then hop away,
The garden's tidy for today. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "Fix AA-LCR evaluation" is directly related to the changeset, which addresses multiple aspects of the AA-LCR (Automated Analysis for Language Comprehension and Reasoning) evaluation system. The PR modifies the correctness check function signature, updates scoring logic, enhances dataset preparation, and adjusts prompt templates—all of which constitute fixes to the evaluation system. While the title is somewhat broad and doesn't specify the exact nature of the technical fixes (such as adding generation validation or updating the scoring logic), it clearly communicates the intent and refers to a real part of the change.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch chsieh/fix-aalcr

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 26a717e and 0a52c1a.

📒 Files selected for processing (3)

nemo_skills/dataset/aalcr/prepare.py (1 hunks)
nemo_skills/evaluation/metrics/aalcr_metrics.py (2 hunks)
nemo_skills/prompt/config/judge/aalcr.yaml (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit-tests
GitHub Check: pre-commit

🔇 Additional comments (4)

nemo_skills/dataset/aalcr/prepare.py (1)

194-194: LGTM! Preserving original question text.

This change correctly stores the original question text before it's wrapped with document context, enabling the judge prompt to reference the unwrapped question.

nemo_skills/prompt/config/judge/aalcr.yaml (1)

9-9: LGTM! Aligned with data schema change.

The prompt now correctly references {original_question} to match the new field added in the dataset preparation.

nemo_skills/evaluation/metrics/aalcr_metrics.py (2)

48-58: LGTM! Signature change adds necessary validation.

The updated signature now accepts generation to validate it's non-empty before checking the judgement. This prevents false positives when the model fails to generate an answer.

67-74: LGTM! Correct gating logic for judge_correct.

The evaluation correctly computes judge_correct_raw using the updated API, then gates the final judge_correct on whether reasoning is valid. This ensures predictions with empty reasoning are marked incorrect regardless of judgement.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3a3eba8 and 14a8699.

📒 Files selected for processing (1)

nemo_skills/evaluation/metrics/aalcr_metrics.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

nemo_skills/evaluation/metrics/aalcr_metrics.py

fayejf

I think the generation_valid and reasoning_valid are not very easy to understand if don't know the context.
Can we came up with a better name or adding some explanation? in case the logic of processing generation will be changed in the future and break it.

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

nemo_skills/evaluation/metrics/aalcr_metrics.py (1)
64-65: Guard against None values before calling strip() and len().

If prediction["generation"] is None or missing, line 64 raises AttributeError or KeyError. Similarly, if _full_generation exists but is None, line 65's prediction.get("_full_generation", "") returns None (not the default), causing TypeError when passed to len().

Apply this diff to safely handle None values:
-            correctness_dict["is_generation_empty"] = len(prediction["generation"].strip()) == 0
-            correctness_dict["is_reasoning_empty"] = len(prediction.get("_full_generation", "")) == 0
+            generation_text = prediction.get("generation") or ""
+            reasoning_text = prediction.get("_full_generation") or ""
+            correctness_dict["is_generation_empty"] = len(generation_text.strip()) == 0
+            correctness_dict["is_reasoning_empty"] = len(reasoning_text.strip()) == 0

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14a8699 and 26a717e.

📒 Files selected for processing (1)

nemo_skills/evaluation/metrics/aalcr_metrics.py (1 hunks)

🧰 Additional context used

🪛 GitHub Actions: Lint and Format

nemo_skills/evaluation/metrics/aalcr_metrics.py

[error] 1-1: Trailing whitespace detected and fixed by pre-commit hook. The hook failed with exit code 1.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

🔇 Additional comments (1)

nemo_skills/evaluation/metrics/aalcr_metrics.py (1)

66-66: LGTM! Logic correctly handles empty generations.

The logic properly separates the raw LLM judgement from the final correctness determination, ensuring that empty generation or reasoning content is marked as incorrect regardless of the LLM judgement. This prevents misleading accuracy metrics from empty responses.

Note: This logic assumes the None handling issue in lines 64-65 is addressed.

Also applies to: 70-73

nemo_skills/evaluation/metrics/aalcr_metrics.py

Signed-off-by: fayejf <fayejf07@gmail.com>

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

fayejf

LGTM. thanks!

fayejf · 2025-10-24T22:12:05Z

nemo_skills/evaluation/metrics/aalcr_metrics.py

-            correctness_dict["judge_correct"] = (
-                self.is_aalcr_correct(prediction["judgement"]) if is_valid_generation else False
-            )
+             # Invalid generation: reasoning in not finished or non-reasoning generation is empty


typo # Invalid generation: reasoning is not finished or non-reasoning generation is empty

fayejf · 2025-10-24T22:13:50Z

This PR is dependent on #958

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> Signed-off-by: fayejf <fayejf07@gmail.com> Co-authored-by: fayejf <fayejf07@gmail.com>

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> Signed-off-by: fayejf <fayejf07@gmail.com> Co-authored-by: fayejf <fayejf07@gmail.com> Signed-off-by: dgitman <dgitman@nvidia.com>

add generation and reasoning valid

14a8699

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

hsiehjackson requested a review from fayejf October 15, 2025 20:31

coderabbitai bot reviewed Oct 15, 2025

View reviewed changes

nemo_skills/evaluation/metrics/aalcr_metrics.py Outdated Show resolved Hide resolved

fayejf reviewed Oct 16, 2025

View reviewed changes

hsiehjackson and others added 2 commits October 15, 2025 17:30

add comment and clear logic

a7a22a2

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

Merge branch 'main' into chsieh/fix-aalcr

26a717e

hsiehjackson requested a review from fayejf October 16, 2025 00:33

coderabbitai bot reviewed Oct 16, 2025

View reviewed changes

nemo_skills/evaluation/metrics/aalcr_metrics.py Outdated Show resolved Hide resolved

fayejf and others added 6 commits October 17, 2025 13:29

judge question fix

03b6588

Signed-off-by: fayejf <fayejf07@gmail.com>

raw also consider empty generation condition

0a52c1a

Signed-off-by: fayejf <fayejf07@gmail.com>

Merge branch 'main' into chsieh/fix-aalcr

8b5edf5

update comments

cd9b155

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

check generation for valid judgement

d9e0e3a

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

minimize changes

bbb5930

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

fayejf approved these changes Oct 24, 2025

View reviewed changes

fix typo

056fb45

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

hsiehjackson enabled auto-merge (squash) October 24, 2025 22:18

hsiehjackson merged commit a7e6f71 into main Oct 24, 2025
4 of 5 checks passed

hsiehjackson deleted the chsieh/fix-aalcr branch October 24, 2025 22:33

dgtm777 pushed a commit that referenced this pull request Oct 29, 2025

Fix AA-LCR evaluation (#952)

78d7703

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> Signed-off-by: fayejf <fayejf07@gmail.com> Co-authored-by: fayejf <fayejf07@gmail.com>

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Fix AA-LCR evaluation (#952)

47242f0

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> Signed-off-by: fayejf <fayejf07@gmail.com> Co-authored-by: fayejf <fayejf07@gmail.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix AA-LCR evaluation#952

Fix AA-LCR evaluation#952
hsiehjackson merged 10 commits intomainfrom
chsieh/fix-aalcr

hsiehjackson commented Oct 15, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 15, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

fayejf left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

fayejf left a comment

Uh oh!

fayejf Oct 24, 2025

Uh oh!

fayejf commented Oct 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hsiehjackson commented Oct 15, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fayejf left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fayejf left a comment

Choose a reason for hiding this comment

Uh oh!

fayejf Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

fayejf commented Oct 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hsiehjackson commented Oct 15, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 15, 2025 •

edited

Loading