Skip to content

Fix AA-LCR evaluation#952

Merged
hsiehjackson merged 10 commits intomainfrom
chsieh/fix-aalcr
Oct 24, 2025
Merged

Fix AA-LCR evaluation#952
hsiehjackson merged 10 commits intomainfrom
chsieh/fix-aalcr

Conversation

@hsiehjackson
Copy link
Collaborator

@hsiehjackson hsiehjackson commented Oct 15, 2025

Summary by CodeRabbit

  • Bug Fixes

    • Evaluation now requires both an answer and its accompanying reasoning to be present before marking a judgment correct, avoiding false positives from empty outputs.
  • Behavior Change

    • Scoring is stricter for incomplete predictions; some items previously marked correct may now be labeled incorrect.
  • New Data

    • Source items now preserve the original question text used in prompts for clearer context.
  • Notes

    • Prompt references updated to use the preserved original question.

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
@hsiehjackson hsiehjackson requested a review from fayejf October 15, 2025 20:31
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 15, 2025

Walkthrough

Updates AALCR evaluation to pass generation into is_aalcr_correct and gate judge_correct on a new reasoning_valid flag; dataset entries now include original_question; judge prompt replaced {question} with {original_question}.

Changes

Cohort / File(s) Summary
Metrics validation & API
nemo_skills/evaluation/metrics/aalcr_metrics.py
Changes is_aalcr_correct signature to accept (judgement: str, generation: str) and return False for empty generation; _get_score_dict computes reasoning_valid from "_full_generation", calls is_aalcr_correct(prediction["judgement"], prediction["generation"]) to get judge_correct_raw, stores judge_correct_raw, and sets judge_correct = judge_correct_raw only if reasoning_valid is True, otherwise False.
Dataset preparation
nemo_skills/dataset/aalcr/prepare.py
Adds new field original_question to each output entry, populated from question_text.
Prompt configuration
nemo_skills/prompt/config/judge/aalcr.yaml
Replaces placeholder {question} with {original_question} in the judge prompt template.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Caller
  participant Metrics as AALCRMetrics._get_score_dict
  participant Pred as prediction
  participant Helper as is_aalcr_correct(judgement,generation)

  Caller->>Metrics: _get_score_dict(prediction)
  Metrics->>Pred: read "generation", "_full_generation", "judgement"
  Note over Metrics: reasoning_valid = non-empty "_full_generation"
  Metrics->>Helper: is_aalcr_correct(prediction["judgement"], prediction["generation"])
  Helper-->>Metrics: judge_correct_raw (bool) or False if generation empty
  alt reasoning_valid
    Metrics-->>Caller: judge_correct = judge_correct_raw
  else
    Metrics-->>Caller: judge_correct = False
  end
Loading
sequenceDiagram
  autonumber
  participant Prep as prepare.py
  participant File as dataset entry
  participant Prompt as aalcr.yaml

  Prep->>File: write entry with `question_text` and `original_question = question_text`
  File->>Prompt: fill template using `{original_question}` placeholder
  Prompt-->>Caller: prompt now references original_question
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I nibbled notes and rewrote clues,
I carried questions in my shoes,
A judgement paired with its full tale,
No empty hops along the trail—
I twitch, I check, then hop away,
The garden's tidy for today. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "Fix AA-LCR evaluation" is directly related to the changeset, which addresses multiple aspects of the AA-LCR (Automated Analysis for Language Comprehension and Reasoning) evaluation system. The PR modifies the correctness check function signature, updates scoring logic, enhances dataset preparation, and adjusts prompt templates—all of which constitute fixes to the evaluation system. While the title is somewhat broad and doesn't specify the exact nature of the technical fixes (such as adding generation validation or updating the scoring logic), it clearly communicates the intent and refers to a real part of the change.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chsieh/fix-aalcr

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 26a717e and 0a52c1a.

📒 Files selected for processing (3)
  • nemo_skills/dataset/aalcr/prepare.py (1 hunks)
  • nemo_skills/evaluation/metrics/aalcr_metrics.py (2 hunks)
  • nemo_skills/prompt/config/judge/aalcr.yaml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit
🔇 Additional comments (4)
nemo_skills/dataset/aalcr/prepare.py (1)

194-194: LGTM! Preserving original question text.

This change correctly stores the original question text before it's wrapped with document context, enabling the judge prompt to reference the unwrapped question.

nemo_skills/prompt/config/judge/aalcr.yaml (1)

9-9: LGTM! Aligned with data schema change.

The prompt now correctly references {original_question} to match the new field added in the dataset preparation.

nemo_skills/evaluation/metrics/aalcr_metrics.py (2)

48-58: LGTM! Signature change adds necessary validation.

The updated signature now accepts generation to validate it's non-empty before checking the judgement. This prevents false positives when the model fails to generate an answer.


67-74: LGTM! Correct gating logic for judge_correct.

The evaluation correctly computes judge_correct_raw using the updated API, then gates the final judge_correct on whether reasoning is valid. This ensures predictions with empty reasoning are marked incorrect regardless of judgement.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3a3eba8 and 14a8699.

📒 Files selected for processing (1)
  • nemo_skills/evaluation/metrics/aalcr_metrics.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests

Copy link
Collaborator

@fayejf fayejf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the generation_valid and reasoning_valid are not very easy to understand if don't know the context.
Can we came up with a better name or adding some explanation? in case the logic of processing generation will be changed in the future and break it.

hsiehjackson and others added 2 commits October 15, 2025 17:30
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
@hsiehjackson hsiehjackson requested a review from fayejf October 16, 2025 00:33
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
nemo_skills/evaluation/metrics/aalcr_metrics.py (1)

64-65: Guard against None values before calling strip() and len().

If prediction["generation"] is None or missing, line 64 raises AttributeError or KeyError. Similarly, if _full_generation exists but is None, line 65's prediction.get("_full_generation", "") returns None (not the default), causing TypeError when passed to len().

Apply this diff to safely handle None values:

-            correctness_dict["is_generation_empty"] = len(prediction["generation"].strip()) == 0
-            correctness_dict["is_reasoning_empty"] = len(prediction.get("_full_generation", "")) == 0
+            generation_text = prediction.get("generation") or ""
+            reasoning_text = prediction.get("_full_generation") or ""
+            correctness_dict["is_generation_empty"] = len(generation_text.strip()) == 0
+            correctness_dict["is_reasoning_empty"] = len(reasoning_text.strip()) == 0
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14a8699 and 26a717e.

📒 Files selected for processing (1)
  • nemo_skills/evaluation/metrics/aalcr_metrics.py (1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Lint and Format
nemo_skills/evaluation/metrics/aalcr_metrics.py

[error] 1-1: Trailing whitespace detected and fixed by pre-commit hook. The hook failed with exit code 1.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests
🔇 Additional comments (1)
nemo_skills/evaluation/metrics/aalcr_metrics.py (1)

66-66: LGTM! Logic correctly handles empty generations.

The logic properly separates the raw LLM judgement from the final correctness determination, ensuring that empty generation or reasoning content is marked as incorrect regardless of the LLM judgement. This prevents misleading accuracy metrics from empty responses.

Note: This logic assumes the None handling issue in lines 64-65 is addressed.

Also applies to: 70-73

fayejf and others added 6 commits October 17, 2025 13:29
Signed-off-by: fayejf <fayejf07@gmail.com>
Signed-off-by: fayejf <fayejf07@gmail.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Copy link
Collaborator

@fayejf fayejf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. thanks!

correctness_dict["judge_correct"] = (
self.is_aalcr_correct(prediction["judgement"]) if is_valid_generation else False
)
# Invalid generation: reasoning in not finished or non-reasoning generation is empty
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo # Invalid generation: reasoning is not finished or non-reasoning generation is empty

@fayejf
Copy link
Collaborator

fayejf commented Oct 24, 2025

This PR is dependent on #958

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
@hsiehjackson hsiehjackson enabled auto-merge (squash) October 24, 2025 22:18
@hsiehjackson hsiehjackson merged commit a7e6f71 into main Oct 24, 2025
4 of 5 checks passed
@hsiehjackson hsiehjackson deleted the chsieh/fix-aalcr branch October 24, 2025 22:33
dgtm777 pushed a commit that referenced this pull request Oct 29, 2025
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: fayejf <fayejf07@gmail.com>
Co-authored-by: fayejf <fayejf07@gmail.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: fayejf <fayejf07@gmail.com>
Co-authored-by: fayejf <fayejf07@gmail.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants