Conversation
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
WalkthroughUpdates AALCR evaluation to pass generation into is_aalcr_correct and gate judge_correct on a new reasoning_valid flag; dataset entries now include Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Caller
participant Metrics as AALCRMetrics._get_score_dict
participant Pred as prediction
participant Helper as is_aalcr_correct(judgement,generation)
Caller->>Metrics: _get_score_dict(prediction)
Metrics->>Pred: read "generation", "_full_generation", "judgement"
Note over Metrics: reasoning_valid = non-empty "_full_generation"
Metrics->>Helper: is_aalcr_correct(prediction["judgement"], prediction["generation"])
Helper-->>Metrics: judge_correct_raw (bool) or False if generation empty
alt reasoning_valid
Metrics-->>Caller: judge_correct = judge_correct_raw
else
Metrics-->>Caller: judge_correct = False
end
sequenceDiagram
autonumber
participant Prep as prepare.py
participant File as dataset entry
participant Prompt as aalcr.yaml
Prep->>File: write entry with `question_text` and `original_question = question_text`
File->>Prompt: fill template using `{original_question}` placeholder
Prompt-->>Caller: prompt now references original_question
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
🔇 Additional comments (4)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nemo_skills/evaluation/metrics/aalcr_metrics.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
fayejf
left a comment
There was a problem hiding this comment.
I think the generation_valid and reasoning_valid are not very easy to understand if don't know the context.
Can we came up with a better name or adding some explanation? in case the logic of processing generation will be changed in the future and break it.
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
nemo_skills/evaluation/metrics/aalcr_metrics.py (1)
64-65: Guard againstNonevalues before callingstrip()andlen().If
prediction["generation"]isNoneor missing, line 64 raisesAttributeErrororKeyError. Similarly, if_full_generationexists but isNone, line 65'sprediction.get("_full_generation", "")returnsNone(not the default), causingTypeErrorwhen passed tolen().Apply this diff to safely handle
Nonevalues:- correctness_dict["is_generation_empty"] = len(prediction["generation"].strip()) == 0 - correctness_dict["is_reasoning_empty"] = len(prediction.get("_full_generation", "")) == 0 + generation_text = prediction.get("generation") or "" + reasoning_text = prediction.get("_full_generation") or "" + correctness_dict["is_generation_empty"] = len(generation_text.strip()) == 0 + correctness_dict["is_reasoning_empty"] = len(reasoning_text.strip()) == 0
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nemo_skills/evaluation/metrics/aalcr_metrics.py(1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Lint and Format
nemo_skills/evaluation/metrics/aalcr_metrics.py
[error] 1-1: Trailing whitespace detected and fixed by pre-commit hook. The hook failed with exit code 1.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
🔇 Additional comments (1)
nemo_skills/evaluation/metrics/aalcr_metrics.py (1)
66-66: LGTM! Logic correctly handles empty generations.The logic properly separates the raw LLM judgement from the final correctness determination, ensuring that empty generation or reasoning content is marked as incorrect regardless of the LLM judgement. This prevents misleading accuracy metrics from empty responses.
Note: This logic assumes the
Nonehandling issue in lines 64-65 is addressed.Also applies to: 70-73
Signed-off-by: fayejf <fayejf07@gmail.com>
Signed-off-by: fayejf <fayejf07@gmail.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
| correctness_dict["judge_correct"] = ( | ||
| self.is_aalcr_correct(prediction["judgement"]) if is_valid_generation else False | ||
| ) | ||
| # Invalid generation: reasoning in not finished or non-reasoning generation is empty |
There was a problem hiding this comment.
typo # Invalid generation: reasoning is not finished or non-reasoning generation is empty
|
This PR is dependent on #958 |
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> Signed-off-by: fayejf <fayejf07@gmail.com> Co-authored-by: fayejf <fayejf07@gmail.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> Signed-off-by: fayejf <fayejf07@gmail.com> Co-authored-by: fayejf <fayejf07@gmail.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Summary by CodeRabbit
Bug Fixes
Behavior Change
New Data
Notes