Skip to content

replace raise error with LOG.warning in AA LCR dataset prepare#1119

Merged
gwarmstrong merged 3 commits intomainfrom
anowaczynski/aalcr-fix
Dec 17, 2025
Merged

replace raise error with LOG.warning in AA LCR dataset prepare#1119
gwarmstrong merged 3 commits intomainfrom
anowaczynski/aalcr-fix

Conversation

@anowaczynski-nvidia
Copy link
Collaborator

@anowaczynski-nvidia anowaczynski-nvidia commented Dec 16, 2025

Summary by CodeRabbit

  • Bug Fixes

    • Token mismatch errors are now handled as warnings, allowing dataset processing to continue instead of halting with an error.
  • Tests

    • AALCR dataset is now included in the evaluation test suite.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 16, 2025

📝 Walkthrough

Walkthrough

Token mismatch handling in AALCR dataset preparation is modified to issue a warning instead of raising an error, allowing processing to continue. Additionally, the aalcr dataset is enabled for evaluation by removing it from the excluded datasets list.

Changes

Cohort / File(s) Summary
AALCR dataset preparation
nemo_skills/dataset/aalcr/prepare.py
Replaces ValueError on token count mismatch with a warning; processing continues and entries are preserved with added tokenizer fields
Test configuration
tests/gpu-tests/test_eval.py
Removes "aalcr" from EXCLUDED_DATASETS, enabling the dataset for evaluation candidate pool

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

  • nemo_skills/dataset/aalcr/prepare.py: Verify that demoting the error to a warning does not unintentionally mask legitimate token mismatches or compromise data integrity downstream
  • tests/gpu-tests/test_eval.py: Confirm aalcr is ready for evaluation and that removing it from the exclusion list is intentional and tested

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: replacing a hard error (ValueError) with a warning in the AA LCR dataset preparation code.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch anowaczynski/aalcr-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
nemo_skills/dataset/aalcr/prepare.py (1)

189-190: Consider enhancing the warning message with entry context.

The warning message could be more actionable by including the entry index or question_id to help track which specific entries have token mismatches. This would aid debugging if issues arise downstream.

Apply this diff to enhance the warning:

-            if n_tokens != entry["input_tokens"]:  # check if the n_tokens exactly match the input_tokens in the entry
-                LOG.warning(f"n_tokens: {n_tokens} != input_tokens: {entry['input_tokens']}")
+            if n_tokens != entry["input_tokens"]:  # check if the n_tokens exactly match the input_tokens in the entry
+                LOG.warning(
+                    f"Token mismatch for entry {entry['index']}: "
+                    f"computed {n_tokens} tokens with {tokenizer_name}, "
+                    f"but dataset reports {entry['input_tokens']} tokens"
+                )

Additionally, you might consider tracking the frequency and magnitude of mismatches to assess whether this is a widespread issue or limited to specific entries.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b7116c4 and d1d894e.

📒 Files selected for processing (2)
  • nemo_skills/dataset/aalcr/prepare.py (1 hunks)
  • tests/gpu-tests/test_eval.py (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/gpu-tests/test_eval.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests

Signed-off-by: George Armstrong <georgea@nvidia.com>
Copy link
Collaborator

@gwarmstrong gwarmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @anowaczynski-nvidia !

@gwarmstrong gwarmstrong enabled auto-merge (squash) December 17, 2025 17:42
@gwarmstrong gwarmstrong merged commit f98d6ed into main Dec 17, 2025
5 checks passed
@gwarmstrong gwarmstrong deleted the anowaczynski/aalcr-fix branch December 17, 2025 17:57
wasiahmad pushed a commit that referenced this pull request Dec 19, 2025
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Dec 19, 2025
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
blahblahasdf pushed a commit to blahblahasdf/Skills that referenced this pull request Jan 8, 2026
…A-NeMo#1119)

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
hsiehjackson pushed a commit that referenced this pull request Jan 13, 2026
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Feb 4, 2026
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants