replace raise error with LOG.warning in AA LCR dataset prepare by anowaczynski-nvidia · Pull Request #1119 · NVIDIA-NeMo/Skills

anowaczynski-nvidia · 2025-12-16T22:55:19Z

Summary by CodeRabbit

Bug Fixes
- Token mismatch errors are now handled as warnings, allowing dataset processing to continue instead of halting with an error.
Tests
- AALCR dataset is now included in the evaluation test suite.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

coderabbitai · 2025-12-16T22:57:50Z

📝 Walkthrough

Walkthrough

Token mismatch handling in AALCR dataset preparation is modified to issue a warning instead of raising an error, allowing processing to continue. Additionally, the aalcr dataset is enabled for evaluation by removing it from the excluded datasets list.

Changes

Cohort / File(s)	Summary
AALCR dataset preparation `nemo_skills/dataset/aalcr/prepare.py`	Replaces ValueError on token count mismatch with a warning; processing continues and entries are preserved with added tokenizer fields
Test configuration `tests/gpu-tests/test_eval.py`	Removes "aalcr" from EXCLUDED_DATASETS, enabling the dataset for evaluation candidate pool

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

nemo_skills/dataset/aalcr/prepare.py: Verify that demoting the error to a warning does not unintentionally mask legitimate token mismatches or compromise data integrity downstream
tests/gpu-tests/test_eval.py: Confirm aalcr is ready for evaluation and that removing it from the exclusion list is intentional and tested

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: replacing a hard error (ValueError) with a warning in the AA LCR dataset preparation code.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch anowaczynski/aalcr-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

nemo_skills/dataset/aalcr/prepare.py (1)
189-190: Consider enhancing the warning message with entry context.

The warning message could be more actionable by including the entry index or question_id to help track which specific entries have token mismatches. This would aid debugging if issues arise downstream.

Apply this diff to enhance the warning:
-            if n_tokens != entry["input_tokens"]:  # check if the n_tokens exactly match the input_tokens in the entry
-                LOG.warning(f"n_tokens: {n_tokens} != input_tokens: {entry['input_tokens']}")
+            if n_tokens != entry["input_tokens"]:  # check if the n_tokens exactly match the input_tokens in the entry
+                LOG.warning(
+                    f"Token mismatch for entry {entry['index']}: "
+                    f"computed {n_tokens} tokens with {tokenizer_name}, "
+                    f"but dataset reports {entry['input_tokens']} tokens"
+                )
Additionally, you might consider tracking the frequency and magnitude of mismatches to assess whether this is a widespread issue or limited to specific entries.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b7116c4 and d1d894e.

📒 Files selected for processing (2)

nemo_skills/dataset/aalcr/prepare.py (1 hunks)
tests/gpu-tests/test_eval.py (0 hunks)

💤 Files with no reviewable changes (1)

tests/gpu-tests/test_eval.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: unit-tests

Signed-off-by: George Armstrong <georgea@nvidia.com>

gwarmstrong

LGTM. Thank you @anowaczynski-nvidia !

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

…A-NeMo#1119) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

anowaczynski-nvidia added 2 commits December 16, 2025 23:48

replace raise error with LOG.warning in AA LCR dataset prepare

8380bbf

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

remove aalcr from EXCLUDED_DATASETS in test_eval

d1d894e

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

anowaczynski-nvidia requested review from fayejf and gwarmstrong December 16, 2025 22:55

coderabbitai bot reviewed Dec 16, 2025

View reviewed changes

anowaczynski-nvidia added the run GPU tests label Dec 17, 2025

Merge branch 'main' into anowaczynski/aalcr-fix

338de8b

Signed-off-by: George Armstrong <georgea@nvidia.com>

gwarmstrong approved these changes Dec 17, 2025

View reviewed changes

gwarmstrong enabled auto-merge (squash) December 17, 2025 17:42

gwarmstrong merged commit f98d6ed into main Dec 17, 2025
5 checks passed

gwarmstrong deleted the anowaczynski/aalcr-fix branch December 17, 2025 17:57

coderabbitai bot mentioned this pull request Dec 17, 2025

Revert "Use run.Script for generate pipeline (#1052)" #1125

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace raise error with LOG.warning in AA LCR dataset prepare#1119

replace raise error with LOG.warning in AA LCR dataset prepare#1119
gwarmstrong merged 3 commits intomainfrom
anowaczynski/aalcr-fix

anowaczynski-nvidia commented Dec 16, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 16, 2025

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

gwarmstrong left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anowaczynski-nvidia commented Dec 16, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 16, 2025

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gwarmstrong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anowaczynski-nvidia commented Dec 16, 2025 •

edited by coderabbitai bot

Loading