replace raise error with LOG.warning in AA LCR dataset prepare#1119
replace raise error with LOG.warning in AA LCR dataset prepare#1119gwarmstrong merged 3 commits intomainfrom
Conversation
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
📝 WalkthroughWalkthroughToken mismatch handling in AALCR dataset preparation is modified to issue a warning instead of raising an error, allowing processing to continue. Additionally, the aalcr dataset is enabled for evaluation by removing it from the excluded datasets list. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
nemo_skills/dataset/aalcr/prepare.py (1)
189-190: Consider enhancing the warning message with entry context.The warning message could be more actionable by including the entry index or question_id to help track which specific entries have token mismatches. This would aid debugging if issues arise downstream.
Apply this diff to enhance the warning:
- if n_tokens != entry["input_tokens"]: # check if the n_tokens exactly match the input_tokens in the entry - LOG.warning(f"n_tokens: {n_tokens} != input_tokens: {entry['input_tokens']}") + if n_tokens != entry["input_tokens"]: # check if the n_tokens exactly match the input_tokens in the entry + LOG.warning( + f"Token mismatch for entry {entry['index']}: " + f"computed {n_tokens} tokens with {tokenizer_name}, " + f"but dataset reports {entry['input_tokens']} tokens" + )Additionally, you might consider tracking the frequency and magnitude of mismatches to assess whether this is a widespread issue or limited to specific entries.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_skills/dataset/aalcr/prepare.py(1 hunks)tests/gpu-tests/test_eval.py(0 hunks)
💤 Files with no reviewable changes (1)
- tests/gpu-tests/test_eval.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: unit-tests
Signed-off-by: George Armstrong <georgea@nvidia.com>
gwarmstrong
left a comment
There was a problem hiding this comment.
LGTM. Thank you @anowaczynski-nvidia !
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
…A-NeMo#1119) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Summary by CodeRabbit
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.