Conversation
📝 WalkthroughWalkthroughSingle file refactor of the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
nemo_skills/evaluation/evaluator/code.py (1)
122-135:<think>stripping is reasonable; minor corner case worth notingThe
partition("</think>")approach is simple and avoids regex, and treating an unterminated<think>as invalid by returning an empty completion is consistent and predictable.One minor corner case: if the model emits multiple
<think>...</think>segments, only content up to the first</think>is removed; any subsequent<think>blocks remain incompletion. If that ever shows up in practice, consider looping until no<think>/</think>pairs remain, or explicitly stripping all such segments.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nemo_skills/evaluation/evaluator/code.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
🔇 Additional comments (2)
nemo_skills/evaluation/evaluator/code.py (2)
118-121: Robust initialization and normalization look goodUsing
generation_dict.get("generation", "") or ""pluscompletion.replace("\r", "")avoids KeyErrors, handlesNone/empty values, and normalizes line endings without changing semantics. No issues here.
169-173: Whitespace‑stripping toggle aligns well with different evaluation pathsConditionally stripping only at the very end (
if strip_whitespace: completion = completion.strip()) is a clean consolidation of the previous behavior and plays nicely with callers that require exact formatting (e.g., infilling / partial‑solution concatenation) versus those that just need trimmed code. This looks good.
| specific_fence = f"```{language}" | ||
| generic_fence = "```" | ||
|
|
||
| # Find the *last* occurrence of the code block (handles CoT steps) | ||
| start_index = completion.rfind(specific_fence) | ||
| fence_len = len(specific_fence) | ||
|
|
||
| # Fallback to generic fence if specific language tag is missing | ||
| if start_index == -1: | ||
| start_index = completion.rfind(generic_fence) | ||
| fence_len = len(generic_fence) | ||
|
|
||
| if start_index != -1: | ||
| # Move past the opening fence | ||
| content_start = start_index + fence_len | ||
| completion = completion[content_start:] | ||
|
|
||
| # Check for closing fence | ||
| end_index = completion.find(generic_fence) | ||
| if end_index != -1: | ||
| # Valid block found | ||
| completion = completion[:end_index] | ||
| else: | ||
| # STRICT MODE: Opening fence found, but no closing fence. | ||
| # The generation is truncated/incomplete. Discard it. | ||
| completion = "" | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd nemo_skills && find . -name "code.py" -path "*/evaluation/evaluator/*" | head -5Repository: NVIDIA-NeMo/Skills
Length of output: 92
🏁 Script executed:
cat -n nemo_skills/evaluation/evaluator/code.py | sed -n '135,170p'Repository: NVIDIA-NeMo/Skills
Length of output: 1670
🏁 Script executed:
cat -n nemo_skills/evaluation/evaluator/code.py | sed -n '1,150p' | tail -50Repository: NVIDIA-NeMo/Skills
Length of output: 2326
🏁 Script executed:
grep -n "def.*extract\|def.*fence\|def.*code" nemo_skills/evaluation/evaluator/code.py | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 258
🏁 Script executed:
rg "preprocess_code" nemo_skills/evaluation/evaluator/code.py -B 2 -A 2Repository: NVIDIA-NeMo/Skills
Length of output: 1889
🏁 Script executed:
rg "preprocess_code" nemo_skills/ --type py | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 1105
🏁 Script executed:
find nemo_skills -name "*test*.py" -path "*/evaluation/*" | head -10Repository: NVIDIA-NeMo/Skills
Length of output: 44
🏁 Script executed:
grep -B 5 -A 5 'preprocess_code.*language' nemo_skills/evaluation/evaluator/livecodebench.py | head -30Repository: NVIDIA-NeMo/Skills
Length of output: 533
🏁 Script executed:
cat -n nemo_skills/evaluation/evaluator/livecodebench.py | grep -B 10 "preprocess_code.*language"Repository: NVIDIA-NeMo/Skills
Length of output: 683
🏁 Script executed:
rg "_preprocess_and_validate_file" nemo_skills/evaluation/evaluator/livecodebench.py -B 2 -A 2Repository: NVIDIA-NeMo/Skills
Length of output: 820
🏁 Script executed:
python3 << 'EOF'
# Simulate the fence extraction logic with the issue and the fix
def test_current_logic(completion, language):
"""Current (broken) logic"""
specific_fence = f"```{language}"
generic_fence = "```"
start_index = completion.rfind(specific_fence)
fence_len = len(specific_fence)
if start_index == -1:
start_index = completion.rfind(generic_fence)
fence_len = len(generic_fence)
if start_index != -1:
content_start = start_index + fence_len
result = completion[content_start:]
return result
return ""
def test_proposed_fix(completion, language):
"""Proposed fix"""
specific_fence = f"```{language}"
generic_fence = "```"
start_index = completion.rfind(specific_fence)
fence_len = len(specific_fence)
if start_index == -1:
start_index = completion.rfind(generic_fence)
fence_len = len(generic_fence)
if start_index != -1:
# Skip the whole fence header line
newline_idx = completion.find("\n", start_index + fence_len)
if newline_idx == -1:
# Inline fences - fall back to original behavior
content_start = start_index + fence_len
else:
content_start = newline_idx + 1
result = completion[content_start:]
return result
return ""
# Test cases
test_cases = [
("```python3\nprint('hi')\n```", "python", "Issue case: python3 with language=python"),
("```python\nprint('hi')\n```", "python", "Normal case: matching language"),
("```cpp20\nint main() { return 0; }\n```", "cpp", "Issue case: cpp20 with language=cpp"),
("```javascript-es6\nconsole.log('hi');\n```", "javascript", "Issue case: javascript-es6 with language=javascript"),
("```\nplain code\n```", "python", "Generic fence fallback"),
]
print("=" * 80)
for completion, lang, desc in test_cases:
print(f"\nTest: {desc}")
print(f"Input: {repr(completion)}, language={repr(lang)}")
current = test_current_logic(completion, lang)
proposed = test_proposed_fix(completion, lang)
print(f"Current logic result: {repr(current[:30])}")
print(f"Proposed fix result: {repr(proposed[:30])}")
if current != proposed:
print("⚠️ DIFFERENCE FOUND")
print(f" Current starts with: {repr(current.split(chr(10))[0])}")
print(f" Proposed starts with: {repr(proposed.split(chr(10))[0])}")
EOFRepository: NVIDIA-NeMo/Skills
Length of output: 148
🏁 Script executed:
cat -n nemo_skills/evaluation/evaluator/code.py | sed -n '139,165p'Repository: NVIDIA-NeMo/Skills
Length of output: 1197
🏁 Script executed:
python3 << 'EOF'
# Quick manual verification of the fix logic
# Example 1: ```python3 header (the main issue)
completion = "```python3\nprint('hi')\n```"
language = "python"
# Current broken logic:
specific_fence = f"```{language}" # "```python"
start_index = completion.rfind(specific_fence) # 0
fence_len = len(specific_fence) # 10
content_start = start_index + fence_len # 10
result_broken = completion[content_start:] # "3\nprint('hi')\n```"
# Proposed fix:
newline_idx = completion.find("\n", start_index + fence_len) # finds \n at position 10
content_start = newline_idx + 1 # 11
result_fixed = completion[content_start:] # "print('hi')\n```"
print("Example 1: ```python3 with language='python'")
print(f"Broken result starts with: {repr(result_broken[:20])}")
print(f"Fixed result starts with: {repr(result_fixed[:20])}")
print(f"Issue found: {result_broken[0] == '3'}")
print()
# Example 2: Normal case (should work the same)
completion2 = "```python\nprint('hi')\n```"
language2 = "python"
specific_fence2 = f"```{language2}" # "```python"
start_index2 = completion2.rfind(specific_fence2) # 0
fence_len2 = len(specific_fence2) # 9
newline_idx2 = completion2.find("\n", start_index2 + fence_len2) # finds \n at position 9
content_start2 = newline_idx2 + 1 # 10
result_fixed2 = completion2[content_start2:] # "print('hi')\n```"
print("Example 2: ```python with language='python' (normal case)")
print(f"Fixed result starts with: {repr(result_fixed2[:20])}")
EOFRepository: NVIDIA-NeMo/Skills
Length of output: 325
Language‑specific fence handling mis‑slices headers like ```python3; skip the entire fence line
The current code combines rfind(specific_fence) with content_start = start_index + fence_len, which fails when model output includes version-tagged headers. For example, with language="python" and a generated header like \``python3, the logic finds ```python(10 chars) and advancescontent_startby 10, landing before the3and producing extracted code starting with3\nprint(...)`, breaking execution.
This occurs across any language with version suffixes: cpp20, javascript-es6, etc. The fix is to skip the entire opening-fence line (up to and including the newline):
- if start_index != -1:
- # Move past the opening fence
- content_start = start_index + fence_len
- completion = completion[content_start:]
+ if start_index != -1:
+ # Skip the whole fence header line so headers like ```python3 or ```cpp
+ # are not partially left in the extracted code.
+ newline_idx = completion.find("\n", start_index + fence_len)
+ if newline_idx == -1:
+ # Inline fences like ```python print("hi")``` — fall back to the
+ # original behavior and just move past the matched fence substring.
+ content_start = start_index + fence_len
+ else:
+ content_start = newline_idx + 1
+ completion = completion[content_start:]This preserves the "take the last fenced block" behavior while avoiding stray fence‑header fragments.
🤖 Prompt for AI Agents
In nemo_skills/evaluation/evaluator/code.py around lines 139-165, the current
logic slices starting at start_index + len(specific_fence) which can leave
partial fence headers (e.g., ```python3) at the start of extracted code; change
the slicing to skip the entire opening fence line by locating the next newline
after the matched fence and set content_start to newline_idx + 1 (falling back
to start_index + fence_len only if no newline is found), then continue with the
existing closing-fence search so extracted code does not include trailing
version/tag fragments.
|
@gwarmstrong can you approve this PR? It is verified and working as expected. |
|
@wasiahmad please fix signoff with instructions here: https://github.com/NVIDIA-NeMo/Skills/pull/1086/checks?check_run_id=57834521539 |
Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
This PR fixes code extraction login from LLM generation. While evaluating DeepSeek v3.2 models, we found that models may generate multiple code blocks. Therefore, we need to extract the last code block from the generation.
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.