Skip to content

updating code extraction logic#1086

Merged
gwarmstrong merged 13 commits intomainfrom
lcb_fix
Dec 11, 2025
Merged

updating code extraction logic#1086
gwarmstrong merged 13 commits intomainfrom
lcb_fix

Conversation

@wasiahmad
Copy link
Collaborator

@wasiahmad wasiahmad commented Dec 9, 2025

This PR fixes code extraction login from LLM generation. While evaluating DeepSeek v3.2 models, we found that models may generate multiple code blocks. Therefore, we need to extract the last code block from the generation.

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Improved robustness of code extraction to gracefully handle edge cases and incomplete inputs.
    • Enhanced code block detection with language-aware parsing and fallback mechanisms for malformed code fences.
    • Fixed handling of incomplete or missing code block closures.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 9, 2025

📝 Walkthrough

Walkthrough

Single file refactor of the preprocess_code function to improve robustness. Changes include replacing regex-based thinking-trace extraction with a partition-based approach, introducing language-aware code fence detection with fallback logic, enforcing stricter fence closure validation, and consolidating parsing branches into a single cleanup step.

Changes

Cohort / File(s) Change Summary
Code preprocessing robustness
nemo_skills/evaluation/evaluator/code.py
Reworked preprocess_code function with improved edge-case handling: replaced regex-based thinking-trace extraction with partition-based approach; introduced language-aware code fence extraction with fallback to generic fences; enforced strict fence closure validation; consolidated ad-hoc parsing branches into single cleanup step; added graceful handling of missing generation data with defaults and carriage return stripping

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Language-aware fence extraction logic: Verify that the fallback from language-specific fence (<language>`) to generic fence () works correctly and doesn't miss valid code blocks
  • Partition-based thinking block extraction: Confirm that the new approach correctly handles edge cases with unclosed or malformed <think> tags
  • Fence closure validation: Check that missing closing fences are handled consistently and result in expected empty-completion behavior

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'updating code extraction logic' directly relates to the changeset, which exclusively modifies the code extraction logic in the evaluator.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch lcb_fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
nemo_skills/evaluation/evaluator/code.py (1)

122-135: <think> stripping is reasonable; minor corner case worth noting

The partition("</think>") approach is simple and avoids regex, and treating an unterminated <think> as invalid by returning an empty completion is consistent and predictable.

One minor corner case: if the model emits multiple <think>...</think> segments, only content up to the first </think> is removed; any subsequent <think> blocks remain in completion. If that ever shows up in practice, consider looping until no <think>/</think> pairs remain, or explicitly stripping all such segments.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 774cef6 and 018bbf6.

📒 Files selected for processing (1)
  • nemo_skills/evaluation/evaluator/code.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests
🔇 Additional comments (2)
nemo_skills/evaluation/evaluator/code.py (2)

118-121: Robust initialization and normalization look good

Using generation_dict.get("generation", "") or "" plus completion.replace("\r", "") avoids KeyErrors, handles None/empty values, and normalizes line endings without changing semantics. No issues here.


169-173: Whitespace‑stripping toggle aligns well with different evaluation paths

Conditionally stripping only at the very end (if strip_whitespace: completion = completion.strip()) is a clean consolidation of the previous behavior and plays nicely with callers that require exact formatting (e.g., infilling / partial‑solution concatenation) versus those that just need trimmed code. This looks good.

Comment on lines +139 to 165
specific_fence = f"```{language}"
generic_fence = "```"

# Find the *last* occurrence of the code block (handles CoT steps)
start_index = completion.rfind(specific_fence)
fence_len = len(specific_fence)

# Fallback to generic fence if specific language tag is missing
if start_index == -1:
start_index = completion.rfind(generic_fence)
fence_len = len(generic_fence)

if start_index != -1:
# Move past the opening fence
content_start = start_index + fence_len
completion = completion[content_start:]

# Check for closing fence
end_index = completion.find(generic_fence)
if end_index != -1:
# Valid block found
completion = completion[:end_index]
else:
# STRICT MODE: Opening fence found, but no closing fence.
# The generation is truncated/incomplete. Discard it.
completion = ""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd nemo_skills && find . -name "code.py" -path "*/evaluation/evaluator/*" | head -5

Repository: NVIDIA-NeMo/Skills

Length of output: 92


🏁 Script executed:

cat -n nemo_skills/evaluation/evaluator/code.py | sed -n '135,170p'

Repository: NVIDIA-NeMo/Skills

Length of output: 1670


🏁 Script executed:

cat -n nemo_skills/evaluation/evaluator/code.py | sed -n '1,150p' | tail -50

Repository: NVIDIA-NeMo/Skills

Length of output: 2326


🏁 Script executed:

grep -n "def.*extract\|def.*fence\|def.*code" nemo_skills/evaluation/evaluator/code.py | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 258


🏁 Script executed:

rg "preprocess_code" nemo_skills/evaluation/evaluator/code.py -B 2 -A 2

Repository: NVIDIA-NeMo/Skills

Length of output: 1889


🏁 Script executed:

rg "preprocess_code" nemo_skills/ --type py | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 1105


🏁 Script executed:

find nemo_skills -name "*test*.py" -path "*/evaluation/*" | head -10

Repository: NVIDIA-NeMo/Skills

Length of output: 44


🏁 Script executed:

grep -B 5 -A 5 'preprocess_code.*language' nemo_skills/evaluation/evaluator/livecodebench.py | head -30

Repository: NVIDIA-NeMo/Skills

Length of output: 533


🏁 Script executed:

cat -n nemo_skills/evaluation/evaluator/livecodebench.py | grep -B 10 "preprocess_code.*language"

Repository: NVIDIA-NeMo/Skills

Length of output: 683


🏁 Script executed:

rg "_preprocess_and_validate_file" nemo_skills/evaluation/evaluator/livecodebench.py -B 2 -A 2

Repository: NVIDIA-NeMo/Skills

Length of output: 820


🏁 Script executed:

python3 << 'EOF'
# Simulate the fence extraction logic with the issue and the fix

def test_current_logic(completion, language):
    """Current (broken) logic"""
    specific_fence = f"```{language}"
    generic_fence = "```"
    
    start_index = completion.rfind(specific_fence)
    fence_len = len(specific_fence)
    
    if start_index == -1:
        start_index = completion.rfind(generic_fence)
        fence_len = len(generic_fence)
    
    if start_index != -1:
        content_start = start_index + fence_len
        result = completion[content_start:]
        return result
    return ""

def test_proposed_fix(completion, language):
    """Proposed fix"""
    specific_fence = f"```{language}"
    generic_fence = "```"
    
    start_index = completion.rfind(specific_fence)
    fence_len = len(specific_fence)
    
    if start_index == -1:
        start_index = completion.rfind(generic_fence)
        fence_len = len(generic_fence)
    
    if start_index != -1:
        # Skip the whole fence header line
        newline_idx = completion.find("\n", start_index + fence_len)
        if newline_idx == -1:
            # Inline fences - fall back to original behavior
            content_start = start_index + fence_len
        else:
            content_start = newline_idx + 1
        result = completion[content_start:]
        return result
    return ""

# Test cases
test_cases = [
    ("```python3\nprint('hi')\n```", "python", "Issue case: python3 with language=python"),
    ("```python\nprint('hi')\n```", "python", "Normal case: matching language"),
    ("```cpp20\nint main() { return 0; }\n```", "cpp", "Issue case: cpp20 with language=cpp"),
    ("```javascript-es6\nconsole.log('hi');\n```", "javascript", "Issue case: javascript-es6 with language=javascript"),
    ("```\nplain code\n```", "python", "Generic fence fallback"),
]

print("=" * 80)
for completion, lang, desc in test_cases:
    print(f"\nTest: {desc}")
    print(f"Input: {repr(completion)}, language={repr(lang)}")
    
    current = test_current_logic(completion, lang)
    proposed = test_proposed_fix(completion, lang)
    
    print(f"Current logic result: {repr(current[:30])}")
    print(f"Proposed fix result:  {repr(proposed[:30])}")
    
    if current != proposed:
        print("⚠️  DIFFERENCE FOUND")
        print(f"   Current starts with: {repr(current.split(chr(10))[0])}")
        print(f"   Proposed starts with: {repr(proposed.split(chr(10))[0])}")

EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 148


🏁 Script executed:

cat -n nemo_skills/evaluation/evaluator/code.py | sed -n '139,165p'

Repository: NVIDIA-NeMo/Skills

Length of output: 1197


🏁 Script executed:

python3 << 'EOF'
# Quick manual verification of the fix logic

# Example 1: ```python3 header (the main issue)
completion = "```python3\nprint('hi')\n```"
language = "python"

# Current broken logic:
specific_fence = f"```{language}"  # "```python"
start_index = completion.rfind(specific_fence)  # 0
fence_len = len(specific_fence)  # 10
content_start = start_index + fence_len  # 10
result_broken = completion[content_start:]  # "3\nprint('hi')\n```"

# Proposed fix:
newline_idx = completion.find("\n", start_index + fence_len)  # finds \n at position 10
content_start = newline_idx + 1  # 11
result_fixed = completion[content_start:]  # "print('hi')\n```"

print("Example 1: ```python3 with language='python'")
print(f"Broken result starts with: {repr(result_broken[:20])}")
print(f"Fixed result starts with:  {repr(result_fixed[:20])}")
print(f"Issue found: {result_broken[0] == '3'}")
print()

# Example 2: Normal case (should work the same)
completion2 = "```python\nprint('hi')\n```"
language2 = "python"

specific_fence2 = f"```{language2}"  # "```python"
start_index2 = completion2.rfind(specific_fence2)  # 0
fence_len2 = len(specific_fence2)  # 9
newline_idx2 = completion2.find("\n", start_index2 + fence_len2)  # finds \n at position 9
content_start2 = newline_idx2 + 1  # 10
result_fixed2 = completion2[content_start2:]  # "print('hi')\n```"

print("Example 2: ```python with language='python' (normal case)")
print(f"Fixed result starts with: {repr(result_fixed2[:20])}")
EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 325


Language‑specific fence handling mis‑slices headers like ```python3; skip the entire fence line

The current code combines rfind(specific_fence) with content_start = start_index + fence_len, which fails when model output includes version-tagged headers. For example, with language="python" and a generated header like \``python3, the logic finds ```python(10 chars) and advancescontent_startby 10, landing before the3and producing extracted code starting with3\nprint(...)`, breaking execution.

This occurs across any language with version suffixes: cpp20, javascript-es6, etc. The fix is to skip the entire opening-fence line (up to and including the newline):

-    if start_index != -1:
-        # Move past the opening fence
-        content_start = start_index + fence_len
-        completion = completion[content_start:]
+    if start_index != -1:
+        # Skip the whole fence header line so headers like ```python3 or ```cpp
+        # are not partially left in the extracted code.
+        newline_idx = completion.find("\n", start_index + fence_len)
+        if newline_idx == -1:
+            # Inline fences like ```python print("hi")``` — fall back to the
+            # original behavior and just move past the matched fence substring.
+            content_start = start_index + fence_len
+        else:
+            content_start = newline_idx + 1
+        completion = completion[content_start:]

This preserves the "take the last fenced block" behavior while avoiding stray fence‑header fragments.

🤖 Prompt for AI Agents
In nemo_skills/evaluation/evaluator/code.py around lines 139-165, the current
logic slices starting at start_index + len(specific_fence) which can leave
partial fence headers (e.g., ```python3) at the start of extracted code; change
the slicing to skip the entire opening fence line by locating the next newline
after the matched fence and set content_start to newline_idx + 1 (falling back
to start_index + fence_len only if no newline is found), then continue with the
existing closing-fence search so extracted code does not include trailing
version/tag fragments.

@wasiahmad
Copy link
Collaborator Author

@gwarmstrong can you approve this PR? It is verified and working as expected.

@gwarmstrong
Copy link
Collaborator

@wasiahmad please fix signoff with instructions here: https://github.com/NVIDIA-NeMo/Skills/pull/1086/checks?check_run_id=57834521539
good to merge after that

wasiahmad and others added 12 commits December 11, 2025 14:21
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: bzantium <ryumin93@gmail.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: i-vainn <imoshkov@nvidia.com>
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Co-authored-by: Nick Ludwig <nliudvig@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: Ivan <imoshkov@nvidia.com>
Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com>
Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
@gwarmstrong gwarmstrong enabled auto-merge (squash) December 11, 2025 22:24
@gwarmstrong gwarmstrong merged commit 699dfaa into main Dec 11, 2025
5 checks passed
@gwarmstrong gwarmstrong deleted the lcb_fix branch December 11, 2025 22:41
wasiahmad added a commit that referenced this pull request Dec 12, 2025
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
wasiahmad added a commit that referenced this pull request Dec 19, 2025
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
wasiahmad added a commit that referenced this pull request Dec 19, 2025
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
hsiehjackson pushed a commit that referenced this pull request Jan 13, 2026
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
wasiahmad added a commit that referenced this pull request Feb 4, 2026
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants