updating code extraction logic by wasiahmad · Pull Request #1086 · NVIDIA-NeMo/Skills

wasiahmad · 2025-12-09T20:32:05Z

This PR fixes code extraction login from LLM generation. While evaluating DeepSeek v3.2 models, we found that models may generate multiple code blocks. Therefore, we need to extract the last code block from the generation.

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved robustness of code extraction to gracefully handle edge cases and incomplete inputs.
- Enhanced code block detection with language-aware parsing and fallback mechanisms for malformed code fences.
- Fixed handling of incomplete or missing code block closures.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-09T20:33:58Z

📝 Walkthrough

Walkthrough

Single file refactor of the preprocess_code function to improve robustness. Changes include replacing regex-based thinking-trace extraction with a partition-based approach, introducing language-aware code fence detection with fallback logic, enforcing stricter fence closure validation, and consolidating parsing branches into a single cleanup step.

Changes

Cohort / File(s)	Change Summary
Code preprocessing robustness `nemo_skills/evaluation/evaluator/code.py`	Reworked `preprocess_code` function with improved edge-case handling: replaced regex-based thinking-trace extraction with partition-based approach; introduced language-aware code fence extraction with fallback to generic fences; enforced strict fence closure validation; consolidated ad-hoc parsing branches into single cleanup step; added graceful handling of missing generation data with defaults and carriage return stripping

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Language-aware fence extraction logic: Verify that the fallback from language-specific fence (<language>`) to generic fence () works correctly and doesn't miss valid code blocks
Partition-based thinking block extraction: Confirm that the new approach correctly handles edge cases with unclosed or malformed <think> tags
Fence closure validation: Check that missing closing fences are handled consistently and result in expected empty-completion behavior

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'updating code extraction logic' directly relates to the changeset, which exclusively modifies the code extraction logic in the evaluator.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch lcb_fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

nemo_skills/evaluation/evaluator/code.py (1)

122-135: <think> stripping is reasonable; minor corner case worth noting

The partition("</think>") approach is simple and avoids regex, and treating an unterminated <think> as invalid by returning an empty completion is consistent and predictable.

One minor corner case: if the model emits multiple <think>...</think> segments, only content up to the first </think> is removed; any subsequent <think> blocks remain in completion. If that ever shows up in practice, consider looping until no <think>/</think> pairs remain, or explicitly stripping all such segments.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 774cef6 and 018bbf6.

📒 Files selected for processing (1)

nemo_skills/evaluation/evaluator/code.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

🔇 Additional comments (2)

nemo_skills/evaluation/evaluator/code.py (2)

118-121: Robust initialization and normalization look good

Using generation_dict.get("generation", "") or "" plus completion.replace("\r", "") avoids KeyErrors, handles None/empty values, and normalizes line endings without changing semantics. No issues here.

169-173: Whitespace‑stripping toggle aligns well with different evaluation paths

Conditionally stripping only at the very end (if strip_whitespace: completion = completion.strip()) is a clean consolidation of the previous behavior and plays nicely with callers that require exact formatting (e.g., infilling / partial‑solution concatenation) versus those that just need trimmed code. This looks good.

coderabbitai · 2025-12-09T20:48:03Z

nemo_skills/evaluation/evaluator/code.py

+    specific_fence = f"```{language}"
+    generic_fence = "```"
+
+    # Find the *last* occurrence of the code block (handles CoT steps)
+    start_index = completion.rfind(specific_fence)
+    fence_len = len(specific_fence)
+
+    # Fallback to generic fence if specific language tag is missing
+    if start_index == -1:
+        start_index = completion.rfind(generic_fence)
+        fence_len = len(generic_fence)
+
+    if start_index != -1:
+        # Move past the opening fence
+        content_start = start_index + fence_len
+        completion = completion[content_start:]
+
+        # Check for closing fence
+        end_index = completion.find(generic_fence)
+        if end_index != -1:
+            # Valid block found
+            completion = completion[:end_index]
+        else:
+            # STRICT MODE: Opening fence found, but no closing fence.
+            # The generation is truncated/incomplete. Discard it.
+            completion = ""



⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd nemo_skills && find . -name "code.py" -path "*/evaluation/evaluator/*" | head -5

Repository: NVIDIA-NeMo/Skills

Length of output: 92

🏁 Script executed:

cat -n nemo_skills/evaluation/evaluator/code.py | sed -n '135,170p'

Repository: NVIDIA-NeMo/Skills

Length of output: 1670

🏁 Script executed:

cat -n nemo_skills/evaluation/evaluator/code.py | sed -n '1,150p' | tail -50

Repository: NVIDIA-NeMo/Skills

Length of output: 2326

🏁 Script executed:

grep -n "def.*extract\|def.*fence\|def.*code" nemo_skills/evaluation/evaluator/code.py | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 258

🏁 Script executed:

rg "preprocess_code" nemo_skills/evaluation/evaluator/code.py -B 2 -A 2

Repository: NVIDIA-NeMo/Skills

Length of output: 1889

🏁 Script executed:

rg "preprocess_code" nemo_skills/ --type py | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 1105

🏁 Script executed:

find nemo_skills -name "*test*.py" -path "*/evaluation/*" | head -10

Repository: NVIDIA-NeMo/Skills

Length of output: 44

🏁 Script executed:

grep -B 5 -A 5 'preprocess_code.*language' nemo_skills/evaluation/evaluator/livecodebench.py | head -30

Repository: NVIDIA-NeMo/Skills

Length of output: 533

🏁 Script executed:

cat -n nemo_skills/evaluation/evaluator/livecodebench.py | grep -B 10 "preprocess_code.*language"

Repository: NVIDIA-NeMo/Skills

Length of output: 683

🏁 Script executed:

rg "_preprocess_and_validate_file" nemo_skills/evaluation/evaluator/livecodebench.py -B 2 -A 2

Repository: NVIDIA-NeMo/Skills

Length of output: 820

🏁 Script executed:

python3 << 'EOF' # Simulate the fence extraction logic with the issue and the fix def test_current_logic(completion, language): """Current (broken) logic""" specific_fence = f"```{language}" generic_fence = "```" start_index = completion.rfind(specific_fence) fence_len = len(specific_fence) if start_index == -1: start_index = completion.rfind(generic_fence) fence_len = len(generic_fence) if start_index != -1: content_start = start_index + fence_len result = completion[content_start:] return result return "" def test_proposed_fix(completion, language): """Proposed fix""" specific_fence = f"```{language}" generic_fence = "```" start_index = completion.rfind(specific_fence) fence_len = len(specific_fence) if start_index == -1: start_index = completion.rfind(generic_fence) fence_len = len(generic_fence) if start_index != -1: # Skip the whole fence header line newline_idx = completion.find("\n", start_index + fence_len) if newline_idx == -1: # Inline fences - fall back to original behavior content_start = start_index + fence_len else: content_start = newline_idx + 1 result = completion[content_start:] return result return "" # Test cases test_cases = [ ("```python3\nprint('hi')\n```", "python", "Issue case: python3 with language=python"), ("```python\nprint('hi')\n```", "python", "Normal case: matching language"), ("```cpp20\nint main() { return 0; }\n```", "cpp", "Issue case: cpp20 with language=cpp"), ("```javascript-es6\nconsole.log('hi');\n```", "javascript", "Issue case: javascript-es6 with language=javascript"), ("```\nplain code\n```", "python", "Generic fence fallback"), ] print("=" * 80) for completion, lang, desc in test_cases: print(f"\nTest: {desc}") print(f"Input: {repr(completion)}, language={repr(lang)}") current = test_current_logic(completion, lang) proposed = test_proposed_fix(completion, lang) print(f"Current logic result: {repr(current[:30])}") print(f"Proposed fix result: {repr(proposed[:30])}") if current != proposed: print("⚠️ DIFFERENCE FOUND") print(f" Current starts with: {repr(current.split(chr(10))[0])}") print(f" Proposed starts with: {repr(proposed.split(chr(10))[0])}") EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 148

🏁 Script executed:

cat -n nemo_skills/evaluation/evaluator/code.py | sed -n '139,165p'

Repository: NVIDIA-NeMo/Skills

Length of output: 1197

🏁 Script executed:

python3 << 'EOF' # Quick manual verification of the fix logic # Example 1: ```python3 header (the main issue) completion = "```python3\nprint('hi')\n```" language = "python" # Current broken logic: specific_fence = f"```{language}" # "```python" start_index = completion.rfind(specific_fence) # 0 fence_len = len(specific_fence) # 10 content_start = start_index + fence_len # 10 result_broken = completion[content_start:] # "3\nprint('hi')\n```" # Proposed fix: newline_idx = completion.find("\n", start_index + fence_len) # finds \n at position 10 content_start = newline_idx + 1 # 11 result_fixed = completion[content_start:] # "print('hi')\n```" print("Example 1: ```python3 with language='python'") print(f"Broken result starts with: {repr(result_broken[:20])}") print(f"Fixed result starts with: {repr(result_fixed[:20])}") print(f"Issue found: {result_broken[0] == '3'}") print() # Example 2: Normal case (should work the same) completion2 = "```python\nprint('hi')\n```" language2 = "python" specific_fence2 = f"```{language2}" # "```python" start_index2 = completion2.rfind(specific_fence2) # 0 fence_len2 = len(specific_fence2) # 9 newline_idx2 = completion2.find("\n", start_index2 + fence_len2) # finds \n at position 9 content_start2 = newline_idx2 + 1 # 10 result_fixed2 = completion2[content_start2:] # "print('hi')\n```" print("Example 2: ```python with language='python' (normal case)") print(f"Fixed result starts with: {repr(result_fixed2[:20])}") EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 325

Language‑specific fence handling mis‑slices headers like ```python3; skip the entire fence line

The current code combines rfind(specific_fence) with content_start = start_index + fence_len, which fails when model output includes version-tagged headers. For example, with language="python" and a generated header like \``python3, the logic finds ```python(10 chars) and advancescontent_startby 10, landing before the3and producing extracted code starting with3\nprint(...)`, breaking execution.

This occurs across any language with version suffixes: cpp20, javascript-es6, etc. The fix is to skip the entire opening-fence line (up to and including the newline):

- if start_index != -1: - # Move past the opening fence - content_start = start_index + fence_len - completion = completion[content_start:] + if start_index != -1: + # Skip the whole fence header line so headers like ```python3 or ```cpp + # are not partially left in the extracted code. + newline_idx = completion.find("\n", start_index + fence_len) + if newline_idx == -1: + # Inline fences like ```python print("hi")``` — fall back to the + # original behavior and just move past the matched fence substring. + content_start = start_index + fence_len + else: + content_start = newline_idx + 1 + completion = completion[content_start:]

This preserves the "take the last fenced block" behavior while avoiding stray fence‑header fragments.

🤖 Prompt for AI Agents

In nemo_skills/evaluation/evaluator/code.py around lines 139-165, the current logic slices starting at start_index + len(specific_fence) which can leave partial fence headers (e.g., ```python3) at the start of extracted code; change the slicing to skip the entire opening fence line by locating the next newline after the matched fence and set content_start to newline_idx + 1 (falling back to start_index + fence_len only if no newline is found), then continue with the existing closing-fence search so extracted code does not include trailing version/tag fragments.

wasiahmad · 2025-12-11T21:42:17Z

@gwarmstrong can you approve this PR? It is verified and working as expected.

gwarmstrong · 2025-12-11T21:44:26Z

@wasiahmad please fix signoff with instructions here: https://github.com/NVIDIA-NeMo/Skills/pull/1086/checks?check_run_id=57834521539
good to merge after that

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

wasiahmad requested a review from SeanNaren December 9, 2025 20:32

wasiahmad force-pushed the lcb_fix branch from 7455145 to 22daf11 Compare December 9, 2025 20:33

coderabbitai bot reviewed Dec 9, 2025

View reviewed changes

wasiahmad force-pushed the lcb_fix branch from 4cfd47f to 3879137 Compare December 11, 2025 21:39

wasiahmad and others added 12 commits December 11, 2025 14:21

updating code extraction logic

79dfe85

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: George Armstrong <georgea@nvidia.com>

removing mistakenly pushed code

b75c036

Signed-off-by: George Armstrong <georgea@nvidia.com>

MAINT pin nemo-evaluator (#1095)

164fd81

Signed-off-by: George Armstrong <georgea@nvidia.com>

Update issue templates

9281e22

Signed-off-by: George Armstrong <georgea@nvidia.com>

Delete .github/ISSUE_TEMPLATE directory

ed35495

Signed-off-by: George Armstrong <georgea@nvidia.com>

enable blank issues (#1096)

7f380c4

Signed-off-by: George Armstrong <georgea@nvidia.com>

Fix input_file path handling when executor is "none" (#1089)

16b5195

Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com>

TST for #1089 (#1097)

89c5ffb

Signed-off-by: George Armstrong <georgea@nvidia.com>

Stepheng/prover cleanup (#1078)

cbc61bb

Signed-off-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com>

FEAT Add Tavily Search (#1085)

827c77a

Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com>

gwarmstrong force-pushed the lcb_fix branch from 3879137 to 827c77a Compare December 11, 2025 22:22

Merge branch 'main' into lcb_fix

b83cc18

gwarmstrong approved these changes Dec 11, 2025

View reviewed changes

gwarmstrong enabled auto-merge (squash) December 11, 2025 22:24

gwarmstrong merged commit 699dfaa into main Dec 11, 2025
5 checks passed

gwarmstrong deleted the lcb_fix branch December 11, 2025 22:41

wasiahmad added a commit that referenced this pull request Dec 19, 2025

updating code extraction logic (#1086)

7c529c0

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

wasiahmad added a commit that referenced this pull request Feb 4, 2026

updating code extraction logic (#1086)

f7e5479

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

updating code extraction logic (#1086)

2fb6baa

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

updating code extraction logic#1086

updating code extraction logic#1086
gwarmstrong merged 13 commits intomainfrom
lcb_fix

wasiahmad commented Dec 9, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 9, 2025 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 9, 2025

Uh oh!

wasiahmad commented Dec 11, 2025

Uh oh!

gwarmstrong commented Dec 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

wasiahmad commented Dec 9, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

wasiahmad commented Dec 11, 2025

Uh oh!

gwarmstrong commented Dec 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

wasiahmad commented Dec 9, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 9, 2025 •

edited

Loading