Prompting fix to improve LCB score for GPTOSS by wasiahmad · Pull Request #1169 · NVIDIA-NeMo/Skills

wasiahmad · 2026-01-15T16:52:39Z

Objective of prompt change

When we evaluated GPTOSS using Nemo-Skills on LCB, we noticed the following error appeared frequently.

\"AttributeError(\\\"'_io.StringIO' object has no attribute 'buffer'\\\")\"

After inspecting the generated code, we noticed the following line of code was resulting into the above error.

data = sys.stdin.buffer.read().split()

So, we updated prompt for GPTOSS that instructs the model not to generate code that leads to an AttributeError.

Impact of prompt change

The updates give the following scores on LCB v5 (2407-2412) [315 samples], also known as AA split:

GPT-OSS-120B

------------------------------ livecodebench ------------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 315         | 17531      | 1083        | 86.98% ± 0.92%
pass@10           | 315         | 17531      | 1083        | 93.02%        


---------------------------- livecodebench-hard ---------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 135         | 31470      | 1083        | 75.85% ± 1.27%
pass@10           | 135         | 31470      | 1083        | 87.41%        


--------------------------- livecodebench-medium --------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 102         | 10700      | 849         | 93.04% ± 1.49%
pass@10           | 102         | 10700      | 849         | 96.08%        


---------------------------- livecodebench-easy ---------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 78          | 2336       | 247         | 98.33% ± 1.22%
pass@10           | 78          | 2336       | 247         | 98.72%

NS Eval command:

ns eval \
    --cluster=$cluster \
    --model=$model_dir \
    --server_type=vllm \
    --server_args="--enable-log-requests --async-scheduling --tensor-parallel-size 8" \
    --server_nodes=1 \
    --server_gpus=8 \
    --benchmarks=livecodebench:10 \
    --split="test_v5_2407_2412" \
    --expname="${model_name}-lcb-ns-eval" \
    --output_dir=$output_dir \
    ++prompt_config="gpt-oss/livecodebench" \
    ++inference.endpoint_type=chat \
    ++inference.extra_body.reasoning_effort=high \
    ++inference.temperature=1.0 \
    ++inference.top_p=1.0 \
    ++inference.top_k=-1 \
    ++max_concurrent_requests=1024 \
    ++skip_filled=True \
    ++eval_config.timeout=10

GPT-OSS-20B

------------------------------ livecodebench ------------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 315         | 32616      | 1090        | 82.79% ± 0.91%
pass@10           | 315         | 32616      | 1090        | 91.11%        


---------------------------- livecodebench-hard ---------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 135         | 57098      | 1090        | 66.59% ± 1.96%
pass@10           | 135         | 57098      | 1090        | 82.22%        


--------------------------- livecodebench-medium --------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 102         | 22206      | 1090        | 92.25% ± 1.34%
pass@10           | 102         | 22206      | 1090        | 97.06%        


---------------------------- livecodebench-easy ---------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 78          | 3859       | 734         | 98.46% ± 0.54%
pass@10           | 78          | 3859       | 734         | 98.72%

Qwen3-32B

------------------------------ livecodebench ------------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 315         | 12735      | 1169        | 66.98% ± 1.41%
pass@10           | 315         | 12735      | 1169        | 80.32%        


---------------------------- livecodebench-hard ---------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 135         | 18759      | 1149        | 39.41% ± 1.87%
pass@10           | 135         | 18759      | 1149        | 60.00%        


--------------------------- livecodebench-medium --------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 102         | 11311      | 1169        | 79.80% ± 1.80%
pass@10           | 102         | 11311      | 1169        | 93.14%        


---------------------------- livecodebench-easy ---------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 78          | 4171       | 837         | 97.95% ± 1.08%
pass@10           | 78          | 4171       | 837         | 98.72%

NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

------------------------------ livecodebench ------------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 315         | 28483      | 1649        | 66.86% ± 2.30%
pass@10           | 315         | 28483      | 1649        | 80.95%        


---------------------------- livecodebench-hard ---------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 135         | 44079      | 1649        | 43.11% ± 3.40%
pass@10           | 135         | 44079      | 1649        | 62.96%        


--------------------------- livecodebench-medium --------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 102         | 21548      | 1649        | 77.06% ± 2.78%
pass@10           | 102         | 21548      | 1649        | 91.18%        


---------------------------- livecodebench-easy ---------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 78          | 10557      | 1649        | 94.62% ± 2.69%
pass@10           | 78          | 10557      | 1649        | 98.72%

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai · 2026-01-15T16:56:12Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds a new YAML configuration file that specifies prompt instructions for LiveCodeBench tasks, defining Python execution constraints, forbidden I/O methods (buffer-based operations), allowed I/O approaches, and standardized answer formatting with code block structures.

Changes

Cohort / File(s)	Summary
LiveCodeBench Prompt Configuration `nemo_skills/prompt/config/gpt-oss/livecodebench.yaml`	New prompt template file containing instructions for Python code task execution, including I/O method restrictions (forbids `.buffer` attributes), allowed input methods (input, sys.stdin.read/readline, open().read), step-by-step thinking directives, and standardized answer format with code block wrapping.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

PR #1079: Adds complementary LiveCodeBench prompt configurations including prompt-set entries and templates for the same benchmark feature.

Suggested reviewers

titu1994
ekmb

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding a prompt configuration file to improve LCB (LiveCodeBench) score for GPTOSS, which aligns with the file addition.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-01-15T16:56:22Z

Greptile Overview

Greptile Summary

This PR adds a GPT-OSS-specific prompt configuration for LiveCodeBench that prevents the model from generating code using sys.stdin.buffer and threading, which were causing AttributeErrors during evaluation.

Key Changes:

Created new prompt file at nemo_skills/prompt/config/gpt-oss/livecodebench.yaml with targeted constraints
Unlike earlier iterations of this PR, the author correctly rolled back changes to nemo_skills/dataset/livecodebench/__init__.py and prepare.py
The default prompt config remains eval/livecodebench/python_codegen, so this change only affects GPT-OSS when explicitly specified with ++prompt_config=gpt-oss/livecodebench

Impact:
The new prompt is opt-in and model-specific, avoiding the concerns raised in previous review threads about changing defaults for all models. GPT-OSS users must explicitly use ++prompt_config=gpt-oss/livecodebench to get these constraints. The PR description shows significant improvements in LCB scores for GPT-OSS models (86.98% for 120B, 82.79% for 20B) while also testing other models like Qwen3 and Nemotron.

Confidence Score: 5/5

This PR is safe to merge - it only adds a new prompt configuration file without modifying any defaults or existing behavior
Score reflects that this is a minimal, non-breaking change: (1) only adds a new opt-in prompt file in the gpt-oss directory, (2) does not modify any existing files or defaults, (3) properly rolled back earlier changes that would have affected all models, (4) addresses a real issue (AttributeError) with targeted constraints, (5) PR description shows thorough testing with strong results across multiple models
No files require special attention

Important Files Changed

Filename	Overview
nemo_skills/prompt/config/gpt-oss/livecodebench.yaml	New GPT-OSS specific prompt that adds constraints to avoid `sys.stdin.buffer` and `threading` which were causing AttributeErrors

Sequence Diagram

sequenceDiagram
    participant User as User/Evaluator
    participant CLI as NS Eval CLI
    participant Dataset as LiveCodeBench Dataset
    participant PromptLoader as Prompt Loader
    participant Model as GPT-OSS Model
    participant Executor as Code Executor
    
    User->>CLI: ns eval with ++prompt_config=gpt-oss/livecodebench
    CLI->>Dataset: Load LiveCodeBench dataset
    Dataset-->>CLI: Return problems with {question}
    CLI->>PromptLoader: Load prompt config: gpt-oss/livecodebench
    PromptLoader-->>CLI: Return prompt with constraints
    Note over PromptLoader: Adds: "avoid sys.stdin.buffer<br/>avoid threading"
    CLI->>Model: Generate code with modified prompt
    Note over Model: Avoids problematic patterns<br/>due to prompt constraints
    Model-->>CLI: Generated code (without sys.stdin.buffer/threading)
    CLI->>Executor: Execute generated code
    Executor-->>CLI: Success (no AttributeError)
    CLI-->>User: Return evaluation results (86.98% pass@1)

greptile-apps · 2026-01-15T16:56:23Z

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-21T13:27:55Z

nemo_skills/prompt/config/gpt-oss/livecodebench.yaml

+user: |-
+  {question}
+
+  Please avoid using `sys.stdin.buffer` to process input, and avoid using `threading`.


style: The simplified prompt lacks explicit guidance on allowed input methods (like input(), sys.stdin.read()) that were in the previous version. Consider whether models might benefit from positive examples of what TO use, not just what to avoid.

Suggested change

Please avoid using `sys.stdin.buffer` to process input, and avoid using `threading`.

Please avoid using `sys.stdin.buffer` to process input, and avoid using `threading`. Instead use `input()` or `sys.stdin.read()`.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

greptile-apps

_{4 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-24T01:34:50Z

nemo_skills/dataset/livecodebench/__init__.py

 METRICS_TYPE = "livecodebench"
 EVAL_SPLIT = "test_v6_2408_2505"
-GENERATION_ARGS = "++prompt_config=eval/livecodebench/python_codegen ++eval_type=livecodebench"
+GENERATION_ARGS = "++prompt_config=gpt-oss/livecodebench ++eval_type=livecodebench"


style: changing the default GENERATION_ARGS from eval/livecodebench/python_codegen to gpt-oss/livecodebench affects all evaluations using this dataset module unless explicitly overridden with ++prompt_config

check that existing evaluation workflows (especially those not using GPTOSS) still function correctly with this new default, or document that users should specify ++prompt_config=eval/livecodebench/python_codegen for non-GPTOSS models

it does look a bit strange that we are using gpt-oss prompt as default. Did you verify it also improves results for other models (e.g. nano-v3 or qwen3)? If not, then maybe we should keep default and only use the other prompt for gpt-oss? Or alternatively we can update here, but then would need to change all previously reported commands in the documentation to use original prompt for reproducibility @wasiahmad

You're absolutely right to be concerned. Looking at the prompt differences:

Original prompt (eval/livecodebench/python_codegen): Generic instruction to generate executable Python code

New GPT-OSS prompt (gpt-oss/livecodebench): Same as original but adds: "Please avoid using sys.stdin.buffer to process input, and avoid using threading."

This change affects all evaluations using the livecodebench dataset unless explicitly overridden, not just GPTOSS evaluations. The GPT-OSS specific constraints may actually hurt performance on other models that don't have the same issues.

I'd recommend keeping the original default and only using the GPT-OSS prompt when specifically needed. The constraints about sys.stdin.buffer and threading are:

GPTOSS-specific: Other models might not generate these problematic patterns

Potentially limiting: These constraints could prevent valid solutions that use threading or binary input handling

Breaking change: This breaks reproducibility of all existing commands in documentation

Better approach: Keep eval/livecodebench/python_codegen as the default and explicitly use ++prompt_config=gpt-oss/livecodebench only for GPTOSS evaluations.

You're absolutely right to be concerned about this change. The PR is switching the default prompt config from eval/livecodebench/python_codegen to gpt-oss/livecodebench without testing it on other models.

Looking at the differences:

Original (eval/livecodebench/python_codegen):

user: |- Here is a problem for which you need to generate an executable code in python programming language. {question}

New default (gpt-oss/livecodebench):

user: |- {question} Please avoid using `sys.stdin.buffer` to process input, and avoid using `threading`.

The new prompt is specifically tailored for GPT-OSS to avoid the sys.stdin.buffer AttributeError, but it's quite different from the original - it removes the explicit instruction to generate Python code and adds restrictions that may not be necessary for other models.

Recommendations:

Don't make this the default - This prompt is specifically optimized for GPT-OSS and could potentially hurt performance on other models like Nano-v3 or Qwen3

Keep the original default and use model-specific overrides - Users should explicitly specify ++prompt_config=gpt-oss/livecodebench when evaluating GPT-OSS models

Update documentation - If you want to keep this change, you'd need to update all existing evaluation commands in documentation to use ++prompt_config=eval/livecodebench/python_codegen for reproducibility

The cleanest approach is to revert this change to the default and let GPT-OSS evaluations explicitly use the specialized prompt.

greptile-apps · 2026-01-24T01:34:51Z

nemo_skills/dataset/livecodebench/prepare.py

+    FORMATTING_MESSAGE_WITH_STARTER_CODE = (
+        """Please place the solution code in the following format:\n```python\n# Your solution code here\n```"""
+    )
+    FORMATTING_WITHOUT_STARTER_CODE = """Write Python code to solve the problem. Please place the solution code in the following format:\n```python\n# Your solution code here\n```"""


style: the reference to Qwen's implementation was removed - consider keeping a comment linking to the original inspiration or documenting why these specific formatting instructions were chosen, especially since this is a significant departure from the previous approach

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-24T20:02:35Z

nemo_skills/dataset/livecodebench/__init__.py

 METRICS_TYPE = "livecodebench"
 EVAL_SPLIT = "test_v6_2408_2505"
-GENERATION_ARGS = "++prompt_config=eval/livecodebench/python_codegen ++eval_type=livecodebench"
+GENERATION_ARGS = "++prompt_config=gpt-oss/livecodebench ++eval_type=livecodebench"


changing the default GENERATION_ARGS affects ALL models using this dataset (Nemotron, Qwen, etc.), not just GPT-OSS

the GPT-OSS specific constraints (sys.stdin.buffer, threading) may harm performance on other models that don't have these issues, or prevent valid solutions

per the previous thread, you should verify this prompt improves results for other models, or keep eval/livecodebench/python_codegen as default and only override for GPT-OSS evaluations using ++prompt_config=gpt-oss/livecodebench

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: dgitman <dgitman@nvidia.com>

wasiahmad added 2 commits January 15, 2026 08:51

adding lcb prompt for gpt-oss

99296c0

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Merge branch 'main' into gptoss_dsv32_lcb_fix

026ada0

Kipok requested a review from titu1994 January 16, 2026 22:26

wasiahmad added 2 commits January 21, 2026 05:15

lcb prompting updates

98132ce

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

lcb prompting updates

2d46b1d

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

greptile-apps bot reviewed Jan 21, 2026

View reviewed changes

wasiahmad added 2 commits January 23, 2026 12:13

updating default prompt-config

7467684

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Merge branch 'main' into gptoss_dsv32_lcb_fix

7a3c0a8

greptile-apps bot reviewed Jan 24, 2026

View reviewed changes

changing back dataset prep

01f14d9

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

greptile-apps bot reviewed Jan 24, 2026

View reviewed changes

rolling back to previous prompt config and defaults

24fcb4c

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

greptile-apps bot reviewed Jan 25, 2026

View reviewed changes

Kipok approved these changes Jan 25, 2026

View reviewed changes

Kipok merged commit c7470d9 into main Jan 25, 2026
5 checks passed

Kipok deleted the gptoss_dsv32_lcb_fix branch January 25, 2026 20:11

coderabbitai bot mentioned this pull request Feb 5, 2026

LCB generic prompting #1215

Merged

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Prompting fix to improve LCB score for GPTOSS (#1169)

940d90e

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Prompting fix to improve LCB score for GPTOSS (#1169)

c89ac9c

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: dgitman <dgitman@nvidia.com>

	Please avoid using `sys.stdin.buffer` to process input, and avoid using `threading`.
	Please avoid using `sys.stdin.buffer` to process input, and avoid using `threading`. Instead use `input()` or `sys.stdin.read()`.

Conversation

wasiahmad commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Objective of prompt change

Impact of prompt change

GPT-OSS-120B

GPT-OSS-20B

Qwen3-32B

NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

Uh oh!

coderabbitai bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

greptile-apps bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot commented Jan 15, 2026

Greptile's behavior is changing!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wasiahmad commented Jan 15, 2026 •

edited

Loading

coderabbitai bot commented Jan 15, 2026 •

edited

Loading

greptile-apps bot commented Jan 15, 2026 •

edited

Loading

Kipok Jan 24, 2026 •

edited

Loading