Skip to content

Prompting fix to improve LCB score for GPTOSS#1169

Merged
Kipok merged 8 commits intomainfrom
gptoss_dsv32_lcb_fix
Jan 25, 2026
Merged

Prompting fix to improve LCB score for GPTOSS#1169
Kipok merged 8 commits intomainfrom
gptoss_dsv32_lcb_fix

Conversation

@wasiahmad
Copy link
Collaborator

@wasiahmad wasiahmad commented Jan 15, 2026

Objective of prompt change

When we evaluated GPTOSS using Nemo-Skills on LCB, we noticed the following error appeared frequently.

\"AttributeError(\\\"'_io.StringIO' object has no attribute 'buffer'\\\")\"

After inspecting the generated code, we noticed the following line of code was resulting into the above error.

data = sys.stdin.buffer.read().split()

So, we updated prompt for GPTOSS that instructs the model not to generate code that leads to an AttributeError.

Impact of prompt change

The updates give the following scores on LCB v5 (2407-2412) [315 samples], also known as AA split:

GPT-OSS-120B

------------------------------ livecodebench ------------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 315         | 17531      | 1083        | 86.98% ± 0.92%
pass@10           | 315         | 17531      | 1083        | 93.02%        


---------------------------- livecodebench-hard ---------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 135         | 31470      | 1083        | 75.85% ± 1.27%
pass@10           | 135         | 31470      | 1083        | 87.41%        


--------------------------- livecodebench-medium --------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 102         | 10700      | 849         | 93.04% ± 1.49%
pass@10           | 102         | 10700      | 849         | 96.08%        


---------------------------- livecodebench-easy ---------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 78          | 2336       | 247         | 98.33% ± 1.22%
pass@10           | 78          | 2336       | 247         | 98.72%

NS Eval command:

ns eval \
    --cluster=$cluster \
    --model=$model_dir \
    --server_type=vllm \
    --server_args="--enable-log-requests --async-scheduling --tensor-parallel-size 8" \
    --server_nodes=1 \
    --server_gpus=8 \
    --benchmarks=livecodebench:10 \
    --split="test_v5_2407_2412" \
    --expname="${model_name}-lcb-ns-eval" \
    --output_dir=$output_dir \
    ++prompt_config="gpt-oss/livecodebench" \
    ++inference.endpoint_type=chat \
    ++inference.extra_body.reasoning_effort=high \
    ++inference.temperature=1.0 \
    ++inference.top_p=1.0 \
    ++inference.top_k=-1 \
    ++max_concurrent_requests=1024 \
    ++skip_filled=True \
    ++eval_config.timeout=10

GPT-OSS-20B

------------------------------ livecodebench ------------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 315         | 32616      | 1090        | 82.79% ± 0.91%
pass@10           | 315         | 32616      | 1090        | 91.11%        


---------------------------- livecodebench-hard ---------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 135         | 57098      | 1090        | 66.59% ± 1.96%
pass@10           | 135         | 57098      | 1090        | 82.22%        


--------------------------- livecodebench-medium --------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 102         | 22206      | 1090        | 92.25% ± 1.34%
pass@10           | 102         | 22206      | 1090        | 97.06%        


---------------------------- livecodebench-easy ---------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 78          | 3859       | 734         | 98.46% ± 0.54%
pass@10           | 78          | 3859       | 734         | 98.72% 

Qwen3-32B

------------------------------ livecodebench ------------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 315         | 12735      | 1169        | 66.98% ± 1.41%
pass@10           | 315         | 12735      | 1169        | 80.32%        


---------------------------- livecodebench-hard ---------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 135         | 18759      | 1149        | 39.41% ± 1.87%
pass@10           | 135         | 18759      | 1149        | 60.00%        


--------------------------- livecodebench-medium --------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 102         | 11311      | 1169        | 79.80% ± 1.80%
pass@10           | 102         | 11311      | 1169        | 93.14%        


---------------------------- livecodebench-easy ---------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 78          | 4171       | 837         | 97.95% ± 1.08%
pass@10           | 78          | 4171       | 837         | 98.72%

NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

------------------------------ livecodebench ------------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 315         | 28483      | 1649        | 66.86% ± 2.30%
pass@10           | 315         | 28483      | 1649        | 80.95%        


---------------------------- livecodebench-hard ---------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 135         | 44079      | 1649        | 43.11% ± 3.40%
pass@10           | 135         | 44079      | 1649        | 62.96%        


--------------------------- livecodebench-medium --------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 102         | 21548      | 1649        | 77.06% ± 2.78%
pass@10           | 102         | 21548      | 1649        | 91.18%        


---------------------------- livecodebench-easy ---------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy      
pass@1[avg-of-10] | 78          | 10557      | 1649        | 94.62% ± 2.69%
pass@10           | 78          | 10557      | 1649        | 98.72%

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 15, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds a new YAML configuration file that specifies prompt instructions for LiveCodeBench tasks, defining Python execution constraints, forbidden I/O methods (buffer-based operations), allowed I/O approaches, and standardized answer formatting with code block structures.

Changes

Cohort / File(s) Summary
LiveCodeBench Prompt Configuration
nemo_skills/prompt/config/gpt-oss/livecodebench.yaml
New prompt template file containing instructions for Python code task execution, including I/O method restrictions (forbids .buffer attributes), allowed input methods (input, sys.stdin.read/readline, open().read), step-by-step thinking directives, and standardized answer format with code block wrapping.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

  • PR #1079: Adds complementary LiveCodeBench prompt configurations including prompt-set entries and templates for the same benchmark feature.

Suggested reviewers

  • titu1994
  • ekmb
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding a prompt configuration file to improve LCB (LiveCodeBench) score for GPTOSS, which aligns with the file addition.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 15, 2026

Greptile Overview

Greptile Summary

This PR adds a GPT-OSS-specific prompt configuration for LiveCodeBench that prevents the model from generating code using sys.stdin.buffer and threading, which were causing AttributeErrors during evaluation.

Key Changes:

  • Created new prompt file at nemo_skills/prompt/config/gpt-oss/livecodebench.yaml with targeted constraints
  • Unlike earlier iterations of this PR, the author correctly rolled back changes to nemo_skills/dataset/livecodebench/__init__.py and prepare.py
  • The default prompt config remains eval/livecodebench/python_codegen, so this change only affects GPT-OSS when explicitly specified with ++prompt_config=gpt-oss/livecodebench

Impact:
The new prompt is opt-in and model-specific, avoiding the concerns raised in previous review threads about changing defaults for all models. GPT-OSS users must explicitly use ++prompt_config=gpt-oss/livecodebench to get these constraints. The PR description shows significant improvements in LCB scores for GPT-OSS models (86.98% for 120B, 82.79% for 20B) while also testing other models like Qwen3 and Nemotron.

Confidence Score: 5/5

  • This PR is safe to merge - it only adds a new prompt configuration file without modifying any defaults or existing behavior
  • Score reflects that this is a minimal, non-breaking change: (1) only adds a new opt-in prompt file in the gpt-oss directory, (2) does not modify any existing files or defaults, (3) properly rolled back earlier changes that would have affected all models, (4) addresses a real issue (AttributeError) with targeted constraints, (5) PR description shows thorough testing with strong results across multiple models
  • No files require special attention

Important Files Changed

Filename Overview
nemo_skills/prompt/config/gpt-oss/livecodebench.yaml New GPT-OSS specific prompt that adds constraints to avoid sys.stdin.buffer and threading which were causing AttributeErrors

Sequence Diagram

sequenceDiagram
    participant User as User/Evaluator
    participant CLI as NS Eval CLI
    participant Dataset as LiveCodeBench Dataset
    participant PromptLoader as Prompt Loader
    participant Model as GPT-OSS Model
    participant Executor as Code Executor
    
    User->>CLI: ns eval with ++prompt_config=gpt-oss/livecodebench
    CLI->>Dataset: Load LiveCodeBench dataset
    Dataset-->>CLI: Return problems with {question}
    CLI->>PromptLoader: Load prompt config: gpt-oss/livecodebench
    PromptLoader-->>CLI: Return prompt with constraints
    Note over PromptLoader: Adds: "avoid sys.stdin.buffer<br/>avoid threading"
    CLI->>Model: Generate code with modified prompt
    Note over Model: Avoids problematic patterns<br/>due to prompt constraints
    Model-->>CLI: Generated code (without sys.stdin.buffer/threading)
    CLI->>Executor: Execute generated code
    Executor-->>CLI: Success (no AttributeError)
    CLI-->>User: Return evaluation results (86.98% pass@1)
Loading

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 15, 2026

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

@Kipok Kipok requested a review from titu1994 January 16, 2026 22:26
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

user: |-
{question}

Please avoid using `sys.stdin.buffer` to process input, and avoid using `threading`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: The simplified prompt lacks explicit guidance on allowed input methods (like input(), sys.stdin.read()) that were in the previous version. Consider whether models might benefit from positive examples of what TO use, not just what to avoid.

Suggested change
Please avoid using `sys.stdin.buffer` to process input, and avoid using `threading`.
Please avoid using `sys.stdin.buffer` to process input, and avoid using `threading`. Instead use `input()` or `sys.stdin.read()`.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

METRICS_TYPE = "livecodebench"
EVAL_SPLIT = "test_v6_2408_2505"
GENERATION_ARGS = "++prompt_config=eval/livecodebench/python_codegen ++eval_type=livecodebench"
GENERATION_ARGS = "++prompt_config=gpt-oss/livecodebench ++eval_type=livecodebench"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: changing the default GENERATION_ARGS from eval/livecodebench/python_codegen to gpt-oss/livecodebench affects all evaluations using this dataset module unless explicitly overridden with ++prompt_config

check that existing evaluation workflows (especially those not using GPTOSS) still function correctly with this new default, or document that users should specify ++prompt_config=eval/livecodebench/python_codegen for non-GPTOSS models

Copy link
Collaborator

@Kipok Kipok Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does look a bit strange that we are using gpt-oss prompt as default. Did you verify it also improves results for other models (e.g. nano-v3 or qwen3)? If not, then maybe we should keep default and only use the other prompt for gpt-oss? Or alternatively we can update here, but then would need to change all previously reported commands in the documentation to use original prompt for reproducibility @wasiahmad

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right to be concerned. Looking at the prompt differences:

Original prompt (eval/livecodebench/python_codegen): Generic instruction to generate executable Python code

New GPT-OSS prompt (gpt-oss/livecodebench): Same as original but adds: "Please avoid using sys.stdin.buffer to process input, and avoid using threading."

This change affects all evaluations using the livecodebench dataset unless explicitly overridden, not just GPTOSS evaluations. The GPT-OSS specific constraints may actually hurt performance on other models that don't have the same issues.

I'd recommend keeping the original default and only using the GPT-OSS prompt when specifically needed. The constraints about sys.stdin.buffer and threading are:

  1. GPTOSS-specific: Other models might not generate these problematic patterns
  2. Potentially limiting: These constraints could prevent valid solutions that use threading or binary input handling
  3. Breaking change: This breaks reproducibility of all existing commands in documentation

Better approach: Keep eval/livecodebench/python_codegen as the default and explicitly use ++prompt_config=gpt-oss/livecodebench only for GPTOSS evaluations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right to be concerned about this change. The PR is switching the default prompt config from eval/livecodebench/python_codegen to gpt-oss/livecodebench without testing it on other models.

Looking at the differences:

Original (eval/livecodebench/python_codegen):

user: |-
  Here is a problem for which you need to generate an executable code in python programming language.

  {question}

New default (gpt-oss/livecodebench):

user: |-
  {question}

  Please avoid using `sys.stdin.buffer` to process input, and avoid using `threading`.

The new prompt is specifically tailored for GPT-OSS to avoid the sys.stdin.buffer AttributeError, but it's quite different from the original - it removes the explicit instruction to generate Python code and adds restrictions that may not be necessary for other models.

Recommendations:

  1. Don't make this the default - This prompt is specifically optimized for GPT-OSS and could potentially hurt performance on other models like Nano-v3 or Qwen3

  2. Keep the original default and use model-specific overrides - Users should explicitly specify ++prompt_config=gpt-oss/livecodebench when evaluating GPT-OSS models

  3. Update documentation - If you want to keep this change, you'd need to update all existing evaluation commands in documentation to use ++prompt_config=eval/livecodebench/python_codegen for reproducibility

The cleanest approach is to revert this change to the default and let GPT-OSS evaluations explicitly use the specialized prompt.

Comment on lines +26 to +29
FORMATTING_MESSAGE_WITH_STARTER_CODE = (
"""Please place the solution code in the following format:\n```python\n# Your solution code here\n```"""
)
FORMATTING_WITHOUT_STARTER_CODE = """Write Python code to solve the problem. Please place the solution code in the following format:\n```python\n# Your solution code here\n```"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: the reference to Qwen's implementation was removed - consider keeping a comment linking to the original inspiration or documenting why these specific formatting instructions were chosen, especially since this is a significant departure from the previous approach

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

METRICS_TYPE = "livecodebench"
EVAL_SPLIT = "test_v6_2408_2505"
GENERATION_ARGS = "++prompt_config=eval/livecodebench/python_codegen ++eval_type=livecodebench"
GENERATION_ARGS = "++prompt_config=gpt-oss/livecodebench ++eval_type=livecodebench"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing the default GENERATION_ARGS affects ALL models using this dataset (Nemotron, Qwen, etc.), not just GPT-OSS

the GPT-OSS specific constraints (sys.stdin.buffer, threading) may harm performance on other models that don't have these issues, or prevent valid solutions

per the previous thread, you should verify this prompt improves results for other models, or keep eval/livecodebench/python_codegen as default and only override for GPT-OSS evaluations using ++prompt_config=gpt-oss/livecodebench

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@Kipok Kipok merged commit c7470d9 into main Jan 25, 2026
5 checks passed
@Kipok Kipok deleted the gptoss_dsv32_lcb_fix branch January 25, 2026 20:11
@coderabbitai coderabbitai bot mentioned this pull request Feb 5, 2026
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants