Prompting fix to improve LCB score for GPTOSS#1169
Conversation
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughAdds a new YAML configuration file that specifies prompt instructions for LiveCodeBench tasks, defining Python execution constraints, forbidden I/O methods (buffer-based operations), allowed I/O approaches, and standardized answer formatting with code block structures. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile OverviewGreptile SummaryThis PR adds a GPT-OSS-specific prompt configuration for LiveCodeBench that prevents the model from generating code using Key Changes:
Impact: Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User as User/Evaluator
participant CLI as NS Eval CLI
participant Dataset as LiveCodeBench Dataset
participant PromptLoader as Prompt Loader
participant Model as GPT-OSS Model
participant Executor as Code Executor
User->>CLI: ns eval with ++prompt_config=gpt-oss/livecodebench
CLI->>Dataset: Load LiveCodeBench dataset
Dataset-->>CLI: Return problems with {question}
CLI->>PromptLoader: Load prompt config: gpt-oss/livecodebench
PromptLoader-->>CLI: Return prompt with constraints
Note over PromptLoader: Adds: "avoid sys.stdin.buffer<br/>avoid threading"
CLI->>Model: Generate code with modified prompt
Note over Model: Avoids problematic patterns<br/>due to prompt constraints
Model-->>CLI: Generated code (without sys.stdin.buffer/threading)
CLI->>Executor: Execute generated code
Executor-->>CLI: Success (no AttributeError)
CLI-->>User: Return evaluation results (86.98% pass@1)
|
Greptile's behavior is changing!From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
| user: |- | ||
| {question} | ||
|
|
||
| Please avoid using `sys.stdin.buffer` to process input, and avoid using `threading`. |
There was a problem hiding this comment.
style: The simplified prompt lacks explicit guidance on allowed input methods (like input(), sys.stdin.read()) that were in the previous version. Consider whether models might benefit from positive examples of what TO use, not just what to avoid.
| Please avoid using `sys.stdin.buffer` to process input, and avoid using `threading`. | |
| Please avoid using `sys.stdin.buffer` to process input, and avoid using `threading`. Instead use `input()` or `sys.stdin.read()`. |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
| METRICS_TYPE = "livecodebench" | ||
| EVAL_SPLIT = "test_v6_2408_2505" | ||
| GENERATION_ARGS = "++prompt_config=eval/livecodebench/python_codegen ++eval_type=livecodebench" | ||
| GENERATION_ARGS = "++prompt_config=gpt-oss/livecodebench ++eval_type=livecodebench" |
There was a problem hiding this comment.
style: changing the default GENERATION_ARGS from eval/livecodebench/python_codegen to gpt-oss/livecodebench affects all evaluations using this dataset module unless explicitly overridden with ++prompt_config
check that existing evaluation workflows (especially those not using GPTOSS) still function correctly with this new default, or document that users should specify ++prompt_config=eval/livecodebench/python_codegen for non-GPTOSS models
There was a problem hiding this comment.
it does look a bit strange that we are using gpt-oss prompt as default. Did you verify it also improves results for other models (e.g. nano-v3 or qwen3)? If not, then maybe we should keep default and only use the other prompt for gpt-oss? Or alternatively we can update here, but then would need to change all previously reported commands in the documentation to use original prompt for reproducibility @wasiahmad
There was a problem hiding this comment.
You're absolutely right to be concerned. Looking at the prompt differences:
Original prompt (eval/livecodebench/python_codegen): Generic instruction to generate executable Python code
New GPT-OSS prompt (gpt-oss/livecodebench): Same as original but adds: "Please avoid using sys.stdin.buffer to process input, and avoid using threading."
This change affects all evaluations using the livecodebench dataset unless explicitly overridden, not just GPTOSS evaluations. The GPT-OSS specific constraints may actually hurt performance on other models that don't have the same issues.
I'd recommend keeping the original default and only using the GPT-OSS prompt when specifically needed. The constraints about sys.stdin.buffer and threading are:
- GPTOSS-specific: Other models might not generate these problematic patterns
- Potentially limiting: These constraints could prevent valid solutions that use threading or binary input handling
- Breaking change: This breaks reproducibility of all existing commands in documentation
Better approach: Keep eval/livecodebench/python_codegen as the default and explicitly use ++prompt_config=gpt-oss/livecodebench only for GPTOSS evaluations.
There was a problem hiding this comment.
You're absolutely right to be concerned about this change. The PR is switching the default prompt config from eval/livecodebench/python_codegen to gpt-oss/livecodebench without testing it on other models.
Looking at the differences:
Original (eval/livecodebench/python_codegen):
user: |-
Here is a problem for which you need to generate an executable code in python programming language.
{question}New default (gpt-oss/livecodebench):
user: |-
{question}
Please avoid using `sys.stdin.buffer` to process input, and avoid using `threading`.The new prompt is specifically tailored for GPT-OSS to avoid the sys.stdin.buffer AttributeError, but it's quite different from the original - it removes the explicit instruction to generate Python code and adds restrictions that may not be necessary for other models.
Recommendations:
-
Don't make this the default - This prompt is specifically optimized for GPT-OSS and could potentially hurt performance on other models like Nano-v3 or Qwen3
-
Keep the original default and use model-specific overrides - Users should explicitly specify
++prompt_config=gpt-oss/livecodebenchwhen evaluating GPT-OSS models -
Update documentation - If you want to keep this change, you'd need to update all existing evaluation commands in documentation to use
++prompt_config=eval/livecodebench/python_codegenfor reproducibility
The cleanest approach is to revert this change to the default and let GPT-OSS evaluations explicitly use the specialized prompt.
| FORMATTING_MESSAGE_WITH_STARTER_CODE = ( | ||
| """Please place the solution code in the following format:\n```python\n# Your solution code here\n```""" | ||
| ) | ||
| FORMATTING_WITHOUT_STARTER_CODE = """Write Python code to solve the problem. Please place the solution code in the following format:\n```python\n# Your solution code here\n```""" |
There was a problem hiding this comment.
style: the reference to Qwen's implementation was removed - consider keeping a comment linking to the original inspiration or documenting why these specific formatting instructions were chosen, especially since this is a significant departure from the previous approach
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
| METRICS_TYPE = "livecodebench" | ||
| EVAL_SPLIT = "test_v6_2408_2505" | ||
| GENERATION_ARGS = "++prompt_config=eval/livecodebench/python_codegen ++eval_type=livecodebench" | ||
| GENERATION_ARGS = "++prompt_config=gpt-oss/livecodebench ++eval_type=livecodebench" |
There was a problem hiding this comment.
changing the default GENERATION_ARGS affects ALL models using this dataset (Nemotron, Qwen, etc.), not just GPT-OSS
the GPT-OSS specific constraints (sys.stdin.buffer, threading) may harm performance on other models that don't have these issues, or prevent valid solutions
per the previous thread, you should verify this prompt improves results for other models, or keep eval/livecodebench/python_codegen as default and only override for GPT-OSS evaluations using ++prompt_config=gpt-oss/livecodebench
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: dgitman <dgitman@nvidia.com>
Objective of prompt change
When we evaluated GPTOSS using Nemo-Skills on LCB, we noticed the following error appeared frequently.
After inspecting the generated code, we noticed the following line of code was resulting into the above error.
So, we updated prompt for GPTOSS that instructs the model not to generate code that leads to an AttributeError.
Impact of prompt change
The updates give the following scores on LCB v5 (2407-2412) [315 samples], also known as AA split:
GPT-OSS-120B
NS Eval command:
GPT-OSS-20B
Qwen3-32B
NVIDIA-Nemotron-3-Nano-30B-A3B-BF16