-
Notifications
You must be signed in to change notification settings - Fork 2.4k
fix: improve smoke test prompt for reliable tool calling #6281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The previous prompt 'please list files in the current directory' was ambiguous and didn't explicitly require tool usage. Models like qwen/qwen3-coder and z-ai/glm-4.6 would sometimes respond with text describing what they would do instead of actually calling the tool. The new prompt explicitly instructs the model to immediately call the shell tool without asking for confirmation, which should improve reliability for models with weaker tool-calling capabilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR improves the smoke test prompt to explicitly require immediate tool usage, addressing flakiness issues with models that have weaker tool-calling capabilities (particularly qwen/qwen3-coder and z-ai/glm-4.6).
Key Changes
- Modified the test prompt from an ambiguous request to an explicit command requiring immediate tool execution
- Changed "please list files in the current directory" to "Immediately call the shell tool to run 'ls -la'. Do not ask for confirmation."
* main: fix: adding more open models (#6300) docs: add goose for vs code extension (#6262) feat(code-mode): use server names for MCP extensions (#6284) docs: agent skills compatibility note (#6299) docs: clarify GOOSE_TERMINAL requires ~/.zshenv for zsh users (#6297) feat: add OpenAI Codex CLI provider (#6263) docs: fix Resources menu (#6292) Remove Advent of AI announcement banner (#6291) Add blog post: How We Use goose to Maintain goose (#6289)
grok was not liking this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
michaelneale
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waiting for semgrep
Summary
Improves the smoke test prompt to be more explicit about requiring immediate tool usage, which should reduce flakiness for models like
qwen/qwen3-coderandz-ai/glm-4.6.Problem
The previous prompt "please list files in the current directory" was ambiguous. Models with weaker tool-calling capabilities would sometimes respond with text describing what they would do instead of actually calling the tool:
This caused ~50% failure rate for qwen and GLM models in CI.
Solution
Changed the prompt to explicitly instruct immediate tool usage:
Testing
The smoke tests will validate this change on the PR itself. Looking for improved pass rates on:
openrouter: qwen/qwen3-coderopenrouter: z-ai/glm-4.6Related
TSK-710