fix: improve smoke test prompt for reliable tool calling #6281

baxen · 2025-12-25T04:40:39Z

Summary

Improves the smoke test prompt to be more explicit about requiring immediate tool usage, which should reduce flakiness for models like qwen/qwen3-coder and z-ai/glm-4.6.

Problem

The previous prompt "please list files in the current directory" was ambiguous. Models with weaker tool-calling capabilities would sometimes respond with text describing what they would do instead of actually calling the tool:

I'll help you list the files in the current directory. Let me use the appropriate tool for this.
[session ends without tool call]

This caused ~50% failure rate for qwen and GLM models in CI.

Solution

Changed the prompt to explicitly instruct immediate tool usage:

Before: "please list files in the current directory"
After: "Immediately call the shell tool to run 'ls -la'. Do not ask for confirmation."

Testing

The smoke tests will validate this change on the PR itself. Looking for improved pass rates on:

openrouter: qwen/qwen3-coder
openrouter: z-ai/glm-4.6

The previous prompt 'please list files in the current directory' was ambiguous and didn't explicitly require tool usage. Models like qwen/qwen3-coder and z-ai/glm-4.6 would sometimes respond with text describing what they would do instead of actually calling the tool. The new prompt explicitly instructs the model to immediately call the shell tool without asking for confirmation, which should improve reliability for models with weaker tool-calling capabilities.

Copilot

Pull request overview

This PR improves the smoke test prompt to explicitly require immediate tool usage, addressing flakiness issues with models that have weaker tool-calling capabilities (particularly qwen/qwen3-coder and z-ai/glm-4.6).

Key Changes

Modified the test prompt from an ambiguous request to an explicit command requiring immediate tool execution
Changed "please list files in the current directory" to "Immediately call the shell tool to run 'ls -la'. Do not ask for confirmation."

* main: fix: adding more open models (#6300) docs: add goose for vs code extension (#6262) feat(code-mode): use server names for MCP extensions (#6284) docs: agent skills compatibility note (#6299) docs: clarify GOOSE_TERMINAL requires ~/.zshenv for zsh users (#6297) feat: add OpenAI Codex CLI provider (#6263) docs: fix Resources menu (#6292) Remove Advent of AI announcement banner (#6291) Add blog post: How We Use goose to Maintain goose (#6289)

grok was not liking this

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

scripts/test_providers.sh

michaelneale

waiting for semgrep

Copilot AI review requested due to automatic review settings December 25, 2025 04:40

Copilot started reviewing on behalf of baxen December 25, 2025 04:41 View session

Copilot AI reviewed Dec 25, 2025

View reviewed changes

michaelneale added 2 commits December 31, 2025 09:51

simplify again

818074b

grok was not liking this

Copilot AI review requested due to automatic review settings December 30, 2025 23:36

Copilot started reviewing on behalf of michaelneale December 30, 2025 23:37 View session

Copilot AI reviewed Dec 30, 2025

View reviewed changes

scripts/test_providers.sh Show resolved Hide resolved

michaelneale approved these changes Dec 31, 2025

View reviewed changes

baxen merged commit 38f5f33 into main Dec 31, 2025
26 checks passed

baxen deleted the baxen/reliable-smoke branch December 31, 2025 23:52

github-actions bot mentioned this pull request Jan 6, 2026

chore(release): release version 1.19.0 (minor) #6344

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: improve smoke test prompt for reliable tool calling #6281

fix: improve smoke test prompt for reliable tool calling #6281

Uh oh!

baxen commented Dec 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

michaelneale left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: improve smoke test prompt for reliable tool calling #6281

fix: improve smoke test prompt for reliable tool calling #6281

Uh oh!

Conversation

baxen commented Dec 25, 2025

Summary

Problem

Solution

Testing

Related

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

michaelneale left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants