Skip to content

Conversation

@baxen
Copy link
Collaborator

@baxen baxen commented Dec 25, 2025

Summary

Improves the smoke test prompt to be more explicit about requiring immediate tool usage, which should reduce flakiness for models like qwen/qwen3-coder and z-ai/glm-4.6.

Problem

The previous prompt "please list files in the current directory" was ambiguous. Models with weaker tool-calling capabilities would sometimes respond with text describing what they would do instead of actually calling the tool:

I'll help you list the files in the current directory. Let me use the appropriate tool for this.
[session ends without tool call]

This caused ~50% failure rate for qwen and GLM models in CI.

Solution

Changed the prompt to explicitly instruct immediate tool usage:

  • Before: "please list files in the current directory"
  • After: "Immediately call the shell tool to run 'ls -la'. Do not ask for confirmation."

Testing

The smoke tests will validate this change on the PR itself. Looking for improved pass rates on:

  • openrouter: qwen/qwen3-coder
  • openrouter: z-ai/glm-4.6

Related

TSK-710

The previous prompt 'please list files in the current directory' was
ambiguous and didn't explicitly require tool usage. Models like
qwen/qwen3-coder and z-ai/glm-4.6 would sometimes respond with text
describing what they would do instead of actually calling the tool.

The new prompt explicitly instructs the model to immediately call the
shell tool without asking for confirmation, which should improve
reliability for models with weaker tool-calling capabilities.
Copilot AI review requested due to automatic review settings December 25, 2025 04:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the smoke test prompt to explicitly require immediate tool usage, addressing flakiness issues with models that have weaker tool-calling capabilities (particularly qwen/qwen3-coder and z-ai/glm-4.6).

Key Changes

  • Modified the test prompt from an ambiguous request to an explicit command requiring immediate tool execution
  • Changed "please list files in the current directory" to "Immediately call the shell tool to run 'ls -la'. Do not ask for confirmation."

* main:
  fix: adding more open models (#6300)
  docs: add goose for vs code extension (#6262)
  feat(code-mode): use server names for MCP extensions (#6284)
  docs: agent skills compatibility note (#6299)
  docs: clarify GOOSE_TERMINAL requires ~/.zshenv for zsh users (#6297)
  feat: add OpenAI Codex CLI provider (#6263)
  docs: fix Resources menu (#6292)
  Remove Advent of AI announcement banner (#6291)
  Add blog post: How We Use goose to Maintain goose (#6289)
grok was not liking this
Copilot AI review requested due to automatic review settings December 30, 2025 23:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Copy link
Collaborator

@michaelneale michaelneale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waiting for semgrep

@baxen baxen merged commit 38f5f33 into main Dec 31, 2025
26 checks passed
@baxen baxen deleted the baxen/reliable-smoke branch December 31, 2025 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants