Skip to content

Conversation

@zanesq
Copy link
Collaborator

@zanesq zanesq commented Jan 22, 2026

Summary

Goose created this or we could just remove the experimentals for now.

Investigation Results

I investigated the flaky smoke-tests-code-exec job using the GitHub CLI and found:

  1. Root Cause: Two models have inconsistent tool-calling behavior:

    • google:gemini-3-pro-preview - Most frequent offender (~80% of failures). Sometimes returns empty responses without making any tool calls.
    • openrouter:nvidia/nemotron-3-nano-30b-a3b - Occasional failures with similar behavior.
  2. Pattern: When these models fail, they return nothing within ~5 seconds. When they succeed, they take ~45 seconds and properly call tools. This is typical of preview/experimental models.

  3. Timeline:

    • gemini-3-pro-preview added Nov 19, 2025
    • nvidia/nemotron-3-nano-30b-a3b added Dec 31, 2025

Fix Applied

I modified scripts/test_providers.sh to add an "allowed failures" mechanism:

  1. Added an ALLOWED_FAILURES array listing the flaky models
  2. Added an is_allowed_failure() function to check if a model is in the list
  3. Modified the test logic to:
    • Mark flaky model failures with ⚠ FLAKY instead of ✗ FAILED
    • Track "hard failures" separately from allowed failures
    • Only exit with error code 1 if there are hard failures
    • Show a clear message when all required tests pass but some flaky tests failed

Expected Behavior After Fix

  • If gemini-3-pro-preview fails: Test shows ⚠ google: gemini-3-pro-preview (flaky) and the job passes
  • If a non-flaky model fails: Test shows ✗ provider: model and the job fails
  • Summary clearly shows which tests were flaky vs hard failures

This approach:

  • ✅ Keeps testing the flaky models (we still see if they pass/fail)
  • ✅ Doesn't block PRs due to known flaky preview models
  • ✅ Still fails on real regressions in stable models
  • ✅ Provides clear visibility into flaky test status

@zanesq zanesq marked this pull request as ready for review January 22, 2026 17:28
@zanesq zanesq requested a review from michaelneale January 22, 2026 18:08
@zanesq
Copy link
Collaborator Author

zanesq commented Jan 22, 2026

From conversation in Discord maybe we close this and look into the gemini issue since that seems concerning (-preview is latest) or maybe we add retries for flaky providers. Note its definitely flakiness because they all passed on this PR.

Copy link
Collaborator

@michaelneale michaelneale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is ok for now - and will have a follow up to chase these

@zanesq zanesq merged commit e7bfdf8 into main Jan 23, 2026
18 checks passed
@zanesq zanesq deleted the zane/flaky-providers branch January 23, 2026 00:10
@rabi
Copy link
Contributor

rabi commented Jan 23, 2026

From conversation in Discord maybe we close this and look into the gemini issue since that seems concerning (-preview is latest)

Hi Zane! If the gemini3 failures are with code_execution, I've #6555 that would possibly fix the issue, as I don't see empty response issues with it locally.

fbalicchia pushed a commit to fbalicchia/goose that referenced this pull request Jan 23, 2026
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
tlongwell-block added a commit that referenced this pull request Jan 23, 2026
* origin/main:
  Fix GCP Vertex AI global endpoint support for Gemini 3 models (#6187)
  fix: macOS keychain infinite prompt loop    (#6620)
  chore: reduce duplicate or unused cargo deps (#6630)
  feat: codex subscription support (#6600)
  smoke test allow pass for flaky providers (#6638)
  feat: Add built-in skill for goose documentation reference (#6534)
  Native images (#6619)
  docs: ml-based prompt injection detection (#6627)
  Strip the audience for compacting (#6646)
  chore(release): release version 1.21.0 (minor) (#6634)
  add collapsable chat nav (#6649)
  fix: capitalize Rust in CONTRIBUTING.md (#6640)
  chore(deps): bump lodash from 4.17.21 to 4.17.23 in /ui/desktop (#6623)
  Vibe mcp apps (#6569)
  Add session forking capability (#5882)
  chore(deps): bump lodash from 4.17.21 to 4.17.23 in /documentation (#6624)
  fix(docs): use named import for globby v13 (#6639)
  PR Code Review (#6043)
  fix(docs): use dynamic import for globby ESM module (#6636)

# Conflicts:
#	Cargo.lock
#	crates/goose-server/src/routes/session.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants