smoke test allow pass for flaky providers #6638

zanesq · 2026-01-22T16:52:40Z

Summary

Goose created this or we could just remove the experimentals for now.

Investigation Results

I investigated the flaky smoke-tests-code-exec job using the GitHub CLI and found:

Root Cause: Two models have inconsistent tool-calling behavior:
- google:gemini-3-pro-preview - Most frequent offender (~80% of failures). Sometimes returns empty responses without making any tool calls.
- openrouter:nvidia/nemotron-3-nano-30b-a3b - Occasional failures with similar behavior.
Pattern: When these models fail, they return nothing within ~5 seconds. When they succeed, they take ~45 seconds and properly call tools. This is typical of preview/experimental models.
Timeline:
- gemini-3-pro-preview added Nov 19, 2025
- nvidia/nemotron-3-nano-30b-a3b added Dec 31, 2025

Fix Applied

I modified scripts/test_providers.sh to add an "allowed failures" mechanism:

Added an ALLOWED_FAILURES array listing the flaky models
Added an is_allowed_failure() function to check if a model is in the list
Modified the test logic to:
- Mark flaky model failures with ⚠ FLAKY instead of ✗ FAILED
- Track "hard failures" separately from allowed failures
- Only exit with error code 1 if there are hard failures
- Show a clear message when all required tests pass but some flaky tests failed

Expected Behavior After Fix

If gemini-3-pro-preview fails: Test shows ⚠ google: gemini-3-pro-preview (flaky) and the job passes
If a non-flaky model fails: Test shows ✗ provider: model and the job fails
Summary clearly shows which tests were flaky vs hard failures

This approach:

✅ Keeps testing the flaky models (we still see if they pass/fail)
✅ Doesn't block PRs due to known flaky preview models
✅ Still fails on real regressions in stable models
✅ Provides clear visibility into flaky test status

zanesq · 2026-01-22T18:10:24Z

From conversation in Discord maybe we close this and look into the gemini issue since that seems concerning (-preview is latest) or maybe we add retries for flaky providers. Note its definitely flakiness because they all passed on this PR.

michaelneale

is ok for now - and will have a follow up to chase these

rabi · 2026-01-23T02:22:44Z

From conversation in Discord maybe we close this and look into the gemini issue since that seems concerning (-preview is latest)

Hi Zane! If the gemini3 failures are with code_execution, I've #6555 that would possibly fix the issue, as I don't see empty response issues with it locally.

Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>

* origin/main: Fix GCP Vertex AI global endpoint support for Gemini 3 models (#6187) fix: macOS keychain infinite prompt loop (#6620) chore: reduce duplicate or unused cargo deps (#6630) feat: codex subscription support (#6600) smoke test allow pass for flaky providers (#6638) feat: Add built-in skill for goose documentation reference (#6534) Native images (#6619) docs: ml-based prompt injection detection (#6627) Strip the audience for compacting (#6646) chore(release): release version 1.21.0 (minor) (#6634) add collapsable chat nav (#6649) fix: capitalize Rust in CONTRIBUTING.md (#6640) chore(deps): bump lodash from 4.17.21 to 4.17.23 in /ui/desktop (#6623) Vibe mcp apps (#6569) Add session forking capability (#5882) chore(deps): bump lodash from 4.17.21 to 4.17.23 in /documentation (#6624) fix(docs): use named import for globby v13 (#6639) PR Code Review (#6043) fix(docs): use dynamic import for globby ESM module (#6636) # Conflicts: # Cargo.lock # crates/goose-server/src/routes/session.rs

pass flaky providers

44d75cb

zanesq marked this pull request as ready for review January 22, 2026 17:28

zanesq requested a review from michaelneale January 22, 2026 18:08

michaelneale approved these changes Jan 22, 2026

View reviewed changes

zanesq merged commit e7bfdf8 into main Jan 23, 2026
18 checks passed

zanesq deleted the zane/flaky-providers branch January 23, 2026 00:10

fbalicchia pushed a commit to fbalicchia/goose that referenced this pull request Jan 23, 2026

smoke test allow pass for flaky providers (block#6638)

2475760

Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smoke test allow pass for flaky providers #6638

smoke test allow pass for flaky providers #6638

Uh oh!

zanesq commented Jan 22, 2026 •

edited

Loading

Uh oh!

zanesq commented Jan 22, 2026

Uh oh!

michaelneale left a comment

Uh oh!

Uh oh!

rabi commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

smoke test allow pass for flaky providers #6638

smoke test allow pass for flaky providers #6638

Uh oh!

Conversation

zanesq commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Investigation Results

Fix Applied

Expected Behavior After Fix

Uh oh!

zanesq commented Jan 22, 2026

Uh oh!

michaelneale left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rabi commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zanesq commented Jan 22, 2026 •

edited

Loading