smoke test allow pass for flaky providers #6638
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Goose created this or we could just remove the experimentals for now.
Investigation Results
I investigated the flaky
smoke-tests-code-execjob using the GitHub CLI and found:Root Cause: Two models have inconsistent tool-calling behavior:
google:gemini-3-pro-preview- Most frequent offender (~80% of failures). Sometimes returns empty responses without making any tool calls.openrouter:nvidia/nemotron-3-nano-30b-a3b- Occasional failures with similar behavior.Pattern: When these models fail, they return nothing within ~5 seconds. When they succeed, they take ~45 seconds and properly call tools. This is typical of preview/experimental models.
Timeline:
gemini-3-pro-previewadded Nov 19, 2025nvidia/nemotron-3-nano-30b-a3badded Dec 31, 2025Fix Applied
I modified
scripts/test_providers.shto add an "allowed failures" mechanism:ALLOWED_FAILURESarray listing the flaky modelsis_allowed_failure()function to check if a model is in the list⚠ FLAKYinstead of✗ FAILEDExpected Behavior After Fix
gemini-3-pro-previewfails: Test shows⚠ google: gemini-3-pro-preview (flaky)and the job passes✗ provider: modeland the job failsThis approach: