Skip to content

fix: increase Ollama retry config + add transient-only mode#7677

Merged
DOsinga merged 4 commits into
aaif-goose:mainfrom
fresh3nough:fix/ollama-model-loading-timeout-7635
Mar 26, 2026
Merged

fix: increase Ollama retry config + add transient-only mode#7677
DOsinga merged 4 commits into
aaif-goose:mainfrom
fresh3nough:fix/ollama-model-loading-timeout-7635

Conversation

@fresh3nough
Copy link
Copy Markdown
Contributor

Problem

When using Ollama with large models (e.g. Qwen3.5 35b), goose gives up after ~7 seconds with a 500 'connection refused' error. Ollama returns HTTP 500 while loading a model into memory, which can take 10-120s for large models on consumer hardware.

The default retry config (3 retries with 1s/2s/4s backoff = ~7s total) is insufficient for this scenario.

Fix

Override retry_config() in OllamaProvider with values tuned for local model loading:

  • 10 retries (up from 3)
  • 2s initial interval (up from 1s)
  • 1.5x backoff multiplier (down from 2.0, more gradual ramp)
  • 15s max interval (down from 30s)

This provides ~100s of total retry wait time (even with worst-case jitter, >60s), which handles models that take up to ~2 minutes to load.

Testing

  • Added unit tests verifying the retry config values and that total wait time exceeds 60s
  • All existing Ollama tests continue to pass
  • cargo clippy clean

Closes #7635

@fresh3nough fresh3nough force-pushed the fix/ollama-model-loading-timeout-7635 branch from 616e2df to 0334f3a Compare March 5, 2026 15:24
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 616e2df1c5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/goose/src/providers/ollama.rs
…7635)

When Ollama loads a large model (e.g. Qwen3.5 35b), it returns HTTP 500
errors while the model is loading into memory. The default retry config
(3 retries, ~7s total) was insufficient for models that take 10-120s to
load, causing 'connection refused' errors.

Override retry_config() in OllamaProvider with Ollama-specific values:
- 10 retries (up from 3)
- 2s initial interval (up from 1s)
- 1.5x backoff multiplier (down from 2.0, more gradual)
- 15s max interval (down from 30s)

This provides ~100s of total retry wait time, enough for large models
on slower hardware.

Closes aaif-goose#7635

Signed-off-by: fre <anonwurcod@proton.me>
@fresh3nough fresh3nough force-pushed the fix/ollama-model-loading-timeout-7635 branch from 0334f3a to 0b03c0a Compare March 5, 2026 15:32
Client errors (400/404) such as typo model names now fail fast instead
of waiting 80-100s through the full retry backoff. Transient errors
(5xx during model loading, connection refused, rate limits) still use
the extended Ollama retry config.

- Add transient_only flag to RetryConfig
- Update should_retry predicate to accept config
- Set transient_only on Ollama retry config
- Add unit tests for retry predicate behavior

Signed-off-by: fre <anonwurcod@proton.me>
@fresh3nough fresh3nough force-pushed the fix/ollama-model-loading-timeout-7635 branch from 65bffa9 to 31bddd6 Compare March 5, 2026 15:45
Copy link
Copy Markdown
Collaborator

@DOsinga DOsinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does seem to change a lot more than just the retry for ollama

Comment thread crates/goose/src/providers/ollama.rs Outdated
@fresh3nough
Copy link
Copy Markdown
Contributor Author

in addition im updating pr title to extending transient mode

@fresh3nough fresh3nough changed the title fix: increase Ollama retry config for slow model loading fix: increase Ollama retry config + add transient-only mode Mar 11, 2026
Remove test_ollama_retry_config_values (asserts constants equal
themselves) and test_ollama_retry_config_provides_sufficient_wait_time
(recomputes backoff math). The transient_only behavior test is retained
as it exercises actual feature logic.

Signed-off-by: fre <anonwurcod@proton.me>
@fresh3nough fresh3nough force-pushed the fix/ollama-model-loading-timeout-7635 branch from bcc0d6b to d571860 Compare March 11, 2026 16:01
@DOsinga
Copy link
Copy Markdown
Collaborator

DOsinga commented Mar 11, 2026

so I am thinking we shouldn't fix this using the retry mechanism - it would just retry any ollama call even the real faulty one > 1m. is there a different way of doing this? something that is more targeted to the issue

@fresh3nough
Copy link
Copy Markdown
Contributor Author

fresh3nough commented Mar 11, 2026

gotcha, couple options:

Long First-Byte Timeout + Normal Chunk Timeout (builds on your earlier Ollama work) --> Use tokio::time::timeout only until the first SSE line/chunk arrives (like the “defer stall timeout” pattern we discussed before). --> After first chunk → revert to 30s per-chunk. --> This handles slow model loading without touching retries at all.

or

Smart Error-Pattern Retry Predicate
Only extend retries when the error matches known “model still loading” patterns:
Common Ollama signatures during cold start (from real issues + community reports):
HTTP 503 Service Unavailable
HTTP 500 with body containing: "llama runner", "loading model", "server not yet available", "timed out waiting for llama runner to start", or "model is loading"

In code: add a new method in OllamaProvider:Rustfn

is_model_loading_error(&self, err: &reqwest::Error) -> bool {
if err.status() == Some(StatusCode::SERVICE_UNAVAILABLE) { return true; }
let body = err.to_string().to_lowercase();
body.contains("llama runner") || body.contains("loading model") || ...
}

Then update the should_retry predicate to use this only for Ollama (and only up to ~120s total). Real errors fail instantly. This is precise, zero user impact, and easy to maintain

2 is more what I was thinking tbh

Copy link
Copy Markdown
Collaborator

@DOsinga DOsinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean fix — the transient_only mode correctly handles the Codex concern about 4xx errors getting the long Ollama backoff, the tests are meaningful, and the mechanical RetryConfig::new() updates to bedrock/databricks are the right approach now that transient_only is a private field. LGTM.

Resolve merge conflict in crates/goose/src/providers/retry.rs by combining:
- upstream auth credential refresh logic
- PR transient_only support for should_retry

Signed-off-by: fre <anonwurcod@proton.me>
@DOsinga DOsinga added this pull request to the merge queue Mar 26, 2026
Merged via the queue into aaif-goose:main with commit dfbd2dd Mar 26, 2026
21 checks passed
hydrosquall pushed a commit to hydrosquall/goose that referenced this pull request Mar 31, 2026
…se#7677)

Signed-off-by: fre <anonwurcod@proton.me>
Signed-off-by: Cameron Yick <cameron.yick@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stops waiting for ollama before specified timeout reached

3 participants