fix: lazy init local inference runtime in router instead of app start#8656
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bab91e8b2c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| summary[key] = | ||
| key.toLowerCase().includes('secret') || key.toLowerCase().includes('key') | ||
| ? '[REDACTED]' |
There was a problem hiding this comment.
Redact token/password env vars before persisting diagnostics
startGoosed persists a summarized copy of spawnEnv into startup diagnostics, but summarizeEnv only redacts keys containing secret or key. This leaves common credential variables such as *_TOKEN, *_PASSWORD, and auth/cookie-style names in plaintext in userData/logs/startup/*.json, which can leak sensitive data through local logs or support bundles. Please broaden redaction patterns (or avoid writing full env values) before recording this snapshot.
Useful? React with 👍 / 👎.
| const flush = () => { | ||
| fs.writeFileSync(diagnosticsPath, `${JSON.stringify(diagnostics, null, 2)}\n`); | ||
| }; |
There was a problem hiding this comment.
Guard diagnostics file writes from startup path failures
createStartupDiagnostics writes with fs.writeFileSync and no error handling, and record() calls this on every startup event (including stdout/stderr callbacks). On disk-full or permission-denied profiles (ENOSPC/EACCES), the throw propagates from these callbacks and can break startup even though diagnostics are non-critical. Wrap writes in try/catch and disable further diagnostics after the first write error so backend startup remains resilient.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 102558671e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| app.quit(); | ||
| } | ||
| recordStartupEvent('startup_ready'); |
There was a problem hiding this comment.
Record startup_ready only on successful startup
When serverReady is false, the code calls app.quit() but does not return, so execution falls through and still records startup_ready. Because app.quit() is asynchronous in Electron, this path is reachable on every startup failure (both local and external), and the diagnostics file will incorrectly mark a failed startup as ready, which undermines the new startup tracing and makes failure triage misleading.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2c9073e127
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| boot_marker("logging init start"); | ||
| crate::logging::setup_logging(Some("goosed"))?; | ||
| boot_marker("logging init done"); |
There was a problem hiding this comment.
Gate boot markers behind debug logging
This commit adds boot_marker(...) calls on the normal success path, so goosed now writes many GOOSED_BOOT lines to stderr even when nothing is wrong. That violates the logging rule in /workspace/goose/AGENTS.md (only add logs for errors/security), and it also degrades desktop triage because ui/desktop/src/goosed.ts collects stderr into startup error diagnostics. Please guard these markers behind an explicit debug flag or a trace-level logger so production stderr stays signal-only.
Useful? React with 👍 / 👎.
| const attemptId = `goosed-startup-${startedAt.toISOString().replace(/:/g, '-')}-${process.pid}.json`; | ||
| const diagnosticsPath = path.join(diagnosticsDir, attemptId); |
There was a problem hiding this comment.
Add retention for startup diagnostics logs
A new diagnostics file is created for every startup attempt, and main.ts now always enables this path (diagnosticsDir: STARTUP_LOGS_DIR), but there is no age/count cleanup anywhere. On long-lived installs this will grow userData/logs/startup without bound and can consume noticeable disk space, especially when stderr tails are captured. Add a retention step (for example, cap by file count and/or max age) before creating the next attempt file.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ed7e88951d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| axum::extract::State(_state): axum::extract::State<Arc<AppState>>, | ||
| ) -> Result<Json<Vec<LocalModelResponse>>, ErrorResponse> { | ||
| let recommended_id = recommend_local_model(&state.inference_runtime); | ||
| let recommended_id = recommend_local_model(&InferenceRuntime::get_or_init()); |
There was a problem hiding this comment.
Keep a strong InferenceRuntime reference alive
list_local_models now passes a temporary Arc from InferenceRuntime::get_or_init() directly into recommend_local_model, and this same pattern is repeated in get_repo_files. Since the global runtime cache is a Weak and AppState no longer stores a strong Arc, these route calls can initialize and then immediately drop the llama backend on each request when no local provider is active. That is a regression in latency and can surface the BackendAlreadyInitialized unreachable! panic on subsequent requests if backend teardown is incomplete. A long-lived strong reference (as before) is needed to keep runtime lifecycle stable.
Useful? React with 👍 / 👎.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 701435b3a4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const startupTrace = createStartupDiagnostics(diagnosticsDir, workingDir, null, null); | ||
|
|
||
| if (externalGoosed?.enabled && externalGoosed.url) { | ||
| const url = externalGoosed.url.replace(/\/$/, ''); |
There was a problem hiding this comment.
Populate diagnostics baseUrl on external backend path
startGoosed computes the external backend URL here but immediately returns without copying it into the startup diagnostics object, which was initialized with baseUrl: null. In external-backend startup failures, the generated goosed-startup-*.json then lacks the endpoint that was actually probed, making it harder to distinguish misconfiguration (wrong host/port) from connectivity issues. Set startupTrace.diagnostics.baseUrl before each external early return.
Useful? React with 👍 / 👎.
| --- | ||
| title: Debug Desktop Startup Failures | ||
| sidebar_label: Debug Desktop Startup Failures | ||
| description: Find the desktop startup diagnostics log, understand the key fields, and share the right artifacts when Goose fails to start. |
There was a problem hiding this comment.
Use lowercase goose branding in troubleshooting docs
documentation/AGENTS.md requires writing the product name as lowercase goose throughout the documentation tree, but this new page introduces Goose in user-facing text. This violates the documented docs standard and will keep branding inconsistent unless these occurrences are normalized.
Useful? React with 👍 / 👎.
jh-block
left a comment
There was a problem hiding this comment.
This is a good mitigation, allowing users for whom local inference fails to start to still use the rest of the app, but I think we should go further and find out why local inference isn't working for them and fix it -- one clue in #8459 is the MSVC version issue. I am happy to take a closer look at that!
| #[cfg(feature = "local-inference")] | ||
| pub fn get_inference_runtime(&self) -> Arc<InferenceRuntime> { | ||
| self.inference_runtime | ||
| .get_or_init(InferenceRuntime::get_or_init) |
There was a problem hiding this comment.
Maybe we can make this fallible (make InferenceRuntime::get_or_init return a Result rather than panicking, and return Result in this function) so that we can gracefully handle problems with starting the inference runtime
There was a problem hiding this comment.
That's a good idea!
| export const createStartupDiagnostics = ( | ||
| diagnosticsDir: string | undefined, | ||
| workingDir: string, | ||
| goosedPath: string | null, |
There was a problem hiding this comment.
The only call site of this function passes null for goosedPath and baseUrl and then mutates startupTrace.diagnostics.goosedPath / .baseUrl later. Either drop the parameters, or use them instead of the manual mutation of the StartupTrace
| flush: () => void; | ||
| } | ||
|
|
||
| const STARTUP_TAIL_LIMIT = 40; |
There was a problem hiding this comment.
I think this needs to be a little higher (80 or even 100), the diag in 8459 was truncated right before the actual crash.
| elapsedMs: Date.now() - monotonicStart, | ||
| ...(details ? { details } : {}), | ||
| }); | ||
| flush(); |
There was a problem hiding this comment.
flushing after every line might slow down startup a little, but it's probably not too bad
* main: (41 commits) removed the specific code owner for documentation change (#8749) fix(providers): handle missing delta field in streaming chunks (#8700) refactor(providers): extract http_status module and rename handle_status_openai_compat (#8620) fix(providers/openai): accept streaming chunks with both reasoning fields (#8715) feat: associate threads with projects (#8745) upgrade goose sdk and tui to be compatible with the latest agentclientprotocol/sdk package (#8667) feat: extend goose2 context window ux with auto-compaction (#8721) improve goose2 agent management flows (#8737) alexhancock/tui-improvements (#8736) fix: add strict:false to Responses API tools and gpt-5.4 to known models (#8636) persist and reliably apply chat model selection (#8734) merge goose-acp crate into goose (#8726) docs: AGENTS.md section on goose2 desktop backend architecture (#8732) feat: goose2 message bubble + action tray (#8720) consolidate provider ACP methods onto inventory (#8710) ci: declare and enforce MSRV of 1.91.1 (#8670) fix(ui): correct grammar in apps view description (#8668) (#8679) Stop load openai fast model for openapi compatible custom endpoint (#8644) feat(hooks): add Husky git hooks for ui/goose2 (#8577) fix: links in chat could not be opened (#8544) ...
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aed2b17bcb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| extension_loading_tasks: Arc::new(Mutex::new(HashMap::new())), | ||
| #[cfg(feature = "local-inference")] | ||
| inference_runtime: InferenceRuntime::get_or_init(), | ||
| inference_runtime: Arc::new(OnceLock::new()), |
There was a problem hiding this comment.
Keep a process-level strong local runtime reference
AppState::new now leaves the local-inference runtime unset until get_inference_runtime() is called, which means the server no longer guarantees a process-lifetime strong Arc<InferenceRuntime>. The normal local provider path (LocalInferenceProvider::from_env) still initializes via the global weak cache and does not populate this lock, so if local-provider sessions are removed/evicted before any local-inference route call, the runtime can be dropped and then re-initialized later; on backends that surface BackendAlreadyInitialized, that path still reaches the unreachable! panic. Keep a long-lived strong reference regardless of which path initializes first.
Useful? React with 👍 / 👎.
| .sort((a, b) => b.modifiedMs - a.modifiedMs); | ||
|
|
||
| for (const startupLog of startupLogs.slice(STARTUP_LOGS_TO_KEEP)) { | ||
| fs.unlinkSync(startupLog.filePath); |
There was a problem hiding this comment.
Guard startup-log retention deletes from hard failure
Startup diagnostics cleanup currently uses fs.unlinkSync without error handling, so a single locked/read-only/missing file in logs/startup can throw and abort createStartupDiagnostics before backend startup. Since diagnostics are non-critical, this makes app launch fragile under common filesystem issues (EACCES, EPERM, transient ENOENT). Wrap cleanup filesystem calls in try/catch and continue startup when retention pruning fails.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aed2b17bcb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (startupTrace) { | ||
| startupTrace.diagnostics.pid = goosedProcess.pid ?? null; | ||
| startupTrace.record('spawn_success', { pid: goosedProcess.pid ?? null }); | ||
| } |
There was a problem hiding this comment.
Record spawn_success only after process actually spawns
spawn_success is recorded immediately after spawn(...) returns, but Node can still emit an error event without ever spawning the child (for example EACCES, ENOENT from loader/runtime deps, or AV quarantine), in which case pid is null and the diagnostics timeline incorrectly claims success before failure. This makes startup traces misleading in exactly the failure cases this logging is meant to debug; emit this marker from the child spawn event (or gate it on a non-null PID) instead.
Useful? React with 👍 / 👎.
* main: fix: preprompt would show after loading session (#8744) commands to acp+ migration: extensions management (#8733) feat: desktop notification when goose finishes a task (#8647) harden code review skill for async state and default-resolution bugs (#8740) Feature/at agent mention (#8571) fix: removed hardcoded dependency of goose-acp-macro (#8753) perf: split agent setup into staged phases to reduce startup blocking (#8746) Add /skills command (#8600) Replace deprecated Claude ACP package links (#8625)
* main: (29 commits) chore(deps): bump winreg from 0.55.0 to 0.56.0 (#8829) Fix grammar issue (#8669) colorize context window indicator (#8851) Refresh canonical model metadata from models.dev (#8838) fix(ci): prevent flaky smoke test timeouts from failing the build (#8837) updates: release 0.19.0 of the tui/sdk/etc (#8806) add a goose2 signed release flow (#8728) Port provider tests to typescript (#8237) refactor: make ACP server smaller (#8787) Add NVIDIA provider, and improve declarative provider UX (#8798) fix: removed failed provider test for deprecated providers (#8801) fix: only call cleanup when the pr is from same repo (#8799) chore: check stale for draft pr (#8803) fix: use _meta instead of meta in newSession request (#8796) fix: add missing underscore prefix in updateWorkingDir method name (#8743) feat: migrate session metadata storage from frontend overlay to backend (#8769) Add more info to BUILDING_LINUX (#8789) feat(acp): Align to new request patterns of ACP Streamable HTTP/WS transport (#8605) Dedupe and organize skills/sources (#8731) docs: add skills slash command (#8783) ...
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4ce5a6573e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| }; | ||
|
|
||
| goosedProcess.on('exit', (code) => { | ||
| goosedProcess.on('exit', (code, signal) => { |
There was a problem hiding this comment.
Persist diagnostics after stderr stream fully drains
child_exit is recorded from the child process exit event, but in Node that event can fire before stderr is fully drained. Since stderr updates only append to diagnostics.stderrTail and do not flush to disk, lines that arrive after exit can be missing from goosed-startup-*.json (for example, final crash output emitted during teardown). Record the terminal event on close (or flush when appending stderr) so persisted startup diagnostics contain the complete tail.
Useful? React with 👍 / 👎.
…ture Conflict resolutions: - configuration.rs: kept both mcp_only (fork) and tls_cert_path/tls_key_path (upstream aaif-goose#8400) - commands/agent.rs: kept fork's AppState::new(tls, settings) signature + added boot_marker calls from upstream aaif-goose#8656 - state.rs: adopted upstream's OnceLock lazy-init for inference_runtime (aaif-goose#8656) + retained settings: Arc<Settings> from fork
Summary
Fix Issue: #8459
Changes
doctorbut the log file is generated on the client side and server side does not know unless we put hard coded directory, we can think about more of this when start up error occurs more often)Testing
Tested local inference feature