Skip to content

fix: lazy init local inference runtime in router instead of app start#8656

Merged
lifeizhou-ap merged 12 commits into
mainfrom
lifei/logging-whne-start-error
Apr 27, 2026
Merged

fix: lazy init local inference runtime in router instead of app start#8656
lifeizhou-ap merged 12 commits into
mainfrom
lifei/logging-whne-start-error

Conversation

@lifeizhou-ap
Copy link
Copy Markdown
Collaborator

@lifeizhou-ap lifeizhou-ap commented Apr 20, 2026

Summary

Fix Issue: #8459

Changes

  • Init the local inference runtime in local inference router instead of the goosed started
  • Added minimal startup logs for debug in the future and added diagnose doc (I was thinking about including this in the doctor but the log file is generated on the client side and server side does not know unless we put hard coded directory, we can think about more of this when start up error occurs more often)

Testing

Tested local inference feature

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bab91e8b2c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread ui/desktop/src/goosed.ts Outdated
Comment on lines +224 to +226
summary[key] =
key.toLowerCase().includes('secret') || key.toLowerCase().includes('key')
? '[REDACTED]'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Redact token/password env vars before persisting diagnostics

startGoosed persists a summarized copy of spawnEnv into startup diagnostics, but summarizeEnv only redacts keys containing secret or key. This leaves common credential variables such as *_TOKEN, *_PASSWORD, and auth/cookie-style names in plaintext in userData/logs/startup/*.json, which can leak sensitive data through local logs or support bundles. Please broaden redaction patterns (or avoid writing full env values) before recording this snapshot.

Useful? React with 👍 / 👎.

Comment thread ui/desktop/src/goosed.ts Outdated
Comment on lines +265 to +267
const flush = () => {
fs.writeFileSync(diagnosticsPath, `${JSON.stringify(diagnostics, null, 2)}\n`);
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard diagnostics file writes from startup path failures

createStartupDiagnostics writes with fs.writeFileSync and no error handling, and record() calls this on every startup event (including stdout/stderr callbacks). On disk-full or permission-denied profiles (ENOSPC/EACCES), the throw propagates from these callbacks and can break startup even though diagnostics are non-critical. Wrap writes in try/catch and disable further diagnostics after the first write error so backend startup remains resilient.

Useful? React with 👍 / 👎.

@lifeizhou-ap lifeizhou-ap changed the title added more logging to debug added more logging to debug (WIP) Apr 20, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 102558671e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread ui/desktop/src/main.ts Outdated
}
app.quit();
}
recordStartupEvent('startup_ready');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Record startup_ready only on successful startup

When serverReady is false, the code calls app.quit() but does not return, so execution falls through and still records startup_ready. Because app.quit() is asynchronous in Electron, this path is reachable on every startup failure (both local and external), and the diagnostics file will incorrectly mark a failed startup as ready, which undermines the new startup tracing and makes failure triage misleading.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2c9073e127

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +42 to +44
boot_marker("logging init start");
crate::logging::setup_logging(Some("goosed"))?;
boot_marker("logging init done");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Gate boot markers behind debug logging

This commit adds boot_marker(...) calls on the normal success path, so goosed now writes many GOOSED_BOOT lines to stderr even when nothing is wrong. That violates the logging rule in /workspace/goose/AGENTS.md (only add logs for errors/security), and it also degrades desktop triage because ui/desktop/src/goosed.ts collects stderr into startup error diagnostics. Please guard these markers behind an explicit debug flag or a trace-level logger so production stderr stays signal-only.

Useful? React with 👍 / 👎.

Comment thread ui/desktop/src/goosed.ts Outdated
Comment on lines +228 to +229
const attemptId = `goosed-startup-${startedAt.toISOString().replace(/:/g, '-')}-${process.pid}.json`;
const diagnosticsPath = path.join(diagnosticsDir, attemptId);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add retention for startup diagnostics logs

A new diagnostics file is created for every startup attempt, and main.ts now always enables this path (diagnosticsDir: STARTUP_LOGS_DIR), but there is no age/count cleanup anywhere. On long-lived installs this will grow userData/logs/startup without bound and can consume noticeable disk space, especially when stderr tails are captured. Add a retention step (for example, cap by file count and/or max age) before creating the next attempt file.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ed7e88951d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

axum::extract::State(_state): axum::extract::State<Arc<AppState>>,
) -> Result<Json<Vec<LocalModelResponse>>, ErrorResponse> {
let recommended_id = recommend_local_model(&state.inference_runtime);
let recommended_id = recommend_local_model(&InferenceRuntime::get_or_init());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep a strong InferenceRuntime reference alive

list_local_models now passes a temporary Arc from InferenceRuntime::get_or_init() directly into recommend_local_model, and this same pattern is repeated in get_repo_files. Since the global runtime cache is a Weak and AppState no longer stores a strong Arc, these route calls can initialize and then immediately drop the llama backend on each request when no local provider is active. That is a regression in latency and can surface the BackendAlreadyInitialized unreachable! panic on subsequent requests if backend teardown is incomplete. A long-lived strong reference (as before) is needed to keep runtime lifecycle stable.

Useful? React with 👍 / 👎.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-27 05:02 UTC

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 701435b3a4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread ui/desktop/src/goosed.ts
const startupTrace = createStartupDiagnostics(diagnosticsDir, workingDir, null, null);

if (externalGoosed?.enabled && externalGoosed.url) {
const url = externalGoosed.url.replace(/\/$/, '');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Populate diagnostics baseUrl on external backend path

startGoosed computes the external backend URL here but immediately returns without copying it into the startup diagnostics object, which was initialized with baseUrl: null. In external-backend startup failures, the generated goosed-startup-*.json then lacks the endpoint that was actually probed, making it harder to distinguish misconfiguration (wrong host/port) from connectivity issues. Set startupTrace.diagnostics.baseUrl before each external early return.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

---
title: Debug Desktop Startup Failures
sidebar_label: Debug Desktop Startup Failures
description: Find the desktop startup diagnostics log, understand the key fields, and share the right artifacts when Goose fails to start.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use lowercase goose branding in troubleshooting docs

documentation/AGENTS.md requires writing the product name as lowercase goose throughout the documentation tree, but this new page introduces Goose in user-facing text. This violates the documented docs standard and will keep branding inconsistent unless these occurrences are normalized.

Useful? React with 👍 / 👎.

@lifeizhou-ap lifeizhou-ap changed the title added more logging to debug (WIP) chore: lazy init local inference runtime in router instead of app start Apr 21, 2026
@lifeizhou-ap lifeizhou-ap requested a review from jh-block April 21, 2026 06:09
@lifeizhou-ap lifeizhou-ap changed the title chore: lazy init local inference runtime in router instead of app start fix: lazy init local inference runtime in router instead of app start Apr 21, 2026
Copy link
Copy Markdown
Collaborator

@jh-block jh-block left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good mitigation, allowing users for whom local inference fails to start to still use the rest of the app, but I think we should go further and find out why local inference isn't working for them and fix it -- one clue in #8459 is the MSVC version issue. I am happy to take a closer look at that!

Comment thread crates/goose-server/src/state.rs Outdated
#[cfg(feature = "local-inference")]
pub fn get_inference_runtime(&self) -> Arc<InferenceRuntime> {
self.inference_runtime
.get_or_init(InferenceRuntime::get_or_init)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make this fallible (make InferenceRuntime::get_or_init return a Result rather than panicking, and return Result in this function) so that we can gracefully handle problems with starting the inference runtime

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea!

Comment thread ui/desktop/src/startupDiagnostics.ts Outdated
export const createStartupDiagnostics = (
diagnosticsDir: string | undefined,
workingDir: string,
goosedPath: string | null,
Copy link
Copy Markdown
Collaborator

@jh-block jh-block Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only call site of this function passes null for goosedPath and baseUrl and then mutates startupTrace.diagnostics.goosedPath / .baseUrl later. Either drop the parameters, or use them instead of the manual mutation of the StartupTrace

Comment thread ui/desktop/src/startupDiagnostics.ts Outdated
flush: () => void;
}

const STARTUP_TAIL_LIMIT = 40;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be a little higher (80 or even 100), the diag in 8459 was truncated right before the actual crash.

elapsedMs: Date.now() - monotonicStart,
...(details ? { details } : {}),
});
flush();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flushing after every line might slow down startup a little, but it's probably not too bad

@lifeizhou-ap lifeizhou-ap enabled auto-merge April 22, 2026 01:23
* main: (41 commits)
  removed the specific code owner for documentation change (#8749)
  fix(providers): handle missing delta field in streaming chunks (#8700)
  refactor(providers): extract http_status module and rename handle_status_openai_compat (#8620)
  fix(providers/openai): accept streaming chunks with both reasoning fields (#8715)
  feat: associate threads with projects (#8745)
  upgrade goose sdk and tui to be compatible with the latest agentclientprotocol/sdk package (#8667)
  feat: extend goose2 context window ux with auto-compaction (#8721)
  improve goose2 agent management flows (#8737)
  alexhancock/tui-improvements (#8736)
  fix: add strict:false to Responses API tools and gpt-5.4 to known models (#8636)
  persist and reliably apply chat model selection (#8734)
  merge goose-acp crate into goose (#8726)
  docs: AGENTS.md section on goose2 desktop backend architecture (#8732)
  feat: goose2 message bubble + action tray (#8720)
  consolidate provider ACP methods onto inventory (#8710)
  ci: declare and enforce MSRV of 1.91.1 (#8670)
  fix(ui): correct grammar in apps view description (#8668) (#8679)
  Stop load openai fast model for openapi compatible custom endpoint (#8644)
  feat(hooks): add Husky git hooks for ui/goose2 (#8577)
  fix: links in chat could not be opened (#8544)
  ...
@lifeizhou-ap lifeizhou-ap marked this pull request as draft April 22, 2026 11:48
auto-merge was automatically disabled April 22, 2026 11:48

Pull request was converted to draft

@lifeizhou-ap lifeizhou-ap reopened this Apr 22, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aed2b17bcb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

extension_loading_tasks: Arc::new(Mutex::new(HashMap::new())),
#[cfg(feature = "local-inference")]
inference_runtime: InferenceRuntime::get_or_init(),
inference_runtime: Arc::new(OnceLock::new()),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep a process-level strong local runtime reference

AppState::new now leaves the local-inference runtime unset until get_inference_runtime() is called, which means the server no longer guarantees a process-lifetime strong Arc<InferenceRuntime>. The normal local provider path (LocalInferenceProvider::from_env) still initializes via the global weak cache and does not populate this lock, so if local-provider sessions are removed/evicted before any local-inference route call, the runtime can be dropped and then re-initialized later; on backends that surface BackendAlreadyInitialized, that path still reaches the unreachable! panic. Keep a long-lived strong reference regardless of which path initializes first.

Useful? React with 👍 / 👎.

.sort((a, b) => b.modifiedMs - a.modifiedMs);

for (const startupLog of startupLogs.slice(STARTUP_LOGS_TO_KEEP)) {
fs.unlinkSync(startupLog.filePath);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard startup-log retention deletes from hard failure

Startup diagnostics cleanup currently uses fs.unlinkSync without error handling, so a single locked/read-only/missing file in logs/startup can throw and abort createStartupDiagnostics before backend startup. Since diagnostics are non-critical, this makes app launch fragile under common filesystem issues (EACCES, EPERM, transient ENOENT). Wrap cleanup filesystem calls in try/catch and continue startup when retention pruning fails.

Useful? React with 👍 / 👎.

@lifeizhou-ap lifeizhou-ap marked this pull request as ready for review April 22, 2026 11:51
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aed2b17bcb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread ui/desktop/src/goosed.ts
Comment on lines +316 to +319
if (startupTrace) {
startupTrace.diagnostics.pid = goosedProcess.pid ?? null;
startupTrace.record('spawn_success', { pid: goosedProcess.pid ?? null });
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Record spawn_success only after process actually spawns

spawn_success is recorded immediately after spawn(...) returns, but Node can still emit an error event without ever spawning the child (for example EACCES, ENOENT from loader/runtime deps, or AV quarantine), in which case pid is null and the diagnostics timeline incorrectly claims success before failure. This makes startup traces misleading in exactly the failure cases this logging is meant to debug; emit this marker from the child spawn event (or gate it on a non-null PID) instead.

Useful? React with 👍 / 👎.

* main:
  fix: preprompt would show after loading session (#8744)
  commands to acp+ migration: extensions management (#8733)
  feat: desktop notification when goose finishes a task (#8647)
  harden code review skill for async state and default-resolution bugs (#8740)
  Feature/at agent mention (#8571)
  fix: removed hardcoded dependency of goose-acp-macro (#8753)
  perf: split agent setup into staged phases to reduce startup blocking (#8746)
  Add /skills command (#8600)
  Replace deprecated Claude ACP package links (#8625)
@lifeizhou-ap lifeizhou-ap added this pull request to the merge queue Apr 23, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to a conflict with the base branch Apr 23, 2026
* main: (29 commits)
  chore(deps): bump winreg from 0.55.0 to 0.56.0 (#8829)
  Fix grammar issue (#8669)
  colorize context window indicator (#8851)
  Refresh canonical model metadata from models.dev (#8838)
  fix(ci): prevent flaky smoke test timeouts from failing the build (#8837)
  updates: release 0.19.0 of the tui/sdk/etc (#8806)
  add a goose2 signed release flow (#8728)
  Port provider tests to typescript (#8237)
  refactor: make ACP server smaller (#8787)
  Add NVIDIA provider, and improve declarative provider UX (#8798)
  fix: removed failed provider test for deprecated providers (#8801)
  fix: only call cleanup when the pr is from same repo (#8799)
  chore: check stale for draft pr (#8803)
  fix: use _meta instead of meta in newSession request (#8796)
  fix: add missing underscore prefix in updateWorkingDir method name (#8743)
  feat: migrate session metadata storage from frontend overlay to backend (#8769)
  Add more info to BUILDING_LINUX (#8789)
  feat(acp): Align to new request patterns of ACP Streamable HTTP/WS transport (#8605)
  Dedupe and organize skills/sources (#8731)
  docs: add skills slash command (#8783)
  ...
@lifeizhou-ap lifeizhou-ap enabled auto-merge April 27, 2026 03:07
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4ce5a6573e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread ui/desktop/src/goosed.ts
};

goosedProcess.on('exit', (code) => {
goosedProcess.on('exit', (code, signal) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Persist diagnostics after stderr stream fully drains

child_exit is recorded from the child process exit event, but in Node that event can fire before stderr is fully drained. Since stderr updates only append to diagnostics.stderrTail and do not flush to disk, lines that arrive after exit can be missing from goosed-startup-*.json (for example, final crash output emitted during teardown). Record the terminal event on close (or flush when appending stderr) so persisted startup diagnostics contain the complete tail.

Useful? React with 👍 / 👎.

@lifeizhou-ap lifeizhou-ap added this pull request to the merge queue Apr 27, 2026
Merged via the queue into main with commit 8014adf Apr 27, 2026
24 of 25 checks passed
@lifeizhou-ap lifeizhou-ap deleted the lifei/logging-whne-start-error branch April 27, 2026 04:55
securifera added a commit to securifera/goose that referenced this pull request Apr 29, 2026
…ture

Conflict resolutions:
- configuration.rs: kept both mcp_only (fork) and tls_cert_path/tls_key_path (upstream aaif-goose#8400)
- commands/agent.rs: kept fork's AppState::new(tls, settings) signature + added boot_marker calls from upstream aaif-goose#8656
- state.rs: adopted upstream's OnceLock lazy-init for inference_runtime (aaif-goose#8656) + retained settings: Arc<Settings> from fork
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants