Skip to content

fix: bug-fix sweep#673

Merged
buremba merged 3 commits into
mainfrom
sweep/bugs
May 13, 2026
Merged

fix: bug-fix sweep#673
buremba merged 3 commits into
mainfrom
sweep/bugs

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 13, 2026

Bug-fix sweep from 15 parallel hunter reports. Security fixes first, then high-confidence contained correctness fixes. Not for auto-merge — needs careful human review of the auth / sandbox changes.

Security

Severity Fix File
CRITICAL sandbox action-key smuggling: createActionCaller spread caller input before forcing action, letting a read-only query_sdk script set action: "delete" and reach write/delete admin handlers. Now spreads input first, strips any caller action, forces the discriminator; plus a defensive read-mode access !== "read" reject in __sdk_dispatch. Regression test added. packages/server/src/sandbox/namespaces/action-call.ts, packages/server/src/sandbox/run-script.ts
HIGH MCP-proxy SSRF: assertSafeUrl only ran in probeMcpServer; discoverTools/callTool/sendRequest fetched config.upstream_url unchecked. Now validated in sendRequest on every outbound fetch. packages/server/src/mcp-proxy/client.ts
HIGH OAuth profile:read silent auto-approval now gated to redirect URIs on the canonical origin (getConfiguredPublicOrigin()) — DCR-registered third-party clients fall through to the consent page. DCR registration rate-limiter is on by default (RATE_LIMIT_ENABLED !== 'false') and fails closed (503) instead of fail-open. packages/server/src/auth/oauth/routes.ts
HIGH secret-proxy: reject (401) instead of warn when a request names an agent in the URL but carries no auth header — was forwarding upstream using the named agent's credential. packages/server/src/gateway/proxy/secret-proxy.ts
MEDIUM cross-org tool-list bleed: MCP tool-discovery cache keyed by bare connectorKey → org B could get org A's cached catalog. Now keyed ${orgId}:${connectorKey}. packages/server/src/mcp-proxy/client.ts
MEDIUM getClientIP trusted X-Forwarded-For unconditionally → IP rate-limit bypass. Now only trusts forwarded headers behind TRUSTED_PROXY, taking the rightmost trusted hop; otherwise 'unknown' (coarse bucket). packages/server/src/utils/rate-limiter.ts
MEDIUM getLobuServiceToken() with no org argument minted a token acting as a random tenant's admin. Now fails closed without an organizationId. packages/server/src/lobu/service-token.ts
LOW (defense-in-depth) AgentConnectionStore.deleteConnection / listChannelBindings / deleteAllChannelBindings now org-scoped. insertEvent on-conflict origin probe now filters on organization_id (prevents cross-tenant supersede on append-only events). packages/server/src/lobu/stores/postgres-stores.ts, packages/server/src/utils/insert-event.ts

Other fixes

core

  • sanitizeForLogging cycle guard (WeakSet"[Circular]") — was a stack overflow on any circular log object. + regression test.
  • verifyWorkerToken rejects far-future timestamps (was a one-directional skew check).
  • retryWithBackoff isolates throwing shouldRetry/onRetry callbacks so they can't mask the real error.

agent-worker

  • custom-command execFile gets a 120s timeout + killSignal: SIGKILL; signal-killed children now report a non-zero exit code (was reported as 0/success).
  • heartbeat-failure abort resolves the in-flight prompt turn before session.dispose() — was a permanent hang exactly when the gateway is already in trouble.
  • worker process throws/exits when SSE reconnects are exhausted instead of logging "started successfully" and running forever as a zombie.
  • 60s / 120s AbortSignal.timeout on gateway and MCP-tool fetch calls.

embeddings

  • getExtractor() no longer caches a rejected model-load promise (one transient failure bricked the backend until restart).

connectors

  • Google Calendar: paginate until nextPageToken is exhausted so the trailing nextSyncToken is captured — incremental sync now actually engages (was re-emitting the full ±1y window every run on busy calendars).
  • Gmail: after:<unix-seconds> instead of after:YYYY/MM/DD — no more re-emitting the whole current day on every sync.
  • Google Play parseDate computed numerically (s*1000 + ms) — string concat produced 1970 / year-7340 dates.

connector-sdk

  • close the launched Playwright browser / CDP WebSocket (+ Target.closeTarget) on a partial-acquire failure (newContext/addCookies/newPage/CDP-setup throw) — was leaking Chrome processes / sockets on long-lived worker hosts.
  • CdpPage.goto removes its message listener on navigation timeout (was accumulating handlers on long-lived sessions).

connector-worker

  • daemon no longer auto-completes a run_type='watcher' run as success — logs and skips instead, so a stray watcher run can't be stomped (the server-side poll allowlist already excludes it; this is the defensive backstop).

server

  • watcher next_run_at advances on terminal failure / materialize error (mirrors the feeds model) — stops a permanently-broken watcher re-dispatching a fresh agent run every 60s forever.
  • deliverToBotConnections called once per createNotificationForUsers instead of once per admin — was posting the same Slack/Telegram notification N times in a multi-admin org.
  • embedded scaleDeployment(name, 1) throws when the worker process is gone (instead of silent no-op) so MessageConsumer/createWorkerDeployment re-create it and drain the already-queued message.
  • insertEventupsertEmbedding uses the transaction-bound sql handle (was grabbing the singleton pool, breaking atomicity / FK for any transactional caller with an embedding).
  • dotenv.config() loaded at the top of ./instrument so Sentry isn't silently disabled when SENTRY_DSN lives in .env.
  • test-db.ts fixSchemaConstraints includes the 'task' run type (was re-narrowing runs_run_type_check and dropping it between tests).

openclaw-plugin

  • autoCapture moved from before_prompt_build (silently dropped by the worker's plugin-loader) to agent_end, and now pairs the answer with the question that actually prompted it (last assistant message → preceding user message) instead of a mismatched Q/A pair.

Deferred (higher-risk / low-confidence — needs a dedicated PR)

  • packages/core/src/utils/lock.tsAsyncLock.acquire releases the new lock on acquisition timeout → concurrent critical sections. Fix needs a careful restructure of the lock-chain wiring.
  • packages/server/src/server.ts + start-local.ts — graceful shutdown doesn't await httpServer.close() and tears down DB/gateway before the listener stops; no unhandledRejection/uncaughtException handlers. Shutdown rework, out of scope here.
  • packages/server/src/gateway/orchestration/impl/embedded-deployment.ts — crashed worker still silently drops the in-flight message (no re-queue / no user-facing error). The scaleDeployment fix here covers recovery on the next message, but immediate re-spawn / job-fail-on-unexpected-exit needs plumbing the deployment manager → message consumer.
  • packages/server/src/gateway/orchestration/base-deployment-manager.ts — reconcile can scale a worker to 0 right after a message was queued for it (updateDeploymentActivity bumps lastActivity last). Needs activity bump at enqueue time.
  • packages/server/src/connect/routes.ts/connect/:token/validate activates feeds (status='active', next_run_at=NOW()) before validation completes → cron can run an unvalidated connection in parallel. + concurrent /oauth/start clobbers the PKCE verifier; userinfo/actor failures fall through to a 'connect-flow' sentinel owner.
  • packages/server/src/watchers/automation.tsreconcileWatcherRuns marks a windowed run completed even if the agent turn is still in flight (complete_window mid-turn). + dispatchWatcherRun claim→HTTP→update with no transaction.
  • packages/server/src/lobu/stores/postgres-stores.ts + gateway/channels/binding-service.ts — Telegram-DM channel bindings collide across orgs (key is (platform, channel_id, team_id) with team_id NULL and channel_id=user id); /lobu try can clobber the user's own bound agent and resolveAgent doesn't check the binding's agent belongs to the connection's org. Needs a keyspace/schema change.
  • packages/server/src/db/embedded-schema-patches.ts — omits the device_worker_connection_binding backfill UPDATEs present in the dbmate migration; embedded installs with pre-existing device_workers rows diverge.
  • packages/connector-sdk/src/retry.tswithHttpRetry mis-classifies errors via substring matching ('500' matches ids, 'invalid' matches transient 5xx bodies). Wants structured status on the thrown error.
  • CLI SSE parsers (packages/cli/src/eval/client.ts, commands/chat.ts) — assume \n framing; break on CRLF / multi-line data:. Robustness, deferred.
  • connectors (google_gmail.ts, youtube.ts) — bare catch {} in per-item sync loops swallow 401/429 → "success" with items_found: 0. YouTube also persists an expiring search pageToken as its checkpoint.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 13, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 38.00000% with 31 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
packages/cli/src/commands/eval.ts 0.00% 15 Missing ⚠️
packages/core/src/utils/retry.ts 47.82% 12 Missing ⚠️
packages/cli/src/commands/chat.ts 0.00% 2 Missing ⚠️
packages/core/src/worker/auth.ts 50.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dfdd3e00e7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// Must keep paginating until pageToken is exhausted, otherwise the sync
// token is never obtained and every subsequent sync re-runs the full
// window from scratch. maxResults is only a soft cap on stored events.
if (!pageToken) break;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce max_results while paginating calendar sync

For calendars with more than max_results events, this loop now keeps paginating until Google has no nextPageToken and still pushes every returned event, so the documented cap (“Maximum events to fetch per sync”) is no longer enforced. Once events.length >= maxResults, the request size drops to 1 and the connector can crawl the rest of a busy calendar one event at a time, ingesting far more rows than configured; keep paging for the sync token, but stop appending/returning events after the cap.

Useful? React with 👍 / 👎.

const xreal = request.headers.get('X-Real-IP');
if (xreal) return xreal.trim();
}
return 'unknown';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve per-client rate-limit keys when proxy trust is unset

When TRUSTED_PROXY is not explicitly set, this now always returns the literal unknown, so every caller of the unauthenticated rate-limited endpoints shares the same bucket (for example invitation preview, public-org join, and OAuth registration, which this commit also enables by default). In deployments that have not added this new env var, 10 OAuth client registrations or 5 invitation previews from any users globally can throttle everyone else; either retain a request-specific fallback or make the trusted-proxy requirement part of the deployment config before switching these endpoints on.

Useful? React with 👍 / 👎.

@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 13, 2026

Addressed both Codex P2s in 404d31be:

  • google_calendar.ts — always request a full 250-event page; maxResults is now applied only to stored events (if (events.length >= maxResults) break inside the item loop) while pagination still continues to the last page so the sync token is obtained. No more 1-event-per-round-trip crawl past the cap.
  • rate-limiter.ts / getClientIP — without TRUSTED_PROXY we now key on the leftmost X-Forwarded-For entry as a best-effort fallback instead of a shared 'unknown' bucket. Spoofable, but that only lets an attacker evade their own limit (pre-existing behavior) rather than letting one client throttle everyone on the public endpoints. With TRUSTED_PROXY=true the abuse-resistant rightmost-hop behavior is unchanged.

make build-packages + bun run typecheck + bun test packages/connectors green.

@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 13, 2026

Rebased onto current main (now includes #672), moved sandbox/namespaces/__tests__/action-call.test.ts__tests__/unit/sandbox/action-call.test.ts so it runs under the bun test ... __tests__/unit job instead of being picked up by vitest's glob (the bun:test import doesn't resolve under vitest). Also dropped the unused GatewayConfig import + 4 InvalidationEvent.resource: call sites left dangling on main by #672 so standalone server typecheck passes.

@buremba buremba force-pushed the sweep/bugs branch 2 times, most recently from 07cb728 to 5b97b5b Compare May 13, 2026 05:12
buremba added 3 commits May 13, 2026 06:16
…ak/timeout/race fixes

Security:
- CRITICAL: sandbox action-key smuggling — createActionCaller spread caller
  input before forcing `action`, so a read-only query_sdk script could set
  `action: "delete"` and reach write/delete admin handlers. Now spreads input
  first then forces the discriminator (and strips any caller `action`); added a
  defensive read-mode check in __sdk_dispatch and a regression test.
- HIGH: MCP-proxy SSRF — assertSafeUrl only ran in probeMcpServer; discoverTools
  /callTool/sendRequest fetched config.upstream_url unchecked. Now validated in
  sendRequest on every outbound fetch.
- HIGH: OAuth profile:read auto-approval gated to first-party (canonical-origin)
  redirect URIs only; DCR registration rate-limiter on by default + fails closed.
- secret-proxy: reject (don't warn) unauthenticated requests that name an agent.
- cross-org tool-list bleed — MCP tool-discovery cache keyed by orgId:connectorKey.
- getClientIP only trusts X-Forwarded-For / CF-Connecting-IP / X-Real-IP behind
  TRUSTED_PROXY; takes the rightmost trusted hop.
- getLobuServiceToken fails closed without an organizationId.
- AgentConnectionStore.deleteConnection / channel-binding list+deleteAll org-scoped.
- insertEvent onConflict probe now org-scoped (defense against cross-tenant supersede).

Other fixes:
- core: sanitizeForLogging cycle guard (WeakSet → "[Circular]"); verifyWorkerToken
  rejects far-future timestamps; retryWithBackoff isolates throwing shouldRetry/onRetry.
- agent-worker: custom-command execFile gets a 120s timeout + correct signal-kill
  exit code; heartbeat-failure abort resolves the in-flight turn before dispose();
  worker exits when SSE reconnects are exhausted (no zombie); 60–120s timeouts on
  gateway/MCP fetches.
- embeddings: getExtractor no longer caches a rejected model-load promise.
- connectors: Google Calendar paginates to the trailing nextSyncToken (incremental
  sync now engages); Gmail uses epoch-second `after:` (no per-day re-emit); Google
  Play parseDate computed numerically.
- connector-sdk: close the launched browser / CDP socket on partial-acquire
  failure; remove the goto timeout listener leak.
- connector-worker: never auto-complete a watcher run as success in the daemon.
- server: watcher next_run_at advances on terminal failure / materialize error
  (no per-minute retry storm); deliverToBotConnections called once per
  createNotificationForUsers (no per-admin duplicate sends); scaleDeployment(name,1)
  throws when the worker is gone so the consumer re-creates it; insert-event
  upsertEmbedding uses the txn handle; dotenv loaded before ./instrument so Sentry
  isn't silently disabled; test-db fixSchemaConstraints includes the 'task' run type.
- openclaw-plugin: autoCapture moved to the agent_end hook (the worker drops
  before_prompt_build) and pairs the answer with the question that prompted it.
… keep per-client rate-limit key without TRUSTED_PROXY
…lone typecheck (drop dead InvalidationEvent.resource sites + unused import left by #672)
@buremba buremba merged commit 7c1500b into main May 13, 2026
15 checks passed
@buremba buremba deleted the sweep/bugs branch May 13, 2026 05:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants