Skip to content

sec: hardening sweep — webhook sigs, SSRF, nix injection, secret-proxy, token revocation, egress timeout, sandbox, encryption-key#692

Merged
buremba merged 9 commits into
mainfrom
sec/hardening-sweep
May 13, 2026
Merged

sec: hardening sweep — webhook sigs, SSRF, nix injection, secret-proxy, token revocation, egress timeout, sandbox, encryption-key#692
buremba merged 9 commits into
mainfrom
sec/hardening-sweep

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 13, 2026

Security hardening sweep — landing 8 work-streams on main.

# PR Title
WS1 #678 test(connections): pin webhook signature verification contract
WS2 #681 fix(proxy): harden worker egress proxy against SSRF via IPv6 normalization (+ NAT64 byte-canonical fix)
WS3 #682 fix(orchestrator): block Nix-expression injection in nix-shell -p (+ underscore-allowance)
WS4 #679 fix(secret-proxy): 24h orphan TTL + per-source failed-resolution throttle
WS5 #683 feat(auth): jti-based token revocation kill switch (+ shared-helper enforcement)
WS6 #677 fix(egress-judge): hard per-call timeout that fails closed and feeds the breaker
WS7 #680 fix(sandbox): re-enforce SDK manifest, org-slug guard, allowCrossOrg on __sdk_dispatch host side
WS8 #676 fix(core): harden ENCRYPTION_KEY parsing + configurable worker-token clock skew

Each work-stream PR was reviewed (Codex P1/P2 findings addressed inline on #681/#682/#683 before merging) and has full test coverage. Using a merge commit instead of squash so the 8 work-streams stay as distinct commits on main.

🤖 Generated with Claude Code

buremba added 9 commits May 13, 2026 16:12
…Org on __sdk_dispatch host side (#680)

A malicious guest script can call the __sdk_dispatch global directly with an
un-manifested method, a poisoned orgPath ('__proto__'/'constructor'), or a
cross-org path the manifest never advertised. Re-check all three on the host:

- reject any path not in the manifest's dispatchable set, regardless of mode
  (dry-run only skips writes for modes that have it - it is not an auth gate)
- reject org slugs that are __proto__/constructor/prototype, non-string, or
  when 'org' isn't an own property of the target SDK, before target.org(slug)
- reject any non-empty orgPath when allowCrossOrg is false

Adds packages/server/src/__tests__/unit/sandbox/manifest-enforcement.test.ts.
Inbound chat-platform webhooks (/slack/events, /api/v1/webhooks/:id) are
authenticated by the Chat SDK adapter for each platform, not the HTTP
route. Verified the adapters genuinely perform the documented scheme:

- slack    → HMAC-SHA256 over `v0:{ts}:{rawBody}` vs `x-slack-signature`,
             401 on mismatch (@chat-adapter/slack `verifySignature`)
- telegram → constant-time compare of `x-telegram-bot-api-secret-token`
- discord  → Ed25519 verify of `x-signature-ed25519`/`-timestamp`
- whatsapp → HMAC-SHA256 over the body vs `x-hub-signature-256`
- teams    → Bot Framework bearer-JWT validation in `bridgeAdapter`

Adds a regression test that drives the real Slack adapter — forged
signature → 401, missing signature → 401, stale-but-signed replay → 401,
correctly-signed url_verification challenge → 200 — plus route-level
tests that the /slack/events edge keeps its freshness-window check and
delegates fresh requests to ChatInstanceManager.handleSlackAppWebhook.
Documents where verification happens in slack.ts and handleWebhook().
…the breaker (#677)

The LLM egress judge was awaited synchronously by the HTTP proxy with no
timeout independent of the model client's own transport timeout, so a hung
judge call could stall an outbound request indefinitely. Add a per-call
timeout (default 8s, configurable via EGRESS_JUDGE_TIMEOUT_MS or
EgressJudgeOptions.judgeTimeoutMs) that, on expiry, fails the verdict
closed (deny) and counts as a circuit-breaker failure — so a run of
timeouts opens the breaker and subsequent calls short-circuit without
invoking the model. Verdict cache untouched.
* fix(orchestrator): block Nix-expression injection in nix-shell -p

Skill-declared nixPackages were passed directly to `nix-shell -p <arg>`,
which evaluates each <arg> as a Nix expression. A skill could smuggle a
side-effecting expression — e.g. `pkgs.fetchurl; builtins.exec ...`,
`import ./evil.nix`, or `a && b` — past the previous charset-only
validator and run code at evaluation time, before the worker process
even started.

Replace the raw pass-through with strict validation: each entry must be
either a leaf identifier matching `^[a-z0-9][a-z0-9-]*$` or a
`<known-namespace>.<leaf>` attr path (only known per-language sets like
`python3Packages`, `nodePackages`, …; matches the shapes real skills
already use). Validated names are re-emitted as explicit `pkgs.<...>`
references inside a single `nix-shell -E` expression
(`let pkgs = import <nixpkgs> {}; in pkgs.mkShell { buildInputs = [ … ]; }`),
so nix-shell never parses caller-controlled text.

* fix(orchestrator): allow underscores in Nix package leaf names

NIX_LEAF_RE was /^[a-z0-9][a-z0-9-]*$/ which rejected valid nixpkgs
attributes like `poppler_utils` (used in examples/personal-finance/lobu.toml).
Underscore is a legal character in Nix attribute names and is safe here because
the injection-guard metacharacter regex already blocks ; | & etc., and the
attr is re-emitted as pkgs.<attr> — the raw string is never handed to nix.

Updated regex: /^[a-z0-9_][a-z0-9_-]*$/ (mirrors NIX_ATTR_LEAF_RE which was
already correct for namespace-qualified leaves).

Added positive test cases: poppler_utils, csvtk, cairo_2.
Added negative cases confirming semicolons are still blocked even when
combined with underscore-containing names (pkgs;builtins.exec, foo_bar;builtins.exec).

Fixes Codex P2 finding on PR #682 (sec/3-nix-injection).
…ation (#681)

* fix(proxy): harden worker egress proxy against SSRF via IPv6 normalization

Route every resolved IP and CONNECT/forward target literal through a
single normalizeIpLiteral() funnel before the blocklist check: handles
IPv4-mapped IPv6 (dotted and hex), NAT64 (64:ff9b::/96), compressed
forms, and strips zone IDs; normalized-to-IPv4 results are re-checked
against the IPv4 blocklist, and anything that looks like an IP literal
but won't parse fails closed. Validate the CONNECT port is an integer
in 1..65535. Pin connections to the already-validated address (no
re-resolve) and document that 3xx redirects re-enter the proxy and are
independently re-validated. Adds a vitest regression suite.

* fix(proxy): canonicalize IPv6 before NAT64 prefix check

The previous startsWith("64:ff9b::") check only matched compressed IPv6
spellings, so a fully-expanded form such as `64:ff9b:0:0:0:0:a9fe:a9fe`
(= 169.254.169.254 — AWS metadata endpoint) passed net.isIP() === 6,
missed the NAT64 decode, and fell through to the bare IPv6 blocklist
which does not know about embedded IPv4 addresses.

Fix: add expandIpv6ToHextets() which normalises any valid IPv6 string
into 8 unsigned 16-bit hextets (handling :: compression and dotted-quad
suffixes per RFC 4291 §2.2). The NAT64 check now verifies hextets
[0..5] === [0x0064, 0xff9b, 0, 0, 0, 0] then extracts the embedded IPv4
from hextets [6..7], so both compressed and expanded spellings produce
the same decoded IPv4 and the existing private-IP blocklist catches it.

Identified in Codex review of PR #681.

Tests: adds three new cases covering compressed-form loopback (already
blocked), expanded-form link-local (the regression — now blocked), and
expanded-form public address (still allowed).
* feat(auth): jti-based token revocation kill switch

Worker tokens and settings session cookies are pure cryptographic
material — a leaked token used to be valid until expiry with no way
to invalidate it. Both token types now carry a random `jti` minted at
issue time, and a new `revoked_tokens` Postgres table (mirroring
`grant-store.ts`) maps revoked jtis to their original expiry.

- `generateWorkerToken` mints `jti: randomUUID()`; `WorkerTokenData`
  gains `jti?: string`.
- `setSettingsSessionCookie` mints `jti` when absent. A new local
  `SettingsSession` type widens the decoded payload so the field
  surfaces through verification without changing the shared payload
  interface.
- New `RevokedTokenStore` (Postgres-backed, in-memory TTL cache,
  lazy GC sweep, lazy `CREATE TABLE IF NOT EXISTS`).
- `createApiAuthMiddleware` calls `isRevoked(jti)` after both the
  settings-cookie path and the worker-token path; revoked tokens
  return 401.
- New `lobu token revoke <jti>` CLI subcommand writes the row via
  `DATABASE_URL`. Effective within one cache TTL (~60s).
- Vitest/bun regression test covers store revoke/isRevoked/sweep plus
  the middleware accept/reject paths.

* fix(auth): enforce jti revocation inside verifySettingsSession, not just middleware

Codex P1 finding (PR #683 sec/5-token-revocation): RevokedTokenStore.isRevoked(jti)
was only checked inside createApiAuthMiddleware. Routes that call verifySettingsSession
or verifySettingsSessionOrToken directly bypass that middleware and therefore accepted
revoked cookies until token expiry.

Previously-bypassed routes:
- packages/server/src/gateway/routes/public/agent-history.ts:275
- packages/server/src/gateway/routes/public/agent-config.ts:354,375,409
- packages/server/src/gateway/routes/public/connections.ts:473,508

Fix:
- verifySettingsSession, verifySettingsToken, and verifySettingsSessionOrToken are
  now async and perform the isRevoked check before returning the session payload.
- A setRevokedTokenStore() injector (mirroring setAuthProvider()) allows tests to
  supply a custom store without touching the process-wide singleton.
- The redundant post-check in createApiAuthMiddleware is removed; the middleware
  now delegates to the shared helper for settings session revocation.
- All call sites updated to await the now-async functions.
- Three new tests for verifySettingsSession: live jti passes, revoked jti returns
  null, injected store takes precedence over singleton.
…ttle (#679)

Reduce the placeholder->secret mapping default TTL from ~7 days to 24h
(configurable via SECRET_PLACEHOLDER_TTL_MS) so an orphaned mapping left
behind by a worker crash / mid-day agent deletion stops being live within
a day; teardown cascade and lazy GC are unchanged. Add an in-memory
per-source (bound agentId, else client IP) cap on *failed* placeholder
resolutions: after 20 misses in 5min the source hard-fails and we log the
throttle once instead of per-attempt -- a successful lookup clears the
bucket, so legitimate high-throughput valid traffic is never throttled.
The cross-agent placeholder-binding 403 is untouched.
Resolves conflict in packages/core/src/utils/encryption.ts: keep both the
strict canonical-base64/hex round-trip validation (from #676) and the
cached-key perf optimization (from main).
@buremba buremba merged commit a2db047 into main May 13, 2026
6 of 14 checks passed
@buremba buremba deleted the sec/hardening-sweep branch May 13, 2026 13:39
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 506362482d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +52 to +53
if (workerData.jti && (await revokedTokens.isRevoked(workerData.jti))) {
return c.json({ success: false, error: "Unauthorized" }, 401);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Apply revocation to every worker-token verifier

This revocation check only protects routes that go through createApiAuthMiddleware; the commit also adds jti to all worker tokens, but several production worker-token entry points still call verifyWorkerToken() directly and never consult RevokedTokenStore (e.g. gateway/proxy/http-proxy.ts for outbound proxy access and routes/internal/middleware.ts for internal worker APIs). In those contexts a leaked token whose jti has been inserted into revoked_tokens remains usable until its normal TTL, so the new kill switch is incomplete for the exact worker tokens it claims to revoke. Please move/wrap the revocation check in shared worker-token authentication or update all direct verifier call sites.

Useful? React with 👍 / 👎.

buremba added a commit that referenced this pull request May 13, 2026
…694)

#672 deleted getClientIp from rate-limiter.ts as dead code; #692
(hardening sweep) then added an import + call to it in
secret-proxy.ts. The gateway fails to boot from source with
"SyntaxError: ...does not provide an export named 'getClientIp'",
and the prod bundle would silently call undefined() on the legacy
header-swap path. Restore the helper.

Also fix biome noEmptyBlockStatements lint errors in
command-registry.test.ts (10 mock(async () => {}) bodies, all from
#685) — these blocked the pre-commit hook on main.
buremba added a commit that referenced this pull request May 15, 2026
PR #692 ("sec: hardening sweep") tightened `getEncryptionKey()` to require
strict canonical base64 (`[A-Za-z0-9+/]`). Historical keys generated as
`openssl rand -base64 32 | tr +/ -_` use the URL-safe alphabet (`-_`) and
the validator started rejecting them — even though they decode to the same
32 bytes.

Effect: every code path that calls `encrypt()` / `decrypt()` 500s on prod,
including agent session creation. `lobu chat` against any agent in prod
returns "Internal server error" with this in app-pod logs:

  [worker-auth] Error verifying token: ENCRYPTION_KEY must be a canonical
  base64 or hex encoded 32-byte key.
    at getEncryptionKey
    at encrypt
    at generateWorkerToken
    at /lobu/api/v1/agents

Fix: add a parallel branch that accepts URL-safe base64 (alphabet
`[A-Za-z0-9_-]`, no padding), with the same round-trip + length check so
typo'd keys are still rejected. No key rotation needed.

Test added: 32-byte URL-safe base64 key round-trips encrypt/decrypt.
buremba added a commit that referenced this pull request May 18, 2026
…fix Slack (#881)

Two Slack connections in prod (`slack-lobu-preview`, `1b91933131464c95`) were
stuck in `status=error` since 2026-05-13 14:22 UTC with `Failed to resolve
secret ref for connection X field "botToken"`. The underlying secret rows
were intact in the right org; the failure was a transient boot-time issue
caused by the encryption-key parser regression in #692 (a 43-char unpadded
URL-safe base64 key was rejected for ~2 days until #735 added the urlsafe
branch). But once #735 deployed, the connections did not self-heal — the
boot loop only retried `status=active` rows and there was no other path
that would flip them back.

Three changes to the boot path:

1. `initialize()` now retries `status=error` rows alongside `active` ones.
   Transient deploy-time failures self-heal on the next boot; on success
   the error marker is cleared (under the connection's own org context).
   `stopped` stays operator-driven.

2. The catch block wraps `connectionStore.updateConnection` in
   `orgContext.run(connection.organizationId, ...)`. Boot has no ALS org
   id and the postgres store's `saveConnection()` calls `getOrgId()`
   strict — without the wrap the error-marker write itself throws and the
   row stays `active`, hiding the failure.

3. `startInstance()` always rebinds to the connection's `organizationId`
   instead of only when the caller has no org bound. A caller's org that
   happens to differ from the connection's used to silently win and the
   secret lookup would miss the right tenant bucket. Webhooks resolved
   by team_id reach this path with whatever ALS context the public route
   leaves in place; rebinding unconditionally is the safe default.

Tests: `chat-instance-manager-boot.test.ts` drives three red→green cases
against PGlite — recovery from a previous boot's `error` state, the
error-marking branch persisting the status under the right org, and the
cross-tenant rebind in `startInstance()`. All three failed before the fix
and pass after.

Followup manual SQL flips the existing errored rows back to `active` on
prod once the new image rolls (no migration — this is a data fix tied
to an already-fixed transient regression, not a schema change).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant