fix: four small confirmed findings (token timing, apply provider keys, worker probes, dead AsyncLock)#1066
Conversation
…r path The trusted-worker auth path on /api/workers/* grants full cross-org access but compared the bearer token with 'provided === expected', which short-circuits on the first mismatching byte and leaks the secret's length/prefix via timing. Extract a compareWorkerToken helper that does a length-equality pre-check then crypto.timingSafeEqual, mirroring the smoke route's existing pattern. Add unit tests (valid accepted, wrong same-length rejected, length-mismatch rejected without throwing, missing/ unconfigured token rejected).
Provider keys are pushed via setProviderApiKey but were only reached inside executePlan, which runs after the create/update/delete==0 (and no pending auth) early-return. Provider keys are never represented as plan rows, so a key-only .env change produced an all-noop plan and was a silent no-op. Extract pushProviderApiKeys and call it on every confirmed apply, including the all-noop short-circuit and the pending-auth-only path, without double-pushing. Add apply-cmd tests asserting setProviderApiKey is called for declared providerKeys on otherwise-noop agents.
… PVCs The worker Deployment had no liveness/readiness probe (only a wait-for-app init container) while the app and embeddings deployments have probes. The worker daemon is a poll loop with no HTTP server, so add an exec livenessProbe that confirms PID 1's cmdline is still the daemon (kubelet restarts the container if the loop crashes hard), reusing the chart's healthCheck.livenessProbe timing. Also guard the multi-replica gap: the worker and embeddings cache PVCs are ReadWriteOnce, so replicaCount>1 with cache.enabled would hit a Multi-Attach error. Add a Helm fail() guard on both so the misconfig is caught at template time. Validated via helm template (default renders; replicaCount=2+cache fails; replicaCount=2 with cache disabled renders).
AsyncLock (utils/lock.ts) has a mutual-exclusion race on timeout and has no production callers — the only 'new AsyncLock' reference was a JSDoc @example. Delete it, its test, and the barrel re-export from core's index.ts. Confirmed no non-test imports across the workspace; typecheck stays green.
📝 WalkthroughWalkthroughThis PR introduces Helm configuration validation to prevent multi-attach PVC violations, refactors CLI apply logic to handle provider-key-only changes via a dedicated helper, adds constant-time token comparison to server authentication, and modernizes the core package exports by removing AsyncLock and adding utilities for encryption, networking, session parsing, and worker communication. ChangesHelm Deployment Constraints & Health Monitoring
CLI Provider Key Extraction & Control Flow Refactoring
Server Authentication Hardening & Core Package Modernization
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
bug_free 88, simplicity 84, slop 5, bugs 0, 0 blockers Typecheck/unit/integration exit 0. Explored Helm: default template renders worker exec liveness; worker/embeddings replicaCount=2 with cache enabled fail as intended. [env] integration log also contains an embedded Postgres shared-memory failure from another worktree despite exit=0. Full verdict JSON{
"bug_free_confidence": 88,
"bugs": 0,
"slop": 5,
"simplicity": 84,
"blockers": [],
"change_type": "fix",
"behavior_change_risk": "medium",
"tests_adequate": true,
"suggested_fixes": [],
"notes": "Typecheck/unit/integration exit 0. Explored Helm: default template renders worker exec liveness; worker/embeddings replicaCount=2 with cache enabled fail as intended. [env] integration log also contains an embedded Postgres shared-memory failure from another worktree despite exit=0.",
"categories": {
"src": 214,
"tests": 295,
"docs": 0,
"config": 22,
"deps": 0,
"migrations": 0,
"ci": 0,
"generated": 0
}
}Local review gate — branch protection can require the |
The prior commit pushed provider keys before executePlan, but
setProviderApiKey targets /agents/<id>/providers/... — on a first apply the
agent isn't created until executePlan's upsertAgent step, so the key push
404'd ('Agent not found'). Reorder: run executePlan first (creates agents),
then push keys; the all-noop/key-only short-circuit still pushes directly
(no creates there, so agents already exist remotely). Add an ordering
regression test (executePlan-then-keys succeeds; keys-first reproduces the
404 via a recording client that mirrors the server constraint).
Address pi review nits: the JSDoc still said the helper runs before the short-circuit (stale after the executePlan-first reorder); rewrite it to match the actual call sites. Drop the unused third param from the test double's setProviderApiKey (per the no-underscore-prefix rule).
Final review summaryFinal Progression across iterations: bug_free 58 → 86 → 88.
Validation per finding:
|
There was a problem hiding this comment.
🧹 Nitpick comments (2)
charts/lobu/templates/worker-deployment.yaml (1)
127-142: 🏗️ Heavy liftConsider whether the probe sufficiently detects unhealthy states.
The exec probe checks if PID 1's cmdline contains "daemon", but this only confirms the process still exists—it won't detect if the daemon is hung, deadlocked, or stuck in a bad state. Since PID 1 exiting stops the container automatically, this probe adds limited value beyond Kubernetes' built-in lifecycle management.
For a daemon poll loop, a more effective liveness check might verify that the daemon is making progress (e.g., periodically updating a timestamp file that the probe validates for freshness, or exposing a Unix socket health endpoint).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@charts/lobu/templates/worker-deployment.yaml` around lines 127 - 142, The current livenessProbe simply greps /proc/1/cmdline ("grep -qa daemon /proc/1/cmdline") which only verifies PID 1 exists; change the probe to verify actual daemon progress by implementing a lightweight liveliness indicator the daemon updates (e.g., touch/mtime on a timestamp file or a small Unix socket/health endpoint) and update the livenessProbe to exec a freshness check against that indicator instead of grepping cmdline; look for livenessProbe, exec, and the existing command entry in this template and replace the command to validate the timestamp/socket freshness (and ensure the daemon process (daemon startup code) writes the timestamp/serves the socket).packages/cli/src/commands/_lib/apply/apply-cmd.ts (1)
1297-1301: 💤 Low valueConsider clarifying the comment to emphasize the first-apply scenario.
The comment correctly explains the ordering constraint, but could be more explicit about when the 404 would occur: specifically on a first apply when creating an agent, since that's when the agent doesn't exist yet. On subsequent applies with noop agents, the short-circuit path at line 1251 handles key pushing directly.
Example revision:
- // Resources FIRST: executePlan does `upsertAgent` for created agents, and - // `setProviderApiKey` targets `/agents/<id>/providers/...` — pushing keys - // before the agent exists 404s on a first apply. So run the plan, then push - // keys. (The all-noop / key-only short-circuit above pushes keys directly: - // there are no agent creates there, so the agents already exist remotely.) + // Resources FIRST: on a first apply with agent creation, executePlan runs + // `upsertAgent`, and the subsequent `setProviderApiKey` call targets + // `/agents/<id>/providers/...` — pushing keys before the agent exists would + // 404. So run the plan first, then push keys. (The all-noop / key-only + // short-circuit above pushes keys directly: those agents already exist + // remotely since there's no create in the plan.)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/cli/src/commands/_lib/apply/apply-cmd.ts` around lines 1297 - 1301, Update the comment near executePlan in apply-cmd.ts to explicitly state that the 404 occurs on a first apply when agents are being created remotely: mention that executePlan does upsertAgent and setProviderApiKey targets /agents/<id>/providers/..., so pushing keys before agent creation will 404 on first apply because the agent does not yet exist, while subsequent applies (or the all-noop / key-only short-circuit at the earlier branch) avoid this by pushing keys after agents already exist; reference executePlan, upsertAgent, setProviderApiKey and the all-noop / key-only short-circuit to make the scenario clear.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@charts/lobu/templates/worker-deployment.yaml`:
- Around line 127-142: The current livenessProbe simply greps /proc/1/cmdline
("grep -qa daemon /proc/1/cmdline") which only verifies PID 1 exists; change the
probe to verify actual daemon progress by implementing a lightweight liveliness
indicator the daemon updates (e.g., touch/mtime on a timestamp file or a small
Unix socket/health endpoint) and update the livenessProbe to exec a freshness
check against that indicator instead of grepping cmdline; look for
livenessProbe, exec, and the existing command entry in this template and replace
the command to validate the timestamp/socket freshness (and ensure the daemon
process (daemon startup code) writes the timestamp/serves the socket).
In `@packages/cli/src/commands/_lib/apply/apply-cmd.ts`:
- Around line 1297-1301: Update the comment near executePlan in apply-cmd.ts to
explicitly state that the 404 occurs on a first apply when agents are being
created remotely: mention that executePlan does upsertAgent and
setProviderApiKey targets /agents/<id>/providers/..., so pushing keys before
agent creation will 404 on first apply because the agent does not yet exist,
while subsequent applies (or the all-noop / key-only short-circuit at the
earlier branch) avoid this by pushing keys after agents already exist; reference
executePlan, upsertAgent, setProviderApiKey and the all-noop / key-only
short-circuit to make the scenario clear.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: f1a67df1-a152-4364-a134-01be782d68e9
📒 Files selected for processing (10)
charts/lobu/templates/embeddings-deployment.yamlcharts/lobu/templates/worker-deployment.yamlpackages/cli/src/commands/_lib/apply/__tests__/apply-cmd.test.tspackages/cli/src/commands/_lib/apply/apply-cmd.tspackages/core/src/__tests__/utils-lock.test.tspackages/core/src/index.tspackages/core/src/utils/lock.tspackages/server/src/auth/__tests__/worker-token.test.tspackages/server/src/auth/worker-token.tspackages/server/src/index.ts
💤 Files with no reviewable changes (3)
- packages/core/src/tests/utils-lock.test.ts
- packages/core/src/utils/lock.ts
- packages/core/src/index.ts
Final status — ready for reviewAll 4 findings fixed; suites green (typecheck/unit/integration = 0); chart validated with
pi verdict: bug_free 88, 0 bugs, 0 blockers, no suggested fixes. A fresh re-roll toward >90 is blocked by the shared Codex review quota (resets ~4.2h); 88 is model confidence on a clean medium-risk change, not a defect. Diff: 10 files, +307/-224. Not merged — left for you to review and merge. |
Four small, well-scoped confirmed findings, one logical commit each. All validated locally; draft pending
make reviewpi verdict.#5 (MED, security) — constant-time
WORKER_API_TOKENcomparepackages/server/src/index.tstrusted-worker path (/api/workers/*, full cross-org access) compared the bearer token withprovided === expected— early-exits on first mismatching byte, leaking secret length/prefix via timing. The sibling smoke route already usescrypto.timingSafeEqual.Fix: extract
compareWorkerToken(newpackages/server/src/auth/worker-token.ts) doing a length-equality pre-check thentimingSafeEqual(try/catch so a length mismatch never throws), and call it from the middleware.Evidence: new
worker-token.test.ts— 5 tests pass (valid accepted, wrong same-length rejected, length-mismatch rejected without throwing, missing/unconfigured token rejected).bunx tsc --noEmitclean.#11 (MED) — provider API keys silently dropped on a noop-only apply
apply-cmd.tsearly-returned ("Nothing to apply.") when create/update/delete counts were all 0 and no pending auth — BEFOREexecutePlan, which is the only place provider keys were pushed (setProviderApiKey). Provider keys are never plan rows, so a key-only.envchange was a silent no-op.Fix: extract
pushProviderApiKeys(client, agents)and call it on every confirmed apply — in the all-noop short-circuit, the pending-auth-only path, and the resource-work path — without double-pushing (removed the in-executePlanloop). Dry-run still skips (returns above).Evidence: extended
apply-cmd.test.ts— assertssetProviderApiKeycalled once per declared key for otherwise-noop agents, and not called when none declared. 13/13 pass.bunx tsc --noEmitclean.#13 (MED, ops) — worker Deployment had no probes + unguarded multi-replica RWO cache
charts/lobu/templates/worker-deployment.yamlhad only await-for-appinit container, no liveness/readiness probe (app + embeddings have probes). Worker cache + embeddings cache PVCs are ReadWriteOnce with no guard againstreplicaCount>1.Fix:
livenessProbeconfirming PID 1's cmdline is still the daemon (grep -qa daemon /proc/1/cmdline), reusinghealthCheck.livenessProbetiming. (No readiness probe — no readiness signal exists; the wait-for-app init container already gates startup.)fail()guard on worker AND embeddings forreplicaCount>1 && cache.enabled(Multi-Attach on RWO PVC), caught at template time.Evidence (
helm template):worker.replicaCount=2→Error: ... worker.replicaCount > 1 requires worker.cache.enabled=false ...worker.replicaCount=2 --set worker.cache.enabled=false→ renders, replicas:2 + livenessProbehelm lintpasses. Prod opts intoapp.replicaCount:2(no cache PVC) — worker/embeddings default to 1, unaffected.#16 (LOW, dead code) — delete
AsyncLockpackages/core/src/utils/lock.tsAsyncLockhas a mutual-exclusion race on timeout and no production callers (onlynew AsyncLockref was a JSDoc@example).Fix: delete
lock.ts, its test, and the barrel re-export incore/src/index.ts. Confirmed no non-test imports across the workspace.Evidence: core
bun test592 pass / 0 fail;bunx tsc --noEmitclean.Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Improvements