fix(backend): close concurrent-apply races on agent + connection upsert#466
Conversation
|
Triage decision: Reasons:
Next: Manual merge required after final review due to size threshold. PR is otherwise well-formed with good test coverage for the concurrent race fix. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 32c0b340e4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // We lost the race — someone else inserted the row between our read and | ||
| // our INSERT. Re-read and fall through to the update path so concurrent | ||
| // PUTs with the same config converge on a single noop response. | ||
| const reread = await connectionStore.getConnection(stableId); | ||
| if (!reread) { |
There was a problem hiding this comment.
Block loser-path updates until create outcome is known
When a concurrent PUT loses the ON CONFLICT DO NOTHING claim, this branch immediately re-reads and falls into the normal update path against the placeholder row. In that path, chatManager.updateConnection() can return 200 without running create-time startup/validation (because the placeholder is stopped), so a malformed or non-startable config can be reported as success while the winning request later fails and deletes the row. This creates false-positive convergence for concurrent applies on new stable IDs.
Useful? React with 👍 / 👎.
POST /agents was check-then-write: two concurrent applies in the same org could both pass the existence check, both run saveMetadata, and both run the Owletto MCP auto-injection — clobbering operator-set mcpServers. Use INSERT ... ON CONFLICT (id) DO NOTHING RETURNING so only the actual inserter runs the auto-injection; losers fall through to the idempotent 200 path. PUT /:agentId/connections/by-stable-id/:stableId had the same race in its create branch: two concurrent PUTs both observed `existing===null`, both called chatManager.addConnection (or the fallback persist) with the same stableId, and the second clobbered the first via ON CONFLICT DO UPDATE plus spawned a duplicate platform instance. Atomically claim the row with INSERT ... ON CONFLICT DO NOTHING; on conflict, re-read and fall through to the existing update path (which keeps the hardening branch's settingsChanged + allowGroups defaults intact) so concurrent identical PUTs converge on a single noop or update response.
32c0b34 to
de85b27
Compare
|
Triage decision: Reasons:
Next: Human review required for P1-flagged concurrency issue. Bot identified potential false-positive convergence risk in loser-path updates that needs design review. |
Pi flagged two follow-ups on PR #466: 1. POST /agents — INSERT and saveSettings(mcpServers) ran as two writes, so a concurrent loser returning 200 in the idempotent branch could PATCH operator config in between, only for the winner's deferred saveSettings to clobber it. Fold mcp_servers into the same INSERT so row + auto-injection land atomically. 2. PUT /:agentId/platforms/by-stable-id/:stableId — the PR-466 atomic-claim works for the no-manager fallback, but with the real ChatInstanceManager a "loser" could re-read the just-created row mid-addConnection and call updateConnection against half-initialized state (potentially double-spawning the chat instance). Add an in-process per-stableId Promise chain (primary serialization for embedded-only deployment) plus a short-lived pg_advisory_xact_lock keyed on (NAMESPACE, hashStableId(stableId)) for cross-process defense-in-depth and pg_locks visibility. Tests: 20-iteration concurrent POST asserts mcpServers.owletto.url is populated regardless of which handler ran the INSERT; concurrent PUT with a counting mock manager asserts addConnection is called exactly once.
Pi flagged two follow-ups on PR #466: 1. POST /agents — INSERT and saveSettings(mcpServers) ran as two writes, so a concurrent loser returning 200 in the idempotent branch could PATCH operator config in between, only for the winner's deferred saveSettings to clobber it. Fold mcp_servers into the same INSERT so row + auto-injection land atomically. 2. PUT /:agentId/platforms/by-stable-id/:stableId — the PR-466 atomic-claim works for the no-manager fallback, but with the real ChatInstanceManager a "loser" could re-read the just-created row mid-addConnection and call updateConnection against half-initialized state (potentially double-spawning the chat instance). Add an in-process per-stableId Promise chain (primary serialization for embedded-only deployment) plus a short-lived pg_advisory_xact_lock keyed on (NAMESPACE, hashStableId(stableId)) for cross-process defense-in-depth and pg_locks visibility. Tests: 20-iteration concurrent POST asserts mcpServers.owletto.url is populated regardless of which handler ran the INSERT; concurrent PUT with a counting mock manager asserts addConnection is called exactly once.
* refactor(cli): merge owletto bin into lobu memory namespace
Consolidates the standalone `owletto` CLI into `@lobu/cli` as the
`lobu memory` namespace and deletes the entire `packages/owletto-cli`
package. The merge:
- Drops the `owletto` binary entirely (no compat wrapper).
- Drops `owletto dev` (docker-compose, dead — there's no compose file).
- Drops `owletto start` and `owletto version` (lobu run + lobu --version
cover them; one boot path: the embedded backend bundle).
- Folds 13 commands into `lobu memory <verb>`: seed, run, login, token,
health, configure, browser-auth, skills list/add, init, org current/set.
- Adds top-level `lobu doctor` (system deps + memory MCP reachability;
folds owletto's `doctor` and `health` checks).
- `lobu init` now wires the memory plugin config inline at the end of
scaffold when memory is enabled (no separate `owletto init` step).
Code moves:
- packages/owletto-cli/src/lib/* → packages/cli/src/commands/memory/_lib/
- packages/owletto-cli/src/commands/{openclaw,seed,browser-auth}.ts →
_lib/*-cmd.ts (citty wrappers stripped, citty defineCommand exports
replaced by plain async functions consumed by thin commander handlers)
- skills/owletto and skills/owletto-openclaw now bundle through cli's
build.cjs alongside skills/lobu (one bundled-skills directory).
Build / CI / release-please cleanup:
- Drop packages/owletto-cli from workspace globs, tsconfig excludes,
biome ignores, release-please-config.json, .release-please-manifest,
publish-packages.mjs, bump-version.mjs, knip workspace overrides.
- ci.yml: stop running `bun test packages/owletto-cli` and stop
typechecking it in the matrix.
- packages/cli/package.json: pull in @clack/prompts, @lobu/owletto-sdk,
playwright (formerly transitive via owletto-cli).
- README + landing site (rename reference/owletto-cli.md →
reference/lobu-memory.md, update astro nav, sidebar links, internal
anchors) + skills/owletto/references: replace `npx owletto@latest`
with `npx @lobu/cli@latest memory`.
- docs/RELEASING.md: drop the unscoped `owletto` package; six → five.
- src/index.ts (lobu admin) test that scans owletto-cli for callTool
refs now points at packages/cli/src/commands/memory/.
Validation:
- make build-packages clean.
- bun run typecheck clean.
- bun run check (biome) clean across 281 files.
- bun test packages/owletto-backend/src/auth/__tests__/tool-access.test.ts:
47 pass (the cli-source walker correctly resolves the new memory dir).
- lobu --help, lobu memory --help, lobu memory token --help,
lobu memory init --help all render the new shape.
* fix(cli): consolidate memory auth and skills into lobu
* refactor(cli): drop dead memory-token helpers and restore doctor server ping
- Remove unused `printMemoryToken`, `TokenOptions`, and
`resolveConfiguredOrgUrl` from memory/_lib/openclaw-cmd.ts left
behind after the auth consolidation moved memory token retrieval to
top-level `lobu token`.
- Restore the server-URL connectivity check in `lobu doctor` (default
mode) that used to ship as part of `owletto doctor`. `--memory-only`
still shortcuts to authenticated MCP validation.
- Update plan doc references from `lobu memory token` to the actual
unified surface `lobu token --raw`.
* feat(backend): persist egressConfig + preApprovedTools + guardrails on agents (#460)
* docs(plans): lobu apply — file → cloud converger plan
Three-PR rollout for lobu apply, the file-first → cloud converger.
Reuse-first design: existing /agents and /manage_entity_schema routes
are mostly idempotent already; only one upsert route is added. CLI
loops over them in dependency order. Re-running converges on partial
failure.
Defers: --prune, --force, lobu pull, watchers, raw SKILL.md round-trip,
org-scoped agent IDs, secret upload, memory data — all v2/v3.
* feat(backend): persist egressConfig + preApprovedTools + guardrails on agents
The file-loader produces these three settings fields from lobu.toml but
the agents table had no columns for them, so postgres-stores.ts silently
dropped them on every saveSettings(). This blocks `lobu apply`: cloud
would never accept the values local agents already have, leaving every
re-run as drift.
Adds the three jsonb columns (with empty defaults) and round-trips them
through rowToSettings / saveSettings / deleteSettings. New unit tests
cover populated, absent, and post-delete shapes against the test DB.
PR-1 of docs/plans/lobu-apply.md.
* feat(backend): idempotent agent create + stable-id connection upsert (#461)
PR-2 of `docs/plans/lobu-apply.md`. Two narrow extensions to
`lobu/agent-routes.ts` so `lobu apply` can converge by re-running:
1. `POST /` with an agent that already exists in the same org now
returns 200 with the existing payload instead of 409. Cross-org
collision still returns 409. The Owletto-MCP auto-injection only
runs on the actual create path, never on the idempotent return —
so re-applying never clobbers operator-edited `mcpServers`.
2. New `PUT /:agentId/connections/by-stable-id/:stableId` for
deterministic-ID upsert. Reuses `ChatInstanceManager.addConnection`
(which already accepts a caller-supplied `stableId`) for create,
`ChatInstanceManager.updateConnection` for change. Returns
`{ noop: true, connection }` when submitted config matches existing,
`{ updated: true, willRestart: true, connection }` when it changes,
201 with the connection on first create.
Tests cover the 200/409 shapes, the no-clobber MCP guarantee, the
noop/updated/create branches of the upsert, and the unknown-agent 404.
Test harness mocks `mcpAuth` + `getChatInstanceManager` so the routes
exercise the no-manager fallback path against a pglite Postgres.
* feat(cli): lobu apply — files → org converger (#462)
* feat(cli): lobu apply — files → org converger
PR-3 of docs/plans/lobu-apply.md. Implements `lobu apply`:
- Parses lobu.toml via existing cli/src/config/loader.ts:loadConfig
- Walks $VAR refs to enforce required-secrets check before any mutation
- Computes per-resource diff client-side (create/update/noop/drift)
- Loops over existing CRUD endpoints in dependency order:
agents → settings → connections → entity types → relationship types
- Re-runs converge on partial failure (every endpoint is idempotent)
CLI surface:
lobu apply [--dry-run] [--yes] [--only agents|memory] [--org <slug>]
Reuses _lib/openclaw-auth.ts (PR #459) for token + URL resolution.
Inlines buildStableConnectionId from owletto-backend's file-loader.ts:56
with a sync-comment — keep matching when either side changes.
Snapshot tests cover diff rendering for create/update/noop/drift and the
will-restart warning. End-to-end test runs once PR-1 + PR-2 land.
* fix(cli): apply — substitute $VAR refs and align connection wire shape
Two showstopper bugs caught by pi review before integration test:
1. Connection upsert was sending `type` but PR-2's `PUT /by-stable-id`
route expects `platform`. Server would 400 on every connection apply.
Aligns the client and apply-cmd payloads to `{ platform, name?, config }`.
2. `$VAR` references in connection configs were collected for the
secrets check but never substituted. Apply would send literal
`"botToken": "$TELEGRAM_BOT_TOKEN"` to the server. Now resolves
against `process.env` (or an injected env for tests) and fails
loudly with the var name when missing or empty.
Adds `env` option to `loadDesiredState` for testability. Test coverage
extended: existing test asserts the resolved value is in the desired
config; new test asserts unset $VAR throws a clear error.
Validation: typecheck, biome, 27 cli tests — all clean.
* docs(plans): lobu secrets push — security-conscious design (#465)
* fix(backend): align connection-config encrypt/decrypt prefix (#464)
encryptConfig() in postgres-stores.ts wrote raw `iv:tag:ciphertext`
output from @lobu/core's `encrypt()`, but decryptConfig() only
decrypts strings prefixed with `enc:v1:`. So any secret-named field
that ran through encryptConfig was stored as prefixless ciphertext
and round-tripped as that ciphertext literal on read.
Production was unaffected because ChatInstanceManager normalizes
secrets to `secret://` refs before persisting to chat_connections.
The agent_connections.config rows written by persistConnectionSnapshot
on the older lobu agent-routes paths (and the no-manager fallback added
in PR #461) are the affected surface.
Fix:
- encryptConfig prepends `enc:v1:` to the AES-GCM output.
- decryptConfig strips the prefix before delegating to `decrypt()`.
- Migration 20260430022231 backfills agent_connections.config rows
whose values match the prefixless `iv:tag:ciphertext` shape (24 hex
IV + 32 hex tag + hex ciphertext) by re-prefixing them. Idempotent —
re-running is a noop because the regex skips already-prefixed values.
Adds round-trip + boundary tests in
packages/owletto-backend/src/lobu/stores/__tests__/postgres-stores.test.ts.
* feat: lobu apply hardening + e2e harness (#468)
Three things on top of the merged lobu apply v1:
Bug A — PUT /agents/:id/connections/by-stable-id/:stableId now folds
`settings` (allowFrom, allowGroups, …) into the noop comparison so
settings-only edits trip willRestart instead of being silently
noop'd. Symmetric default `allowGroups: true` on the create-fallback
path keeps follow-up PUTs round-tripping as noop.
Bug B — `diff.canonical` no longer collapses `[]` and `{}` to null;
clearing a remote allowlist by setting it to `[]` now produces an
`update` action. The `platform` key the route injects into stored
connection config is stripped before comparing so an unchanged
connection round-trips as `=`.
Bootstrap PAT — `start-local.ts` mints a default user / org `dev`
/ owner-scoped PAT on first boot under `LOBU_LOCAL_BOOTSTRAP=true`,
prints it once with the org slug + URL, and persists it under
`${OWLETTO_DATA_DIR}/bootstrap-pat.txt`. Production never sets the
flag, so the path is dead in cloud. Required collateral:
`multi-tenant.ts` now hydrates `c.var.user` for PAT bearers so
REST routes that read `c.get('user')` (e.g. POST /agents) work the
same as for cli-tokens; the asymmetry was previously masked by the
test mock setting `user` directly. CLI client.ts dropped the
trailing slash on POST /api/:org/agents — Hono `app.route()` does
not match `path/` against `routes.post('/')`.
`scripts/e2e-lobu-apply.sh` exercises create → noop → update →
drift detect against PGlite end-to-end, asserting REST rows landed.
Self-cleans server, data dir, and project dir; tears down on any
failure with a tail of the server log.
Out of scope: concurrent-apply lock, connection encryption asymmetry,
`lobu pull` (v2), secrets push (v3).
* fix(backend): close concurrent-apply races on agent + connection upsert (#466)
POST /agents was check-then-write: two concurrent applies in the same org
could both pass the existence check, both run saveMetadata, and both run
the Owletto MCP auto-injection — clobbering operator-set mcpServers. Use
INSERT ... ON CONFLICT (id) DO NOTHING RETURNING so only the actual
inserter runs the auto-injection; losers fall through to the idempotent
200 path.
PUT /:agentId/connections/by-stable-id/:stableId had the same race in its
create branch: two concurrent PUTs both observed `existing===null`, both
called chatManager.addConnection (or the fallback persist) with the same
stableId, and the second clobbered the first via ON CONFLICT DO UPDATE
plus spawned a duplicate platform instance. Atomically claim the row
with INSERT ... ON CONFLICT DO NOTHING; on conflict, re-read and fall
through to the existing update path (which keeps the hardening branch's
settingsChanged + allowGroups defaults intact) so concurrent identical
PUTs converge on a single noop or update response.
* docs(plans): lobu pull — cloud → files round-trip design (#463)
* refactor: rename connections to platforms (CLI + TOML + server route) (#470)
User-facing rename across the chat-platform CRUD surface:
- `lobu connections (list|add)` → `lobu platforms (list|add)`; description
reads "Manage chat platforms".
- `packages/cli/src/commands/connections/` → `packages/cli/src/commands/platforms/`,
with `platforms.ts` → `platform-prompts.ts` to avoid name clash with the
command file.
- `lobu.toml` schema: `[[agents.<id>.connections]]` → `[[agents.<id>.platforms]]`.
The file-loader emits a clear migration error if the legacy key is encountered,
pointing the user at the new block. No backwards-compat alias.
- Server routes under `/api/:org/agents/:id/` rename `connections` → `platforms`
for list / create / get / delete / start / stop / by-stable-id upsert. Body
shape (`platform` field) and DB internals (`agent_connections` table,
`ChatInstanceManager`) stay as-is — internal naming.
- `lobu apply` plumbing: `DesiredConnection` → `DesiredPlatform`,
`buildStableConnectionId` → `buildStablePlatformId`, `rowsByKind('connection')`
→ `rowsByKind('platform')`, plan renderer labels, snapshots, e2e script.
- Docs sweep: `packages/landing/src/content/docs/{reference/lobu-toml.md,
reference/lobu-apply.md, platforms/*.mdx}`, `docs/plans/lobu-apply.md`,
`lobu init` wizard scaffolding + generated `lobu.toml`.
Owletto-web submodule rename is a separate PR (the `useAgentConnections` hook
collides with an existing `useAgentPlatforms` schema-catalog hook there).
Tests: existing tests updated to use new field/route names + a new
`file-loader-platforms.test.ts` pinning the migration-error path.
* fix(backend): bootstrap PAT mode 0600 + PAT auth source disambiguation (#472)
* fix(backend): also gate platform start/stop routes with requireSessionOrAdminPat (#473)
Pi follow-up to #472: POST /:agentId/platforms/:platformId/start and
.../stop were the last admin-tier mutations on the bare mcpAuth gate. A
weak PAT (mcp:read or mcp:read+mcp:write but no mcp:admin) could bounce
a connection's lifecycle. Add the helper + 2 admission tests.
Also fixes a stale path in the existing PUT /:agentId/connections/by-stable-id
admission test — #470 renamed the route to /platforms/by-stable-id but the
admission test added in #472 still used the old path and only passed by
landing in the 404 branch (now correctly 403 via the renamed path).
* fix(backend): close residual races in agent + connection upsert (#474)
Pi flagged two follow-ups on PR #466:
1. POST /agents — INSERT and saveSettings(mcpServers) ran as two writes,
so a concurrent loser returning 200 in the idempotent branch could
PATCH operator config in between, only for the winner's deferred
saveSettings to clobber it. Fold mcp_servers into the same INSERT so
row + auto-injection land atomically.
2. PUT /:agentId/platforms/by-stable-id/:stableId — the PR-466
atomic-claim works for the no-manager fallback, but with the real
ChatInstanceManager a "loser" could re-read the just-created row
mid-addConnection and call updateConnection against half-initialized
state (potentially double-spawning the chat instance). Add an
in-process per-stableId Promise chain (primary serialization for
embedded-only deployment) plus a short-lived pg_advisory_xact_lock
keyed on (NAMESPACE, hashStableId(stableId)) for cross-process
defense-in-depth and pg_locks visibility.
Tests: 20-iteration concurrent POST asserts mcpServers.owletto.url is
populated regardless of which handler ran the INSERT; concurrent PUT
with a counting mock manager asserts addConnection is called exactly
once.
* fix(cli): address apply review gaps
* fix(cli): satisfy CodeQL token and regex checks
* chore(cli): remove unused memory helpers
* chore(submodule): bump owletto-web to current main
* fix(backend): align platform-update fallback settings to manager semantics
Fallback path was merging current.settings with the request, so apply could
not drop fields like allowFrom — diverging from both the manager path
(line 1079) and the change-detection at line 1068 which model replace
semantics. Use { allowGroups: true, ...settings } to match.
Summary
Fixes the check-then-write races pi flagged in the merged
lobu applyv1:saveMetadata, overwriting metadata + resetting the auto-injected Owletto MCP server. Now uses atomic ON CONFLICT DO NOTHING; only the actual insert path runs the MCP injection.Test plan
Out of scope
settingsfield). No overlap.Known follow-up