Skip to content

fix(backend): close concurrent-apply races on agent + connection upsert#466

Merged
buremba merged 1 commit into
feat/owletto-cli-mergefrom
feat/concurrent-apply-race-fix
Apr 30, 2026
Merged

fix(backend): close concurrent-apply races on agent + connection upsert#466
buremba merged 1 commit into
feat/owletto-cli-mergefrom
feat/concurrent-apply-race-fix

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented Apr 30, 2026

Summary

Fixes the check-then-write races pi flagged in the merged lobu apply v1:

  • POST /agents: same-org duplicate path was check-then-write. Two concurrent POSTs could both pass the existence check then both run saveMetadata, overwriting metadata + resetting the auto-injected Owletto MCP server. Now uses atomic ON CONFLICT DO NOTHING; only the actual insert path runs the MCP injection.
  • PUT /connections/by-stable-id/:stableId: create branch had the same race. Now atomic; concurrent creates serialize cleanly into one create + one noop/update.

Test plan

  • New concurrent-POST test: 2x parallel calls -> exactly 1x201, 1x200, 1 DB row
  • New concurrent-PUT test: 2x parallel calls (identical config) -> 1xcreated, 1xnoop/updated, 1 DB row
  • All existing tests still pass

Out of scope

  • The hardening branch is fixing the noop comparison (settings field). No overlap.
  • No org-level apply lock (deferred — needed if multi-resource transactional batch is ever introduced).

Known follow-up

  • Pi self-review summarized the patch correctly with no substantive bugs flagged but rate-limited before issuing an explicit SHIP/FIX verdict. Re-run pi when quota refreshes if a formal verdict is required.

@github-actions github-actions Bot added the triage:needs-human Triage agent escalated for human review label Apr 30, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Triage decision: needs-human

Reasons:

  • PR size exceeds auto-merge threshold (416 lines, 2 files; limits 300 lines, 10 files)

Next: Manual merge required after final review due to size threshold. PR is otherwise well-formed with good test coverage for the concurrent race fix.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 32c0b340e4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +858 to +862
// We lost the race — someone else inserted the row between our read and
// our INSERT. Re-read and fall through to the update path so concurrent
// PUTs with the same config converge on a single noop response.
const reread = await connectionStore.getConnection(stableId);
if (!reread) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Block loser-path updates until create outcome is known

When a concurrent PUT loses the ON CONFLICT DO NOTHING claim, this branch immediately re-reads and falls into the normal update path against the placeholder row. In that path, chatManager.updateConnection() can return 200 without running create-time startup/validation (because the placeholder is stopped), so a malformed or non-startable config can be reported as success while the winning request later fails and deletes the row. This creates false-positive convergence for concurrent applies on new stable IDs.

Useful? React with 👍 / 👎.

POST /agents was check-then-write: two concurrent applies in the same org
could both pass the existence check, both run saveMetadata, and both run
the Owletto MCP auto-injection — clobbering operator-set mcpServers. Use
INSERT ... ON CONFLICT (id) DO NOTHING RETURNING so only the actual
inserter runs the auto-injection; losers fall through to the idempotent
200 path.

PUT /:agentId/connections/by-stable-id/:stableId had the same race in its
create branch: two concurrent PUTs both observed `existing===null`, both
called chatManager.addConnection (or the fallback persist) with the same
stableId, and the second clobbered the first via ON CONFLICT DO UPDATE
plus spawned a duplicate platform instance. Atomically claim the row
with INSERT ... ON CONFLICT DO NOTHING; on conflict, re-read and fall
through to the existing update path (which keeps the hardening branch's
settingsChanged + allowGroups defaults intact) so concurrent identical
PUTs converge on a single noop or update response.
@buremba buremba force-pushed the feat/concurrent-apply-race-fix branch from 32c0b34 to de85b27 Compare April 30, 2026 03:02
@github-actions
Copy link
Copy Markdown
Contributor

Triage decision: needs-human

Reasons:

  • Review comment contains escalation keyword "P1" from trusted bot reviewer
  • PR size exceeds auto-merge threshold (435 lines; limit 300 lines)
  • Concurrent race condition issue flagged for human judgment on update path safety

Next: Human review required for P1-flagged concurrency issue. Bot identified potential false-positive convergence risk in loser-path updates that needs design review.

@buremba buremba merged commit 3d5993c into feat/owletto-cli-merge Apr 30, 2026
5 checks passed
@buremba buremba deleted the feat/concurrent-apply-race-fix branch April 30, 2026 03:06
buremba added a commit that referenced this pull request Apr 30, 2026
Pi flagged two follow-ups on PR #466:

1. POST /agents — INSERT and saveSettings(mcpServers) ran as two writes,
   so a concurrent loser returning 200 in the idempotent branch could
   PATCH operator config in between, only for the winner's deferred
   saveSettings to clobber it. Fold mcp_servers into the same INSERT so
   row + auto-injection land atomically.

2. PUT /:agentId/platforms/by-stable-id/:stableId — the PR-466
   atomic-claim works for the no-manager fallback, but with the real
   ChatInstanceManager a "loser" could re-read the just-created row
   mid-addConnection and call updateConnection against half-initialized
   state (potentially double-spawning the chat instance). Add an
   in-process per-stableId Promise chain (primary serialization for
   embedded-only deployment) plus a short-lived pg_advisory_xact_lock
   keyed on (NAMESPACE, hashStableId(stableId)) for cross-process
   defense-in-depth and pg_locks visibility.

Tests: 20-iteration concurrent POST asserts mcpServers.owletto.url is
populated regardless of which handler ran the INSERT; concurrent PUT
with a counting mock manager asserts addConnection is called exactly
once.
buremba added a commit that referenced this pull request Apr 30, 2026
Pi flagged two follow-ups on PR #466:

1. POST /agents — INSERT and saveSettings(mcpServers) ran as two writes,
   so a concurrent loser returning 200 in the idempotent branch could
   PATCH operator config in between, only for the winner's deferred
   saveSettings to clobber it. Fold mcp_servers into the same INSERT so
   row + auto-injection land atomically.

2. PUT /:agentId/platforms/by-stable-id/:stableId — the PR-466
   atomic-claim works for the no-manager fallback, but with the real
   ChatInstanceManager a "loser" could re-read the just-created row
   mid-addConnection and call updateConnection against half-initialized
   state (potentially double-spawning the chat instance). Add an
   in-process per-stableId Promise chain (primary serialization for
   embedded-only deployment) plus a short-lived pg_advisory_xact_lock
   keyed on (NAMESPACE, hashStableId(stableId)) for cross-process
   defense-in-depth and pg_locks visibility.

Tests: 20-iteration concurrent POST asserts mcpServers.owletto.url is
populated regardless of which handler ran the INSERT; concurrent PUT
with a counting mock manager asserts addConnection is called exactly
once.
buremba added a commit that referenced this pull request Apr 30, 2026
* refactor(cli): merge owletto bin into lobu memory namespace

Consolidates the standalone `owletto` CLI into `@lobu/cli` as the
`lobu memory` namespace and deletes the entire `packages/owletto-cli`
package. The merge:

- Drops the `owletto` binary entirely (no compat wrapper).
- Drops `owletto dev` (docker-compose, dead — there's no compose file).
- Drops `owletto start` and `owletto version` (lobu run + lobu --version
  cover them; one boot path: the embedded backend bundle).
- Folds 13 commands into `lobu memory <verb>`: seed, run, login, token,
  health, configure, browser-auth, skills list/add, init, org current/set.
- Adds top-level `lobu doctor` (system deps + memory MCP reachability;
  folds owletto's `doctor` and `health` checks).
- `lobu init` now wires the memory plugin config inline at the end of
  scaffold when memory is enabled (no separate `owletto init` step).

Code moves:
- packages/owletto-cli/src/lib/* → packages/cli/src/commands/memory/_lib/
- packages/owletto-cli/src/commands/{openclaw,seed,browser-auth}.ts →
  _lib/*-cmd.ts (citty wrappers stripped, citty defineCommand exports
  replaced by plain async functions consumed by thin commander handlers)
- skills/owletto and skills/owletto-openclaw now bundle through cli's
  build.cjs alongside skills/lobu (one bundled-skills directory).

Build / CI / release-please cleanup:
- Drop packages/owletto-cli from workspace globs, tsconfig excludes,
  biome ignores, release-please-config.json, .release-please-manifest,
  publish-packages.mjs, bump-version.mjs, knip workspace overrides.
- ci.yml: stop running `bun test packages/owletto-cli` and stop
  typechecking it in the matrix.
- packages/cli/package.json: pull in @clack/prompts, @lobu/owletto-sdk,
  playwright (formerly transitive via owletto-cli).
- README + landing site (rename reference/owletto-cli.md →
  reference/lobu-memory.md, update astro nav, sidebar links, internal
  anchors) + skills/owletto/references: replace `npx owletto@latest`
  with `npx @lobu/cli@latest memory`.
- docs/RELEASING.md: drop the unscoped `owletto` package; six → five.
- src/index.ts (lobu admin) test that scans owletto-cli for callTool
  refs now points at packages/cli/src/commands/memory/.

Validation:
- make build-packages clean.
- bun run typecheck clean.
- bun run check (biome) clean across 281 files.
- bun test packages/owletto-backend/src/auth/__tests__/tool-access.test.ts:
  47 pass (the cli-source walker correctly resolves the new memory dir).
- lobu --help, lobu memory --help, lobu memory token --help,
  lobu memory init --help all render the new shape.

* fix(cli): consolidate memory auth and skills into lobu

* refactor(cli): drop dead memory-token helpers and restore doctor server ping

- Remove unused `printMemoryToken`, `TokenOptions`, and
  `resolveConfiguredOrgUrl` from memory/_lib/openclaw-cmd.ts left
  behind after the auth consolidation moved memory token retrieval to
  top-level `lobu token`.
- Restore the server-URL connectivity check in `lobu doctor` (default
  mode) that used to ship as part of `owletto doctor`. `--memory-only`
  still shortcuts to authenticated MCP validation.
- Update plan doc references from `lobu memory token` to the actual
  unified surface `lobu token --raw`.

* feat(backend): persist egressConfig + preApprovedTools + guardrails on agents (#460)

* docs(plans): lobu apply — file → cloud converger plan

Three-PR rollout for lobu apply, the file-first → cloud converger.
Reuse-first design: existing /agents and /manage_entity_schema routes
are mostly idempotent already; only one upsert route is added. CLI
loops over them in dependency order. Re-running converges on partial
failure.

Defers: --prune, --force, lobu pull, watchers, raw SKILL.md round-trip,
org-scoped agent IDs, secret upload, memory data — all v2/v3.

* feat(backend): persist egressConfig + preApprovedTools + guardrails on agents

The file-loader produces these three settings fields from lobu.toml but
the agents table had no columns for them, so postgres-stores.ts silently
dropped them on every saveSettings(). This blocks `lobu apply`: cloud
would never accept the values local agents already have, leaving every
re-run as drift.

Adds the three jsonb columns (with empty defaults) and round-trips them
through rowToSettings / saveSettings / deleteSettings. New unit tests
cover populated, absent, and post-delete shapes against the test DB.

PR-1 of docs/plans/lobu-apply.md.

* feat(backend): idempotent agent create + stable-id connection upsert (#461)

PR-2 of `docs/plans/lobu-apply.md`. Two narrow extensions to
`lobu/agent-routes.ts` so `lobu apply` can converge by re-running:

1. `POST /` with an agent that already exists in the same org now
   returns 200 with the existing payload instead of 409. Cross-org
   collision still returns 409. The Owletto-MCP auto-injection only
   runs on the actual create path, never on the idempotent return —
   so re-applying never clobbers operator-edited `mcpServers`.

2. New `PUT /:agentId/connections/by-stable-id/:stableId` for
   deterministic-ID upsert. Reuses `ChatInstanceManager.addConnection`
   (which already accepts a caller-supplied `stableId`) for create,
   `ChatInstanceManager.updateConnection` for change. Returns
   `{ noop: true, connection }` when submitted config matches existing,
   `{ updated: true, willRestart: true, connection }` when it changes,
   201 with the connection on first create.

Tests cover the 200/409 shapes, the no-clobber MCP guarantee, the
noop/updated/create branches of the upsert, and the unknown-agent 404.
Test harness mocks `mcpAuth` + `getChatInstanceManager` so the routes
exercise the no-manager fallback path against a pglite Postgres.

* feat(cli): lobu apply — files → org converger (#462)

* feat(cli): lobu apply — files → org converger

PR-3 of docs/plans/lobu-apply.md. Implements `lobu apply`:

- Parses lobu.toml via existing cli/src/config/loader.ts:loadConfig
- Walks $VAR refs to enforce required-secrets check before any mutation
- Computes per-resource diff client-side (create/update/noop/drift)
- Loops over existing CRUD endpoints in dependency order:
  agents → settings → connections → entity types → relationship types
- Re-runs converge on partial failure (every endpoint is idempotent)

CLI surface:
  lobu apply [--dry-run] [--yes] [--only agents|memory] [--org <slug>]

Reuses _lib/openclaw-auth.ts (PR #459) for token + URL resolution.
Inlines buildStableConnectionId from owletto-backend's file-loader.ts:56
with a sync-comment — keep matching when either side changes.

Snapshot tests cover diff rendering for create/update/noop/drift and the
will-restart warning. End-to-end test runs once PR-1 + PR-2 land.

* fix(cli): apply — substitute $VAR refs and align connection wire shape

Two showstopper bugs caught by pi review before integration test:

1. Connection upsert was sending `type` but PR-2's `PUT /by-stable-id`
   route expects `platform`. Server would 400 on every connection apply.
   Aligns the client and apply-cmd payloads to `{ platform, name?, config }`.

2. `$VAR` references in connection configs were collected for the
   secrets check but never substituted. Apply would send literal
   `"botToken": "$TELEGRAM_BOT_TOKEN"` to the server. Now resolves
   against `process.env` (or an injected env for tests) and fails
   loudly with the var name when missing or empty.

Adds `env` option to `loadDesiredState` for testability. Test coverage
extended: existing test asserts the resolved value is in the desired
config; new test asserts unset $VAR throws a clear error.

Validation: typecheck, biome, 27 cli tests — all clean.

* docs(plans): lobu secrets push — security-conscious design (#465)

* fix(backend): align connection-config encrypt/decrypt prefix (#464)

encryptConfig() in postgres-stores.ts wrote raw `iv:tag:ciphertext`
output from @lobu/core's `encrypt()`, but decryptConfig() only
decrypts strings prefixed with `enc:v1:`. So any secret-named field
that ran through encryptConfig was stored as prefixless ciphertext
and round-tripped as that ciphertext literal on read.

Production was unaffected because ChatInstanceManager normalizes
secrets to `secret://` refs before persisting to chat_connections.
The agent_connections.config rows written by persistConnectionSnapshot
on the older lobu agent-routes paths (and the no-manager fallback added
in PR #461) are the affected surface.

Fix:
- encryptConfig prepends `enc:v1:` to the AES-GCM output.
- decryptConfig strips the prefix before delegating to `decrypt()`.
- Migration 20260430022231 backfills agent_connections.config rows
  whose values match the prefixless `iv:tag:ciphertext` shape (24 hex
  IV + 32 hex tag + hex ciphertext) by re-prefixing them. Idempotent —
  re-running is a noop because the regex skips already-prefixed values.

Adds round-trip + boundary tests in
packages/owletto-backend/src/lobu/stores/__tests__/postgres-stores.test.ts.

* feat: lobu apply hardening + e2e harness (#468)

Three things on top of the merged lobu apply v1:

Bug A — PUT /agents/:id/connections/by-stable-id/:stableId now folds
  `settings` (allowFrom, allowGroups, …) into the noop comparison so
  settings-only edits trip willRestart instead of being silently
  noop'd. Symmetric default `allowGroups: true` on the create-fallback
  path keeps follow-up PUTs round-tripping as noop.

Bug B — `diff.canonical` no longer collapses `[]` and `{}` to null;
  clearing a remote allowlist by setting it to `[]` now produces an
  `update` action. The `platform` key the route injects into stored
  connection config is stripped before comparing so an unchanged
  connection round-trips as `=`.

Bootstrap PAT — `start-local.ts` mints a default user / org `dev`
  / owner-scoped PAT on first boot under `LOBU_LOCAL_BOOTSTRAP=true`,
  prints it once with the org slug + URL, and persists it under
  `${OWLETTO_DATA_DIR}/bootstrap-pat.txt`. Production never sets the
  flag, so the path is dead in cloud. Required collateral:
  `multi-tenant.ts` now hydrates `c.var.user` for PAT bearers so
  REST routes that read `c.get('user')` (e.g. POST /agents) work the
  same as for cli-tokens; the asymmetry was previously masked by the
  test mock setting `user` directly. CLI client.ts dropped the
  trailing slash on POST /api/:org/agents — Hono `app.route()` does
  not match `path/` against `routes.post('/')`.

`scripts/e2e-lobu-apply.sh` exercises create → noop → update →
  drift detect against PGlite end-to-end, asserting REST rows landed.
  Self-cleans server, data dir, and project dir; tears down on any
  failure with a tail of the server log.

Out of scope: concurrent-apply lock, connection encryption asymmetry,
`lobu pull` (v2), secrets push (v3).

* fix(backend): close concurrent-apply races on agent + connection upsert (#466)

POST /agents was check-then-write: two concurrent applies in the same org
could both pass the existence check, both run saveMetadata, and both run
the Owletto MCP auto-injection — clobbering operator-set mcpServers. Use
INSERT ... ON CONFLICT (id) DO NOTHING RETURNING so only the actual
inserter runs the auto-injection; losers fall through to the idempotent
200 path.

PUT /:agentId/connections/by-stable-id/:stableId had the same race in its
create branch: two concurrent PUTs both observed `existing===null`, both
called chatManager.addConnection (or the fallback persist) with the same
stableId, and the second clobbered the first via ON CONFLICT DO UPDATE
plus spawned a duplicate platform instance. Atomically claim the row
with INSERT ... ON CONFLICT DO NOTHING; on conflict, re-read and fall
through to the existing update path (which keeps the hardening branch's
settingsChanged + allowGroups defaults intact) so concurrent identical
PUTs converge on a single noop or update response.

* docs(plans): lobu pull — cloud → files round-trip design (#463)

* refactor: rename connections to platforms (CLI + TOML + server route) (#470)

User-facing rename across the chat-platform CRUD surface:

- `lobu connections (list|add)` → `lobu platforms (list|add)`; description
  reads "Manage chat platforms".
- `packages/cli/src/commands/connections/` → `packages/cli/src/commands/platforms/`,
  with `platforms.ts` → `platform-prompts.ts` to avoid name clash with the
  command file.
- `lobu.toml` schema: `[[agents.<id>.connections]]` → `[[agents.<id>.platforms]]`.
  The file-loader emits a clear migration error if the legacy key is encountered,
  pointing the user at the new block. No backwards-compat alias.
- Server routes under `/api/:org/agents/:id/` rename `connections` → `platforms`
  for list / create / get / delete / start / stop / by-stable-id upsert. Body
  shape (`platform` field) and DB internals (`agent_connections` table,
  `ChatInstanceManager`) stay as-is — internal naming.
- `lobu apply` plumbing: `DesiredConnection` → `DesiredPlatform`,
  `buildStableConnectionId` → `buildStablePlatformId`, `rowsByKind('connection')`
  → `rowsByKind('platform')`, plan renderer labels, snapshots, e2e script.
- Docs sweep: `packages/landing/src/content/docs/{reference/lobu-toml.md,
  reference/lobu-apply.md, platforms/*.mdx}`, `docs/plans/lobu-apply.md`,
  `lobu init` wizard scaffolding + generated `lobu.toml`.

Owletto-web submodule rename is a separate PR (the `useAgentConnections` hook
collides with an existing `useAgentPlatforms` schema-catalog hook there).

Tests: existing tests updated to use new field/route names + a new
`file-loader-platforms.test.ts` pinning the migration-error path.

* fix(backend): bootstrap PAT mode 0600 + PAT auth source disambiguation (#472)

* fix(backend): also gate platform start/stop routes with requireSessionOrAdminPat (#473)

Pi follow-up to #472: POST /:agentId/platforms/:platformId/start and
.../stop were the last admin-tier mutations on the bare mcpAuth gate. A
weak PAT (mcp:read or mcp:read+mcp:write but no mcp:admin) could bounce
a connection's lifecycle. Add the helper + 2 admission tests.

Also fixes a stale path in the existing PUT /:agentId/connections/by-stable-id
admission test — #470 renamed the route to /platforms/by-stable-id but the
admission test added in #472 still used the old path and only passed by
landing in the 404 branch (now correctly 403 via the renamed path).

* fix(backend): close residual races in agent + connection upsert (#474)

Pi flagged two follow-ups on PR #466:

1. POST /agents — INSERT and saveSettings(mcpServers) ran as two writes,
   so a concurrent loser returning 200 in the idempotent branch could
   PATCH operator config in between, only for the winner's deferred
   saveSettings to clobber it. Fold mcp_servers into the same INSERT so
   row + auto-injection land atomically.

2. PUT /:agentId/platforms/by-stable-id/:stableId — the PR-466
   atomic-claim works for the no-manager fallback, but with the real
   ChatInstanceManager a "loser" could re-read the just-created row
   mid-addConnection and call updateConnection against half-initialized
   state (potentially double-spawning the chat instance). Add an
   in-process per-stableId Promise chain (primary serialization for
   embedded-only deployment) plus a short-lived pg_advisory_xact_lock
   keyed on (NAMESPACE, hashStableId(stableId)) for cross-process
   defense-in-depth and pg_locks visibility.

Tests: 20-iteration concurrent POST asserts mcpServers.owletto.url is
populated regardless of which handler ran the INSERT; concurrent PUT
with a counting mock manager asserts addConnection is called exactly
once.

* fix(cli): address apply review gaps

* fix(cli): satisfy CodeQL token and regex checks

* chore(cli): remove unused memory helpers

* chore(submodule): bump owletto-web to current main

* fix(backend): align platform-update fallback settings to manager semantics

Fallback path was merging current.settings with the request, so apply could
not drop fields like allowFrom — diverging from both the manager path
(line 1079) and the change-detection at line 1068 which model replace
semantics. Use { allowGroups: true, ...settings } to match.
@buremba buremba restored the feat/concurrent-apply-race-fix branch May 12, 2026 00:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

triage:needs-human Triage agent escalated for human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant