Skip to content

feat: app-control bundled skill (per-app screenshot + raw input)#29343

Merged
siddseethepalli merged 29 commits into
mainfrom
siddseethepalli/app-control-skill
May 3, 2026
Merged

feat: app-control bundled skill (per-app screenshot + raw input)#29343
siddseethepalli merged 29 commits into
mainfrom
siddseethepalli/app-control-skill

Conversation

@siddseethepalli
Copy link
Copy Markdown
Contributor

@siddseethepalli siddseethepalli commented May 3, 2026

Summary

Adds a new app-control bundled skill that lets the assistant observe and send raw input (keyboard, mouse) to a specific named macOS application. Complements computer-use for cases where the AX tree is unhelpful: emulators, games, OpenGL canvases, custom-rendered Electron apps. Bypasses the AX tree by capturing per-window screenshots and posting input events scoped to a single process via CGEventPostToPid.

Architecture: refactor HostCuProxy onto a shared HostProxyBase (PR 1), then a parallel proxy/tool/skill stack for app-control, then macOS client primitives (window screenshot, keyboard, mouse, executor) and connection wiring.

Self-review result

PASS on integration correctness (round 2). 7 production-breaking issues caught and fixed in 5 follow-up PRs. 6 stylistic/slop items remain as follow-ups (see below — none affect correctness).

PRs merged into feature branch

Implementation (16 PRs)

Self-review fixes (5 PRs)

Rollout

The feature flag app-control is registered with defaultEnabled: false in meta/feature-flags/feature-flag-registry.json. Not provisioned in LaunchDarkly Terraform — registry default is the gate. Trade-offs: no remote kill switch, no graduated rollout, but local override via ~/.vellum/protected/feature-flags.json works for testing.

The flag is read in exactly one place: the skill-projection layer's SKILL.md frontmatter check. The proxy is instantiated unconditionally when the macOS client supports host_app_control; gating lives only in the SKILL.md frontmatter. To GA later: remove the frontmatter line and the registry entry — no code reads the flag, so orphan overrides in protected/feature-flags.json are inert.

Remaining slop (follow-up candidates, not blockers)

Surface during round-2 self-review. None are correctness bugs:

  1. Four-site duplication of host-proxy preactivation logic (conversation-routes.ts, process-message.ts, two paths in conversation-process.ts). A applyHostProxyPreactivations helper would consolidate.
  2. HostProxyBase.cancel() is test-only public API; could be deleted or moved to HostTransferProxy-style.
  3. HostAppControlProxy.activeApp field is set but only read by tests.
  4. HostAppControlRequest.toolName (Swift) is decoded but unread; the discriminator now lives in input.tool.
  5. HostAppControlCancel.conversationId (Swift) is decoded but unread.
  6. HOST_PREFIX_KEYS_BY_LENGTH longest-prefix match is over-engineered for current state; direct lookup would suffice.

Part of plan: app-control-skill.md


Open in Devin Review

siddseethepalli and others added 21 commits May 2, 2026 23:10
Add the host_app_control capability to the HostProxyCapability union (macOS only) and declare the wire types (HostAppControlRequest, HostAppControlInput discriminated union, HostAppControlCancel, HostAppControlState, HostAppControlResultPayload). No consumers yet — this is type-only scaffolding for the proxy class in PR 4.

Part of plan: app-control-skill.md (PR 2 of 16)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Add Swift types (HostAppControlRequest, HostAppControlInput discriminated enum, HostAppControlCancel, HostAppControlState, HostAppControlResultPayload, WindowBounds) mirroring the TypeScript wire shapes added in PR 2. Codable round-trip matches the JSON conventions used by HostCuRequest.

Part of plan: app-control-skill.md (PR 3 of 16)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Extract the structurally-shared lifecycle (pending map, timeout, abort SSE, dispose, isAvailable) from HostCuProxy into a new abstract HostProxyBase class. HostCuProxy now extends the base and retains only CU-specific state (step counter, AX-tree diff, loop detector).

Part of plan: app-control-skill.md (PR 1 of 16)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Define the 8 app-control proxy tools (start, observe, press, combo, type, click, drag, stop) with executionMode: 'proxy' and stub execute() that throws. Add forwardAppControlProxyTool() bridge helper. Mirrors the computer-use tool-definition pattern.

Part of plan: app-control-skill.md (PR 5 of 16)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Add HostAppControlProxy extending the shared HostProxyBase. Owns app-control-specific state: per-instance active-app, PNG-hash loop guard (5 identical observations -> stuck warning), and a module-level singleton lock so only one conversation holds an active session at a time. Disposes release the lock.

Part of plan: app-control-skill.md (PR 4 of 16)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Add AppKeyboard helper that posts synthetic keyboard events to a target process via CGEventPostToPid (NOT CGEventPost) so input is scoped to the target app and never leaks to other foregrounded windows. Supports press (with optional hold duration), combo (simultaneous multi-key hold), and type (Unicode-aware string typing). On cancellation, all held keys are released before re-throwing.

Part of plan: app-control-skill.md (PR 7 of 16)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Add AppMouse helper that posts synthetic mouse clicks and drags to a target process via CGEventPostToPid (NOT CGEventPost). Coordinates are window-relative and translated to global at post time. Click supports left/right/middle and an optional double-click flag (sets mouseEventClickState=2). Drag posts mouseDown -> 10 interpolated mouseDragged events -> mouseUp.

Part of plan: app-control-skill.md (PR 8 of 16)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Add AppWindowCapture for capturing the frontmost normal window of a target process by PID. Returns CaptureResult with state (running/missing/minimized) and PNG base64 + window bounds when available. Distinguishes a missing process from a minimized one. PNG encoding via CGImageDestination.

Part of plan: app-control-skill.md (PR 6 of 16)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Add the result-pickup HTTP endpoint that the macOS client POSTs to after executing an app-control action. Mirrors the host-cu-result route. Forwards the payload to conversation.hostAppControlProxy.resolve(requestId, payload). Adds the field declaration on Conversation; full lifecycle wiring lands in PR 10.

Part of plan: app-control-skill.md (PR 9 of 16)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Register the new app-control bundled skill (SKILL.md + TOOLS.json + 8 tool stubs forwarding through skill-proxy-bridge). Add the app-control feature flag (defaultEnabled: false, scope: assistant). The skill is gated by the flag via SKILL.md frontmatter; no in-code flag checks needed since the projection layer handles gating.

Part of plan: app-control-skill.md (PR 12 of 16)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Implement AppControlExecutor that switches on HostAppControlRequest.input and dispatches to AppWindowCapture (async, ScreenCaptureKit-backed since macOS 15 deprecated CGWindowListCreateImage), AppKeyboard, and AppMouse. Resolves the target app to a pid_t via bundle ID first then localized name. Click/drag fetch current window bounds before posting events.

Part of plan: app-control-skill.md (PR 13 of 16)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
…29329)

Mirror the four hostCuProxy attachment points in Conversation: declare the field, add setHostAppControlProxy, dispose the proxy in Conversation.dispose, and parallel any teardown/availability checks. PR 9 added the field declaration; this PR completes the lifecycle wiring.

Part of plan: app-control-skill.md (PR 10 of 16)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
…29331)

Add a sibling branch to the computer_use_* dispatch in surfaceProxyResolver. app_control_stop is handled locally (calls proxy.dispose, returns a stopped summary, no client round-trip), matching CU's _done/_respond pattern. All other app_control_* tools forward to ctx.hostAppControlProxy.request. Returns an isError unavailability result when no proxy or no client connected.

Part of plan: app-control-skill.md (PR 11 of 16)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Add hostAppControlRequest and hostAppControlCancel handlers in the SSE message dispatch, mirroring the existing hostCu* handlers. Each request launches a cancellable Task that calls AppControlExecutor.perform(_:) and POSTs the result to /v1/host-app-control-result. Capability advertisement now includes both host_cu and host_app_control.

Part of plan: app-control-skill.md (PR 15 of 16)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
…29333)

When a connecting client supports the host_app_control capability, unconditionally instantiate HostAppControlProxy and attach it to the Conversation, plus preactivate the app-control skill. The feature flag is read only by the skill-projection layer via SKILL.md frontmatter — no in-code flag check is needed since unreached tools are harmless.

Part of plan: app-control-skill.md (PR 14 of 16)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
…9335)

Add an end-to-end app-control flow test driving a fake conversation through start -> observe -> stop with mocked SSE broadcasts and POSTs to /v1/host-app-control-result, plus singleton-lock coverage. Add a static-analysis guard that fails if any AppControl swift file uses the deprecated global CGEventPost (CGEventPostToPid / CGEvent.postToPid are required).

Part of plan: app-control-skill.md (PR 16 of 16)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
The 400-line tools/app-control/definitions.ts was referenced only by app-control-tool-schemas.test.ts. The production bundled-skill path uses TOOLS.json + bundled-tool-registry.ts. The hand-duplicated schemas in definitions.ts had no sync enforcement against TOOLS.json. Rewrite the schema test to validate TOOLS.json directly.

The skill-proxy-bridge.ts helper is preserved (the bundled-skill stubs still use it).

Part of plan: app-control-skill.md (fix round 1)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
…e capability correctly (#29339)

Two production-breaking fixes for app-control:
1. registerPendingInteraction now handles host_app_control_request by registering with kind: 'host_app_control'. Without this, every result POST from the macOS client fell through the route handler's early-return and the proxy's promise never resolved.
2. capabilityForMessageType now matches the longest prefix before the trailing _request/_cancel suffix. Previously it sliced to the second underscore, mapping host_app_control_request to undefined and broadcasting to all subscribers instead of routing only to host_app_control-capable clients.

Part of plan: app-control-skill.md (fix round 1)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
…ad state (#29340)

Four entangled correctness fixes:
1. surfaceProxyResolver injects 'tool' (e.g. 'start', 'observe') derived from toolName before forwarding to HostAppControlProxy. Without this, the Swift client could not decode requests and the singleton-lock guard never fired.
2. app_control_stop now clears the Conversation's hostAppControlProxy reference after dispose so subsequent tool calls cleanly fail with 'unavailable' instead of dispatching to a disposed proxy.
3. Delete the write-only _actionHistory ring buffer, recordActionFingerprint method, and actionHistory getter; nothing in production read them.
4. PNG-hash STUCK_REPEAT_THRESHOLD lowered from 5 to 4 so the warning fires after 5 total identical observations as the plan specified, not 6.

Part of plan: app-control-skill.md (fix round 1)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Both dequeue paths in conversation-process.ts reset preactivatedSkillIds and only re-added computer-use. Add the parallel re-add for app-control so the skill remains projected for queued messages 2+, mirroring the CU branch.

Part of plan: app-control-skill.md (fix round 1)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
…ccluded state (#29341)

Two wire-type coherence fixes:
1. HostAppControlCancel (TS + Swift) was missing conversationId, but host-proxy-base.ts has always sent it on the wire. Schema now matches the actual envelope, matching HostCuCancelRequest's shape.
2. Drop the HostAppControlState.occluded variant from TS, Swift, the route Zod schema, TOOLS.json, and definitions.ts. AppWindowCapture only emits running/minimized/missing; nothing produces occluded. Re-add when a producer exists.

Part of plan: app-control-skill.md (fix round 1)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
@siddseethepalli siddseethepalli requested a review from awlevin as a code owner May 3, 2026 04:46
@siddseethepalli
Copy link
Copy Markdown
Contributor Author

Plan papertrail: `app-control` bundled skill

app-control bundled skill

Overview

Add a new bundled skill app-control that lets the assistant observe and send raw input (keyboard, mouse) to a specific named macOS application. This complements the existing computer-use skill, which targets the macOS Accessibility tree and is unsuitable for any app where the AX tree is unhelpful — emulators, games, OpenGL canvases, custom-rendered Electron windows. The new skill bypasses the AX tree by capturing per-window screenshots and posting input events scoped to a single process via CGEventPostToPid.

The work is structured as a refactor of the existing HostCuProxy onto a shared HostProxyBase, then a parallel proxy/tool/skill stack for app-control, then macOS client primitives and wiring. PRs are intentionally narrow so the dependency graph admits multiple parallel waves.

Why a separate skill (not extending computer-use)

  • HostCuProxy enforces a 50-step session cap and an AX-tree-diff loop detector. Both are correct for ad-hoc UI tasks and wrong for sustained interactive control of a single app. Extending CU would require either tool-aware conditional guards (smell — same proxy, two policies) or guards that the new tools bypass (smell — new tools that ignore CU's safeguards).
  • CU's tool surface is built around "find element ID, click element ID". A separate skill is an unambiguous signal to the model that it's operating in a paradigm without an AX tree.
  • Independent feature flag and rollout posture: app-control posts raw keyboard events to specific PIDs; CU's posture is different and they should iterate independently.
  • HostProxyBase is extracted as part of this work so the structural mirror between proxies is paid down once.

PR 1: Extract HostProxyBase from HostCuProxy

Depends on

None

Branch

app-control/pr-1-extract-host-proxy-base

Title

refactor(daemon): extract HostProxyBase from HostCuProxy

Files

  • assistant/src/daemon/host-proxy-base.ts (new)
  • assistant/src/daemon/host-cu-proxy.ts (modified)
  • assistant/src/__tests__/host-proxy-base.test.ts (new)
  • assistant/src/__tests__/host-cu-proxy.test.ts (existing — must continue passing)

Implementation steps

  1. Create assistant/src/daemon/host-proxy-base.ts exporting abstract class HostProxyBase<TRequest, TResultPayload>. Constructor options: { capabilityName: HostProxyCapability, requestEventName: string, cancelEventName: string, resultPendingKind: string, timeoutMs?: number } (default timeoutMs 60_000). Owns:
    • protected pending: Map<string, { resolve: (p: TResultPayload) => void; reject: (err: Error) => void; timer: NodeJS.Timeout }>
    • protected request(toolName: string, input: unknown, conversationId: string, signal: AbortSignal): Promise<TResultPayload> — generates requestId, broadcasts requestEventName via assistantEventHub, registers a timeout that rejects with "timeout", registers an abort handler that broadcasts cancelEventName and rejects with "aborted", returns the deferred promise.
    • resolve(requestId: string, payload: TResultPayload): void — looks up and resolves; no-op if absent.
    • cancel(requestId: string, reason: string): void — broadcasts cancel and rejects.
    • dispose(): void — rejects all pending with "disposed", clears timers and the map.
    • isAvailable(): boolean — returns assistantEventHub.getMostRecentClientByCapability(this.capabilityName) != null.
  2. Modify assistant/src/daemon/host-cu-proxy.ts: change class HostCuProxy to extends HostProxyBase<HostCuRequest, HostCuResultPayload>. Pass the existing constants (capability "host_cu", request event "host_cu_request", cancel event "host_cu_cancel", pending kind "host_cu", timeout 60_000) to super(). Delete the local pending-map / timeout / abort / dispose / isAvailable fields and methods. Keep CU-specific state in the subclass: stepCount, axTreeHistory, consecutiveUnchangedSteps, the 3-step repeated-action guard. CU's public method (request(toolName, input, conversationId, signal)) wraps super.request(...) and applies the CU-specific step cap and loop detection on top.
  3. Create assistant/src/__tests__/host-proxy-base.test.ts. Build a minimal test subclass class TestProxy extends HostProxyBase<{ input: string }, { result: string }> with mock event names. Cover: (a) request() resolves when resolve() is called with the right id, (b) timeout rejects after timeoutMs, (c) abort signal triggers cancel broadcast and rejection, (d) dispose() rejects all pending, (e) isAvailable() reflects assistantEventHub capability presence.
  4. Run the existing host-cu-proxy.test.ts — must pass byte-for-byte. This is the regression gate. If any test fails, the refactor preserved less behavior than intended; fix before merging.

Acceptance criteria

  • bun test src/__tests__/host-proxy-base.test.ts passes.
  • bun test src/__tests__/host-cu-proxy.test.ts passes unchanged (no test edits).
  • bunx tsc --noEmit passes for the assistant package.
  • host-cu-proxy.ts no longer contains the pending-map / timeout / abort lifecycle code (it's all in the base).
  • HostCuProxy continues to enforce its 50-step cap and AX-diff loop detection (subclass-specific, not in the base).

PR 2: Add host_app_control capability + message types

Depends on

None

Branch

app-control/pr-2-host-app-control-types

Title

feat(daemon): add host_app_control capability and message types

Files

  • assistant/src/channels/types.ts (modified)
  • assistant/src/daemon/message-types/host-app-control.ts (new)
  • assistant/src/daemon/message-types/index.ts (modified)
  • assistant/src/daemon/message-protocol.ts (modified)
  • assistant/src/runtime/pending-interactions.ts (modified)

Implementation steps

  1. Modify assistant/src/channels/types.ts: add "host_app_control" to the HostProxyCapability union. In the supportsHostProxy capability matrix, add "host_app_control" to the macOS row only.
  2. Create assistant/src/daemon/message-types/host-app-control.ts exporting:
    • interface HostAppControlRequest { kind: "host_app_control_request"; requestId: string; conversationId: string; toolName: string; input: HostAppControlInput }
    • type HostAppControlInput — discriminated union over the 8 tool inputs (start, observe, press, combo, type, click, drag, stop). Each variant includes app: string (bundle ID or process name); start adds optional args?: string[]; press adds key, modifiers?, duration_ms?; etc.
    • interface HostAppControlCancel { kind: "host_app_control_cancel"; requestId: string }
    • type HostAppControlState = "running" | "missing" | "minimized" | "occluded"
    • interface HostAppControlResultPayload { requestId: string; state: HostAppControlState; pngBase64?: string; windowBounds?: { x: number; y: number; width: number; height: number }; executionResult?: string; executionError?: string }
  3. Modify assistant/src/daemon/message-types/index.ts barrel to re-export the new types.
  4. Modify assistant/src/daemon/message-protocol.ts to extend the broadcast envelope union with the new request/cancel kinds.
  5. Modify assistant/src/runtime/pending-interactions.ts to add "host_app_control" to the kind union.

Acceptance criteria

  • bunx tsc --noEmit passes.
  • Existing tests pass without modification.
  • HostProxyCapability now includes "host_app_control".
  • Importing HostAppControlRequest / HostAppControlResultPayload from the message-types barrel resolves.

PR 3: macOS shared HostAppControl* types

Depends on

None

Branch

app-control/pr-3-macos-shared-types

Title

feat(macos): add HostAppControl request and result types

Files

  • The macOS shared types file currently declaring HostCuRequest / HostCuResultPayload (locate via grep -r "struct HostCuRequest" clients/macos).

Implementation steps

  1. In the same Swift module that declares HostCuRequest, add HostAppControlRequest, HostAppControlInput (Swift enum mirroring the TypeScript discriminated union — one case per tool with associated values), HostAppControlState enum, and HostAppControlResultPayload. Match the TypeScript shapes exactly so Codable round-trips through SSE/HTTP without translation.
  2. Add Codable, Equatable conformances. Match the JSON key names used on the wire (camelCase or snake_case — match whatever HostCuRequest already uses).

Acceptance criteria

  • macOS app builds (cd clients/macos && ./build.sh test or equivalent).
  • A round-trip JSONEncoder().encode(HostAppControlRequest(...)) / JSONDecoder().decode(HostAppControlRequest.self, ...) test passes for at least one variant of each tool.

PR 4: HostAppControlProxy class

Depends on

PR 1, PR 2

Branch

app-control/pr-4-host-app-control-proxy

Title

feat(daemon): add HostAppControlProxy over HostProxyBase

Files

  • assistant/src/daemon/host-app-control-proxy.ts (new)
  • assistant/src/__tests__/host-app-control-proxy.test.ts (new)

Implementation steps

  1. Create assistant/src/daemon/host-app-control-proxy.ts with:
    • Module-level let activeAppControlConversationId: string | undefined — singleton lock that ensures only one conversation can hold an app-control session at a time.
    • class HostAppControlProxy extends HostProxyBase<HostAppControlRequest, HostAppControlResultPayload>. Constructor passes capability "host_app_control", request event "host_app_control_request", cancel event "host_app_control_cancel", pending kind "host_app_control", default timeout 60_000.
    • Subclass state: private activeApp?: { bundleId?: string; pid?: number; name: string }, private lastObservationHash?: string, private actionHistory: string[] /* ring buffer of last 5 tool+input fingerprints */.
    • Public async request(toolName: string, input: HostAppControlInput, conversationId: string, signal: AbortSignal): Promise<ToolExecutionResult>:
      • For app_control_start: if activeAppControlConversationId is set and != conversationId, reject with a clear error including the holding conversation id.
      • Forward to super.request(...) to get a HostAppControlResultPayload.
      • Update lastObservationHash (sha256 of pngBase64) when state === "running" and PNG present; if 5 successive presses produce identical hash, attach a "stuck" warning to the result.
      • Format the payload into ToolExecutionResult content blocks: PNG → image block with mime image/png; error → text block; otherwise text summary.
      • On app_control_start success, set activeAppControlConversationId = conversationId and store activeApp.
    • Override dispose() to call super.dispose() then release the singleton lock if held by this conversation.
  2. Create assistant/src/__tests__/host-app-control-proxy.test.ts. Cover: (a) start round-trip, (b) singleton lock — second conversation's start rejects, (c) PNG-hash loop guard fires after 5 identical observations, (d) dispose() releases the lock, (e) abort propagates to super.request and triggers cancel SSE frame.

Acceptance criteria

  • bun test src/__tests__/host-app-control-proxy.test.ts passes.
  • HostCuProxy tests still pass (no regression from base sharing).
  • The 50-step cap from CU is not present in HostAppControlProxy (different policy).

PR 5: app-control proxy-tool definitions

Depends on

PR 2

Branch

app-control/pr-5-app-control-tool-defs

Title

feat(tools): add app-control proxy-tool definitions

Files

  • assistant/src/tools/app-control/definitions.ts (new)
  • assistant/src/tools/app-control/skill-proxy-bridge.ts (new)
  • assistant/src/__tests__/app-control-tool-schemas.test.ts (new)

Implementation steps

  1. Create assistant/src/tools/app-control/definitions.ts exporting appControlTools: Tool[]. Mirror the structure of assistant/src/tools/computer-use/definitions.ts — each tool:
    • Has executionMode: "proxy".
    • execute() is a stub that throws "app-control tool must be forwarded to the connected client" (matches CU's pattern at definitions.ts:18-22).
    • Includes reasoning: string and activity: string fields like CU.
    • Eight tools:
      • app_control_start({ app: string, args?: string[] }) — risk Medium.
      • app_control_observe({ app: string }) — risk Low.
      • app_control_press({ app: string, key: string, modifiers?: string[], duration_ms?: number }) — risk Low. duration_ms default 50.
      • app_control_combo({ app: string, keys: string[], duration_ms?: number }) — risk Low.
      • app_control_type({ app: string, text: string }) — risk Low.
      • app_control_click({ app: string, x: number, y: number, button?: "left" | "right" | "middle", double?: boolean }) — risk Low.
      • app_control_drag({ app: string, from_x: number, from_y: number, to_x: number, to_y: number, button?: "left" | "right" | "middle" }) — risk Low.
      • app_control_stop({ app?: string, reason?: string }) — risk Low, terminal.
  2. Create assistant/src/tools/app-control/skill-proxy-bridge.ts exporting forwardAppControlProxyTool() (mirror assistant/src/tools/computer-use/skill-proxy-bridge.ts:16-28). This is the helper the bundled skill stubs will import in PR 12.
  3. Create assistant/src/__tests__/app-control-tool-schemas.test.ts. Cover: each tool's input schema validates well-formed inputs, rejects malformed ones (missing app, wrong button enum, etc.). Eight tools total.

Acceptance criteria

  • bun test src/__tests__/app-control-tool-schemas.test.ts passes.
  • bunx tsc --noEmit passes.
  • All eight tools have executionMode: "proxy" and stub execute.

PR 6: macOS AppWindowCapture.swift

Depends on

PR 3

Branch

app-control/pr-6-app-window-capture

Title

feat(macos): add per-app window screenshot helper

Files

  • clients/macos/vellum-assistant/AppControl/AppWindowCapture.swift (new)
  • clients/macos/vellum-assistantTests/AppWindowCaptureTests.swift (new)

Implementation steps

  1. Create clients/macos/vellum-assistant/AppControl/AppWindowCapture.swift:
    • enum AppWindowCapture namespace with static func capture(forPid pid: pid_t) -> CaptureResult.
    • struct CaptureResult { let state: HostAppControlState; let pngBase64: String?; let bounds: CGRect? }.
    • Implementation: query CGWindowListCopyWindowInfo([.optionOnScreenOnly, .excludeDesktopElements], kCGNullWindowID). Filter to entries where kCGWindowOwnerPID == pid AND kCGWindowLayer == 0. Pick the first match (frontmost normal window). If none, return .missing or .minimized (distinguish: if the process is alive but has no on-screen normal window, .minimized; if the process isn't running, .missing — caller resolves running state separately).
    • CGWindowListCreateImage(.null, .optionIncludingWindow, windowID, [.bestResolution, .boundsIgnoreFraming]). Encode as PNG via CGImageDestinationFinalize. Return base64 string.
  2. Create clients/macos/vellum-assistantTests/AppWindowCaptureTests.swift:
    • Resolve a known-running app (e.g., Finder via NSRunningApplication.runningApplications(withBundleIdentifier: "com.apple.finder").first?.processIdentifier). Capture and assert non-nil bounds and PNG data starting with the PNG magic header \x89PNG.
    • With an obviously-fake PID (e.g., pid_t(999_999)), assert .missing.

Acceptance criteria

  • macOS test target builds.
  • AppWindowCaptureTests passes locally on a developer machine with Finder running.
  • The helper returns .missing for unknown PIDs without crashing.

PR 7: macOS AppKeyboard.swift

Depends on

PR 3

Branch

app-control/pr-7-app-keyboard

Title

feat(macos): add per-process keyboard input helper

Files

  • clients/macos/vellum-assistant/AppControl/AppKeyboard.swift (new)
  • clients/macos/vellum-assistantTests/AppKeyboardMapTests.swift (new)

Implementation steps

  1. Create clients/macos/vellum-assistant/AppControl/AppKeyboard.swift:
    • enum AppKeyboard namespace.
    • static let keyMap: [String: CGKeyCode] — friendly names to virtual key codes. Cover "a"-"z", "0"-"9", "enter", "return", "tab", "escape", "space", "backspace", "delete", "up", "down", "left", "right", function keys "f1"-"f12". Standard Carbon kVK_* constants.
    • static func modifierFlags(_ mods: [String]) -> CGEventFlags — translates ["cmd","shift","option","control","fn"] to CGEventFlags.
    • static func press(pid: pid_t, key: String, modifiers: [String], durationMs: Int) async throws — synthesizes CGEvent(keyboardEventSource: nil, virtualKey: code, keyDown: true), applies flags, posts via CGEventPostToPid(pid, event). Sleeps durationMs (default 50). Posts the matching keyUp. Releases on Task.isCancelled.
    • static func combo(pid: pid_t, keys: [String], durationMs: Int) async throws — keyDown all, sleep, keyUp all (in reverse order). Releases all on cancellation.
    • static func type(pid: pid_t, text: String) async throws — for each character, decompose into key + modifier flags, post press/release. Use CGEvent(keyboardEventSource:nil, virtualKey:0, keyDown:true) plus event.keyboardSetUnicodeString(stringLength:..., unicodeString:...) for non-ASCII.
  2. Create clients/macos/vellum-assistantTests/AppKeyboardMapTests.swift:
    • Assert keyMap["enter"] == kVK_Return, keyMap["a"] == kVK_ANSI_A, keyMap["up"] == kVK_UpArrow, etc. Spot-check ~10 entries.
    • Assert modifierFlags(["cmd","shift"]) equals [.maskCommand, .maskShift].
    • Real CGEventPostToPid cannot be unit-tested headlessly — note in the test file's header comment that input behavior is verified manually.

Acceptance criteria

  • macOS test target builds.
  • AppKeyboardMapTests passes.
  • The helper uses CGEventPostToPid (process-scoped), not CGEventPost (global). Grep test in PR 17 enforces this.

PR 8: macOS AppMouse.swift

Depends on

PR 3

Branch

app-control/pr-8-app-mouse

Title

feat(macos): add per-process mouse input helper

Files

  • clients/macos/vellum-assistant/AppControl/AppMouse.swift (new)
  • clients/macos/vellum-assistantTests/AppMouseTests.swift (new)

Implementation steps

  1. Create clients/macos/vellum-assistant/AppControl/AppMouse.swift:
    • enum AppMouse namespace.
    • static func click(pid: pid_t, windowBounds: CGRect, x: Double, y: Double, button: MouseButton, double: Bool) throws — converts window-relative (x, y) to global screen coordinates using windowBounds, synthesizes CGEvent(mouseEventSource: nil, mouseType: .leftMouseDown, mouseCursorPosition: globalPoint, mouseButton: .left) (or right/center), posts via CGEventPostToPid(pid, event). For double: true, set event.setIntegerValueField(.mouseEventClickState, value: 2).
    • static func drag(pid: pid_t, windowBounds: CGRect, fromX: Double, fromY: Double, toX: Double, toY: Double, button: MouseButton) throws — mouseDown at from, mouseDragged sequence (e.g., 10 interpolated points), mouseUp at to.
    • enum MouseButton: String { case left, right, middle } mapped to CGMouseButton.
  2. Create clients/macos/vellum-assistantTests/AppMouseTests.swift:
    • Test the window-relative-to-global coordinate translation logic (factor it out into a pure helper if possible). For window at (100, 200, 800, 600) and click at (10, 20), global should be (110, 220).
    • Test the drag interpolation produces N intermediate points strictly between from and to.

Acceptance criteria

  • macOS test target builds.
  • AppMouseTests passes.
  • Helper uses CGEventPostToPid (process-scoped).

PR 9: POST /v1/host-app-control-result route

Depends on

PR 4

Branch

app-control/pr-9-host-app-control-route

Title

feat(runtime): add /v1/host-app-control-result route

Files

  • assistant/src/runtime/routes/host-app-control-routes.ts (new)
  • assistant/src/runtime/routes/index.ts (modified)
  • assistant/src/__tests__/host-app-control-routes.test.ts (new)

Implementation steps

  1. Create assistant/src/runtime/routes/host-app-control-routes.ts exporting registerHostAppControlRoutes(app). Mirror the shape of assistant/src/runtime/routes/host-cu-routes.ts:22-127. The handler:
    • Parses body as HostAppControlResultPayload.
    • Looks up the conversation via requestId (or via conversationId in the payload — match CU's lookup pattern).
    • Calls conversation.hostAppControlProxy?.resolve(requestId, payload). Returns 200 even if the proxy is gone (late delivery).
  2. Modify assistant/src/runtime/routes/index.ts to call registerHostAppControlRoutes(app) alongside the existing CU registration.
  3. Create assistant/src/__tests__/host-app-control-routes.test.ts:
    • POST a result with no matching pending request → 200, no crash.
    • POST a result that resolves a pending request from a fake conversation → assert the awaiting promise resolves with the payload.
    • POST malformed body → 400.

Acceptance criteria

  • bun test src/__tests__/host-app-control-routes.test.ts passes.
  • bunx tsc --noEmit passes.

PR 10: Conversation lifecycle wiring

Depends on

PR 4

Branch

app-control/pr-10-conversation-lifecycle

Title

feat(daemon): wire HostAppControlProxy into Conversation lifecycle

Files

  • assistant/src/daemon/conversation.ts (modified)
  • assistant/src/__tests__/conversation-app-control-lifecycle.test.ts (new)

Implementation steps

  1. Modify assistant/src/daemon/conversation.ts. Mirror the four hostCuProxy references (around lines 120, 206, 760, 932 — verify exact lines at edit time):
    • Add field hostAppControlProxy?: HostAppControlProxy.
    • Add setHostAppControlProxy(proxy: HostAppControlProxy | undefined): void.
    • In dispose(), call this.hostAppControlProxy?.dispose() then null it out.
    • Anywhere CU is checked for availability/teardown alongside conversation events, add the parallel app-control check.
  2. Create assistant/src/__tests__/conversation-app-control-lifecycle.test.ts:
    • Construct a Conversation, attach a fake HostAppControlProxy, dispose the conversation, assert dispose() was called on the proxy.
    • Setting a new proxy after dispose throws or is a no-op (match CU's behavior).

Acceptance criteria

  • bun test src/__tests__/conversation-app-control-lifecycle.test.ts passes.
  • Existing conversation.ts tests pass unchanged.

PR 11: surfaceProxyResolver extension

Depends on

PR 4, PR 5

Branch

app-control/pr-11-surface-proxy-resolver

Title

feat(daemon): route app_control_* tools through HostAppControlProxy

Files

  • assistant/src/daemon/conversation-surfaces.ts (modified)
  • assistant/src/__tests__/conversation-surfaces-app-control.test.ts (new)

Implementation steps

  1. Modify assistant/src/daemon/conversation-surfaces.ts. Find surfaceProxyResolver (around line 1739; verify at edit time). Add a sibling block to the computer_use_* branch:
    • When toolName.startsWith("app_control_"):
      • If ctx.hostAppControlProxy == null || !ctx.hostAppControlProxy.isAvailable(), return a ToolExecutionResult indicating unavailability with a hint to enable the app-control flag and connect a macOS client.
      • If toolName === "app_control_stop", call ctx.hostAppControlProxy.dispose() and return a local "stopped" summary without a client round-trip (matches the CU _done/_respond pattern at conversation-surfaces.ts:1755-1766).
      • Otherwise call ctx.hostAppControlProxy.request(toolName, input, conversationId, signal) and return its result.
  2. Create assistant/src/__tests__/conversation-surfaces-app-control.test.ts:
    • Resolver returns "unavailable" when no proxy is attached.
    • Resolver routes app_control_observe to the proxy (mock) and returns its result.
    • Resolver short-circuits app_control_stop locally without invoking the proxy's request.

Acceptance criteria

  • bun test src/__tests__/conversation-surfaces-app-control.test.ts passes.
  • CU surface resolution tests still pass.

PR 12: Bundled skill assets + feature flag

Depends on

PR 5

Branch

app-control/pr-12-bundled-skill-assets

Title

feat(skills): add app-control bundled skill and feature flag

Files

  • meta/feature-flags/feature-flag-registry.json (modified)
  • assistant/src/config/bundled-skills/app-control/SKILL.md (new)
  • assistant/src/config/bundled-skills/app-control/TOOLS.json (new)
  • assistant/src/config/bundled-skills/app-control/tools/app-control-start.ts (new)
  • assistant/src/config/bundled-skills/app-control/tools/app-control-observe.ts (new)
  • assistant/src/config/bundled-skills/app-control/tools/app-control-press.ts (new)
  • assistant/src/config/bundled-skills/app-control/tools/app-control-combo.ts (new)
  • assistant/src/config/bundled-skills/app-control/tools/app-control-type.ts (new)
  • assistant/src/config/bundled-skills/app-control/tools/app-control-click.ts (new)
  • assistant/src/config/bundled-skills/app-control/tools/app-control-drag.ts (new)
  • assistant/src/config/bundled-skills/app-control/tools/app-control-stop.ts (new)
  • The bundled-skills loader index file (locate by grep for bundled-skills/computer-use registration; modify)

Implementation steps

  1. Modify meta/feature-flags/feature-flag-registry.json. Append a new entry (preserve existing key order):
    { "id": "app-control", "scope": "assistant", "key": "app-control", "label": "App Control", "description": "Enable the app-control skill (per-app screenshot + raw input bypassing AX tree)", "defaultEnabled": false }
  2. Create SKILL.md with frontmatter name: app-control, feature-flag: app-control. Body content:
    • When to use: only when explicitly directed to drive a specific named app via raw input (e.g., emulators, games, OpenGL canvases) where the macOS Accessibility tree is unhelpful. Prefer computer-use for general macOS UI navigation.
    • Cadence: take 2-3 actions per turn, then yield with a short narration so the user can interject.
    • Always app_control_observe before acting if the screen state matters.
    • Prefer app_control_combo over rapid sequential app_control_press for simultaneous inputs.
    • Coordinate caveat: click/drag are window-relative; the window may move between observation and click — re-observe if uncertain.
    • Use bundle IDs (e.g. io.example.app) when possible; fall back to localized process names.
    • Stop the session with app_control_stop when done; do not auto-quit the controlled app.
  3. Create TOOLS.json mirroring the schema of computer-use/TOOLS.json. For each of the 8 tools, declare execution_target: "host", the proxy stub script path, and the JSON schema.
  4. Create the 8 tool stub scripts under tools/. Each is one line: import { forwardAppControlProxyTool } from "../../../../tools/app-control/skill-proxy-bridge.js"; export default forwardAppControlProxyTool("app_control_<name>");
  5. Modify the bundled-skills loader index file to register "app-control" alongside "computer-use".

Acceptance criteria

  • bun test passes for any existing bundled-skills registration / SKILL.md frontmatter / feature-flag-registry guard tests.
  • The skill appears in the bundled-skills catalog when the flag is enabled.
  • bunx tsc --noEmit passes.

PR 13: macOS AppControlExecutor.swift

Depends on

PR 6, PR 7, PR 8

Branch

app-control/pr-13-app-control-executor

Title

feat(macos): add AppControlExecutor dispatching tool actions

Files

  • clients/macos/vellum-assistant/AppControl/AppControlExecutor.swift (new)
  • clients/macos/vellum-assistantTests/AppControlExecutorTests.swift (new)

Implementation steps

  1. Create clients/macos/vellum-assistant/AppControl/AppControlExecutor.swift:
    • enum AppControlExecutor { static func perform(_ request: HostAppControlRequest) async -> HostAppControlResultPayload }.
    • Switch on request.input enum cases:
      • .start(app, args): resolve to pid_t via NSRunningApplication.runningApplications(withBundleIdentifier:) (preferred) or by localized name fallback. If not running, launch via NSWorkspace.shared.openApplication(at:configuration:). On launch failure return .missing with executionError.
      • .observe(app): resolve PID, call AppWindowCapture.capture(forPid:), return .running/.minimized/.missing with PNG.
      • .press(...): resolve PID, call AppKeyboard.press(...) with cancellation propagated from the request task.
      • .combo(...): same with AppKeyboard.combo(...).
      • .type(...): same with AppKeyboard.type(...).
      • .click(...): resolve PID, fetch current window bounds via AppWindowCapture, call AppMouse.click(...). If window not visible, return .minimized.
      • .drag(...): same with AppMouse.drag(...).
      • .stop(...): return a "stopped" summary; do NOT terminate the app.
    • private static func resolvePid(forApp app: String) -> pid_t? — central resolver. Return nil if no match, or the first pid if multiple match (and include the count in executionResult for transparency).
  2. Create clients/macos/vellum-assistantTests/AppControlExecutorTests.swift:
    • With a non-existent bundle ID, .start returns .missing with a meaningful executionError.
    • .stop always succeeds and never resolves a PID.
    • Mock the underlying helpers if practical; otherwise restrict tests to dispatch logic only and rely on PR 17 for end-to-end.

Acceptance criteria

  • macOS test target builds.
  • AppControlExecutorTests passes.

PR 14: Proxy instantiation in conversation routes

Depends on

PR 10, PR 11

Branch

app-control/pr-14-proxy-instantiation

Title

feat(runtime): instantiate HostAppControlProxy for capable clients

Files

  • assistant/src/runtime/routes/conversation-routes.ts (modified)
  • assistant/src/daemon/process-message.ts (modified)
  • assistant/src/__tests__/conversation-app-control-instantiation.test.ts (new)

Implementation steps

  1. Modify assistant/src/runtime/routes/conversation-routes.ts around line 1394-1406 (the existing CU instantiation block — verify lines at edit time). Add a parallel block:
    • When supportsHostProxy(sourceInterface, "host_app_control"), unconditionally instantiate new HostAppControlProxy(...), call conversation.setHostAppControlProxy(proxy), and call conversation.addPreactivatedSkillId("app-control"). Do not gate on the feature flag here. The skill-projection layer (which reads feature-flag: app-control from SKILL.md frontmatter at skill-state.ts:35-37) is the single source of gating — it filters the skill out of the model's tool list when the flag resolves to false, regardless of whether the proxy is preactivated. The proxy itself is harmless when unused (small per-conversation object, no model can reach its tools when the skill is filtered).
  2. Modify assistant/src/daemon/process-message.ts around line 155-162 (the existing CU block — verify at edit time) with the same parallel unconditional logic.
  3. Create assistant/src/__tests__/conversation-app-control-instantiation.test.ts:
    • With a macOS client connection, the proxy is attached and app-control is preactivated regardless of the feature flag value.
    • With a non-macOS client, no proxy is attached.
    • With the flag off, the skill does NOT appear in the projected tool list (assert via the skill-projection layer, not the proxy).

Acceptance criteria

  • bun test src/__tests__/conversation-app-control-instantiation.test.ts passes.
  • CU instantiation continues to work unchanged.
  • The flag is read in exactly one place: the skill-projection layer's existing frontmatter check. Grep assistant/src for the literal string "app-control" outside SKILL.md / TOOLS.json / registry / tests — there should be zero hits.

PR 15: macOS connection setup wiring + capability advertisement

Depends on

PR 13

Branch

app-control/pr-15-macos-wiring

Title

feat(macos): wire AppControlExecutor into connection setup

Files

  • clients/macos/vellum-assistant/App/AppDelegate+ConnectionSetup.swift (modified)
  • The macOS file that advertises host_cu on SSE connect (locate via grep for "host_cu" under clients/macos; modify)
  • clients/macos/vellum-assistantTests/AppControlConnectionTests.swift (new)

Implementation steps

  1. Modify clients/macos/vellum-assistant/App/AppDelegate+ConnectionSetup.swift around line 410-480 (the existing hostCu* cases — verify at edit time):
    • Add case .hostAppControlRequest(let req): call Task { let result = await AppControlExecutor.perform(req); await postResult(to: "/v1/host-app-control-result", payload: result) }.
    • Add case .hostAppControlCancel(let req): call into the same cancellation mechanism CU uses (locate the matching hostCuCancel handler and mirror it).
  2. Modify the SSE-connect capability advertisement to include "host_app_control" alongside "host_cu".
  3. Create clients/macos/vellum-assistantTests/AppControlConnectionTests.swift:
    • Decode a sample HostAppControlRequest JSON and assert it dispatches to AppControlExecutor (mock the executor).
    • Assert capability advertisement includes both host_cu and host_app_control.

Acceptance criteria

  • macOS app builds.
  • Connection-setup tests pass.

PR 16: End-to-end integration flow test

Depends on

PR 9, PR 12, PR 14, PR 15

Branch

app-control/pr-16-integration-flow

Title

test(app-control): end-to-end mocked SSE flow

Files

  • assistant/src/__tests__/app-control-flow.test.ts (new)
  • assistant/src/__tests__/app-control-no-global-cgevent.test.ts (new — guard test)

Implementation steps

  1. Create assistant/src/__tests__/app-control-flow.test.ts. Drive a fake conversation through:
    • app_control_start({ app: "com.example.app" }) → assert SSE broadcast captured with correct requestId and toolName.
    • POST a fake result to /v1/host-app-control-result → assert the start tool call resolves with the expected ToolExecutionResult.
    • app_control_observe({ app: "com.example.app" }) → assert the screenshot PNG bytes appear as an image content block in the result.
    • app_control_stop() → assert no SSE broadcast (resolved locally) and dispose() is called on the proxy.
    • Mirror cu-unified-flow.test.ts patterns for fixture setup. Use generic test data — bundle ID com.example.app, no real product names.
  2. Create assistant/src/__tests__/app-control-no-global-cgevent.test.ts. This is a static-analysis guard:
    • Read all Swift files under clients/macos/vellum-assistant/AppControl/ via the file system.
    • Assert none contain CGEventPost( (the global posting function — must use CGEventPostToPid for process-scoping).
    • Allow the literal string in test files (AppControlExecutorTests.swift etc.) only if explicitly suppressed via a // allow: CGEventPost comment on the same line.

Acceptance criteria

  • bun test src/__tests__/app-control-flow.test.ts passes.
  • bun test src/__tests__/app-control-no-global-cgevent.test.ts passes.
  • bunx tsc --noEmit passes for the assistant package.
  • macOS app builds without warnings.

Rollout strategy: registry-only flag, no LaunchDarkly provisioning

The feature flag added in PR 12 is intentionally not provisioned in vellum-assistant-platform Terraform. The registry default (defaultEnabled: false) is the gate. Trade-offs:

  • No remote kill switch. If a critical bug ships, mitigation is a code revert + redeploy, not an LD flag flip.
  • No graduated rollout. Binary on/off via registry default; no per-user/org targeting.
  • Local override still works. Drop {"app-control": true} into ~/.vellum/protected/feature-flags.json to enable for a single workspace during testing (per the gotcha at meta/feature-flags/AGENTS.md documented behavior).

This is appropriate because the flag is solely controlling availability of an experimental skill, and the resolver biases toward open (unknown keys → true, see assistant-feature-flags.ts:328) so removal cleanly defaults to GA.

GA-removal procedure (post-MVP, not a planned PR)

When ready to graduate app-control to general availability, ship a single small PR:

  1. Remove feature-flag: app-control from assistant/src/config/bundled-skills/app-control/SKILL.md frontmatter.
  2. Remove the app-control entry from meta/feature-flags/feature-flag-registry.json.

Two-file PR. No code reads the flag (PR 14 enforces this — the proxy is unconditional, gating lives only in the SKILL.md frontmatter), so any orphan entries in ~/.vellum/protected/feature-flags.json from prior overrides are inert. No workspace migration needed.

Risks and edge cases

  • App not installed / not running: start returns state: missing with a clear executionError. SKILL.md instructs the model to ask the user rather than retry.
  • Window minimized or fullscreen-hidden: observe returns state: minimized. Don't auto-unhide.
  • App name ambiguity: prefer bundle ID; fall back to localized name. If multiple processes match, use the first and surface the count in executionResult.
  • Stuck keys on cancel: AppKeyboard releases all held keys on Task.isCancelled. PR 7 tests cover this.
  • Loop guard: PNG-hash compare on consecutive observations — 5 successive identical observations attaches a "stuck" warning. (No AX tree to diff.)
  • Concurrent control: module-level lock — only one conversation can hold the active session.
  • macOS Accessibility / Screen Recording permissions: synthetic key events require Accessibility permission; window screenshots require Screen Recording permission. CU already prompts for these; verify the first failure path nudges the user clearly.
  • Coordinate space: click/drag window-relative coords are translated using current window bounds at click time. Window may move between observation and click — re-observe if uncertain.
  • Daemon restart mid-session: proxy state lost; the controlled app's own state persists.
  • Coexistence with computer-use: both can be active. SKILL.md tells the model which paradigm to use when.

Out of scope for MVP

  • Auto-installing the controlled app.
  • Targeting non-frontmost windows of the same app.
  • Configurable per-app keymaps.
  • Save state / app-specific helper tools.
  • Audio capture from the controlled app.
  • A passive watch-and-react loop without explicit invocation.
  • Cross-platform clients (macOS only).
  • OCR over the screenshot.
  • Multi-app simultaneous control (one app per session lock).
  • Mouse hover, scroll, gesture events (clicks and drags only).

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 6 additional findings in Devin Review.

Open in Devin Review

Comment on lines +1795 to +1802
if (toolName.startsWith("app_control_")) {
if (!ctx.hostAppControlProxy || !ctx.hostAppControlProxy.isAvailable()) {
return {
content:
"App control is not available — enable the `app-control` feature flag and connect a macOS client.",
isError: true,
};
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 app_control_stop blocked by isAvailable() check, leaking the module-level singleton lock

The isAvailable() check at assistant/src/daemon/conversation-surfaces.ts:1796 gates ALL app_control_* tools, including app_control_stop, which is designed to be a local short-circuit that does not need a client round-trip. If the macOS client disconnects while a conversation holds the singleton app-control lock, the model cannot call app_control_stop to release it — the function returns an error before reaching the stop short-circuit at line 1812. This leaks the module-level activeAppControlConversationId lock (host-app-control-proxy.ts:74), preventing any other conversation from starting an app-control session until the locking conversation is disposed or the client reconnects.

Triggering scenario
  1. macOS client connects → proxy created, isAvailable() true
  2. Model calls app_control_start → singleton lock acquired
  3. macOS client disconnects (network issue, app restart, etc.)
  4. Model calls app_control_stop → line 1796: !ctx.hostAppControlProxy.isAvailable() is true → returns error "not available"
  5. Singleton lock is never released
  6. Another conversation calls app_control_start → rejected with "conversation X currently holds the session"
Prompt for agents
The app_control_stop check at line 1812 must execute BEFORE the isAvailable() guard at line 1796, because stop is a local short-circuit that tears down the proxy and releases the singleton lock without needing a client round-trip. The current ordering means a client disconnect prevents stop from ever running, leaking the module-level singleton lock.

Approach: Move the app_control_stop short-circuit block (lines 1812-1819) to before the isAvailable() check. The stop path only needs the proxy to exist (not be available), so the check should be: if proxy exists AND toolName is app_control_stop, execute the local teardown. The isAvailable() check should only gate tools that actually need a client round-trip (start, observe, press, click, etc.).

Alternatively, split the availability check: check for proxy existence first (needed for all tools including stop), then check isAvailable() only for non-stop tools. Something like:

  if (!ctx.hostAppControlProxy) return unavailable error;
  if (toolName === 'app_control_stop') { ...dispose and return... }
  if (!ctx.hostAppControlProxy.isAvailable()) return unavailable error;
  // ...rest of tools...
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +1414 to +1425
if (supportsHostProxy(sourceInterface, "host_app_control")) {
if (!conversation.isProcessing() || !conversation.hostAppControlProxy) {
conversation.setHostAppControlProxy(
new HostAppControlProxy(mapping.conversationId),
);
}
if (!conversation.isProcessing()) {
conversation.addPreactivatedSkillId("app-control");
}
} else if (!conversation.isProcessing()) {
conversation.setHostAppControlProxy(undefined);
}
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot May 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Proxy recreation on each idle turn releases the singleton lock between messages

Both conversation-routes.ts:1414-1425 and process-message.ts:168-177 create a new HostAppControlProxy whenever !conversation.isProcessing() — even if one already exists with an active singleton lock. setHostAppControlProxy(newProxy) disposes the old proxy (conversation.ts:951-956), which releases the lock (host-app-control-proxy.ts:314-318). This means between user turns, the lock is briefly released and re-acquirable by another conversation. This mirrors the existing CU proxy pattern (which also recreates per-turn), so it's intentional design rather than a bug. The model would need to call app_control_start again to re-acquire the lock in a new turn. The SKILL.md cadence instruction ("Take 2-3 actions per turn, then yield") implies sessions are per-turn.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bb72e80543

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1781 to +1784
let fromX = try container.decode(Double.self, forKey: .fromX)
let fromY = try container.decode(Double.self, forKey: .fromY)
let toX = try container.decode(Double.self, forKey: .toX)
let toY = try container.decode(Double.self, forKey: .toY)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Decode drag coordinates from snake_case keys

The drag variant currently decodes required coordinates from fromX/fromY/toX/toY, but the daemon’s app-control contract sends from_x/from_y/to_x/to_y (as defined in the TS tool input schema). Because these fields are required here, host_app_control_request drag messages fail to decode and the client drops the event instead of executing the action.

Useful? React with 👍 / 👎.

let app = try container.decode(String.self, forKey: .app)
let key = try container.decode(String.self, forKey: .key)
let modifiers = try container.decodeIfPresent([String].self, forKey: .modifiers)
let durationMs = try container.decodeIfPresent(Int.self, forKey: .durationMs)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Decode press/combo hold duration from snake_case key

The decoder reads hold duration from durationMs, but the wire contract uses duration_ms; as a result, valid duration inputs from the assistant are silently dropped and both press/combo fall back to the default 50ms hold. This changes tool behavior for any flow that relies on longer key holds.

Useful? React with 👍 / 👎.

siddseethepalli and others added 4 commits May 3, 2026 02:31
…#29350)

The longest-prefix matcher with HOST_PREFIX_KEYS_BY_LENGTH was over-engineered for current state — every registered key matches a stripped stem exactly. Replace with a direct table lookup keyed on the stem (after stripping _request/_cancel). Behaviorally identical for all currently-defined message types; existing tests still pass.

Part of plan: app-control-skill.md (slop cleanup)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Two pieces of dead public API surface caught by self-review:
1. HostProxyBase.cancel() was only invoked by its own test file; the production cancel path runs via AbortSignal handling inside dispatchRequest.
2. HostAppControlProxy.activeApp / ActiveApp / currentApp are written in the start-success branch but only read by tests; the actual singleton mechanism is activeAppControlConversationId.

Delete both with their tests.

Part of plan: app-control-skill.md (slop cleanup)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
…29352)

Two Swift fields decoded but never consumed:
1. HostAppControlRequest.toolName — AppControlExecutor switches on input only; the discriminator lives in input.tool.
2. HostAppControlCancel.conversationId — AppDelegate's cancel handler invokes cancelHostAppControlRequest(msg.requestId) and never reads conversationId. The sibling HostCuCancelRequest doesn't carry it either, so the 'wire-shape parity' rationale was inconsistent.

The wire envelope still includes both fields (daemon-side TS types unchanged); Swift's Codable silently ignores them on decode.

Part of plan: app-control-skill.md (slop cleanup)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
The same supportsHostProxy(sourceInterface, capability) gate plus addPreactivatedSkillId(skillId) pattern appeared in four places (conversation-routes.ts, process-message.ts, two paths in conversation-process.ts) — one entry per host-proxy capability per call site. Consolidate into a single source of truth: HOST_PROXY_SKILL_PREACTIVATIONS and preactivateHostProxySkills(). Adding a new host-proxy capability now means updating one list, not four call sites.

Part of plan: app-control-skill.md (slop cleanup)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
@siddseethepalli
Copy link
Copy Markdown
Contributor Author

Slop cleanup follow-ups merged

All 6 stylistic items from the round-2 self-review are now addressed on this branch:

# Issue PR
1 Four-site duplication of host-proxy preactivation logic #29353
2 HostProxyBase.cancel() test-only public API removed #29351
3 HostAppControlProxy.activeApp / currentApp (test-only readback) removed #29351
4 HostAppControlRequest.toolName (Swift, unread) removed #29352
5 HostAppControlCancel.conversationId (Swift, unread) removed #29352
6 HOST_PREFIX_KEYS_BY_LENGTH over-engineered prefix matcher → direct lookup #29350

Branch now has 25 commits (16 implementation + 5 round-1 fix + 4 slop cleanup). Ready for manual review.

siddseethepalli and others added 2 commits May 3, 2026 04:40
…yload (#29357)

ScreenCaptureKit failures (most commonly: Screen Recording permission not granted) silently returned nil from captureWindowPNG, and AppWindowCapture.capture(forPid:) still reported state: running with no PNG. Daemon and LLM saw a 'successful' payload with no error and no image — confusing for the user, who has no signal that the macOS app is missing a permission.

Wire the underlying error string through CaptureResult.captureError into HostAppControlResultPayload.executionError. The window state remains correctly classified (running/minimized/missing); the new error field is an orthogonal signal that capture itself failed even though the window exists.

For click/drag tools, the executor only surfaces the capture error when window bounds are also missing — we only need the bounds for those tools, so a missing PNG is non-fatal there.

Part of plan: app-control-skill.md (post-merge UX fix)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
…ence + observe settle delay (#29363)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
devin-ai-integration[bot]

This comment was marked as resolved.

- Register host-app-control-result route policy (approval.write scope)
- Regenerate bundled-tool-registry.ts to include app-control-sequence
- Regenerate openapi.yaml for /v1/host-app-control-result endpoint

Fixes failing CI: Test (bundled-tool-registry-guard, guard-tests),
OpenAPI Spec Check, and Lint (knip unused-files) on #29343.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 12 additional findings in Devin Review.

Open in Devin Review

Comment on lines +148 to +161
if (toolName === TOOL_START) {
if (
activeAppControlConversationId != null &&
activeAppControlConversationId !== conversationId
) {
return {
content:
`Another conversation (${activeAppControlConversationId}) currently holds the ` +
`app-control session. Wait for it to finish, or call app_control_stop ` +
`from that conversation first.`,
isError: true,
};
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Singleton lock uses this.conversationId for acquisition but method parameter for guard check

In HostAppControlProxy.request(), the lock guard at line 150 compares activeAppControlConversationId !== conversationId (method parameter), but handleSuccess() at line 217 sets activeAppControlConversationId = this.conversationId (instance field). If these ever diverge, the lock could be acquired under one identity but guarded under another. All current call sites pass the same value for both (the proxy is constructed with the conversation ID and called with the same), so this doesn't manifest in practice. However, the API signature accepts conversationId as a parameter, implying it could differ. A future refactor or misuse could trigger inconsistent lock behavior. Consider using this.conversationId consistently for both paths, or removing the conversationId parameter in favor of always using the instance field.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

…eys (#29372)

Co-authored-by: Vellum Assistant <assistant@vellum.ai>
@siddseethepalli siddseethepalli merged commit 92ef6dd into main May 3, 2026
14 checks passed
@siddseethepalli siddseethepalli deleted the siddseethepalli/app-control-skill branch May 3, 2026 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant