feat: app-control bundled skill (per-app screenshot + raw input)#29343
Conversation
Add the host_app_control capability to the HostProxyCapability union (macOS only) and declare the wire types (HostAppControlRequest, HostAppControlInput discriminated union, HostAppControlCancel, HostAppControlState, HostAppControlResultPayload). No consumers yet — this is type-only scaffolding for the proxy class in PR 4. Part of plan: app-control-skill.md (PR 2 of 16) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Add Swift types (HostAppControlRequest, HostAppControlInput discriminated enum, HostAppControlCancel, HostAppControlState, HostAppControlResultPayload, WindowBounds) mirroring the TypeScript wire shapes added in PR 2. Codable round-trip matches the JSON conventions used by HostCuRequest. Part of plan: app-control-skill.md (PR 3 of 16) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Extract the structurally-shared lifecycle (pending map, timeout, abort SSE, dispose, isAvailable) from HostCuProxy into a new abstract HostProxyBase class. HostCuProxy now extends the base and retains only CU-specific state (step counter, AX-tree diff, loop detector). Part of plan: app-control-skill.md (PR 1 of 16) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Define the 8 app-control proxy tools (start, observe, press, combo, type, click, drag, stop) with executionMode: 'proxy' and stub execute() that throws. Add forwardAppControlProxyTool() bridge helper. Mirrors the computer-use tool-definition pattern. Part of plan: app-control-skill.md (PR 5 of 16) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Add HostAppControlProxy extending the shared HostProxyBase. Owns app-control-specific state: per-instance active-app, PNG-hash loop guard (5 identical observations -> stuck warning), and a module-level singleton lock so only one conversation holds an active session at a time. Disposes release the lock. Part of plan: app-control-skill.md (PR 4 of 16) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Add AppKeyboard helper that posts synthetic keyboard events to a target process via CGEventPostToPid (NOT CGEventPost) so input is scoped to the target app and never leaks to other foregrounded windows. Supports press (with optional hold duration), combo (simultaneous multi-key hold), and type (Unicode-aware string typing). On cancellation, all held keys are released before re-throwing. Part of plan: app-control-skill.md (PR 7 of 16) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Add AppMouse helper that posts synthetic mouse clicks and drags to a target process via CGEventPostToPid (NOT CGEventPost). Coordinates are window-relative and translated to global at post time. Click supports left/right/middle and an optional double-click flag (sets mouseEventClickState=2). Drag posts mouseDown -> 10 interpolated mouseDragged events -> mouseUp. Part of plan: app-control-skill.md (PR 8 of 16) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Add AppWindowCapture for capturing the frontmost normal window of a target process by PID. Returns CaptureResult with state (running/missing/minimized) and PNG base64 + window bounds when available. Distinguishes a missing process from a minimized one. PNG encoding via CGImageDestination. Part of plan: app-control-skill.md (PR 6 of 16) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Add the result-pickup HTTP endpoint that the macOS client POSTs to after executing an app-control action. Mirrors the host-cu-result route. Forwards the payload to conversation.hostAppControlProxy.resolve(requestId, payload). Adds the field declaration on Conversation; full lifecycle wiring lands in PR 10. Part of plan: app-control-skill.md (PR 9 of 16) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Register the new app-control bundled skill (SKILL.md + TOOLS.json + 8 tool stubs forwarding through skill-proxy-bridge). Add the app-control feature flag (defaultEnabled: false, scope: assistant). The skill is gated by the flag via SKILL.md frontmatter; no in-code flag checks needed since the projection layer handles gating. Part of plan: app-control-skill.md (PR 12 of 16) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Implement AppControlExecutor that switches on HostAppControlRequest.input and dispatches to AppWindowCapture (async, ScreenCaptureKit-backed since macOS 15 deprecated CGWindowListCreateImage), AppKeyboard, and AppMouse. Resolves the target app to a pid_t via bundle ID first then localized name. Click/drag fetch current window bounds before posting events. Part of plan: app-control-skill.md (PR 13 of 16) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
…29329) Mirror the four hostCuProxy attachment points in Conversation: declare the field, add setHostAppControlProxy, dispose the proxy in Conversation.dispose, and parallel any teardown/availability checks. PR 9 added the field declaration; this PR completes the lifecycle wiring. Part of plan: app-control-skill.md (PR 10 of 16) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
…29331) Add a sibling branch to the computer_use_* dispatch in surfaceProxyResolver. app_control_stop is handled locally (calls proxy.dispose, returns a stopped summary, no client round-trip), matching CU's _done/_respond pattern. All other app_control_* tools forward to ctx.hostAppControlProxy.request. Returns an isError unavailability result when no proxy or no client connected. Part of plan: app-control-skill.md (PR 11 of 16) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Add hostAppControlRequest and hostAppControlCancel handlers in the SSE message dispatch, mirroring the existing hostCu* handlers. Each request launches a cancellable Task that calls AppControlExecutor.perform(_:) and POSTs the result to /v1/host-app-control-result. Capability advertisement now includes both host_cu and host_app_control. Part of plan: app-control-skill.md (PR 15 of 16) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
…29333) When a connecting client supports the host_app_control capability, unconditionally instantiate HostAppControlProxy and attach it to the Conversation, plus preactivate the app-control skill. The feature flag is read only by the skill-projection layer via SKILL.md frontmatter — no in-code flag check is needed since unreached tools are harmless. Part of plan: app-control-skill.md (PR 14 of 16) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
…9335) Add an end-to-end app-control flow test driving a fake conversation through start -> observe -> stop with mocked SSE broadcasts and POSTs to /v1/host-app-control-result, plus singleton-lock coverage. Add a static-analysis guard that fails if any AppControl swift file uses the deprecated global CGEventPost (CGEventPostToPid / CGEvent.postToPid are required). Part of plan: app-control-skill.md (PR 16 of 16) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
The 400-line tools/app-control/definitions.ts was referenced only by app-control-tool-schemas.test.ts. The production bundled-skill path uses TOOLS.json + bundled-tool-registry.ts. The hand-duplicated schemas in definitions.ts had no sync enforcement against TOOLS.json. Rewrite the schema test to validate TOOLS.json directly. The skill-proxy-bridge.ts helper is preserved (the bundled-skill stubs still use it). Part of plan: app-control-skill.md (fix round 1) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
…e capability correctly (#29339) Two production-breaking fixes for app-control: 1. registerPendingInteraction now handles host_app_control_request by registering with kind: 'host_app_control'. Without this, every result POST from the macOS client fell through the route handler's early-return and the proxy's promise never resolved. 2. capabilityForMessageType now matches the longest prefix before the trailing _request/_cancel suffix. Previously it sliced to the second underscore, mapping host_app_control_request to undefined and broadcasting to all subscribers instead of routing only to host_app_control-capable clients. Part of plan: app-control-skill.md (fix round 1) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
…ad state (#29340) Four entangled correctness fixes: 1. surfaceProxyResolver injects 'tool' (e.g. 'start', 'observe') derived from toolName before forwarding to HostAppControlProxy. Without this, the Swift client could not decode requests and the singleton-lock guard never fired. 2. app_control_stop now clears the Conversation's hostAppControlProxy reference after dispose so subsequent tool calls cleanly fail with 'unavailable' instead of dispatching to a disposed proxy. 3. Delete the write-only _actionHistory ring buffer, recordActionFingerprint method, and actionHistory getter; nothing in production read them. 4. PNG-hash STUCK_REPEAT_THRESHOLD lowered from 5 to 4 so the warning fires after 5 total identical observations as the plan specified, not 6. Part of plan: app-control-skill.md (fix round 1) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Both dequeue paths in conversation-process.ts reset preactivatedSkillIds and only re-added computer-use. Add the parallel re-add for app-control so the skill remains projected for queued messages 2+, mirroring the CU branch. Part of plan: app-control-skill.md (fix round 1) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
…ccluded state (#29341) Two wire-type coherence fixes: 1. HostAppControlCancel (TS + Swift) was missing conversationId, but host-proxy-base.ts has always sent it on the wire. Schema now matches the actual envelope, matching HostCuCancelRequest's shape. 2. Drop the HostAppControlState.occluded variant from TS, Swift, the route Zod schema, TOOLS.json, and definitions.ts. AppWindowCapture only emits running/minimized/missing; nothing produces occluded. Re-add when a producer exists. Part of plan: app-control-skill.md (fix round 1) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Plan papertrail: `app-control` bundled skill
|
| if (toolName.startsWith("app_control_")) { | ||
| if (!ctx.hostAppControlProxy || !ctx.hostAppControlProxy.isAvailable()) { | ||
| return { | ||
| content: | ||
| "App control is not available — enable the `app-control` feature flag and connect a macOS client.", | ||
| isError: true, | ||
| }; | ||
| } |
There was a problem hiding this comment.
🔴 app_control_stop blocked by isAvailable() check, leaking the module-level singleton lock
The isAvailable() check at assistant/src/daemon/conversation-surfaces.ts:1796 gates ALL app_control_* tools, including app_control_stop, which is designed to be a local short-circuit that does not need a client round-trip. If the macOS client disconnects while a conversation holds the singleton app-control lock, the model cannot call app_control_stop to release it — the function returns an error before reaching the stop short-circuit at line 1812. This leaks the module-level activeAppControlConversationId lock (host-app-control-proxy.ts:74), preventing any other conversation from starting an app-control session until the locking conversation is disposed or the client reconnects.
Triggering scenario
- macOS client connects → proxy created,
isAvailable()true - Model calls
app_control_start→ singleton lock acquired - macOS client disconnects (network issue, app restart, etc.)
- Model calls
app_control_stop→ line 1796:!ctx.hostAppControlProxy.isAvailable()is true → returns error "not available" - Singleton lock is never released
- Another conversation calls
app_control_start→ rejected with "conversation X currently holds the session"
Prompt for agents
The app_control_stop check at line 1812 must execute BEFORE the isAvailable() guard at line 1796, because stop is a local short-circuit that tears down the proxy and releases the singleton lock without needing a client round-trip. The current ordering means a client disconnect prevents stop from ever running, leaking the module-level singleton lock.
Approach: Move the app_control_stop short-circuit block (lines 1812-1819) to before the isAvailable() check. The stop path only needs the proxy to exist (not be available), so the check should be: if proxy exists AND toolName is app_control_stop, execute the local teardown. The isAvailable() check should only gate tools that actually need a client round-trip (start, observe, press, click, etc.).
Alternatively, split the availability check: check for proxy existence first (needed for all tools including stop), then check isAvailable() only for non-stop tools. Something like:
if (!ctx.hostAppControlProxy) return unavailable error;
if (toolName === 'app_control_stop') { ...dispose and return... }
if (!ctx.hostAppControlProxy.isAvailable()) return unavailable error;
// ...rest of tools...
Was this helpful? React with 👍 or 👎 to provide feedback.
| if (supportsHostProxy(sourceInterface, "host_app_control")) { | ||
| if (!conversation.isProcessing() || !conversation.hostAppControlProxy) { | ||
| conversation.setHostAppControlProxy( | ||
| new HostAppControlProxy(mapping.conversationId), | ||
| ); | ||
| } | ||
| if (!conversation.isProcessing()) { | ||
| conversation.addPreactivatedSkillId("app-control"); | ||
| } | ||
| } else if (!conversation.isProcessing()) { | ||
| conversation.setHostAppControlProxy(undefined); | ||
| } |
There was a problem hiding this comment.
🚩 Proxy recreation on each idle turn releases the singleton lock between messages
Both conversation-routes.ts:1414-1425 and process-message.ts:168-177 create a new HostAppControlProxy whenever !conversation.isProcessing() — even if one already exists with an active singleton lock. setHostAppControlProxy(newProxy) disposes the old proxy (conversation.ts:951-956), which releases the lock (host-app-control-proxy.ts:314-318). This means between user turns, the lock is briefly released and re-acquirable by another conversation. This mirrors the existing CU proxy pattern (which also recreates per-turn), so it's intentional design rather than a bug. The model would need to call app_control_start again to re-acquire the lock in a new turn. The SKILL.md cadence instruction ("Take 2-3 actions per turn, then yield") implies sessions are per-turn.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bb72e80543
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let fromX = try container.decode(Double.self, forKey: .fromX) | ||
| let fromY = try container.decode(Double.self, forKey: .fromY) | ||
| let toX = try container.decode(Double.self, forKey: .toX) | ||
| let toY = try container.decode(Double.self, forKey: .toY) |
There was a problem hiding this comment.
Decode drag coordinates from snake_case keys
The drag variant currently decodes required coordinates from fromX/fromY/toX/toY, but the daemon’s app-control contract sends from_x/from_y/to_x/to_y (as defined in the TS tool input schema). Because these fields are required here, host_app_control_request drag messages fail to decode and the client drops the event instead of executing the action.
Useful? React with 👍 / 👎.
| let app = try container.decode(String.self, forKey: .app) | ||
| let key = try container.decode(String.self, forKey: .key) | ||
| let modifiers = try container.decodeIfPresent([String].self, forKey: .modifiers) | ||
| let durationMs = try container.decodeIfPresent(Int.self, forKey: .durationMs) |
There was a problem hiding this comment.
Decode press/combo hold duration from snake_case key
The decoder reads hold duration from durationMs, but the wire contract uses duration_ms; as a result, valid duration inputs from the assistant are silently dropped and both press/combo fall back to the default 50ms hold. This changes tool behavior for any flow that relies on longer key holds.
Useful? React with 👍 / 👎.
…#29350) The longest-prefix matcher with HOST_PREFIX_KEYS_BY_LENGTH was over-engineered for current state — every registered key matches a stripped stem exactly. Replace with a direct table lookup keyed on the stem (after stripping _request/_cancel). Behaviorally identical for all currently-defined message types; existing tests still pass. Part of plan: app-control-skill.md (slop cleanup) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Two pieces of dead public API surface caught by self-review: 1. HostProxyBase.cancel() was only invoked by its own test file; the production cancel path runs via AbortSignal handling inside dispatchRequest. 2. HostAppControlProxy.activeApp / ActiveApp / currentApp are written in the start-success branch but only read by tests; the actual singleton mechanism is activeAppControlConversationId. Delete both with their tests. Part of plan: app-control-skill.md (slop cleanup) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
…29352) Two Swift fields decoded but never consumed: 1. HostAppControlRequest.toolName — AppControlExecutor switches on input only; the discriminator lives in input.tool. 2. HostAppControlCancel.conversationId — AppDelegate's cancel handler invokes cancelHostAppControlRequest(msg.requestId) and never reads conversationId. The sibling HostCuCancelRequest doesn't carry it either, so the 'wire-shape parity' rationale was inconsistent. The wire envelope still includes both fields (daemon-side TS types unchanged); Swift's Codable silently ignores them on decode. Part of plan: app-control-skill.md (slop cleanup) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
The same supportsHostProxy(sourceInterface, capability) gate plus addPreactivatedSkillId(skillId) pattern appeared in four places (conversation-routes.ts, process-message.ts, two paths in conversation-process.ts) — one entry per host-proxy capability per call site. Consolidate into a single source of truth: HOST_PROXY_SKILL_PREACTIVATIONS and preactivateHostProxySkills(). Adding a new host-proxy capability now means updating one list, not four call sites. Part of plan: app-control-skill.md (slop cleanup) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Slop cleanup follow-ups mergedAll 6 stylistic items from the round-2 self-review are now addressed on this branch:
Branch now has 25 commits (16 implementation + 5 round-1 fix + 4 slop cleanup). Ready for manual review. |
…yload (#29357) ScreenCaptureKit failures (most commonly: Screen Recording permission not granted) silently returned nil from captureWindowPNG, and AppWindowCapture.capture(forPid:) still reported state: running with no PNG. Daemon and LLM saw a 'successful' payload with no error and no image — confusing for the user, who has no signal that the macOS app is missing a permission. Wire the underlying error string through CaptureResult.captureError into HostAppControlResultPayload.executionError. The window state remains correctly classified (running/minimized/missing); the new error field is an orthogonal signal that capture itself failed even though the window exists. For click/drag tools, the executor only surfaces the capture error when window bounds are also missing — we only need the bounds for those tools, so a missing PNG is non-fatal there. Part of plan: app-control-skill.md (post-merge UX fix) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
…ence + observe settle delay (#29363) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
- Register host-app-control-result route policy (approval.write scope) - Regenerate bundled-tool-registry.ts to include app-control-sequence - Regenerate openapi.yaml for /v1/host-app-control-result endpoint Fixes failing CI: Test (bundled-tool-registry-guard, guard-tests), OpenAPI Spec Check, and Lint (knip unused-files) on #29343. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| if (toolName === TOOL_START) { | ||
| if ( | ||
| activeAppControlConversationId != null && | ||
| activeAppControlConversationId !== conversationId | ||
| ) { | ||
| return { | ||
| content: | ||
| `Another conversation (${activeAppControlConversationId}) currently holds the ` + | ||
| `app-control session. Wait for it to finish, or call app_control_stop ` + | ||
| `from that conversation first.`, | ||
| isError: true, | ||
| }; | ||
| } | ||
| } |
There was a problem hiding this comment.
🚩 Singleton lock uses this.conversationId for acquisition but method parameter for guard check
In HostAppControlProxy.request(), the lock guard at line 150 compares activeAppControlConversationId !== conversationId (method parameter), but handleSuccess() at line 217 sets activeAppControlConversationId = this.conversationId (instance field). If these ever diverge, the lock could be acquired under one identity but guarded under another. All current call sites pass the same value for both (the proxy is constructed with the conversation ID and called with the same), so this doesn't manifest in practice. However, the API signature accepts conversationId as a parameter, implying it could differ. A future refactor or misuse could trigger inconsistent lock behavior. Consider using this.conversationId consistently for both paths, or removing the conversationId parameter in favor of always using the instance field.
Was this helpful? React with 👍 or 👎 to provide feedback.
…eys (#29372) Co-authored-by: Vellum Assistant <assistant@vellum.ai>
Summary
Adds a new
app-controlbundled skill that lets the assistant observe and send raw input (keyboard, mouse) to a specific named macOS application. Complementscomputer-usefor cases where the AX tree is unhelpful: emulators, games, OpenGL canvases, custom-rendered Electron apps. Bypasses the AX tree by capturing per-window screenshots and posting input events scoped to a single process viaCGEventPostToPid.Architecture: refactor
HostCuProxyonto a sharedHostProxyBase(PR 1), then a parallel proxy/tool/skill stack for app-control, then macOS client primitives (window screenshot, keyboard, mouse, executor) and connection wiring.Self-review result
PASS on integration correctness (round 2). 7 production-breaking issues caught and fixed in 5 follow-up PRs. 6 stylistic/slop items remain as follow-ups (see below — none affect correctness).
PRs merged into feature branch
Implementation (16 PRs)
Self-review fixes (5 PRs)
Rollout
The feature flag
app-controlis registered withdefaultEnabled: falseinmeta/feature-flags/feature-flag-registry.json. Not provisioned in LaunchDarkly Terraform — registry default is the gate. Trade-offs: no remote kill switch, no graduated rollout, but local override via~/.vellum/protected/feature-flags.jsonworks for testing.The flag is read in exactly one place: the skill-projection layer's SKILL.md frontmatter check. The proxy is instantiated unconditionally when the macOS client supports
host_app_control; gating lives only in the SKILL.md frontmatter. To GA later: remove the frontmatter line and the registry entry — no code reads the flag, so orphan overrides inprotected/feature-flags.jsonare inert.Remaining slop (follow-up candidates, not blockers)
Surface during round-2 self-review. None are correctness bugs:
conversation-routes.ts,process-message.ts, two paths inconversation-process.ts). AapplyHostProxyPreactivationshelper would consolidate.HostProxyBase.cancel()is test-only public API; could be deleted or moved toHostTransferProxy-style.HostAppControlProxy.activeAppfield is set but only read by tests.HostAppControlRequest.toolName(Swift) is decoded but unread; the discriminator now lives ininput.tool.HostAppControlCancel.conversationId(Swift) is decoded but unread.HOST_PREFIX_KEYS_BY_LENGTHlongest-prefix match is over-engineered for current state; direct lookup would suffice.Part of plan: app-control-skill.md