feat: voice-driven permission handling, info panel, and voice mode improvements#6095
Conversation
…ions, and info panel - Voice mode panel with OpenAI Whisper STT, ElevenLabs Flash v2.5 TTS, and waveform visualization - Voice-driven permission handling: speaks tool permission requests, listens for yes/no approval - Barge-in support: interrupt TTS by speaking - Auto-listening loop: TTS finishes → automatically starts listening for next turn - Info panel showing STT/LLM/TTS infrastructure details with BETA badge - OpenAI API key configuration in Settings - Keeps user's selected model (no downgrade) for full tool-use capability - Explicit osascript preference for macOS app interactions in voice mode Co-Authored-By: Claude <noreply@anthropic.com>
| let text = isVoiceModeActive | ||
| ? "[Voice conversation — keep spoken responses brief (2-3 sentences) but fully complete the task using any tools needed. Do not give up early. When interacting with macOS apps (Messages, Contacts, Calendar, Reminders, Notes, Mail, etc.), always use osascript with AppleScript — never query databases directly or use sqlite3.]\n\n\(rawText)" | ||
| : rawText |
There was a problem hiding this comment.
🔴 Voice mode system-instruction prefix is stored in user chat messages and used for thread title generation
When isVoiceModeActive is true, sendMessage() prepends a long system instruction ([Voice conversation — keep spoken responses brief...]) to the user's text at line 242-243. This combined text variable is then used everywhere downstream:
Root Cause & Impact
-
Chat UI pollution: The prefixed text is stored in
ChatMessage(role: .user, text: text, ...)atChatViewModel.swift:307(and line 275 for the bootstrap path). The user sees the full system instruction prefix in their chat bubble for every voice message. -
Thread title corruption: The prefixed text is passed to
onFirstUserMessageatChatViewModel.swift:258→callback(text), which feeds intoThreadManager.generateTitle(for:userMessage:)atThreadManager.swift:117. The fallback title viaderiveTitlewould be"[Voice conversation — keep spoken responses bri..."instead of a title derived from the user's actual message.
The fix should use rawText for the chat message display and onFirstUserMessage callback, and only use the prefixed text for the IPC message sent to the daemon.
Prompt for agents
In clients/shared/Features/Chat/ChatViewModel.swift, the sendMessage() function (starting around line 240) uses the voice-prefixed `text` variable for everything: chat message display, thread title generation, and the IPC message to the daemon. Only the IPC message should contain the prefix. Changes needed:
1. At line 256, change `callback(text)` to `callback(rawText)` so thread title generation uses the user's actual message, not the prefixed one.
2. At line 275 and line 307, change the ChatMessage construction to use `rawText` instead of `text` for the message text, so the user sees their original message in the chat UI.
3. The prefixed `text` should only be used for the actual IPC send to the daemon (bootstrapSession and sendUserMessage calls further below).
4. Similarly at line 289 (`currentTurnUserText = text`) and line 344, use `rawText` so turn tracking is based on the user's actual input.
Was this helpful? React with 👍 or 👎 to provide feedback.
| let isApproved = affirmative.contains(where: { lower.contains($0) }) | ||
| let isDenied = negative.contains(where: { lower.contains($0) }) |
There was a problem hiding this comment.
🔴 Voice permission approval takes priority over denial when response contains both affirmative and negative substrings
In handlePermissionResponse, substring matching with lower.contains($0) is used to classify the user's spoken response, and the isApproved branch is checked before isDenied. This means any denial that incidentally contains an affirmative substring will be incorrectly approved.
Security Impact — Tool Permission Bypass
The affirmative list includes short substrings like "ok", "do it", "sure", and "proceed". Common denial phrases contain these as substrings:
- "No, don't do it" →
contains("do it")→isApproved = true→ approved despite clear denial - "I'm not okay with that" →
contains("ok")→isApproved = true→ approved - "No, please don't proceed" →
contains("proceed")→isApproved = true→ approved - "I'm not sure about that" →
contains("sure")→isApproved = true→ approved
Since if isApproved at line 435 is checked before else if isDenied at line 443, the tool permission is granted when the user explicitly said no. This controls whether the assistant can execute shell commands, write files, and perform other privileged operations on the user's Mac.
The fix should either: (a) check isDenied first and treat ambiguous (both true) as denied for safety, or (b) use word-boundary matching instead of substring contains.
| let isApproved = affirmative.contains(where: { lower.contains($0) }) | |
| let isDenied = negative.contains(where: { lower.contains($0) }) | |
| let isApproved = affirmative.contains(where: { lower.contains($0) }) | |
| let isDenied = negative.contains(where: { lower.contains($0) }) | |
| // If both affirmative and negative substrings match (e.g. "no, don't do it" | |
| // contains "do it" + "no"/"don't"), treat as denial for safety. | |
| let effectiveApproved = isApproved && !isDenied | |
| let effectiveDenied = isDenied |
Was this helpful? React with 👍 or 👎 to provide feedback.
| } catch { | ||
| log.error("Failed to start audio engine: \(error.localizedDescription)") | ||
| } |
There was a problem hiding this comment.
🔴 Audio tap not removed when audioEngine.start() fails, causing crash on next recording attempt
In OpenAIVoiceService.startRecording(), installTap(onBus: 0, ...) is called at line 103 before audioEngine.start() at line 145. If audioEngine.start() throws, the catch block at line 150-152 logs the error but does not remove the tap or clean up state.
Crash Scenario
After the failure:
isRecordingremainsfalse(it was never set totrue)- The tap on bus 0 remains installed
recordingFormatis set but not cleared
On the next call to startRecording(), the guard !isRecording check passes, and installTap(onBus: 0, ...) is called again at line 103. AVAudioEngine crashes with an exception if a tap is already installed on bus 0 — you must remove the existing tap first.
This is the same pattern that the existing VoiceInputManager.beginRecording() has (and had to fix), where the catch block at VoiceInputManager.swift:267-273 properly cleans up recognitionRequest and recognitionTask but also would benefit from removing the tap.
| } catch { | |
| log.error("Failed to start audio engine: \(error.localizedDescription)") | |
| } | |
| } catch { | |
| log.error("Failed to start audio engine: \(error.localizedDescription)") | |
| audioEngine.inputNode.removeTap(onBus: 0) | |
| recordingFormat = nil | |
| } |
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 573355fbe4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } | ||
|
|
||
| // Start monitoring mic for barge-in (interrupt by speaking) | ||
| voiceService.startBargeInMonitor() |
There was a problem hiding this comment.
Skip barge-in monitor when TTS completes immediately
When ElevenLabs is not configured (which is currently allowed), finishTextStream returns via its fast path and invokes the completion synchronously (OpenAIVoiceService.finishTextStream guard at lines 266-269). That completion calls startListening(), so by the time this line runs, recording has already started and installed an input tap; starting the barge-in monitor then tries to install another tap on the same bus, which can trigger AVAudioEngine tap conflicts/crashes for STT-only voice mode users.
Useful? React with 👍 / 👎.
| guard state != .off else { return } | ||
|
|
||
| // Fully shut down audio engine to release the microphone | ||
| voiceService.shutdown() |
There was a problem hiding this comment.
Prevent shutdown from re-entering listening while deactivating
If the user exits voice mode during playback, shutdown() calls stopSpeaking(), which synchronously executes ttsOnComplete (OpenAIVoiceService.stopSpeaking line 319) while VoiceModeManager.state is still .speaking. The stored completion from handleResponseComplete can therefore call startListening() during teardown, leaving the service in an inconsistent recording state (notably isRecording can stay true) and causing subsequent startRecording() calls to early-return.
Useful? React with 👍 / 👎.
1. Permission bypass: 'No, don't do it' was approved because 'do it' matched before 'no'/'don't'. Now treats ambiguous (both match) as denial for safety. 2. Shutdown race: stopSpeaking() during deactivate() fired ttsOnComplete which called startListening(), leaving service in inconsistent state. Now sets state=.off before shutdown. 3. Audio tap leak: if audioEngine.start() failed, the installed tap was never removed, causing a crash on the next startRecording() call. 4. Voice prefix pollution: the voice system instruction prefix was stored in chat messages and used for thread title generation. Now only sent via IPC. 5. Barge-in monitor conflict: startBargeInMonitor() ran after finishTextStream() which could complete synchronously (no ElevenLabs), causing double tap install. Reordered to start barge-in monitor before finishTextStream. Co-Authored-By: Claude <noreply@anthropic.com>
1. Permission bypass: 'No, don't do it' was approved because 'do it' matched before 'no'/'don't'. Now treats ambiguous (both match) as denial for safety. 2. Shutdown race: stopSpeaking() during deactivate() fired ttsOnComplete which called startListening(), leaving service in inconsistent state. Now sets state=.off before shutdown. 3. Audio tap leak: if audioEngine.start() failed, the installed tap was never removed, causing a crash on the next startRecording() call. 4. Voice prefix pollution: the voice system instruction prefix was stored in chat messages and used for thread title generation. Now only sent via IPC. 5. Barge-in monitor conflict: startBargeInMonitor() ran after finishTextStream() which could complete synchronously (no ElevenLabs), causing double tap install. Reordered to start barge-in monitor before finishTextStream. Co-authored-by: Claude <noreply@anthropic.com>
|
Addressed in #6099 |
Remove 4 unused files (~297 lines) from clients/shared/: - NetworkInterfaceResolver.swift (58 lines) — zero references outside file. Added in PR #6656 (QR code pairing) but never used. - HoverEffect.swift (41 lines) — zero references outside file. Added in PR #1868 (Extract design system) but never used. - InlineWidgetCardModifier.swift (114 lines) — zero references outside file. Last touched in PR #3473 (platform guard fix). - VWaveformView.swift (84 lines) — zero references outside file. Added in PR #6095 (two-way voice mode) but never used. Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
* chore: remove dead shared library files Remove 4 unused files (~297 lines) from clients/shared/: - NetworkInterfaceResolver.swift (58 lines) — zero references outside file. Added in PR #6656 (QR code pairing) but never used. - HoverEffect.swift (41 lines) — zero references outside file. Added in PR #1868 (Extract design system) but never used. - InlineWidgetCardModifier.swift (114 lines) — zero references outside file. Last touched in PR #3473 (platform guard fix). - VWaveformView.swift (84 lines) — zero references outside file. Added in PR #6095 (two-way voice mode) but never used. Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai> * fix: restore InlineWidgetCardModifier and HoverEffect — actively used InlineWidgetCardModifier is used by ConfirmationSurfaceView.swift:48 (.inlineWidgetCard() call). HoverEffect is used by ModifiersGallerySection.swift:86 (.vHover() call in DEBUG gallery). Only NetworkInterfaceResolver and VWaveformView are confirmed dead in this PR. Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai> --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: ashlee@vellum.ai <ashlee@vellum.ai>
Summary
Test plan
🤖 Generated with Claude Code