feat: voice-driven permission handling, info panel, and voice mode improvements by AnitaKirkovska · Pull Request #6095 · vellum-ai/vellum-assistant

AnitaKirkovska · 2026-02-21T01:35:33Z

Summary

Adds a two-way voice conversation panel (Voice Mode) with OpenAI Whisper for STT, ElevenLabs Flash v2.5 for streaming TTS, and a reactive waveform visualization
Implements voice-driven permission handling: when tool confirmations are needed, the assistant speaks a natural-language summary and listens for yes/no voice approval
Adds barge-in support (interrupt TTS by speaking), auto-listening loop, BETA badge with info panel, and OpenAI API key settings

Test plan

Open voice mode panel → verify waveform and avatar appear
Speak → verify transcription, assistant response, and TTS playback
Trigger a tool permission → verify voice asks for approval and responds to "yes"/"no"
Speak while assistant is talking → verify barge-in interrupts TTS
Close voice mode → verify mic indicator disappears
Click info button → verify STT/LLM/TTS details shown

🤖 Generated with Claude Code

…ions, and info panel - Voice mode panel with OpenAI Whisper STT, ElevenLabs Flash v2.5 TTS, and waveform visualization - Voice-driven permission handling: speaks tool permission requests, listens for yes/no approval - Barge-in support: interrupt TTS by speaking - Auto-listening loop: TTS finishes → automatically starts listening for next turn - Info panel showing STT/LLM/TTS infrastructure details with BETA badge - OpenAI API key configuration in Settings - Keeps user's selected model (no downgrade) for full tool-use capability - Explicit osascript preference for macOS app interactions in voice mode Co-Authored-By: Claude <noreply@anthropic.com>

devin-ai-integration

Devin Review found 3 potential issues.

View 6 additional findings in Devin Review.

devin-ai-integration · 2026-02-21T01:39:25Z

+        let text = isVoiceModeActive
+            ? "[Voice conversation — keep spoken responses brief (2-3 sentences) but fully complete the task using any tools needed. Do not give up early. When interacting with macOS apps (Messages, Contacts, Calendar, Reminders, Notes, Mail, etc.), always use osascript with AppleScript — never query databases directly or use sqlite3.]\n\n\(rawText)"
+            : rawText


🔴 Voice mode system-instruction prefix is stored in user chat messages and used for thread title generation

When isVoiceModeActive is true, sendMessage() prepends a long system instruction ([Voice conversation — keep spoken responses brief...]) to the user's text at line 242-243. This combined text variable is then used everywhere downstream:

Root Cause & Impact

Chat UI pollution: The prefixed text is stored in ChatMessage(role: .user, text: text, ...) at ChatViewModel.swift:307 (and line 275 for the bootstrap path). The user sees the full system instruction prefix in their chat bubble for every voice message.

Thread title corruption: The prefixed text is passed to onFirstUserMessage at ChatViewModel.swift:258 → callback(text), which feeds into ThreadManager.generateTitle(for:userMessage:) at ThreadManager.swift:117. The fallback title via deriveTitle would be "[Voice conversation — keep spoken responses bri..." instead of a title derived from the user's actual message.

The fix should use rawText for the chat message display and onFirstUserMessage callback, and only use the prefixed text for the IPC message sent to the daemon.

Prompt for agents

In clients/shared/Features/Chat/ChatViewModel.swift, the sendMessage() function (starting around line 240) uses the voice-prefixed `text` variable for everything: chat message display, thread title generation, and the IPC message to the daemon. Only the IPC message should contain the prefix. Changes needed: 1. At line 256, change `callback(text)` to `callback(rawText)` so thread title generation uses the user's actual message, not the prefixed one. 2. At line 275 and line 307, change the ChatMessage construction to use `rawText` instead of `text` for the message text, so the user sees their original message in the chat UI. 3. The prefixed `text` should only be used for the actual IPC send to the daemon (bootstrapSession and sendUserMessage calls further below). 4. Similarly at line 289 (`currentTurnUserText = text`) and line 344, use `rawText` so turn tracking is based on the user's actual input.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-02-21T01:39:26Z

+        let isApproved = affirmative.contains(where: { lower.contains($0) })
+        let isDenied = negative.contains(where: { lower.contains($0) })


🔴 Voice permission approval takes priority over denial when response contains both affirmative and negative substrings

In handlePermissionResponse, substring matching with lower.contains($0) is used to classify the user's spoken response, and the isApproved branch is checked before isDenied. This means any denial that incidentally contains an affirmative substring will be incorrectly approved.

Security Impact — Tool Permission Bypass

The affirmative list includes short substrings like "ok", "do it", "sure", and "proceed". Common denial phrases contain these as substrings:

"No, don't do it" → contains("do it") → isApproved = true → approved despite clear denial

"I'm not okay with that" → contains("ok") → isApproved = true → approved

"No, please don't proceed" → contains("proceed") → isApproved = true → approved

"I'm not sure about that" → contains("sure") → isApproved = true → approved

Since if isApproved at line 435 is checked before else if isDenied at line 443, the tool permission is granted when the user explicitly said no. This controls whether the assistant can execute shell commands, write files, and perform other privileged operations on the user's Mac.

The fix should either: (a) check isDenied first and treat ambiguous (both true) as denied for safety, or (b) use word-boundary matching instead of substring contains.

Suggested change

let isApproved = affirmative.contains(where: { lower.contains($0) })

let isDenied = negative.contains(where: { lower.contains($0) })

let isApproved = affirmative.contains(where: { lower.contains($0) })

let isDenied = negative.contains(where: { lower.contains($0) })

// If both affirmative and negative substrings match (e.g. "no, don't do it"

// contains "do it" + "no"/"don't"), treat as denial for safety.

let effectiveApproved = isApproved && !isDenied

let effectiveDenied = isDenied

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-02-21T01:39:27Z

+        } catch {
+            log.error("Failed to start audio engine: \(error.localizedDescription)")
+        }


🔴 Audio tap not removed when audioEngine.start() fails, causing crash on next recording attempt

In OpenAIVoiceService.startRecording(), installTap(onBus: 0, ...) is called at line 103 before audioEngine.start() at line 145. If audioEngine.start() throws, the catch block at line 150-152 logs the error but does not remove the tap or clean up state.

Crash Scenario

After the failure:

isRecording remains false (it was never set to true)

The tap on bus 0 remains installed

recordingFormat is set but not cleared

On the next call to startRecording(), the guard !isRecording check passes, and installTap(onBus: 0, ...) is called again at line 103. AVAudioEngine crashes with an exception if a tap is already installed on bus 0 — you must remove the existing tap first.

This is the same pattern that the existing VoiceInputManager.beginRecording() has (and had to fix), where the catch block at VoiceInputManager.swift:267-273 properly cleans up recognitionRequest and recognitionTask but also would benefit from removing the tap.

Suggested change

} catch {

log.error("Failed to start audio engine: \(error.localizedDescription)")

}

} catch {

log.error("Failed to start audio engine: \(error.localizedDescription)")

audioEngine.inputNode.removeTap(onBus: 0)

recordingFormat = nil

}

Was this helpful? React with 👍 or 👎 to provide feedback.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 573355fbe4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-02-21T01:40:53Z

+        }
+
+        // Start monitoring mic for barge-in (interrupt by speaking)
+        voiceService.startBargeInMonitor()


Skip barge-in monitor when TTS completes immediately

When ElevenLabs is not configured (which is currently allowed), finishTextStream returns via its fast path and invokes the completion synchronously (OpenAIVoiceService.finishTextStream guard at lines 266-269). That completion calls startListening(), so by the time this line runs, recording has already started and installed an input tap; starting the barge-in monitor then tries to install another tap on the same bus, which can trigger AVAudioEngine tap conflicts/crashes for STT-only voice mode users.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-21T01:40:53Z

+        guard state != .off else { return }
+
+        // Fully shut down audio engine to release the microphone
+        voiceService.shutdown()


Prevent shutdown from re-entering listening while deactivating

If the user exits voice mode during playback, shutdown() calls stopSpeaking(), which synchronously executes ttsOnComplete (OpenAIVoiceService.stopSpeaking line 319) while VoiceModeManager.state is still .speaking. The stored completion from handleResponseComplete can therefore call startListening() during teardown, leaving the service in an inconsistent recording state (notably isRecording can stay true) and causing subsequent startRecording() calls to early-return.

Useful? React with 👍 / 👎.

1. Permission bypass: 'No, don't do it' was approved because 'do it' matched before 'no'/'don't'. Now treats ambiguous (both match) as denial for safety. 2. Shutdown race: stopSpeaking() during deactivate() fired ttsOnComplete which called startListening(), leaving service in inconsistent state. Now sets state=.off before shutdown. 3. Audio tap leak: if audioEngine.start() failed, the installed tap was never removed, causing a crash on the next startRecording() call. 4. Voice prefix pollution: the voice system instruction prefix was stored in chat messages and used for thread title generation. Now only sent via IPC. 5. Barge-in monitor conflict: startBargeInMonitor() ran after finishTextStream() which could complete synchronously (no ElevenLabs), causing double tap install. Reordered to start barge-in monitor before finishTextStream. Co-Authored-By: Claude <noreply@anthropic.com>

1. Permission bypass: 'No, don't do it' was approved because 'do it' matched before 'no'/'don't'. Now treats ambiguous (both match) as denial for safety. 2. Shutdown race: stopSpeaking() during deactivate() fired ttsOnComplete which called startListening(), leaving service in inconsistent state. Now sets state=.off before shutdown. 3. Audio tap leak: if audioEngine.start() failed, the installed tap was never removed, causing a crash on the next startRecording() call. 4. Voice prefix pollution: the voice system instruction prefix was stored in chat messages and used for thread title generation. Now only sent via IPC. 5. Barge-in monitor conflict: startBargeInMonitor() ran after finishTextStream() which could complete synchronously (no ElevenLabs), causing double tap install. Reordered to start barge-in monitor before finishTextStream. Co-authored-by: Claude <noreply@anthropic.com>

AnitaKirkovska · 2026-02-21T01:46:31Z

Addressed in #6099

Remove 4 unused files (~297 lines) from clients/shared/: - NetworkInterfaceResolver.swift (58 lines) — zero references outside file. Added in PR #6656 (QR code pairing) but never used. - HoverEffect.swift (41 lines) — zero references outside file. Added in PR #1868 (Extract design system) but never used. - InlineWidgetCardModifier.swift (114 lines) — zero references outside file. Last touched in PR #3473 (platform guard fix). - VWaveformView.swift (84 lines) — zero references outside file. Added in PR #6095 (two-way voice mode) but never used. Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>

* chore: remove dead shared library files Remove 4 unused files (~297 lines) from clients/shared/: - NetworkInterfaceResolver.swift (58 lines) — zero references outside file. Added in PR #6656 (QR code pairing) but never used. - HoverEffect.swift (41 lines) — zero references outside file. Added in PR #1868 (Extract design system) but never used. - InlineWidgetCardModifier.swift (114 lines) — zero references outside file. Last touched in PR #3473 (platform guard fix). - VWaveformView.swift (84 lines) — zero references outside file. Added in PR #6095 (two-way voice mode) but never used. Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai> * fix: restore InlineWidgetCardModifier and HoverEffect — actively used InlineWidgetCardModifier is used by ConfirmationSurfaceView.swift:48 (.inlineWidgetCard() call). HoverEffect is used by ModifiersGallerySection.swift:86 (.vHover() call in DEBUG gallery). Only NetworkInterfaceResolver and VWaveformView are confirmed dead in this PR. Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai> --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: ashlee@vellum.ai <ashlee@vellum.ai>

AnitaKirkovska merged commit 5233ba3 into main Feb 21, 2026

AnitaKirkovska deleted the anita/voice-mode-improvements branch February 21, 2026 01:35

devin-ai-integration Bot reviewed Feb 21, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Feb 21, 2026

View reviewed changes

devin-ai-integration Bot mentioned this pull request Mar 3, 2026

chore: remove dead WakeWord and Voice files #11763

Merged

4 tasks

devin-ai-integration Bot mentioned this pull request Mar 3, 2026

chore: remove dead shared library files #11767

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: voice-driven permission handling, info panel, and voice mode improvements#6095

feat: voice-driven permission handling, info panel, and voice mode improvements#6095
AnitaKirkovska merged 1 commit into
mainfrom
anita/voice-mode-improvements

AnitaKirkovska commented Feb 21, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Feb 21, 2026

Uh oh!

devin-ai-integration Bot Feb 21, 2026

Uh oh!

devin-ai-integration Bot Feb 21, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Feb 21, 2026

Uh oh!

chatgpt-codex-connector Bot Feb 21, 2026

Uh oh!

AnitaKirkovska commented Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		let isApproved = affirmative.contains(where: { lower.contains($0) })
		let isDenied = negative.contains(where: { lower.contains($0) })

Conversation

AnitaKirkovska commented Feb 21, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

AnitaKirkovska commented Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AnitaKirkovska commented Feb 21, 2026 •

edited by devin-ai-integration Bot

Loading