Skip to content

feat: voice-driven permission handling, info panel, and voice mode improvements#6095

Merged
AnitaKirkovska merged 1 commit into
mainfrom
anita/voice-mode-improvements
Feb 21, 2026
Merged

feat: voice-driven permission handling, info panel, and voice mode improvements#6095
AnitaKirkovska merged 1 commit into
mainfrom
anita/voice-mode-improvements

Conversation

@AnitaKirkovska
Copy link
Copy Markdown
Contributor

@AnitaKirkovska AnitaKirkovska commented Feb 21, 2026

Summary

  • Adds a two-way voice conversation panel (Voice Mode) with OpenAI Whisper for STT, ElevenLabs Flash v2.5 for streaming TTS, and a reactive waveform visualization
  • Implements voice-driven permission handling: when tool confirmations are needed, the assistant speaks a natural-language summary and listens for yes/no voice approval
  • Adds barge-in support (interrupt TTS by speaking), auto-listening loop, BETA badge with info panel, and OpenAI API key settings

Test plan

  • Open voice mode panel → verify waveform and avatar appear
  • Speak → verify transcription, assistant response, and TTS playback
  • Trigger a tool permission → verify voice asks for approval and responds to "yes"/"no"
  • Speak while assistant is talking → verify barge-in interrupts TTS
  • Close voice mode → verify mic indicator disappears
  • Click info button → verify STT/LLM/TTS details shown

🤖 Generated with Claude Code


Open with Devin

…ions, and info panel

- Voice mode panel with OpenAI Whisper STT, ElevenLabs Flash v2.5 TTS, and waveform visualization
- Voice-driven permission handling: speaks tool permission requests, listens for yes/no approval
- Barge-in support: interrupt TTS by speaking
- Auto-listening loop: TTS finishes → automatically starts listening for next turn
- Info panel showing STT/LLM/TTS infrastructure details with BETA badge
- OpenAI API key configuration in Settings
- Keeps user's selected model (no downgrade) for full tool-use capability
- Explicit osascript preference for macOS app interactions in voice mode

Co-Authored-By: Claude <noreply@anthropic.com>
@AnitaKirkovska AnitaKirkovska merged commit 5233ba3 into main Feb 21, 2026
@AnitaKirkovska AnitaKirkovska deleted the anita/voice-mode-improvements branch February 21, 2026 01:35
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 potential issues.

View 6 additional findings in Devin Review.

Open in Devin Review

Comment on lines +242 to +244
let text = isVoiceModeActive
? "[Voice conversation — keep spoken responses brief (2-3 sentences) but fully complete the task using any tools needed. Do not give up early. When interacting with macOS apps (Messages, Contacts, Calendar, Reminders, Notes, Mail, etc.), always use osascript with AppleScript — never query databases directly or use sqlite3.]\n\n\(rawText)"
: rawText
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Voice mode system-instruction prefix is stored in user chat messages and used for thread title generation

When isVoiceModeActive is true, sendMessage() prepends a long system instruction ([Voice conversation — keep spoken responses brief...]) to the user's text at line 242-243. This combined text variable is then used everywhere downstream:

Root Cause & Impact
  1. Chat UI pollution: The prefixed text is stored in ChatMessage(role: .user, text: text, ...) at ChatViewModel.swift:307 (and line 275 for the bootstrap path). The user sees the full system instruction prefix in their chat bubble for every voice message.

  2. Thread title corruption: The prefixed text is passed to onFirstUserMessage at ChatViewModel.swift:258callback(text), which feeds into ThreadManager.generateTitle(for:userMessage:) at ThreadManager.swift:117. The fallback title via deriveTitle would be "[Voice conversation — keep spoken responses bri..." instead of a title derived from the user's actual message.

The fix should use rawText for the chat message display and onFirstUserMessage callback, and only use the prefixed text for the IPC message sent to the daemon.

Prompt for agents
In clients/shared/Features/Chat/ChatViewModel.swift, the sendMessage() function (starting around line 240) uses the voice-prefixed `text` variable for everything: chat message display, thread title generation, and the IPC message to the daemon. Only the IPC message should contain the prefix. Changes needed:

1. At line 256, change `callback(text)` to `callback(rawText)` so thread title generation uses the user's actual message, not the prefixed one.
2. At line 275 and line 307, change the ChatMessage construction to use `rawText` instead of `text` for the message text, so the user sees their original message in the chat UI.
3. The prefixed `text` should only be used for the actual IPC send to the daemon (bootstrapSession and sendUserMessage calls further below).
4. Similarly at line 289 (`currentTurnUserText = text`) and line 344, use `rawText` so turn tracking is based on the user's actual input.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +426 to +427
let isApproved = affirmative.contains(where: { lower.contains($0) })
let isDenied = negative.contains(where: { lower.contains($0) })
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Voice permission approval takes priority over denial when response contains both affirmative and negative substrings

In handlePermissionResponse, substring matching with lower.contains($0) is used to classify the user's spoken response, and the isApproved branch is checked before isDenied. This means any denial that incidentally contains an affirmative substring will be incorrectly approved.

Security Impact — Tool Permission Bypass

The affirmative list includes short substrings like "ok", "do it", "sure", and "proceed". Common denial phrases contain these as substrings:

  • "No, don't do it"contains("do it")isApproved = true → approved despite clear denial
  • "I'm not okay with that"contains("ok")isApproved = true → approved
  • "No, please don't proceed"contains("proceed")isApproved = true → approved
  • "I'm not sure about that"contains("sure")isApproved = true → approved

Since if isApproved at line 435 is checked before else if isDenied at line 443, the tool permission is granted when the user explicitly said no. This controls whether the assistant can execute shell commands, write files, and perform other privileged operations on the user's Mac.

The fix should either: (a) check isDenied first and treat ambiguous (both true) as denied for safety, or (b) use word-boundary matching instead of substring contains.

Suggested change
let isApproved = affirmative.contains(where: { lower.contains($0) })
let isDenied = negative.contains(where: { lower.contains($0) })
let isApproved = affirmative.contains(where: { lower.contains($0) })
let isDenied = negative.contains(where: { lower.contains($0) })
// If both affirmative and negative substrings match (e.g. "no, don't do it"
// contains "do it" + "no"/"don't"), treat as denial for safety.
let effectiveApproved = isApproved && !isDenied
let effectiveDenied = isDenied
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +150 to +152
} catch {
log.error("Failed to start audio engine: \(error.localizedDescription)")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Audio tap not removed when audioEngine.start() fails, causing crash on next recording attempt

In OpenAIVoiceService.startRecording(), installTap(onBus: 0, ...) is called at line 103 before audioEngine.start() at line 145. If audioEngine.start() throws, the catch block at line 150-152 logs the error but does not remove the tap or clean up state.

Crash Scenario

After the failure:

  • isRecording remains false (it was never set to true)
  • The tap on bus 0 remains installed
  • recordingFormat is set but not cleared

On the next call to startRecording(), the guard !isRecording check passes, and installTap(onBus: 0, ...) is called again at line 103. AVAudioEngine crashes with an exception if a tap is already installed on bus 0 — you must remove the existing tap first.

This is the same pattern that the existing VoiceInputManager.beginRecording() has (and had to fix), where the catch block at VoiceInputManager.swift:267-273 properly cleans up recognitionRequest and recognitionTask but also would benefit from removing the tap.

Suggested change
} catch {
log.error("Failed to start audio engine: \(error.localizedDescription)")
}
} catch {
log.error("Failed to start audio engine: \(error.localizedDescription)")
audioEngine.inputNode.removeTap(onBus: 0)
recordingFormat = nil
}
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 573355fbe4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}

// Start monitoring mic for barge-in (interrupt by speaking)
voiceService.startBargeInMonitor()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Skip barge-in monitor when TTS completes immediately

When ElevenLabs is not configured (which is currently allowed), finishTextStream returns via its fast path and invokes the completion synchronously (OpenAIVoiceService.finishTextStream guard at lines 266-269). That completion calls startListening(), so by the time this line runs, recording has already started and installed an input tap; starting the barge-in monitor then tries to install another tap on the same bus, which can trigger AVAudioEngine tap conflicts/crashes for STT-only voice mode users.

Useful? React with 👍 / 👎.

guard state != .off else { return }

// Fully shut down audio engine to release the microphone
voiceService.shutdown()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent shutdown from re-entering listening while deactivating

If the user exits voice mode during playback, shutdown() calls stopSpeaking(), which synchronously executes ttsOnComplete (OpenAIVoiceService.stopSpeaking line 319) while VoiceModeManager.state is still .speaking. The stored completion from handleResponseComplete can therefore call startListening() during teardown, leaving the service in an inconsistent recording state (notably isRecording can stay true) and causing subsequent startRecording() calls to early-return.

Useful? React with 👍 / 👎.

AnitaKirkovska added a commit that referenced this pull request Feb 21, 2026
1. Permission bypass: 'No, don't do it' was approved because 'do it' matched
   before 'no'/'don't'. Now treats ambiguous (both match) as denial for safety.
2. Shutdown race: stopSpeaking() during deactivate() fired ttsOnComplete which
   called startListening(), leaving service in inconsistent state. Now sets
   state=.off before shutdown.
3. Audio tap leak: if audioEngine.start() failed, the installed tap was never
   removed, causing a crash on the next startRecording() call.
4. Voice prefix pollution: the voice system instruction prefix was stored in
   chat messages and used for thread title generation. Now only sent via IPC.
5. Barge-in monitor conflict: startBargeInMonitor() ran after finishTextStream()
   which could complete synchronously (no ElevenLabs), causing double tap install.
   Reordered to start barge-in monitor before finishTextStream.

Co-Authored-By: Claude <noreply@anthropic.com>
AnitaKirkovska added a commit that referenced this pull request Feb 21, 2026
1. Permission bypass: 'No, don't do it' was approved because 'do it' matched
   before 'no'/'don't'. Now treats ambiguous (both match) as denial for safety.
2. Shutdown race: stopSpeaking() during deactivate() fired ttsOnComplete which
   called startListening(), leaving service in inconsistent state. Now sets
   state=.off before shutdown.
3. Audio tap leak: if audioEngine.start() failed, the installed tap was never
   removed, causing a crash on the next startRecording() call.
4. Voice prefix pollution: the voice system instruction prefix was stored in
   chat messages and used for thread title generation. Now only sent via IPC.
5. Barge-in monitor conflict: startBargeInMonitor() ran after finishTextStream()
   which could complete synchronously (no ElevenLabs), causing double tap install.
   Reordered to start barge-in monitor before finishTextStream.

Co-authored-by: Claude <noreply@anthropic.com>
@AnitaKirkovska
Copy link
Copy Markdown
Contributor Author

Addressed in #6099

devin-ai-integration Bot added a commit that referenced this pull request Mar 3, 2026
Remove 4 unused files (~297 lines) from clients/shared/:

- NetworkInterfaceResolver.swift (58 lines) — zero references outside file.
  Added in PR #6656 (QR code pairing) but never used.
- HoverEffect.swift (41 lines) — zero references outside file.
  Added in PR #1868 (Extract design system) but never used.
- InlineWidgetCardModifier.swift (114 lines) — zero references outside file.
  Last touched in PR #3473 (platform guard fix).
- VWaveformView.swift (84 lines) — zero references outside file.
  Added in PR #6095 (two-way voice mode) but never used.

Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
ashleeradka added a commit that referenced this pull request Mar 3, 2026
* chore: remove dead shared library files

Remove 4 unused files (~297 lines) from clients/shared/:

- NetworkInterfaceResolver.swift (58 lines) — zero references outside file.
  Added in PR #6656 (QR code pairing) but never used.
- HoverEffect.swift (41 lines) — zero references outside file.
  Added in PR #1868 (Extract design system) but never used.
- InlineWidgetCardModifier.swift (114 lines) — zero references outside file.
  Last touched in PR #3473 (platform guard fix).
- VWaveformView.swift (84 lines) — zero references outside file.
  Added in PR #6095 (two-way voice mode) but never used.

Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>

* fix: restore InlineWidgetCardModifier and HoverEffect — actively used

InlineWidgetCardModifier is used by ConfirmationSurfaceView.swift:48
(.inlineWidgetCard() call). HoverEffect is used by
ModifiersGallerySection.swift:86 (.vHover() call in DEBUG gallery).

Only NetworkInterfaceResolver and VWaveformView are confirmed dead
in this PR.

Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: ashlee@vellum.ai <ashlee@vellum.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant