Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
230 changes: 154 additions & 76 deletions clients/macos/vellum-assistant/App/VoiceInputManager.swift

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 stopContinuousRecording() missing recognitionTask == nil guard — can leave isRecording stuck and violate endAudio contract

stopRecordingForDictation() was correctly updated (lines 894-906) with a guard for recognitionTask == nil to handle the new async engine start flow where the recognition task hasn't been created yet. However, the structurally identical stopContinuousRecording() was not updated with the same guard.

If stopContinuousRecording() is called before the async installTapAndStartAsync completes: (1) recognitionTask is nil so no callback will deliver isFinal, (2) isRecording stays true permanently, (3) endAudio() is called on the request. When the async task eventually completes, it passes the generation check (same session), creates a recognition task with a request that already had endAudio() called, and the tap starts appending buffers after endAudio() — violating SFSpeechAudioBufferRecognitionRequest's contract.

Comparison with the fix in stopRecordingForDictation

stopRecordingForDictation() handles this at lines 894-906:

guard recognitionTask != nil else {
    recognitionRequest = nil
    isRecording = false
    // ... full cleanup
    return
}

stopContinuousRecording() at lines 260-277 has no such guard.

(Refers to lines 260-277)

Prompt for agents
The stopContinuousRecording() method at line 260 needs the same recognitionTask == nil guard that was added to stopRecordingForDictation() at lines 894-906. Without it, if stopContinuousRecording() is called while the async engine start is still in progress, isRecording remains true forever and endAudio() is called on a request before its recognition task exists.

After the hasInstalledTap = false line (line 272), add a guard checking recognitionTask != nil. If nil, perform direct cleanup: set recognitionRequest = nil, isRecording = false, reset amplitude state, and return early. The cleanup should mirror what stopRecordingForDictation does in its equivalent guard block (lines 894-906), but adapted for the continuous recording context (no overlay dismiss, no dictation context reset).
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,12 @@ final class VoiceInputManager {
/// Guards against double-start/double-stop from rapid key events.
private var isActivatorHeld = false

/// Monotonically increasing counter identifying the current recording
/// session. The async engine-start Task captures this value and checks
/// it after `await` — if it no longer matches, the completion belongs
/// to a stale session and is discarded.
private var recordingGeneration: UInt64 = 0

/// Whether `start()` has been called (monitors are active).
/// Used to guard against duplicate registration from deferred startup.
private(set) var hasStarted = false
Expand Down Expand Up @@ -110,7 +116,7 @@ final class VoiceInputManager {
PTTActivator.cached
}

private let speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en-US"))
private var speechRecognizer: SFSpeechRecognizer? = SFSpeechRecognizer(locale: Locale(identifier: "en-US"))
private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?
private var recognitionTask: SFSpeechRecognitionTask?
private let engineController = AudioEngineController(label: "com.vellum.audioEngine.voiceInput")
Expand Down Expand Up @@ -549,18 +555,32 @@ final class VoiceInputManager {
holdTask = nil
}

/// Capture frontmost app context (for dictation) and begin recording.
/// Start recording immediately for instant UI feedback, then capture
/// frontmost app context off the main actor. The engine starts
/// asynchronously on its audio queue while context capture runs on a
/// detached Task — both happen concurrently without blocking the main
/// actor, so key-up events are processed immediately.
///
/// When Vellum itself is the frontmost app, skip context capture so the
/// transcription falls through to the conversation path (auto-submit to chat)
/// instead of going through DictationTextInserter which would double-insert.
private func captureContextAndBeginRecording() {
beginRecording()
Comment thread
devin-ai-integration[bot] marked this conversation as resolved.
guard isRecording else { return }
if currentMode == .dictation {
let isVellumFrontmost = NSWorkspace.shared.frontmostApplication?.bundleIdentifier == Bundle.main.bundleIdentifier
if !isVellumFrontmost {
Comment thread
devin-ai-integration[bot] marked this conversation as resolved.
currentDictationContext = DictationContextCapture.capture()
let generation = recordingGeneration
Task.detached { [weak self] in
let context = DictationContextCapture.capture()
await MainActor.run { [weak self] in
guard let self else { return }
guard self.isRecording, self.recordingGeneration == generation else { return }
self.currentDictationContext = context
}
}
}
}
beginRecording()
}

/// Stop recording using the appropriate method for the current mode.
Expand Down Expand Up @@ -592,6 +612,12 @@ final class VoiceInputManager {
// MARK: - Recording

private func beginRecording() {
// Recreate speech recognizer if transiently unavailable (e.g. after
// sleep/wake, heavy use, or audio route changes).
if speechRecognizer?.isAvailable != true {
log.warning("Speech recognizer unavailable — recreating")
speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en-US"))
}
guard let speechRecognizer = speechRecognizer, speechRecognizer.isAvailable else {
log.error("Speech recognizer not available")
currentDictationContext = nil
Expand Down Expand Up @@ -637,6 +663,11 @@ final class VoiceInputManager {
return
}

// Show recording state and play chime immediately for instant feedback.
// The audio engine starts asynchronously below — the user hears/sees the
// activation before the engine is ready, hiding hardware latency.
recordingGeneration &+= 1
let generation = recordingGeneration
isRecording = true
onRecordingStateChanged?(true)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Activation chime now plays before engine is ready — verify no audio conflict

The activation chime (VoiceFeedback.playActivationChime() at line 672) now plays immediately when isRecording is set, BEFORE the audio engine starts asynchronously. Previously it played AFTER the engine was running (old line 738). This achieves the stated goal of instant feedback. However, if VoiceFeedback.playActivationChime() uses AVAudioPlayer or a system sound API that interacts with the audio session, it could potentially conflict with the audio engine starting shortly after. This is worth verifying but likely fine since system sounds typically use a separate audio path from AVAudioEngine's input node.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

if currentMode == .dictation {
Expand All @@ -647,6 +678,7 @@ final class VoiceInputManager {
}
}
log.info("Voice recording started")
VoiceFeedback.playActivationChime()

let request = SFSpeechAudioBufferRecognitionRequest()
request.shouldReportPartialResults = true
Expand All @@ -655,87 +687,107 @@ final class VoiceInputManager {
let ampState = amplitudeState
ampState.reset()

// Atomically read the hardware format, install the tap, and start the
// engine in a single dispatch to the audio queue. This prevents the
// TOCTOU race where a format read via `inputNodeFormat()` becomes stale
// before a separate `installTap()` async block executes — which crashes
// with NSInternalInconsistencyException on first use after permission grant.
guard engineController.installTapAndStart(
bufferSize: 1024,
block: { [weak self] buffer, _ in
request.append(buffer)

guard let channelData = buffer.floatChannelData else { return }
let frameLength = Int(buffer.frameLength)
guard frameLength > 0 else { return }

let channelDataArray = Array(UnsafeBufferPointer(start: channelData[0], count: frameLength))
let rawRMS = vDSP.rootMeanSquare(channelDataArray)

let smoothed = 0.5 * rawRMS + 0.5 * ampState.previousSmoothed
ampState.previousSmoothed = smoothed

// Scale amplitude to 0-1 range for waveform visualization.
// Speech RMS is typically 0.01-0.1; multiply to fill the visual range.
let scaled = min(smoothed * 14.0, 1.0)

let now = CFAbsoluteTimeGetCurrent()
guard now - ampState.lastEmissionTime >= 0.033 else { return }
ampState.lastEmissionTime = now

VoiceInputManager.amplitudeSubject.send(scaled)
DispatchQueue.main.async { [weak self] in
self?.onAmplitudeChanged?(scaled)
}
let tapBlock: AVAudioNodeTapBlock = { [weak self] buffer, _ in
request.append(buffer)

guard let channelData = buffer.floatChannelData else { return }
let frameLength = Int(buffer.frameLength)
guard frameLength > 0 else { return }

let channelDataArray = Array(UnsafeBufferPointer(start: channelData[0], count: frameLength))
let rawRMS = vDSP.rootMeanSquare(channelDataArray)

let smoothed = 0.5 * rawRMS + 0.5 * ampState.previousSmoothed
ampState.previousSmoothed = smoothed

// Scale amplitude to 0-1 range for waveform visualization.
// Speech RMS is typically 0.01-0.1; multiply to fill the visual range.
let scaled = min(smoothed * 14.0, 1.0)

let now = CFAbsoluteTimeGetCurrent()
guard now - ampState.lastEmissionTime >= 0.033 else { return }
ampState.lastEmissionTime = now

VoiceInputManager.amplitudeSubject.send(scaled)
DispatchQueue.main.async { [weak self] in
self?.onAmplitudeChanged?(scaled)
}
) else {
log.error("Audio engine failed to start — invalid format or engine error")
isRecording = false
onRecordingStateChanged?(false)
currentDictationContext = nil
recognitionRequest = nil
overlayWindow.dismiss()
resetAudioEngine()
return
}
hasInstalledTap = true

recognitionTask = speechRecognizer.recognitionTask(with: request) { [weak self] result, error in
Task { @MainActor in
guard let self = self else { return }
// Ignore late callbacks delivered after recording was stopped
// (e.g. endAudio() triggering a delayed isFinal via Task dispatch).
guard self.isRecording else { return }

if let result = result {
let text = result.bestTranscription.formattedString
if result.isFinal {
log.info("Transcription: \(text, privacy: .public)")
if !text.trimmingCharacters(in: .whitespacesAndNewlines).isEmpty {
self.handleFinalTranscription(text)
// Start the audio engine asynchronously to avoid blocking the main
// thread during Bluetooth negotiation or hardware initialization.
// The recognition task is started in the completion after the engine
// is running. This eliminates the 2+ second main-thread stall that
// occurs with queue.sync when coreaudiod is contended.
Task { [weak self] in
guard let self else { return }
let success = await self.engineController.installTapAndStartAsync(
bufferSize: 1024,
block: tapBlock
)
// Verify this completion belongs to the current recording session.
// A quick release/retry can cause session A's completion to arrive
// while session B is active — using the stale request would
// desynchronize recognitionTask/recognitionRequest ownership.
guard self.isRecording, self.recordingGeneration == generation else {
// Only tear down if no session is currently active. When a newer
// session is running (isRecording true, generation mismatch),
// it owns the engine — tearing down here would remove its tap.
if success, !self.isRecording {
self.engineController.stopAndRemoveTap()
log.info("Engine started for stale generation \(generation) — tore down (no active session)")
} else if success {
log.info("Stale generation \(generation) completed — skipping teardown, session \(self.recordingGeneration) owns engine")
}
return
Comment on lines +732 to +742

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Stale session's tap appends audio to request after endAudio() in rapid stop/start

In a rapid stop→start scenario, a stale session's audio tap can feed buffers to a SFSpeechAudioBufferRecognitionRequest after endAudio() has been called — violating the API contract explicitly noted at AudioEngineController.swift:188-189.

Detailed sequence
  1. Session A (gen=1): beginRecording() sets isRecording=true, creates request_A, launches async engine start
  2. User stops: stopRecording()tearDownAudioState() calls request_A.endAudio(), sets recognitionRequest=nil. But hasInstalledTap is false so the engine is NOT stopped.
  3. Session B (gen=2): beginRecording() creates request_B, launches another async engine start
  4. Audio queue executes A's installTapAndStartImpl: installs tap_A (closure captures request_A), starts engine
  5. tap_A fires on the audio thread, calling request_A.append(buffer) — after endAudio() was already called
  6. A's continuation resumes: stale check (gen=1≠2) → skips teardown because isRecording=true ("session B owns engine")
  7. Audio queue executes B's impl: removes tap_A, installs tap_B — but audio was already appended to the ended request

The old synchronous installTapAndStart maintained the invariant (remove tap → then endAudio) because by the time beginRecording() returned, the tap was installed and hasInstalledTap=true, so tearDownAudioState() always removed the tap first. The new async path breaks this invariant. The window is narrow (between engine start and the next serial queue item removing the tap), but with the 2+ second Bluetooth latency the PR is designed to handle, the user can easily trigger a stop/start within that window.

Prompt for agents
The stale generation handler at VoiceInputManager.swift:732-742 correctly detects stale sessions but doesn't prevent the stale tap from feeding audio to a request that had endAudio() called. The root cause is that tearDownAudioState() (line 237-246) calls recognitionRequest?.endAudio() when hasInstalledTap is false (async start pending), but the stale session's tap will later install and start feeding audio to the captured request.

Possible approaches:
1. In tearDownAudioState(), when hasInstalledTap is false but a request exists, avoid calling endAudio() — let the stale completion handler clean up instead. The stale handler would need to call endAudio on the captured request after stopAndRemoveTap.
2. Track whether endAudio has been called on the current request (e.g. a flag) and have the tap block check it before appending.
3. In the stale generation handler, when success is true and isRecording is true (newer session active), still call stopAndRemoveTap() to remove the stale tap immediately, rather than relying on the newer session's installTapAndStartAsync to eventually replace it. The newer session's impl will re-install its own tap when it runs on the serial queue.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

}
guard success else {
log.error("Audio engine failed to start — invalid format or engine error")
self.isRecording = false
self.onRecordingStateChanged?(false)
self.currentDictationContext = nil
self.recognitionRequest = nil
self.overlayWindow.dismiss()
self.resetAudioEngine()
return
Comment on lines +744 to +752

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Engine start failure path doesn't play deactivation chime after activation chime was already played

The activation chime is now played immediately at VoiceInputManager.swift:672 (before the async engine start) for instant user feedback. But when the engine subsequently fails to start, the failure handler at lines 735-743 dismisses the overlay and resets state without playing the deactivation chime. The user hears the activation chime indicating recording has started, but gets no audible signal that it failed. In the pre-PR code, the chime was played after the engine was confirmed started, so engine failures never had an orphaned activation chime.

Suggested change
guard success else {
log.error("Audio engine failed to start — invalid format or engine error")
self.isRecording = false
self.onRecordingStateChanged?(false)
self.currentDictationContext = nil
self.recognitionRequest = nil
self.overlayWindow.dismiss()
self.resetAudioEngine()
return
guard success else {
log.error("Audio engine failed to start — invalid format or engine error")
self.isRecording = false
self.onRecordingStateChanged?(false)
self.currentDictationContext = nil
self.recognitionRequest = nil
self.overlayWindow.dismiss()
self.resetAudioEngine()
VoiceFeedback.playDeactivationChime()
return
}
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

}
self.hasInstalledTap = true

self.recognitionTask = speechRecognizer.recognitionTask(with: request) { [weak self] result, error in
Task { @MainActor in
guard let self = self else { return }
// Ignore late callbacks delivered after recording was stopped
// (e.g. endAudio() triggering a delayed isFinal via Task dispatch).
guard self.isRecording else { return }

if let result = result {
let text = result.bestTranscription.formattedString
if result.isFinal {
log.info("Transcription: \(text, privacy: .public)")
if !text.trimmingCharacters(in: .whitespacesAndNewlines).isEmpty {
self.handleFinalTranscription(text)
} else {
VoiceFeedback.playDeactivationChime()
}
self.recognitionTask = nil
self.stopRecording()
} else {
VoiceFeedback.playDeactivationChime()
self.onPartialTranscription?(text)
if self.currentMode == .dictation {
self.overlayWindow.updatePartialTranscription(text)
}
}
}

if let error = error {
log.error("Recognition error: \(error.localizedDescription)")
self.recognitionTask = nil
VoiceFeedback.playDeactivationChime()
self.stopRecording()
} else {
self.onPartialTranscription?(text)
if self.currentMode == .dictation {
self.overlayWindow.updatePartialTranscription(text)
}
}
}

if let error = error {
log.error("Recognition error: \(error.localizedDescription)")
self.recognitionTask = nil
VoiceFeedback.playDeactivationChime()
self.stopRecording()
}
}
}

VoiceFeedback.playActivationChime()
}

// MARK: - Permission Prompt
Expand Down Expand Up @@ -765,10 +817,19 @@ final class VoiceInputManager {

log.info("Permissions granted — starting recording")
prewarmEngine()
self.beginRecording()
guard self.isRecording else { return }
if self.currentMode == .dictation {
self.currentDictationContext = DictationContextCapture.capture()
let generation = self.recordingGeneration
Task.detached { [weak self] in
let context = DictationContextCapture.capture()
await MainActor.run { [weak self] in
guard let self else { return }
guard self.isRecording, self.recordingGeneration == generation else { return }
self.currentDictationContext = context
}
}
}
self.beginRecording()
}


Expand Down Expand Up @@ -844,6 +905,23 @@ final class VoiceInputManager {
}
hasInstalledTap = false

// If the recognition task hasn't been started yet (async engine start
// still in progress), there's no callback to deliver isFinal.
// Clean up directly instead of waiting for a callback that won't come.
guard recognitionTask != nil else {
log.info("Recognition task not yet started — cleaning up directly")
recognitionRequest = nil
isRecording = false
currentDictationContext = nil
activeOrigin = .hotkey
amplitudeState.reset()
Self.amplitudeSubject.send(0)
onAmplitudeChanged?(0)
overlayWindow.dismiss()
VoiceFeedback.playDeactivationChime()
return
}

// Signal end of audio — the recognizer will process remaining audio
// and fire the callback with isFinal = true.
recognitionRequest?.endAudio()
Expand Down
Loading