Skip to content

Conversation

@jackjackbits
Copy link
Contributor

@jackjackbits jackjackbits commented Jun 25, 2025

Voice Dictation Feature - PR Summary

Overview

This PR adds voice dictation functionality to Goose Desktop, allowing users to input messages using their microphone with support for both OpenAI Whisper and ElevenLabs speech-to-text services.

Key Features

1. Voice Input UI

  • Microphone button in chat input area (next to send button)
  • Recording indicator with duration and file size monitoring
  • Real-time waveform visualization during recording
  • Visual feedback for recording/transcribing states

2. Dual Provider Support

  • OpenAI Whisper: Uses existing OpenAI API key, no additional configuration needed
  • ElevenLabs Speech-to-Text: Alternative provider with advanced features
  • Smart provider switching: Automatically available based on configured API keys

3. Settings & Configuration

  • New Voice Dictation section in Settings
  • Toggle to enable/disable the feature
  • Provider selection dropdown
  • ElevenLabs API key configuration with secure storage
  • Provider-specific information and features

4. Technical Implementation

Backend (Rust)

  • New /audio/transcribe endpoint for OpenAI Whisper
  • New /audio/transcribe/elevenlabs endpoint for ElevenLabs
  • /audio/config endpoint to check provider availability
  • 25MB file size limit for both providers
  • Support for multiple audio formats (webm, mp3, mp4, m4a, wav)
  • Automatic API key migration to secure storage for ElevenLabs

Frontend (TypeScript)

  • useWhisper hook for recording management
  • useDictationSettings hook for settings persistence
  • WaveformVisualizer component for audio feedback
  • Microphone permission handling
  • Real-time size and duration monitoring
  • Automatic recording stop at 10 minutes or 25MB

5. Security & Privacy

  • All API keys stored securely
  • Audio data transmitted as base64 over HTTPS
  • No audio stored locally after transcription
  • Microphone permissions requested only when needed

File Changes

New Files

  • crates/goose-server/src/routes/audio.rs - Audio transcription endpoints
  • ui/desktop/src/hooks/useWhisper.ts - Recording and transcription logic
  • ui/desktop/src/hooks/useDictationSettings.ts - Settings management
  • ui/desktop/src/components/settings/dictation/DictationSection.tsx - Settings UI
  • ui/desktop/src/components/WaveformVisualizer.tsx - Audio visualization

Modified Files

  • ui/desktop/src/components/ChatInput.tsx - Added microphone button
  • ui/desktop/src/components/settings/SettingsView.tsx - Added dictation section
  • ui/desktop/src/main.ts - Added microphone permission handling
  • ui/desktop/src/preload.ts - Exposed permission APIs
  • Various server files to register new routes

Testing

  • All Rust tests passing
  • TypeScript compilation successful
  • ESLint and formatting checks passed
  • Manual testing completed with both providers

Future Enhancements

  • Real-time streaming transcription
  • Custom vocabulary support
  • Local Whisper model support
  • Voice activity detection

Breaking Changes

None - Feature is disabled by default and requires user opt-in.

Screenshots

Screenshot 2025-06-25 at 22 08 56 Screenshot 2025-06-26 at 15 06 48

- Add microphone button that appears when OpenAI is configured
- Implement real-time waveform visualization during recording
- Add backend /audio/transcribe endpoint with security measures:
  - 25MB file size limit with 413 status code
  - 30-second timeout for API calls
  - Proper authentication via X-Secret-Key
- Add visual feedback during transcription
- Show recording duration and estimated file size
- Warn users when approaching 25MB limit
- Auto-stop recording at 10 minutes or 25MB
- Add comprehensive integration tests
- Fix ESLint configuration and MessageCopyLink warning

Security: API keys remain backend-only, no frontend exposure
@Kvadratni
Copy link
Contributor

We all wanted this for a long time and attempted to implement it in different ways.
But I have one reservation and this is tying us to one provider.

And I wonder if it's worth bringing it into the settings page where you can enable this feature and then select from the list of providers which will then ask you for a token or reuse a token. So then you're not tied into OpenAI, you have 11 abs and others in there too.

@jackjackbits
Copy link
Contributor Author

We all wanted this for a long time and attempted to implement it in different ways.

But I have one reservation and this is tying us to one provider.

And I wonder if it's worth bringing it into the settings page where you can enable this feature and then select from the list of providers which will then ask you for a token or reuse a token. So then you're not tied into OpenAI, you have 11 abs and others in there too.

hear you. i did it this way out of ease and piggybacking off something already configured as a bonus. i'm not sure it's really that better than local os dictation in the end. but putting in settings and maybe allowing local or other providers (i'm not sure who there is besides 11labs) would be a good approach. if local, it's effectively just a shortcut.

})?;

let response = client
.post("https://api.openai.com/v1/audio/transcriptions")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check/use the OPENAI_HOST from the environment? Similar to

let host: String = config
.get_param("OPENAI_HOST")
.unwrap_or_else(|_| "https://api.openai.com".to_string());

@hugomd
Copy link
Member

hugomd commented Jun 26, 2025

Have we considered pulling in https://github.com/openai/whisper (or https://github.com/m-bain/whisperX) directly?

It runs exceedingly well locally ✨

@jackjackbits
Copy link
Contributor Author

Have we considered pulling in https://github.com/openai/whisper (or https://github.com/m-bain/whisperX) directly?

It runs exceedingly well locally ✨

local macos is pretty good. don't know if it warrants increasing project/binary size?

- Add microphone button to chat input with recording visualization
- Support both OpenAI Whisper and ElevenLabs speech-to-text
- Add Voice Dictation settings section with provider selection
- Implement secure API key storage for ElevenLabs
- Add real-time waveform visualization during recording
- Handle microphone permissions properly
- Add 25MB file size limit and 10-minute duration limit
- Support multiple audio formats (webm, mp3, mp4, m4a, wav)
- Feature is opt-in and disabled by default
@jackjackbits jackjackbits changed the title feat: add voice dictation with OpenAI Whisper feat: add voice dictation using OpenAI Whisper & ElevenLabs Jun 26, 2025
@jackjackbits
Copy link
Contributor Author

We all wanted this for a long time and attempted to implement it in different ways. But I have one reservation and this is tying us to one provider.

And I wonder if it's worth bringing it into the settings page where you can enable this feature and then select from the list of providers which will then ask you for a token or reuse a token. So then you're not tied into OpenAI, you have 11 abs and others in there too.

done

Copy link
Collaborator

@baxen baxen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@@ -0,0 +1,80 @@
# Voice Dictation Feature - PR Summary
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: going to exclude this from the commit just to keep the top level of the code clean here - this is pretty well covered by inline comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, forgot to remove that. sorry.

}
});

// Handle macOS dictation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: also planning to remove this, this looks like a leftover from a different attempt maybe, it appears unreachable to me at the moment. please let me know if i got that wrong!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops! again you're right. left over. thank you.

@baxen baxen merged commit 6ad95fe into block:main Jun 27, 2025
5 of 6 checks passed
@QBlockQ
Copy link

QBlockQ commented Jun 27, 2025

@jackjackbits @Kvadratni

Stunning! We just built something similar in our Meeting app:

Multi-provider STT (Whisper + ElevenLabs + Web Speech fallback) with all processing moved to secure backend routes. Currently testing, but early results show: no vendor lock-in, zero client-side API exposure, smart caching should cut redundant calls significantly.

Your settings-based provider selection is exactly right - gives users choice while keeping tokens secure. The "11 labs and others" flexibility is gold for production apps.

One insight from building: 3-layer fallbacks seem essential since STT services can be unreliable during peak times.

Solid work! 🔥

katzdave added a commit to katzdave/goose that referenced this pull request Jun 27, 2025
* upstream/main:
  Add a reference for recipes (block#3099)
  feat: add voice dictation using OpenAI Whisper & ElevenLabs (block#3079)
  feat: new cli provider for claude code and gemini (block#3083)
  you forgot the important ones! (block#3105)
  hotfix: fix build (block#3102)
  Richer tool call ui messages (block#3104)
  Update linux instructions (block#3087)
zanesq added a commit that referenced this pull request Jun 27, 2025
* 'main' of github.com:block/goose:
  Fix clippy + test errors (#3120)
  Update goose help to include cli (#3095)
  add scheduler type  setting (#3119)
  Add a reference for recipes (#3099)
  feat: add voice dictation using OpenAI Whisper & ElevenLabs (#3079)
  feat: new cli provider for claude code and gemini (#3083)
  you forgot the important ones! (#3105)
  hotfix: fix build (#3102)
  Richer tool call ui messages (#3104)
  Update linux instructions (#3087)
ahau-square pushed a commit that referenced this pull request Jun 27, 2025
* origin/main:
  Added announcement modal (#3098)
  build: Add `just` to Hermit, correct ui/desktop's README (#3116)
  fix: Make the entire toolcall argument row clickable to expand (#3118)
  Fix clippy + test errors (#3120)
  Update goose help to include cli (#3095)
  add scheduler type  setting (#3119)
  Add a reference for recipes (#3099)
  feat: add voice dictation using OpenAI Whisper & ElevenLabs (#3079)
  feat: new cli provider for claude code and gemini (#3083)
  you forgot the important ones! (#3105)
  hotfix: fix build (#3102)
  Richer tool call ui messages (#3104)
  Update linux instructions (#3087)
  Add flag for showing cost tracking (#3090)
  Improve config file editing and recovery fallback mechanisms (#3082)
s-soroosh pushed a commit to s-soroosh/goose that referenced this pull request Jul 18, 2025
cbruyndoncx pushed a commit to cbruyndoncx/goose that referenced this pull request Jul 20, 2025
@jamadeo jamadeo mentioned this pull request Aug 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants