-
Notifications
You must be signed in to change notification settings - Fork 2.4k
feat: add voice dictation using OpenAI Whisper & ElevenLabs #3079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add microphone button that appears when OpenAI is configured - Implement real-time waveform visualization during recording - Add backend /audio/transcribe endpoint with security measures: - 25MB file size limit with 413 status code - 30-second timeout for API calls - Proper authentication via X-Secret-Key - Add visual feedback during transcription - Show recording duration and estimated file size - Warn users when approaching 25MB limit - Auto-stop recording at 10 minutes or 25MB - Add comprehensive integration tests - Fix ESLint configuration and MessageCopyLink warning Security: API keys remain backend-only, no frontend exposure
|
We all wanted this for a long time and attempted to implement it in different ways. And I wonder if it's worth bringing it into the settings page where you can enable this feature and then select from the list of providers which will then ask you for a token or reuse a token. So then you're not tied into OpenAI, you have 11 abs and others in there too. |
hear you. i did it this way out of ease and piggybacking off something already configured as a bonus. i'm not sure it's really that better than local os dictation in the end. but putting in settings and maybe allowing local or other providers (i'm not sure who there is besides 11labs) would be a good approach. if local, it's effectively just a shortcut. |
| })?; | ||
|
|
||
| let response = client | ||
| .post("https://api.openai.com/v1/audio/transcriptions") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check/use the OPENAI_HOST from the environment? Similar to
goose/crates/goose/src/providers/openai.rs
Lines 54 to 56 in c3acddc
| let host: String = config | |
| .get_param("OPENAI_HOST") | |
| .unwrap_or_else(|_| "https://api.openai.com".to_string()); |
|
Have we considered pulling in https://github.com/openai/whisper (or https://github.com/m-bain/whisperX) directly? It runs exceedingly well locally ✨ |
local macos is pretty good. don't know if it warrants increasing project/binary size? |
- Add microphone button to chat input with recording visualization - Support both OpenAI Whisper and ElevenLabs speech-to-text - Add Voice Dictation settings section with provider selection - Implement secure API key storage for ElevenLabs - Add real-time waveform visualization during recording - Handle microphone permissions properly - Add 25MB file size limit and 10-minute duration limit - Support multiple audio formats (webm, mp3, mp4, m4a, wav) - Feature is opt-in and disabled by default
done |
baxen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
VOICE_DICTATION_PR.md
Outdated
| @@ -0,0 +1,80 @@ | |||
| # Voice Dictation Feature - PR Summary | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: going to exclude this from the commit just to keep the top level of the code clean here - this is pretty well covered by inline comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes, forgot to remove that. sorry.
ui/desktop/src/main.ts
Outdated
| } | ||
| }); | ||
|
|
||
| // Handle macOS dictation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: also planning to remove this, this looks like a leftover from a different attempt maybe, it appears unreachable to me at the moment. please let me know if i got that wrong!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops! again you're right. left over. thank you.
|
Stunning! We just built something similar in our Meeting app: Multi-provider STT (Whisper + ElevenLabs + Web Speech fallback) with all processing moved to secure backend routes. Currently testing, but early results show: no vendor lock-in, zero client-side API exposure, smart caching should cut redundant calls significantly. Your settings-based provider selection is exactly right - gives users choice while keeping tokens secure. The "11 labs and others" flexibility is gold for production apps. One insight from building: 3-layer fallbacks seem essential since STT services can be unreliable during peak times. Solid work! 🔥 |
* upstream/main: Add a reference for recipes (block#3099) feat: add voice dictation using OpenAI Whisper & ElevenLabs (block#3079) feat: new cli provider for claude code and gemini (block#3083) you forgot the important ones! (block#3105) hotfix: fix build (block#3102) Richer tool call ui messages (block#3104) Update linux instructions (block#3087)
* 'main' of github.com:block/goose: Fix clippy + test errors (#3120) Update goose help to include cli (#3095) add scheduler type setting (#3119) Add a reference for recipes (#3099) feat: add voice dictation using OpenAI Whisper & ElevenLabs (#3079) feat: new cli provider for claude code and gemini (#3083) you forgot the important ones! (#3105) hotfix: fix build (#3102) Richer tool call ui messages (#3104) Update linux instructions (#3087)
* origin/main: Added announcement modal (#3098) build: Add `just` to Hermit, correct ui/desktop's README (#3116) fix: Make the entire toolcall argument row clickable to expand (#3118) Fix clippy + test errors (#3120) Update goose help to include cli (#3095) add scheduler type setting (#3119) Add a reference for recipes (#3099) feat: add voice dictation using OpenAI Whisper & ElevenLabs (#3079) feat: new cli provider for claude code and gemini (#3083) you forgot the important ones! (#3105) hotfix: fix build (#3102) Richer tool call ui messages (#3104) Update linux instructions (#3087) Add flag for showing cost tracking (#3090) Improve config file editing and recovery fallback mechanisms (#3082)
Co-authored-by: jack <> Signed-off-by: Soroosh <[email protected]>
Co-authored-by: jack <>
Voice Dictation Feature - PR Summary
Overview
This PR adds voice dictation functionality to Goose Desktop, allowing users to input messages using their microphone with support for both OpenAI Whisper and ElevenLabs speech-to-text services.
Key Features
1. Voice Input UI
2. Dual Provider Support
3. Settings & Configuration
4. Technical Implementation
Backend (Rust)
/audio/transcribeendpoint for OpenAI Whisper/audio/transcribe/elevenlabsendpoint for ElevenLabs/audio/configendpoint to check provider availabilityFrontend (TypeScript)
useWhisperhook for recording managementuseDictationSettingshook for settings persistenceWaveformVisualizercomponent for audio feedback5. Security & Privacy
File Changes
New Files
crates/goose-server/src/routes/audio.rs- Audio transcription endpointsui/desktop/src/hooks/useWhisper.ts- Recording and transcription logicui/desktop/src/hooks/useDictationSettings.ts- Settings managementui/desktop/src/components/settings/dictation/DictationSection.tsx- Settings UIui/desktop/src/components/WaveformVisualizer.tsx- Audio visualizationModified Files
ui/desktop/src/components/ChatInput.tsx- Added microphone buttonui/desktop/src/components/settings/SettingsView.tsx- Added dictation sectionui/desktop/src/main.ts- Added microphone permission handlingui/desktop/src/preload.ts- Exposed permission APIsTesting
Future Enhancements
Breaking Changes
None - Feature is disabled by default and requires user opt-in.
Screenshots