-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Support for Audio File Transcription #745
base: main
Are you sure you want to change the base?
Conversation
|
||
const response = await openai.audio.transcriptions.create({ | ||
file, | ||
model: "whisper-1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know @humphd said get this working with openai. Don't do it this way, just follow same logic as other code that uses getSpeechToTextModel
We should not be hardcoding model names in code, instead using utility functions to determine model capabilities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way @Amnish04 restructured the code is that even if you transcribe an audio file for some openrouter model like claude, it would go and find the free groq or paid openai whisper model and use that. This code should reuse same logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tarasglek So basically, the only change required is that we somehow select the model dynamically instead of hardcoding it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and try to reuse same codepath as does transcriptions for voice input already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. So shouldn’t that be a separate issue since I think refactoring that would make the scope of this issue way larger than it should be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, do it in a separate PR before this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright @humphd , I i'll send in a PR as I get some time for the refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactoring in this case would involve careful consideration and style choices, as we’d need to extract the logic from the hook. Since hooks can only be used within React components, and the external function we’re calling is not a React component, we’d need to restructure the code to accommodate this limitation as I talked about earlier.
@AryanK1511 Why don't you move speechToText
function inside a hook in that case. It would be pretty similar to how we do text to speech.
Currently, isSpeechToTextSupported
and getSpeechToTextClient
are packed inside useModels
. But we could move them to a use-speech-to-text
hook following a similar pattern. This would hide all the logic of getting a speech to text client from the places where we use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Amnish04 Well coming back to what I said earlier, putting the logic in hooks doesn't solve the problem since then you won't be able to use these functions outside react components
This PR fixes #722
This is done by introducing support for uploading audio files. Once an audio file (e.g., MP3) is uploaded, the chatbot transcribes it into text using OpenAI's Whisper API and displays the transcription in the chat. Users can then interact with the chatbot to discuss the content of the audio file, just like the existing feature for converting PDF files to Markdown.
Example Usage
For instance, if you upload an MP3 file of a podcast where someone says "Hello, tell me about the universe," the bot will transcribe the audio into text (e.g., "Hello, tell me about the universe") and add it to the chat. Users can then ask follow-up questions or have a conversation about the transcribed content.
Here's an example video demonstrating the feature. I uploaded an MP3 file where I recorded myself saying "Hello, tell me about the universe." The bot successfully transcribed the content, displayed it in the chat, and allowed further interaction.
Screen.Recording.2024-11-18.at.9.53.31.PM.mov
Implementation Details
As discussed earlier with @humphd in this issue comment, I modularized the implementation to ensure clarity and reusability:
audioToText()
function was created inai.ts
to handle transcription using OpenAI's Whisper API (API Reference).Tasks Completed
Add return type to
transcribe()
functionEnsured type safety and better maintainability (commit).
Add support for audio file uploads
Updated input handling to accept MP3 and other audio file types (commit).
Update file import hooks
Modified the file import hook to handle audio files in the same way as PDF files, as per discussions with @humphd (commit).
Add
audioToText()
functionImplemented the transcription logic using OpenAI Whisper (commit).
Test multiple file uploads
Verified that the system works correctly with both audio and PDF files.
Ensure backward compatibility
Tested the application to confirm that existing features remain unaffected by the changes.