Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for Audio File Transcription #745

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

AryanK1511
Copy link

This PR fixes #722

This is done by introducing support for uploading audio files. Once an audio file (e.g., MP3) is uploaded, the chatbot transcribes it into text using OpenAI's Whisper API and displays the transcription in the chat. Users can then interact with the chatbot to discuss the content of the audio file, just like the existing feature for converting PDF files to Markdown.

Example Usage

For instance, if you upload an MP3 file of a podcast where someone says "Hello, tell me about the universe," the bot will transcribe the audio into text (e.g., "Hello, tell me about the universe") and add it to the chat. Users can then ask follow-up questions or have a conversation about the transcribed content.

Here's an example video demonstrating the feature. I uploaded an MP3 file where I recorded myself saying "Hello, tell me about the universe." The bot successfully transcribed the content, displayed it in the chat, and allowed further interaction.

Screen.Recording.2024-11-18.at.9.53.31.PM.mov

Implementation Details

As discussed earlier with @humphd in this issue comment, I modularized the implementation to ensure clarity and reusability:

  • A new audioToText() function was created in ai.ts to handle transcription using OpenAI's Whisper API (API Reference).
  • The code structure is now consistent with the way PDF files are processed, ensuring uniformity across the project.

Tasks Completed

  • Add return type to transcribe() function
    Ensured type safety and better maintainability (commit).

  • Add support for audio file uploads
    Updated input handling to accept MP3 and other audio file types (commit).

  • Update file import hooks
    Modified the file import hook to handle audio files in the same way as PDF files, as per discussions with @humphd (commit).

  • Add audioToText() function
    Implemented the transcription logic using OpenAI Whisper (commit).

  • Test multiple file uploads
    Verified that the system works correctly with both audio and PDF files.

  • Ensure backward compatibility
    Tested the application to confirm that existing features remain unaffected by the changes.


const response = await openai.audio.transcriptions.create({
file,
model: "whisper-1",
Copy link
Owner

@tarasglek tarasglek Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know @humphd said get this working with openai. Don't do it this way, just follow same logic as other code that uses getSpeechToTextModel

We should not be hardcoding model names in code, instead using utility functions to determine model capabilities

Copy link
Owner

@tarasglek tarasglek Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way @Amnish04 restructured the code is that even if you transcribe an audio file for some openrouter model like claude, it would go and find the free groq or paid openai whisper model and use that. This code should reuse same logic

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarasglek So basically, the only change required is that we somehow select the model dynamically instead of hardcoding it right?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and try to reuse same codepath as does transcriptions for voice input already

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. So shouldn’t that be a separate issue since I think refactoring that would make the scope of this issue way larger than it should be

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, do it in a separate PR before this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright @humphd , I i'll send in a PR as I get some time for the refactoring.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring in this case would involve careful consideration and style choices, as we’d need to extract the logic from the hook. Since hooks can only be used within React components, and the external function we’re calling is not a React component, we’d need to restructure the code to accommodate this limitation as I talked about earlier.

@AryanK1511 Why don't you move speechToText function inside a hook in that case. It would be pretty similar to how we do text to speech.

Currently, isSpeechToTextSupported and getSpeechToTextClient are packed inside useModels. But we could move them to a use-speech-to-text hook following a similar pattern. This would hide all the logic of getting a speech to text client from the places where we use it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Amnish04 Well coming back to what I said earlier, putting the logic in hooks doesn't solve the problem since then you won't be able to use these functions outside react components

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support importing/pasting/dropping audio files
4 participants