Transforming text into speech has revolutionized how we interact with machines, making technologies more accessible and inclusive for people across the globe. This project focuses on fine-tuning a Hindi Text-to-Speech (TTS) model using Microsoft's SpeechT5 framework to generate high-quality, natural-sounding speech for Hindi texts. By leveraging advanced deep learning and transformer-based architectures, this model ensures precise articulation of words, even for complex pronunciations in Hindi.
The need for such models arises due to the limited availability of Hindi-specific speech models despite the language's widespread use. Fine-tuning a pre-trained model not only optimizes performance for the Hindi language but also enhances usability in diverse real-world applications like voice assistants, audiobook generation, and interactive educational tools. 🎯
This repository provides everything you need to replicate the project, evaluate the model, and utilize it for your specific use cases — from detailed instructions, code examples, and usage guidelines to sample audio outputs showcasing the model's capabilities.
This project involves the fine-tuning of Microsoft’s SpeechT5 model for generating natural-sounding speech in Hindi, one of the most spoken languages in the world. By utilizing a dataset containing Hindi text-audio pairs, we have trained the model to convert input Hindi text into realistic and expressive speech.
- Dataset Preparation: Processing Hindi text and audio files to create training pairs.
- Preprocessing: Tokenizing text and extracting audio features (mel spectrograms).
- Fine-Tuning: Training SpeechT5 on the prepared dataset for Hindi-specific TTS.
- Inference & Testing: Generating speech from Hindi text inputs and evaluating outputs.
- Optimization: Implementing inference optimization techniques for faster speech generation.
- 📂 Source Code for Fine-Tuning
- 🎧 Audio Samples Generated by the Model
- 🛠️ Usage Instructions with Examples
- 📊 Evaluation Results & Insights
- Accurate Pronunciation: Fine-tuned to handle complex phonetics and Hindi-specific nuances.
- Natural Speech: Produces clear and lifelike speech outputs.
- Flexible Usage: Easily integrated into applications for real-time TTS.
- Customizable: You can further fine-tune or optimize the model for specific tasks.
-
Voice Assistants 🤖
- Creating Hindi-speaking AI assistants like Alexa, Google Assistant, etc.
-
Educational Tools 📖
- Developing learning tools for regional students and visually impaired individuals.
-
Audiobooks & Podcasts 🎧
- Generating Hindi audiobooks and content for entertainment or education.
-
Content Localization 🌏
- Localizing advertisements, videos, and digital platforms for Hindi-speaking audiences.
-
Accessibility Tools ♿
- Providing speech solutions for text accessibility.
Follow these steps to use the model:
git lfs install
git clone https://huggingface.co/Saurabh1207/Hindi_SpeechT5_finetuned
Make sure you have Python and the necessary libraries installed.
Run the following:
pip install git+https://github.com/huggingface/transformers.git accelerate datasets soundfile speechbrain torch
Here’s an example of how you can use the model to generate speech from Hindi text:
import os
import torch
from IPython.display import Audio
import soundfile as sf
from speechbrain.pretrained import EncoderClassifier
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
# Load a sample from the dataset for speaker embedding
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
spk_model_name = "speechbrain/spkrec-xvect-voxceleb"
device = "cuda" if torch.cuda.is_available() else "cpu"
speaker_model = EncoderClassifier.from_hparams(source=spk_model_name, run_opts={"device": device}, savedir=os.path.join("/tmp", spk_model_name))
try:
dataset = load_dataset("mozilla-foundation/common_voice_17_0", "hi", split="validated", trust_remote_code=True)
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
sample = dataset[0]
speaker_embedding = create_speaker_embedding(sample['audio']['array'])
except Exception as e:
print(f"Error loading dataset: {e}")
# Use a random speaker embedding as fallback
speaker_embedding = torch.randn(1, 512)
def create_speaker_embedding(waveform):
with torch.no_grad():
speaker_embeddings = speaker_model.encode_batch(torch.tensor(waveform))
speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2)
speaker_embeddings = speaker_embeddings.squeeze().cpu().numpy()
return speaker_embeddings
# Load processor and fine-tuned model
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("Saurabh1207/Hindi_SpeechT5_finetuned")
# Define input text in Hindi
input_text = "नमस्ते, यह हिंदी टेक्स्ट टू स्पीच मॉडल का परीक्षण है।"
# Preprocess text
inputs = processor(text=input_text, return_tensors="pt")
# Generate speech
speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
Audio(speech.numpy(), rate=16000)
# Save audio file
sf.write("hindi_output.wav", speech.cpu().numpy(), 16000)
print("Hindi speech generated and saved as 'hindi_output.wav'")
Here are some audio samples generated by the Hindi TTS model:
- Sample 1: Play or Download
- Sample 2: Play or Download
- Sample 3: Play or Download
For advanced users, fine-tuning the model on a custom dataset is simple:
- Prepare a dataset with Hindi text and corresponding audio files.
- Use the Hugging Face
SpeechT5ForTextToSpeech
API to fine-tune the model further. - Save and test the optimized model for improved results.
Refer to Hugging Face's documentation for details: SpeechT5 Fine-Tuning Guide.
Explore the official resources for models and libraries used in this project:
- SpeechT5 Overview: Microsoft SpeechT5 on Hugging Face
- Transformers Library: Hugging Face Transformers
- PyTorch: PyTorch Documentation
The fine-tuned Hindi TTS model achieves high performance in subjective and objective evaluations, focusing on:
- Pronunciation accuracy
- Speech naturalness
- Inference speed
Results demonstrate significant improvements over pre-trained models on Hindi datasets.
- Expand the dataset for better coverage of accents and dialects.
- Integrate quantization for faster real-time inference.
- Optimize the model for deployment on low-resource devices.
Contributions are welcome!
If you find any issues or have suggestions, feel free to open an issue or pull request.
For queries or support, please reach out:
- Email: [email protected]
- LinkedIn: Saurabh's LinkedIn Profile