Skip to content

Saurabh-Kumar-0/Text_To_Speech_Model_For_Regional_Language

Repository files navigation

🎙️ Hindi Text-to-Speech (TTS) Fine-Tuned Model 🇮🇳

Transforming text into speech has revolutionized how we interact with machines, making technologies more accessible and inclusive for people across the globe. This project focuses on fine-tuning a Hindi Text-to-Speech (TTS) model using Microsoft's SpeechT5 framework to generate high-quality, natural-sounding speech for Hindi texts. By leveraging advanced deep learning and transformer-based architectures, this model ensures precise articulation of words, even for complex pronunciations in Hindi.

The need for such models arises due to the limited availability of Hindi-specific speech models despite the language's widespread use. Fine-tuning a pre-trained model not only optimizes performance for the Hindi language but also enhances usability in diverse real-world applications like voice assistants, audiobook generation, and interactive educational tools. 🎯

This repository provides everything you need to replicate the project, evaluate the model, and utilize it for your specific use cases — from detailed instructions, code examples, and usage guidelines to sample audio outputs showcasing the model's capabilities.


📋 Project Overview

This project involves the fine-tuning of Microsoft’s SpeechT5 model for generating natural-sounding speech in Hindi, one of the most spoken languages in the world. By utilizing a dataset containing Hindi text-audio pairs, we have trained the model to convert input Hindi text into realistic and expressive speech.

Implementation Steps:

  1. Dataset Preparation: Processing Hindi text and audio files to create training pairs.
  2. Preprocessing: Tokenizing text and extracting audio features (mel spectrograms).
  3. Fine-Tuning: Training SpeechT5 on the prepared dataset for Hindi-specific TTS.
  4. Inference & Testing: Generating speech from Hindi text inputs and evaluating outputs.
  5. Optimization: Implementing inference optimization techniques for faster speech generation.

What You’ll Find in This Repository:

  • 📂 Source Code for Fine-Tuning
  • 🎧 Audio Samples Generated by the Model
  • 🛠️ Usage Instructions with Examples
  • 📊 Evaluation Results & Insights

🛠️ Features

  • Accurate Pronunciation: Fine-tuned to handle complex phonetics and Hindi-specific nuances.
  • Natural Speech: Produces clear and lifelike speech outputs.
  • Flexible Usage: Easily integrated into applications for real-time TTS.
  • Customizable: You can further fine-tune or optimize the model for specific tasks.

📚 Applications 🌟

  1. Voice Assistants 🤖

    • Creating Hindi-speaking AI assistants like Alexa, Google Assistant, etc.
  2. Educational Tools 📖

    • Developing learning tools for regional students and visually impaired individuals.
  3. Audiobooks & Podcasts 🎧

    • Generating Hindi audiobooks and content for entertainment or education.
  4. Content Localization 🌏

    • Localizing advertisements, videos, and digital platforms for Hindi-speaking audiences.
  5. Accessibility Tools

    • Providing speech solutions for text accessibility.

🔧 Setup & Installation

Follow these steps to use the model:

1. Clone the Repository

git lfs install
git clone https://huggingface.co/Saurabh1207/Hindi_SpeechT5_finetuned

2. Install Dependencies

Make sure you have Python and the necessary libraries installed.
Run the following:

pip install git+https://github.com/huggingface/transformers.git accelerate datasets soundfile speechbrain torch

3. Inference Code: Generate Hindi Speech

Here’s an example of how you can use the model to generate speech from Hindi text:

import os
import torch
from IPython.display import Audio
import soundfile as sf
from speechbrain.pretrained import EncoderClassifier
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan

# Load a sample from the dataset for speaker embedding
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
spk_model_name = "speechbrain/spkrec-xvect-voxceleb"
device = "cuda" if torch.cuda.is_available() else "cpu"
speaker_model = EncoderClassifier.from_hparams(source=spk_model_name, run_opts={"device": device}, savedir=os.path.join("/tmp", spk_model_name))

try:
    dataset = load_dataset("mozilla-foundation/common_voice_17_0", "hi", split="validated", trust_remote_code=True)
    dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
    sample = dataset[0]
    speaker_embedding = create_speaker_embedding(sample['audio']['array'])

except Exception as e:
    print(f"Error loading dataset: {e}")
    # Use a random speaker embedding as fallback
    speaker_embedding = torch.randn(1, 512)

def create_speaker_embedding(waveform):
    with torch.no_grad():
        speaker_embeddings = speaker_model.encode_batch(torch.tensor(waveform))
        speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2) 
        speaker_embeddings = speaker_embeddings.squeeze().cpu().numpy()
    return speaker_embeddings

# Load processor and fine-tuned model
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("Saurabh1207/Hindi_SpeechT5_finetuned")

# Define input text in Hindi
input_text = "नमस्ते, यह हिंदी टेक्स्ट टू स्पीच मॉडल का परीक्षण है।"

# Preprocess text
inputs = processor(text=input_text, return_tensors="pt")

# Generate speech
speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
Audio(speech.numpy(), rate=16000)

# Save audio file
sf.write("hindi_output.wav", speech.cpu().numpy(), 16000)
print("Hindi speech generated and saved as 'hindi_output.wav'")

🎧 Sample Outputs

Here are some audio samples generated by the Hindi TTS model:

  1. Sample 1: Play or Download
  2. Sample 2: Play or Download
  3. Sample 3: Play or Download

🚀 How to Fine-Tune Further?

For advanced users, fine-tuning the model on a custom dataset is simple:

  1. Prepare a dataset with Hindi text and corresponding audio files.
  2. Use the Hugging Face SpeechT5ForTextToSpeech API to fine-tune the model further.
  3. Save and test the optimized model for improved results.

Refer to Hugging Face's documentation for details: SpeechT5 Fine-Tuning Guide.


🌐 Documentation Links

Explore the official resources for models and libraries used in this project:


📊 Performance Metrics

The fine-tuned Hindi TTS model achieves high performance in subjective and objective evaluations, focusing on:

  • Pronunciation accuracy
  • Speech naturalness
  • Inference speed

Results demonstrate significant improvements over pre-trained models on Hindi datasets.


💡 Future Improvements

  1. Expand the dataset for better coverage of accents and dialects.
  2. Integrate quantization for faster real-time inference.
  3. Optimize the model for deployment on low-resource devices.

🤝 Contributing

Contributions are welcome!
If you find any issues or have suggestions, feel free to open an issue or pull request.


🧑‍💻 Contact & Support

For queries or support, please reach out:


If you like this project, don’t forget to give it a star!

Hugging Face Model Stars on Hugging Face License


About

Fine Tuned TTS model on different Languages

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published