Skip to content

Latest commit

 

History

History
210 lines (193 loc) · 9.31 KB

index.md

File metadata and controls

210 lines (193 loc) · 9.31 KB
layout title
default
Emotion-Aware TTS

Authors

Suparna De
Email: [email protected]

Ionut Bostan
Email: [email protected]

Nishanth Sastry
Email: [email protected]


Abstract

Recent studies have outlined the accessibility challenges that blind and visually impaired people face in interacting with social networks, with monotone text-to-speech (TTS) screen readers and audio narration of visual elements such as emojis. Emotional speech generation traditionally relies on human input of the expected emotion together with the text to synthesise, with additional challenges around data simplification (causing information loss) and duration inaccuracy, leading to lack of expressive emotional rendering. In real-life communications, the duration of phonemes can vary since the same sentence might be spoken in a variety of ways depending on the speakers’ emotional states or accents (referred to as the one-to-many problem of text to speech generation). As a result, an advanced voice synthesis system is required to account for this unpredictability. We propose an end-to-end context-aware Text-toSpeech (TTS) synthesis system that derives the conveyed emotion from text input and synthesises audio that focuses on emotions and speaker features for natural and expressive speech, integrating advanced natural language processing (NLP) and speech synthesis techniques for real-time applications. The proposed system has two core components: an emotion classifier and a speech synthesiser. The emotion classifier utilises a classification model to extract sentiment information from the input text. Leveraging a non-autoregressive neural TTS model, the speech synthesiser generates Mel-spectrograms by incorporating speaker and emotion embeddings derived from the classifier’s output. We employ a Generative Adversarial Network (GAN)-based vocoder to convert the Mel-spectrograms into audible waveforms. One of the key contributions lies in effectively incorporating emotional characteristics into TTS synthesis. Our system also showcases competitive inference time performance when benchmarked against the state-of-the-art TTS models, making it suitable for real-time accessibility applications.

This work has been accepted for presentation at the 16th International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2024), which will be held from September 2-5, 2024, in Calabria, Italy.


Demo

Welcome to the demonstration page of our Emotion-Aware Text-to-Speech Models. Below, you can listen to audio samples from different TTS models.

Description FastSpeech 2[1] TEMOTTS[2] Our Model
Bikes are fun to ride Your browser does not support the audio tag. Your browser does not support the audio tag. Your browser does not support the audio tag.
Dreams can come true Your browser does not support the audio tag. Your browser does not support the audio tag. Your browser does not support the audio tag.
Friends make life more fun Your browser does not support the audio tag. Your browser does not support the audio tag. Your browser does not support the audio tag.

Emotion Aware Samples

Description FastSpeech 2[1] TEMOTTS[2] Our Model
Blowing out birthday candles makes me feel special! Your browser does not support the audio tag. Your browser does not support the audio tag. Your browser does not support the audio tag.
Her heart felt heavy with sorrow Your browser does not support the audio tag. Your browser does not support the audio tag. Your browser does not support the audio tag.
I am feeling sad Your browser does not support the audio tag. Your browser does not support the audio tag. Your browser does not support the audio tag.

References


  1. Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in International Conference on Learning Representations, 2021.
  2. Shreeram Suresh Chandra, Zongyang Du, Berrak Sisman, "TEMOTTS: Text-aware Emotional Text-to-Speech with no labels", Speech & Machine Learning Lab, The University of Texas at Dallas, TX, USA, 2024.