A Survey on speech generation research.
Style Voice Conversion by Natural Language Prompts.
-
(2023) Towards General-Purpose Text-Instruction-Guided Voice Conversion [paper] [demo]
-
(2023) PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts
Style Voice Conversion by Reference Speech.
-
(2023) HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation by Hierarchical Variational Inference for Zero-shot Speech Synthesis
-
(2023) PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts
Style Specified by a Text Prompt.
-
(2023) PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions
-
(2023) PromptTTS 2: Describing and Generating Voices with Text Prompt
-
(2023) PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions
-
(2023) TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models
-
(2023) InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt
-
(2022) PromptTTS: controllable text-to-speech with text descriptions
Style Specified by an Audio (Speech) Prompt
-
(2023) HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation by Hierarchical Variational Inference for Zero-shot Speech Synthesis
-
(2023) Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts
-
(2023) Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
-
(2023) SpeechX: Neural Codec Language Model as a Versatile Speech Transformer
-
(2023) VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
[paper]
-
(2023) SoundStorm: Efficient Parallel Audio Generation
-
(2023) Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)
-
(2023) Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling (VALL-E X)
-
(2023) Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision
Style Specified by a Text Prompt
-
(2023) Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation
-
(2023) Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model
-
(2023) AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
-
(2023) Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
-
(2022) AudioGen: Textually Guided Audio Generation
Style Specified by a Text Prompt
-
(2023) DAC: High-Fidelity Audio Compression with Improved RVQGAN
-
(2023) AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec
-
(2023) From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion
-
(2022) High Fidelity Neural Audio Compression
-
(2021) SoundStream: An End-to-End Neural Audio Codec
[paper]
-
(2023) Prompting Large Language Models with Speech Recognition Abilities
-
(2023) On decoder-only architecture for speech-to-text and large language model integration
-
(2023) LMs with a Voice: Spoken Language Modeling beyond Speech Tokens
-
(2023) AudioPaLM: A Large Language Model That Can Speak and Listen
-
(2023) Listen, Think, and Understand
-
(2023) SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
- (2023) SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
- SALMONN: Speech Audio Language Music Open Neural Network
- SALMONN: Speech Audio Language Music Open Neural Network
- (2023) EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis
- (2022) SpeechLMScore: Evaluating speech generation using speech language model
-
(2023) HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation by Hierarchical Variational Inference for Zero-shot Speech Synthesis
-
(2023) Towards General-Purpose Text-Instruction-Guided Voice Conversion [paper] [demo]
-
(2023) PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts
-
(2023) Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation
-
(2023) Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model
-
(2023) AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
-
(2023) Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
-
(2022) AudioGen: Textually Guided Audio Generation
-
(2023) HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation by Hierarchical Variational Inference for Zero-shot Speech Synthesis
-
(2023) Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts
-
(2023) Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
-
(2023) PromptTTS 2: Describing and Generating Voices with Text Prompt
-
(2023) PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions
-
(2023) TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models
-
(2023) SpeechX: Neural Codec Language Model as a Versatile Speech Transformer
-
(2023) VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
[paper]
-
(2023) SoundStorm: Efficient Parallel Audio Generation
-
(2023) Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)
-
(2023) Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling (VALL-E X)
-
(2023) Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision
-
(2023) InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt
-
(2022) PromptTTS: controllable text-to-speech with text descriptions