Skip to content

kuan2jiu99/Awesome-Speech-Generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 

Repository files navigation

Awesome-Speech-Generation

A Survey on speech generation research.

Text-Guided Voice Conversion

Style Voice Conversion by Natural Language Prompts.

  • (2023) Towards General-Purpose Text-Instruction-Guided Voice Conversion [paper] [demo]

  • (2023) PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts

    [paper] [demo]

Audio-Guided Voice Conversion

Style Voice Conversion by Reference Speech.

  • (2023) HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation by Hierarchical Variational Inference for Zero-shot Speech Synthesis

    [paper] [demo] [code]

  • (2023) PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts

    [paper] [demo]

Text-Guided Text-to-Speech

Style Specified by a Text Prompt.

  • (2023) PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions

    [paper] [demo]

  • (2023) PromptTTS 2: Describing and Generating Voices with Text Prompt

    [paper] [demo]

  • (2023) PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions

    [paper] [demo]

  • (2023) TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models

    [paper] [demo]

  • (2023) InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

    [paper] [demo]

  • (2022) PromptTTS: controllable text-to-speech with text descriptions

    [paper] [demo]

Audio-Guided Text-to-Speech

Style Specified by an Audio (Speech) Prompt

  • (2023) HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation by Hierarchical Variational Inference for Zero-shot Speech Synthesis

    [paper] [demo] [code]

  • (2023) Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

    [paper] [demo]

  • (2023) Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

    [paper] [demo]

  • (2023) SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

    [paper] [demo]

  • (2023) VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

    [paper]

  • (2023) SoundStorm: Efficient Parallel Audio Generation

    [paper] [demo]

  • (2023) Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)

    [paper] [demo]

  • (2023) Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling (VALL-E X)

    [paper] [demo]

  • (2023) Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

    [paper] [demo]

Text-Guided Text-to-Audio

Style Specified by a Text Prompt

  • (2023) Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

    [paper] [demo]

  • (2023) Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

    [paper] [demo] [github]

  • (2023) AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

    [paper] [demo] [github]

  • (2023) Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

    [paper] [demo] [github]

  • (2022) AudioGen: Textually Guided Audio Generation

    [paper] [demo]

Text-Guided Text-to-Music

Style Specified by a Text Prompt

Audio Codec

  • (2023) DAC: High-Fidelity Audio Compression with Improved RVQGAN

    [paper] [github]

  • (2023) AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec

    [paper] [github]

  • (2023) From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion

    [paper] [github]

  • (2022) High Fidelity Neural Audio Compression

    [paper] [github]

  • (2021) SoundStream: An End-to-End Neural Audio Codec

    [paper]

Multi-modality Large Language Model

  • (2023) Prompting Large Language Models with Speech Recognition Abilities

    [paper] [demo]

  • (2023) On decoder-only architecture for speech-to-text and large language model integration

    [paper] [demo]

  • (2023) LMs with a Voice: Spoken Language Modeling beyond Speech Tokens

    [paper] [demo]

  • (2023) AudioPaLM: A Large Language Model That Can Speak and Listen

    [paper] [demo]

  • (2023) Listen, Think, and Understand

    [paper] [demo]

  • (2023) SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

    [paper] [demo]

Speech Continuation

  • (2023) Textually Pretrained Speech Language Models

    [paper] [demo] [github]

Speech-to-Speech Translation

  • (2023) SeamlessM4T—Massively Multilingual & Multimodal Machine Translation

    [paper] [demo] [github]

Speech-to-Speech

  • (2023) SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

Code-to-Speech

Audio-to-Text

  • SALMONN: Speech Audio Language Music Open Neural Network

Speech-to-Text

  • SALMONN: Speech Audio Language Music Open Neural Network

Evaluation

  • (2023) EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis
  • (2022) SpeechLMScore: Evaluating speech generation using speech language model

Voice Conversion

  • (2023) HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation by Hierarchical Variational Inference for Zero-shot Speech Synthesis

    [paper] [demo] [code]

  • (2023) Towards General-Purpose Text-Instruction-Guided Voice Conversion [paper] [demo]

  • (2023) PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts

    [paper] [demo]

Text-to-Audio

  • (2023) Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

    [paper] [demo]

  • (2023) Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

    [paper] [demo] [github]

  • (2023) AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

    [paper] [demo] [github]

  • (2023) Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

    [paper] [demo] [github]

  • (2022) AudioGen: Textually Guided Audio Generation

    [paper] [demo]

Text-to-Music

Text-to-Speech

  • (2023) HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation by Hierarchical Variational Inference for Zero-shot Speech Synthesis

    [paper] [demo] [code]

  • (2023) Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

    [paper] [demo]

  • (2023) Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

    [paper] [demo]

  • (2023) PromptTTS 2: Describing and Generating Voices with Text Prompt

    [paper] [demo]

  • (2023) PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions

    [paper] [demo]

  • (2023) TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models

    [paper] [demo]

  • (2023) SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

    [paper] [demo]

  • (2023) VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

    [paper]

  • (2023) SoundStorm: Efficient Parallel Audio Generation

    [paper] [demo]

  • (2023) Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)

    [paper] [demo]

  • (2023) Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling (VALL-E X)

    [paper] [demo]

  • (2023) Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

    [paper] [demo]

  • (2023) InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

    [paper] [demo]

  • (2022) PromptTTS: controllable text-to-speech with text descriptions

    [paper] [demo]

About

Survey on speech generation work.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published