MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

阅读中文版本.

📖 Introduction

MOSS-Speech introduces true end-to-end speech interaction. Unlike cascaded pipelines or text-guided models, it directly generates speech without first producing text. Our design not only overcomes the limitation of generated speech being constrained by a text bottleneck, but also inherits the knowledge of the pretrained text language model, thereby enabling more natural and efficient speech-to-speech dialogue.

We add modality-based layer-splitting to a pretrained text LLM, and follow a frozen pre-training strategy to preserve the LLM's capabilities while extending it to speech modality.

Check out our video demo and live demo.

Technical report is available at arXiv:2510.00499.

🔑 Key Features

True Speech-to-Speech Modeling: No text guidance required.
Layer-Splitting Architecture: Integrates modality-specific layers on top of pretrained text LLM backbones.
Frozen Pre-Training Strategy: Preserves LLM abilities while extending to speech modality.
State-of-the-Art Performance: Excels in spoken question answering and speech-to-speech tasks.

🛠️ Installation

# Clone the repository
git clone https://github.com/OpenMOSS/MOSS-Speech
cd MOSS-Speech

# Install dependencies
pip install -r requirements.txt 
git submodule update --init --recursive

🚀 Usage

Launch the web demo

python3 gradio_demo.py

Next Steps

Open source base model: Release the MOSS-Speech-Base model for community use
Support streaming output in Gradio: Implement streaming output for lower response latency in the web demo

License

The code in this repository is released under the Apache 2.0 license.

Acknowledgements

Qwen: We use Qwen3-8B as the base model.
We thank an anonymous colleague for Character Voice!

📜 Citation

If you use this repository or model in your research, please cite:

@misc{zhao2025mossspeechtruespeechtospeechmodels,
      title={MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance}, 
      author={Xingjian Zhao and Zhe Xu and Luozhijie Jin and Yang Wang and Hanfu Chen and Yaozhou Jiang and Ke Chen and Ruixiao Li and Mingshu Chen and Ruiming Wang and Wenbo Zhang and Yiyang Zhang and Donghua Yu and Yang Gao and Xiaogui Yang and Yitian Gong and Yuanfan Xu and Qinyuan Cheng and Zhaoye Fei and Shimin Li and Yaqian Zhou and Xuanjing Huang and Xipeng Qiu},
      year={2025},
      eprint={2510.00499},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.00499}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Matcha-TTS @ bd4d90d		Matcha-TTS @ bd4d90d
assets		assets
cosyvoice		cosyvoice
papers		papers
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
README_ZH.md		README_ZH.md
gradio_demo.py		gradio_demo.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

📖 Introduction

🔑 Key Features

🛠️ Installation

🚀 Usage

Next Steps

License

Acknowledgements

📜 Citation

About

Uh oh!

Contributors 4

Uh oh!

Languages

License

OpenMOSS/MOSS-Speech

Folders and files

Latest commit

History

Repository files navigation

MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

📖 Introduction

🔑 Key Features

🛠️ Installation

🚀 Usage

Next Steps

License

Acknowledgements

📜 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 4

Uh oh!

Languages