A versatile Speech API that supports both Text-to-Speech (TTS) and Speech-to-Text (STT) capabilities. The API integrates multiple TTS architectures including F5-TTS-MLX, XTTS, and Piper for speech synthesis, as well as lightning-whisper-mlx for speech recognition. This comprehensive solution provides high-quality speech processing with various customization options and enhancements.
- 🎯 Multiple TTS Architecture Support:
- F5-TTS-MLX for high-quality neural TTS
- XTTS for multilingual support
- Piper for fast and efficient TTS
- 🎤 Speech-to-Text Support:
- Lightning Whisper MLX for fast and accurate transcription
- Multiple model sizes from tiny to large-v3
- Optional quantization for improved performance
- 🔊 Voice Cloning Capabilities
- 🌐 Multi-language Support
- 🎨 Customizable Voice Settings
- 🎵 Audio Quality Enhancements
- 🚀 FastAPI-based REST API
- 🚀 Auto Download the models
- Python 3.9+
- FFmpeg installed on your system
- CUDA-capable GPU (optional, for improved performance)
- Required Python packages (see
requirements.txt
)
- Clone the repository:
git clone https://github.com/Goekdeniz-Guelmez/OpenAudioAPI.git
cd OpenAudioAPI
2: Install the required dependencies:
pip install -r requirements.txt
3: Install FFmpeg (if not already installed):
- On macOS:
brew install ffmpeg
- On Ubuntu:
sudo apt-get install ffmpeg
- On Windows: Download from FFmpeg website
python start_server.py --host 0.0.0.0 --port 8100 --reload
The API uses a config.json
file to manage both TTS and STT settings. The configuration is divided into two main sections: "TTS" for Text-to-Speech and "STT" for Speech-to-Text. Here's a detailed guide on configuring different architectures:
The configuration file is structured as follows:
{
"TTS": {
"tts-1-hd": {
"voice_name": {
// TTS voice configurations
}
},
"tts-1": {
// Additional TTS configurations
}
},
"STT": {
"model_name": {
// STT model configurations
}
}
}
These settings can be applied to any voice configuration:
{
"enhance_quality": true/false, // Enable audio enhancements
"sample_rate": 44100, // Audio sample rate
"normalization_level": -3.0, // Target dB for normalization
"high_pass_filter": true/false, // Remove low frequencies
"noise_reduction": true/false // Apply noise reduction
}
F5-MLX is optimized for high-quality speech synthesis with voice cloning capabilities.
{
"tts-1-hd": {
"voice_name": {
"architecture": "f5-mlx",
"model": "path/to/model",
"steps": 8, // Number of generation steps (higher = better quality)
"cfg_strength": 2.0, // Classifier-free guidance strength
"sway_sampling_coef": -1.0, // Sampling coefficient
"speed": 1.0, // Speech speed (1.0 = normal)
"method": "rk4", // Generation method: "euler", "midpoint", or "rk4"
"duration": null, // Optional fixed duration in seconds
"seed": null, // Random seed for reproducibility
"quantization_bits": null, // Audio quantization bits
"ref_audio_path": "path/to/reference.wav", // Required for voice cloning
"ref_audio_text": "Reference text" // Required for voice cloning
}
}
}
XTTS excels at multilingual speech synthesis with natural-sounding results.
{
"tts-1-hd": {
"voice_name": {
"architecture": "xtts",
"model": "path/to/model",
"language": "en", // Language code or "auto" for detection
"ref_audio_path": "path/to/reference.wav", // Required
"emotion": "neutral", // Optional emotion parameter
"speed": 1.0, // Speech speed
"split_sentences": true // Enable sentence splitting
}
}
}
Piper provides fast and efficient TTS with good quality output.
{
"tts-1": {
"voice_name": {
"architecture": "piper",
"model": "path/to/model",
"length_scale": 1.0, // Controls speech duration
"noise_scale": 0.667, // Affects voice variation
"noise_w": 0.8, // Affects voice consistency
"volume": 1.0, // Output volume
"use_cuda": false, // Enable GPU acceleration
"speaker": null // Optional speaker ID
}
}
}
Lightning Whisper MLX provides fast and accurate speech recognition with various model sizes and quantization options.
Available Models:
- tiny
- small
- distil-small.en
- base
- medium
- distil-medium.en
- large
- large-v2
- distil-large-v2
- large-v3
- distil-large-v3
Quantization Options:
- None (default)
- "4bit"
- "8bit"
{
"STT": {
"whisper-tiny": {
"architecture": "whisper-mlx",
"model": "tiny"
},
"whisper-large-v3": {
"architecture": "whisper-mlx",
"model": "large-v3",
### Generate Speech
```http
POST /v1/audio/speech
Request body parameters:
{
"model": "tts-1-hd", // Model type
"input": "Text to synthesize", // Input text
"voice": "voice_name", // Voice configuration to use
"response_format": "wav", // Output format: "wav" or "mp3"
// ... additional parameters matching config options
}
GET /v1/available_voices
Returns the complete configuration with available voices and their settings.
GET /health
Returns the API health status.
When using the tts-1-hd
model with enhance_quality: true
, the following enhancements are applied:
- Sample rate conversion to specified rate (default: 44100Hz)
- High-pass filtering to remove low-frequency noise (optional)
- Noise reduction using FFT-based denoising (optional)
- Audio normalization to target level
- High-quality MP3 encoding for MP3 output format
import requests
response = requests.post(
"http://localhost:8000/v1/audio/speech",
json={
"model": "tts-1-hd",
"input": "Hello, world!",
"voice": "default_voice",
"response_format": "wav"
}
)
with open("output.wav", "wb") as f:
f.write(response.content)
import requests
response = requests.post(
"http://localhost:8000/v1/audio/speech",
json={
"model": "tts-1-hd",
"input": "This is a cloned voice speaking.",
"voice": "cloned_voice",
"response_format": "mp3",
"enhance_quality": true
}
)
with open("cloned_voice.mp3", "wb") as f:
f.write(response.content)
import requests
files = {
'file': open('/Users/gokdenizgulmez/Desktop/OpenAudioAPI/SJ.wav', 'rb')
}
data = {
'model': 'whisper-tiny'
}
response = requests.post(url, files=files, data=data)
print(f"Response: {response.json()}")
The API returns appropriate HTTP status codes and error messages:
- 400: Bad Request (invalid parameters)
- 500: Internal Server Error (generation or processing failed)
- 200: Success
Contributions are welcome! Please feel free to submit pull requests.
- f5-tts-mlx by Lucas Newman
- TTS by Coqui
- TTS by Rhasspy
- lightning-whisper-mlx by Mustafa Aljadery
- FastAPI
- FFmpeg
- API endpoint to clone a voice via request.
- Supporting CUDA and Pytorch Whisper.
- Supporting Bark models.
- Supporting Piper TTS for 'tts-1'.
- Adding 'keep_alive' parameter to keep models loaded in RAM.
- Adding support for parler-tts
The OpenAudioAPI software suite was developed by Gökdeniz Gülmez. If you find OpenAudioAPI useful in your research and wish to cite it, please use the following BibTex entry:
@software{
OpenAudioAPI,
author = {Gökdeniz Gülmez},
title = {{OpenAudioAPI}: A versatile Speech API that supports both Text-to-Speech (TTS) and Speech-to-Text (STT) capabilities.},
url = {https://github.com/Goekdeniz-Guelmez/OpenAudioAPI.git},
version = {0.0.1},
year = {2024},
}