A Flask-based REST API that detects human voice activity in audio files. Perfect for those moments when you need to know if someone's actually speaking or if it's just your neighbour's cat knocking things over.
- Audio classification into three categories:
- Blank (No sound)
- Background noise only
- Human voice with background
- Continuous learning through user feedback
- JWT-based authentication
- Support for WAV, MP3, and AAC audio formats
- Noise reduction preprocessing
- Feature extraction using MFCCs, spectral centroids, and chroma features
The system follows a modular architecture with these core components:
- Feature Extraction: Uses librosa to extract meaningful audio features (MFCCs, spectral centroids, chroma)
- Model: RandomForest classifier trained on the extracted features
- API: Flask-based REST endpoints for training, classification, and feedback
- Authentication: JWT-based token system for secure access
Python 3.8+
FFmpeg (for audio processing)
- Clone the repository
- Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
Set your JWT secret key in create_jwt_token.py
:
SECRET_KEY = "your_secure_secret_key"
POST /train
- Train the model with new audio files and labels
- Can also use accumulated feedback data if no new files are provided
POST /upload-audio
- Upload an audio file for classification
- Returns classification result and confidence score
POST /feedback
- Submit feedback for improving model accuracy
- Stores audio file and correct label for future retraining
Training with new files:
curl -X POST \
http://localhost:5000/train \
-H 'Authorization: Bearer YOUR_JWT_TOKEN' \
-F '[email protected]' \
-F '[email protected]' \
-F 'labels=1' \
-F 'labels=2'
Classifying an audio file:
curl -X POST \
http://localhost:5000/upload-audio \
-H 'Authorization: Bearer YOUR_JWT_TOKEN' \
-F '[email protected]'
The system uses a Random Forest Classifier with:
- 100 estimators
- Feature set combining MFCCs, spectral centroids, and chroma features
- Noise reduction preprocessing using the noisereduce library
├── main.py # Flask application and API endpoints
├── voice_activity_detector.py # Core ML functionality
├── create_jwt_token.py # JWT token generation
├── requirements.txt # Project dependencies
└── feedback_audio/ # Directory for feedback audio files
The API implements comprehensive error handling for:
- Invalid file types
- Missing JWT tokens
- Model not found scenarios
- Invalid feedback formats
- File processing errors
Errors are logged to errors.log
for debugging and monitoring.
- Always validate audio files before processing
- Use error handling for robust production deployment
- Regularly retrain the model with feedback data
- Monitor the feedback queue size
- Back up the trained model regularly
- Add model versioning
- Implement batch processing for large audio files
- Add confidence scores to classifications
- Implement real-time streaming classification
- Add more granular voice activity categories
- Change the default SECRET_KEY
- Implement rate limiting
- Add request size limitations
- Implement proper file cleanup
- Consider adding API key rotation
Feel free to open issues and pull requests. Just make sure your code doesn't classify heavy metal as "background noise" - we've been there, it wasn't pretty.
This project is licensed under the MIT License - see the LICENSE file for details.
Note: This software depends on other packages that may be licensed under different terms. The MIT license above applies only to the original code in this repository. Notably, this project uses FFmpeg which is licensed under the LGPL/GPL license.