Pytorch unofficial implementation of VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
Final project for SCC5830- Image Processing @ ICMC/USP.
For the task we intend to use the LibriSpeech dataset initially. However, to use it in this task, we need to generate audios with overlappings voices.
We use Si-SNR with PIT instead of Power Law compressed loss, because it allows us to achieve a better result ( comparison available in: https://github.com/Edresson/VoiceSplit).
We used the MISH activation function instead of ReLU and this has improved the result
You can see a report of what was done in this repository here
Colab notebooks Demos:
Exp 1: https://shorturl.at/eBX18
Exp 2: https://shorturl.at/oyEJN
Exp 3: https://shorturl.at/blnEW
Exp 4: https://shorturl.at/qFJN8
Exp 5 (best): https://shorturl.at/kvAQ8
Site demo for the experiment with best results (Exp 5): https://edresson.github.io/VoiceSplit/
Create documentation for the repository and remove unused code
- Train VoiceSplit model with GE2E3k and Mean Squared Error loss function
In this repository it contains codes of other collaborators, the due credits were given in the used functions:
Preprocessing: Eren Gölge @erogol
VoiceFilter Model: Seungwon Park @seungwonpark