This repository contains an implementation of two DL models for Audio-Visual Source Separation: TASnet and RTFS-Net.
-
(Optional) Create and activate new environment using
conda
orvenv
(+pyenv
).a.
conda
version:# create env conda create -n project_env python=PYTHON_VERSION # activate env conda activate project_env
b.
venv
(+pyenv
) version:# create env ~/.pyenv/versions/PYTHON_VERSION/bin/python3 -m venv project_env # alternatively, using default python version python3 -m venv project_env # activate env source project_env
-
Install all required packages
pip install -r requirements.txt
- First of all you need to prepare your dataset with this structure:
NameOfTheDirectoryWithUtterances
├── audio
│ ├── mix
│ │ ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│ │ ├── FirstSpeakerID2_SecondSpeakerID2.wav
│ │ .
│ │ .
│ │ .
│ │ └── FirstSpeakerIDn_SecondSpeakerIDn.wav
│ ├── s1 # ground truth for the speaker s1, may not be given
│ │ ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│ │ ├── FirstSpeakerID2_SecondSpeakerID2.wav
│ │ .
│ │ .
│ │ .
│ │ └── FirstSpeakerIDn_SecondSpeakerIDn.wav
│ └── s2 # ground truth for the speaker s2, may not be given
│ ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│ ├── FirstSpeakerID2_SecondSpeakerID2.wav
│ .
│ .
│ .
│ └── FirstSpeakerIDn_SecondSpeakerIDn.wav
└── mouths # contains video information for all speakers
├── FirstOrSecondSpeakerID1.npz # npz mouth-crop
├── FirstOrSecondSpeakerID2.npz
.
.
.
└── FirstOrSecondSpeakerIDn.npz
- Generation video embedding
We used open-source project for generation video embeddings: repo. But original repo contained some problems so we forked repo and fixed them: forked repo.
Preparation:
git clone https://github.com/dikirillov/Lipreading_using_Temporal_Convolutional_Networks/
pip install -r Lipreading_using_Temporal_Convolutional_Networks/requirements.txt
Exctraction:
Embedding extraction: model url .
python3 Lipreading_using_Temporal_Convolutional_Networks/main.py --modality video \
--extract-feats \
--config-path 'Lipreading_using_Temporal_Convolutional_Networks/configs/lrw_resnet18_dctcn_boundary.json' \
--model-path <PATH-TO-DOWNLOADED-MODEL> \
--mouth-patch-path <MOUTH-PATCH-PATH>
If you want to retrain model you can use train script with your config:
python3 train.py -cn=CONFIG_NAME HYDRA_CONFIG_ARGUMENTS
For example if you want to retrain RTFS model with your dataset you should change src/configs/rtfs.yaml(set correct dataset dir) and then run command:
python3 train.py dataset
You can use inference.py to generate separated audio from your dataset:
HYDRA_FULL_ERROR=1 python inference.py \
datasets.inference.part=null +datasets.inference.dataset_dir='PATH_TO_YOUR_DATASET' \
inferencer.save_path='PATH_TO_SAVE'
PATH_TO_YOUR_DATASET - path to your dataset folder, it should be located in data folder and contain audio, mouth and mouth_embeds folder.
PATH_TO_SAVE - path where you want to save your files(they will be located in 'data/saved/PATH_TO_SAVE')
If you also want to calculate metrics you should add s1 and s2 folders to your audio folder and then run:
HYDRA_FULL_ERROR=1 python inference.py \
datasets.inference.part=null +datasets.inference.dataset_dir='PATH_TO_YOUR_DATASET' \
inferencer.save_path='PATH_TO_SAVE' \
metrics=inference_metrics