Skip to content

Generate an image of a human face based on that person's speech

License

Notifications You must be signed in to change notification settings

Kacper-Pietkun/Speech-to-face

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

99 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Speech-to-Face

Generate an image of a human face based on that person's speech

The general aim of this project is to recreate and improve Speech-to-Face pipeline presented in the Speech2Face: Learning the Face Behind a Voice paper [1]

Whole implementation is based on PyTorch framework

Overview

In this project you will find implementation of three models:

Sections

To read more about the project go to the page that you are interested in:

Used datasets

In the project we used three different datasets:

  1. VoxCeleb1 - for human speech audio [6]
  2. VoxCeleb2 - for human speech audio [7]
  3. HQ-VoxCeleb - for normalized facial images [8]

HQ-VoxCeleb dataset was used to train FaceDecoder model. To train VoiceEncoder model we filtered VoxCeleb1 and VoxCeleb2 datasets to get audio files for the identities present in HQ-VoxCeleb (because HQ-VoxCeleb does not contain normalized face images for every identity present in VoxCeleb1 or VoxCeleb2 datasets)

Results (cherry-picked)

We achieved the best results using fine-tuned AST as VoiceEncoder model. Moreover, we used VGGFace_serengil as the FaceEncoder when training the VoiceEncoder and FaceDecoder models. The results obtained when using our trained from scratch VE_conv model were much worse. In the image below you can see the conclusion of our work. In the left column you can see the original image of the person from the HQ-VoxCeleb dataset. In the middle column you can see the recostruction of the face from the Face-to-Face pipeline (i.e. convert image to the face embbedding and reconstruct the image - voice is not used in this pipeline). Finally in the right column you can see the results from the Speech-to-Face pipeline (i.e. convert speech to the spectrogram, calculate face embedding from that spectrogram, reconstruct the face).

Results

References

[1] Oh, Tae-Hyun, et al. "Speech2face: Learning the face behind a voice." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.

[2] Parkhi, Omkar, Andrea Vedaldi, and Andrew Zisserman. "Deep face recognition." BMVC 2015-Proceedings of the British Machine Vision Conference 2015. British Machine Vision Association, 2015.

[3] Cole, Forrester, et al. "Synthesizing normalized faces from facial identity features." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

[4] github.com/serengil/deepface

[5] github.com/rcmalli/keras-vggface

[6] robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html

[7] robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html

[8] Bai, Yeqi, et al. "Speech Fusion to Face: Bridging the Gap Between Human's Vocal Characteristics and Facial Imaging." Proceedings of the 30th ACM International Conference on Multimedia. 2022.