Skip to content

A Real-world dataset and a new method for voice activity detection

Notifications You must be signed in to change notification settings

IIT-PAVIS/Voice-Activity-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 

Repository files navigation

Voice-Activity-Detection

Table of contents

RealVAD: A Real-world Dataset for Voice Activity Detection

The task of automatically detecting “Who is Speaking and When” is broadly named as Voice Activity Detection (VAD). Automatic VAD is a very important task and also the foundation of several domains, e.g., human-human, human-computer/ robot/ virtual-agent interaction analyses, and industrial applications.

RealVAD dataset is constructed from a YouTube video composed of a panel discussion lasting approx. 83 minutes. The audio is available from a single channel. There is one static camera capturing all panelists, the moderator and audiences. The field-of-the-view of the camera while the panelists are shown with the assigned identification numbers are as follow.

Alt text Image was adapted from Privacy camp 2018: Round-table, government hacking in different national contexts and strategies https://www.youtube.com/watch?v=51pRTOIso4U

Particular aspects of RealVAD dataset are:

  • It is composed of panelists with different nationalities (British, Dutch, French, German, Italian, American, Mexican, Columbian, Thai). This aspect allows studying the effect of ethnic origin variety to the automatic VAD.
  • There is a gender balance such that there are four female and five male panelists.
  • The panelists are sitting in two rows and they can be gazing audience, other panelists, their laptop, the moderator or anywhere in the room while speaking or not-speaking. Therefore, they were captured not only from frontal-view but also from side-view varying based on their instant posture and head orientation.
  • The panelists are moving freely and are doing various spontaneous actions (e.g., drinking water, checking their cell phone, using their laptop, etc.), resulting in different postures.
  • The panelists’ body parts are sometimes partially occluded by their/other's body part or belongings (e.g., laptop).
  • There are also natural changes of illumination and shadow rising on the wall behind the panelists in the back row.
  • Especially, for the panelists sitting in the front row, there is sometimes background motion occurring when the person(s) behind them moves.

You can reach the annotations from HERE that includes:

  • The upper body detection of nine panelists in bounding box form
  • Associated VAD ground-truth (speaking, not-speaking) for nine panelists
  • Acoustic features extracted from the video: MFCC and raw filterbank energies
  • The corresponding video can be accessed from HERE

When using this dataset for your research, please cite the related papers in your publication:

C. Beyan, M. Shahid and V. Murino, 
"RealVAD: A Real-world Dataset and A Method for Voice Activity Detection by Body Motion Analysis," 
in IEEE Transactions on Multimedia, doi: 10.1109/TMM.2020.3007350.
@ARTICLE{Beyan2020TMM,
  author={C. {Beyan} and M. {Shahid} and V. {Murino}},
  journal={IEEE Transactions on Multimedia}, 
  title={RealVAD: A Real-world Dataset and A Method for Voice Activity Detection by Body Motion Analysis}, 
  year={2020},
  volume={},
  number={},
  pages={1-1},}

Releases

No releases published

Packages

No packages published