Samuel Albanie*, GĂĽl Varol*, Liliane Momeni, Triantafyllos Afouras, Joon Son Chung, Neil Fox and Andrew Zisserman, BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues, ECCV 2020.
Requires: python 3.6. (some non-essential pre-processing scripts require python 3.7)
# Clone this repository
git clone https://github.com/gulvarol/bsl1k.git
cd bsl1k/
# Setup symbolic links (point these to folders where you would like data and checkpoints to be stored)
ln -s <replace_with_data_path> data
ln -s <replace_with_log_path> checkpoint
# Create bsl1k_env environment with dependencies
conda env create -f environment.yml
conda activate bsl1k_env
pip install -r requirements.txt
The demo
folder contains a sample script to apply sign language recognition on an input video. By default, the demo will download: (1) a model that has been pretrained on BSL-1K and then fine-tuned on WLASL, (2) a video from handspeak.com (this particular video is part of the the WLASL test set). The demo should produce the output below (you can change to other inputs):
Usage: run python demo.py
.
The original video source can be found here. Copyright Jolanta Lapiak.
- This code supports I3D classification training for the following sign language video datasets:
Dataset | --datasetname |
Path | --num-classes |
--ram_data |
info/ |
---|---|---|---|---|---|
BSL-1K (coming soon) | bsl1k |
data/bsl1k/ |
1064 | 0 | [COMING SOON] |
WLASL | wlasl |
data/wlasl/ |
2000 | 1 | (3.7GB) |
MSASL | msasl |
data/msasl/ |
1000 | 1 | (6.6GB) |
Phoenix2014T | phoenix2014 |
data/PHOENIX-2014-T-release-v3/PHOENIX-2014-T/ |
1233 | 0 | (3MB) |
BSL-Corpus | bslcp |
data/BSLCP/ |
966 | 0 | (1MB) |
- Please cite the original papers for WLASL, MSASL, Phoenix2014T and BSL-Corpus datasets. Here, we only provide pre-processed metadata, but not the videos, which can instead be obtained via the metadata provided by the dataset authors, as described next:
WLASL: First head to the WLASL authors' github page here and download the .json
file of links. This file evolves over time, the current version is v3 and is called WLASL_v0.3.json
. Place the downloaded file at the location data/wlasl/info/WLASL_v0.3.json
. After this step, video files can be downloaded by running the following command:
python misc/wlasl/download_wlasl.py
Notes: some videos may no longer be accessible - you can contact the WLASL authors to address this issue (they provide an email address on the github page linked above). Also note that the v3 json may produce slightly different results from the WLASL_v0.1.json
we used for our experiments.
MSASL: As for the dataset above, first download the json
files of video links from the authors here and place them into data/msasl/info/
. This will create a file directory structure as follows:
data/
msasl/
info/
MSASL_train.json
MSASL_val.json
MSASL_test.json
The videos may then be downloaded via:
python misc/msasl/download_msasl.py
Phoenix2014T: video files can be downloaded from here (this file should be unpacked to the location data/PHOENIX-2014-T-release-v3
). You can then run the following command script to create .mp4
video files from the provided .png
frames:
python misc/phoenix2014/gather_frames.py
BSL-Corpus: can be downloaded from here upon request from the owners.
- In our folder organization, each dataset has a subfolder
info/
in which most pre-extracted annotations are kept:info/info.pkl
info/pose.pkl
OpenPose is extracted for:bsl1k
for a subset of the videosmsasl
andwlasl
for all videos (we provide these within the.tar
files)
- We have pre-processed all of the video datasets to be at 256x256 spatial resolution. The pre-processing scripts can be found under the
misc
folder for each dataset. Using the original videos is possible, but is slower. - We have pre-processed WLASL and MSASL such that the video frames are stored in a pkl file, we then loaded the entire dataset in RAM. Setting
--ram_data
to 0 will not require this preprocessing step, and use the video files instead. The results are similar with and without this step.
You can download some of the pretrained models used in the experiments by running
bash misc/pretrained_models/download.sh
in the project root directory. All the other pretrained models from the experiments are provided in the Experiments section. The best BSL-1K model reported for the final experiments is the first model.
Note 2021.09.14: You might want to check an improved model here from our follow-up CVPR'21 work.
The training launch for each experiment can be found in the Experiments section by clicking "run" links. The training can be ran by directly
typing python main.py <args>
on terminal with the arguments. We also provide the exp/create_exp.py
script that we used when launching experiments. You can use that via:
cd exp/
# Change config.json contents
python train.py
cd exp/
# Change config.json contents
python test.py
- Best model BSL-1K(m.5), last 20 frames, video pose pretrained
Model | ins. top-1 | ins. top-5 | cls. top-1 | cls. top-5 | Links |
---|---|---|---|---|---|
BSL-1K | 75.51 | 88.83 | 52.76 | 72.14 | run, args, model, logs |
- Trade-off between training noise vs. size
Model | ins. top-1 | ins. top-5 | cls. top-1 | cls. top-5 | Links |
---|---|---|---|---|---|
BSL-1K(m.5) | 70.61 | 85.26 | 47.47 | 68.13 | run, args, model, logs |
BSL-1K(m.6) | 71.33 | 85.92 | 48.83 | 68.82 | run, args, model, logs |
BSL-1K(m.7) | 70.95 | 85.73 | 48.13 | 67.81 | run, args, model, logs |
BSL-1K(m.8) | 69.00 | 83.79 | 45.86 | 64.42 | run, args, model, logs |
BSL-1K(m.9) | 60.53 | 77.51 | 35.09 | 54.26 | run, args, model, logs |
- Contribution of individual cues (pose subset of the data)
Model | ins. top-1 | ins. top-5 | cls. top-1 | cls. top-5 | Links |
---|---|---|---|---|---|
Pose2Sign (70p face) | 24.41 | 47.59 | 9.74 | 25.99 | run, args, model, logs |
Pose2Sign (60p body,hands) | 40.47 | 59.45 | 20.24 | 39.27 | run, args, model, logs |
Pose2Sign (130p all) | 49.66 | 68.02 | 29.91 | 49.21 | run, args, model, logs |
I3D (face-crop) | 42.23 | 69.70 | 21.66 | 50.51 | run, args, model, logs |
I3D (mouth-masked) | 46.75 | 66.34 | 25.85 | 48.02 | run, args, model, logs |
I3D (full-frame) | 65.57 | 81.33 | 44.90 | 64.91 | run, args, model, logs |
- Effect of pretraining
Model | ins. top-1 | ins. top-5 | cls. top-1 | cls. top-5 | Links |
---|---|---|---|---|---|
Random init. | 39.80 | 61.01 | 15.76 | 29.87 | run, args, model, logs |
Gesture recognition | 46.93 | 65.95 | 19.59 | 36.44 | run, args, model, logs |
Sign recognition | 69.90 | 83.45 | 44.97 | 62.73 | run, args, model, logs |
Action recognition | 69.00 | 83.79 | 45.86 | 64.42 | run, args, model, logs |
Video pose distillation | 70.38 | 84.50 | 46.24 | 65.31 | run, args, model, logs |
- The effect of the temporal window for KWS
Model | ins. top-1 | ins. top-5 | cls. top-1 | cls. top-5 | Links |
---|---|---|---|---|---|
1 sec | 60.10 | 75.42 | 36.62 | 53.83 | run, args, model, logs |
2 sec | 64.91 | 80.98 | 40.29 | 59.63 | run, args, model, logs |
4 sec | 68.09 | 82.79 | 45.35 | 63.64 | run, args, model, logs |
8 sec | 69.00 | 83.79 | 45.86 | 64.42 | run, args, model, logs |
16 sec | 65.91 | 81.84 | 39.51 | 59.03 | run, args, model, logs |
- The effect of the number of frames before the mouthing peak
Model | ins. top-1 | ins. top-5 | cls. top-1 | cls. top-5 | Links |
---|---|---|---|---|---|
16 frames | 59.53 | 77.08 | 36.16 | 58.43 | run, args, model, logs |
20 frames | 71.71 | 85.73 | 49.64 | 69.23 | run, args, model, logs |
24 frames | 69.00 | 83.79 | 45.86 | 64.42 | run, args, model, logs |
- WLASL dataset (isolated) - 64 frames input
Model | ins. top-1 | ins. top-5 | cls. top-1 | cls. top-5 | Links |
---|---|---|---|---|---|
Kinetics pretraining | 40.85 | 74.10 | 39.06 | 73.33 | run, args, model, logs |
BSL-1K pretraining | 46.82 | 79.36 | 44.72 | 78.47 | run, args, model, logs |
- MSASL dataset (isolated) - 64 frames input
Model | ins. top-1 | ins. top-5 | cls. top-1 | cls. top-5 | Links |
---|---|---|---|---|---|
Kinetics pretraining | 60.45 | 82.05 | 57.17 | 80.02 | run, args, model, logs |
BSL-1K pretraining | 64.71 | 85.59 | 61.55 | 84.43 | run, args, model, logs |
- Phoenix2014T dataset (co-articulated) - 16 frames input
Model | wer | del_rate | ins_rate | sub_rate | Links |
---|---|---|---|---|---|
Kinetics pretraining | 45.07 | 22.05 | 6.52 | 16.50 | run, args, model, logs |
BSL-1K pretraining | 39.49 | 22.54 | 5.03 | 11.92 | run, args, model, logs |
- BSL-Corpus dataset subset (co-articulated) - 16 frames input
Model | ins. top-1 | ins. top-5 | cls. top-1 | cls. top-5 | Links |
---|---|---|---|---|---|
Kinetics pretraining | 12.79 | 23.11 | 7.76 | 15.76 | run, args, model, logs |
BSL-1K pretraining | 24.35 | 39.14 | 16.00 | 28.54 | run, args, model, logs |
We are in the process of finalising legal confirmation from our broadcasting partners before we release data.
We would like to emphasise that this research represents a working progress towards achieving automatic sign language recognition, and as such, has a number of limitations that we are aware of (and likely many that we are not aware of). Key limitations include:
- The data collected with our technique is long-tailed (this can be seen in Fig. 2 of our paper, referenced below). This reflects the nature of how signs are used in reality, but it also makes it challenging to train existing vision models (which prefer balanced data).
- All data collected here is interpreted. Interpreted data differs from conversations between native signers (see e.g. this paper for a discussion on this point).
- Our approach naturally biases the annotated data towards mouthings (signs that are not frequently mouthed, or signers who do not mouth, are less represented).
If you use this code, please cite the following:
@INPROCEEDINGS{albanie20_bsl1k,
title = {{BSL-1K}: {S}caling up co-articulated sign language recognition using mouthing cues},
author = {Albanie, Samuel and Varol, G{\"u}l and Momeni, Liliane and Afouras, Triantafyllos and Chung, Joon Son and Fox, Neil and Zisserman, Andrew},
booktitle = {ECCV},
year = {2020}
}