This is the official repo for ECCV2020 paper "Whole-Body Human Pose Estimation in the Wild". The slides of this work can be found in Slides. The repo contains COCO-WholeBody annotations proposed in this paper.
[2020.09.19] We have supported COCO-WholeBody in MMPose. A much stronger baseline model dark_pose_hrnet_w48+ with WholeBody AP 66.1% is provided for research purpose.
[2020.09.15] The evaluation tool is integrated into Extended COCO API for version>=1.5
To install from pip:
pip install xtcocotools
COCO-WholeBody dataset is the first large-scale benchmark for whole-body pose estimation. It is an extension of COCO 2017 dataset with the same train/val split as COCO.
Here is an example of one annotated image.
For each person, we annotate 4 types of bounding boxes (person box, face box, left-hand box, and right-hand box) and 133 keypoints (17 for body, 6 for feet, 68 for face and 42 for hands). The face/hand box is defined as the minimal bounding rectangle of the keypoints. The keypoint annotations are illustrated as follows.
Images can be downloaded from COCO 2017 website.
COCO-WholeBody annotations for Train / Validation (Google Drive).
The data format is defined in DATA_FORMAT.
We provide evaluation tools for COCO-WholeBody dataset. Our evaluation tools is developed based on @cocodataset/cocoapi.
We also provide an example groundtruth file (example_gt.json) and an example pred file (example_pred.json).
Evaluate on COCO-WholeBody by running the following line:
python evaluation/evaluation_wholebody.py --gt_file evaluation/example_gt.json --res_file evaluation/example_pred.json
-
COCO-WholeBody dataset is ONLY for research and non-commercial use.
-
The annotations of COCO-WholeBody dataset belong to SenseTime Research, and are licensed under a Creative Commons Attribution 4.0 License.
-
We do not own the copyright of the images. Use of the images must abide by the Flickr Terms of Use. The users of the images accept full responsibility for the use of the dataset, including but not limited to the use of any copies of copyrighted images that they may create from the dataset.
Overview of some popular public datasets for 2D keypoint estimation in RGB images. Kpt stands for keypoints, and #Kpt means the annotated number. ``Wild'' denotes whether the dataset is collected in-the-wild. * means head box.
DataSet | Images | #Kpt | Wild | Body Box | Hand Box | Face Box | Body Kpt | Hand Kpt | Face Kpt | Total |
---|---|---|---|---|---|---|---|---|---|---|
MPII [1] | 25K | 16 | ✔️ | ✔️ | * | ✔️ | 40K | |||
MPII-TRB [2] | 25K | 40 | ✔️ | ✔️ | * | ✔️ | 40K | |||
CrowdPose [3] | 20K | 14 | ✔️ | ✔️ | ✔️ | 80K | ||||
PoseTrack [4] | 23K | 15 | ✔️ | ✔️ | ✔️ | 150K | ||||
AI Challenger [5] | 300K | 14 | ✔️ | ✔️ | ✔️ | 700K | ||||
COCO [6] | 200K | 17 | ✔️ | ✔️ | * | ✔️ | 250K | |||
OneHand10K [7] | 10K | 21 | ✔️ | ✔️ | ✔️ | - | ||||
SynthHand [8] | 63K | 21 | ✔️ | ✔️ | - | |||||
RHD [9] | 41K | 21 | ✔️ | ✔️ | - | |||||
FreiHand [10] | 130K | 21 | ✔️ | - | ||||||
MHP [11] | 80K | 21 | ✔️ | ✔️ | - | |||||
GANerated [12] | 330K | 21 | ✔️ | - | ||||||
Panoptic [13] | 15K | 21 | ✔️ | ✔️ | - | |||||
WFLW [14] | 10K | 98 | ✔️ | ✔️ | ✔️ | - | ||||
AFLW [15] | 25K | 19 | ✔️ | ✔️ | ✔️ | - | ||||
COFW [16] | 1852 | 29 | ✔️ | ✔️ | ✔️ | - | ||||
300W [17] | 3837 | 68 | ✔️ | ✔️ | ✔️ | - | ||||
Halpe[27] | 43K | 136 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | 50K |
COCO-WholeBody | 200K | 133 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | 250K |
Whole-body pose estimation results on our WholeBody benchmark.
Method | body | foot | face | hand | whole | |||||
---|---|---|---|---|---|---|---|---|---|---|
AP | AR | AP | AR | AP | AR | AP | AR | AP | AR | |
OpenPose [18] | 0.563 | 0.612 | 0.532 | 0.645 | 0.482 | 0.626 | 0.198 | 0.342 | 0.338 | 0.449 |
SN [19] | 0.280 | 0.336 | 0.121 | 0.277 | 0.382 | 0.440 | 0.138 | 0.336 | 0.161 | 0.209 |
PAF [20] | 0.266 | 0.328 | 0.100 | 0.257 | 0.309 | 0.362 | 0.133 | 0.321 | 0.141 | 0.185 |
PAF-body [20] | 0.409 | 0.470 | - | - | - | - | - | - | - | - |
AE [21] | 0.405 | 0.464 | 0.077 | 0.160 | 0.477 | 0.580 | 0.341 | 0.435 | 0.274 | 0.350 |
AE-body [21] | 0.582 | 0.634 | - | - | - | - | - | - | - | - |
HRNet [22] | 0.659 | 0.709 | 0.314 | 0.424 | 0.523 | 0.582 | 0.300 | 0.363 | 0.432 | 0.520 |
HRNet-body [22] | 0.758 | 0.809 | - | - | - | - | - | - | - | - |
ZoomNet | 0.743 | 0.802 | 0.798 | 0.869 | 0.623 | 0.701 | 0.401 | 0.498 | 0.541 | 0.658 |
WholeBody-Face (WBF) & WholeBody-Hand (WBH) are subsets of COCO-WholeBody.
We build WBF & WBH by extracting cropped face & hand images and annotations from COCO-WholeBody.
Method | extra. | comm.↓ | chall.↓ | full ↓ | test ↓ |
---|---|---|---|---|---|
RCN [23] | - | 4.67 | 8.44 | 5.41 | - |
DAN [24] | - | 3.19 | 5.24 | 3.59 | 4.30 |
DCFE [25] | w/3D | 2.76 | 5.22 | 3.24 | 3.88 |
LAB [14] | w/Boundary | 2.98 | 5.19 | 3.49 | - |
HRNet [26] | - | 2.87 | 5.15 | 3.32 | 3.85 |
HRNet-Ours | - | 2.89 | 5.15 | 3.33 | 3.91 |
HRNet-Ours | WBF | 2.84 | 4.73 | 3.21 | 3.68 |
Train-set | Test-set | EPE ↓ | NME ↓ |
---|---|---|---|
CMU Panoptic [13] | CMU Panoptic [13] | 7.49 | 0.68 |
WBH → CMU Panoptic [13] | CMU Panoptic [13] | 7.00 | 0.63 |
WBH | WBH | 2.76 | 6.66 |
CMU Panoptic [13] → WBH | WBH | 2.70 | 6.49 |
If you use this dataset in your project, please cite this paper.
@inproceedings{jin2020whole,
title={Whole-Body Human Pose Estimation in the Wild},
author={Jin, Sheng and Xu, Lumin and Xu, Jin and Wang, Can and Liu, Wentao and Qian, Chen and Ouyang, Wanli and Luo, Ping},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2020}
}
[1] Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: New benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
[2] Duan, H., Lin, K.Y., Jin, S., Liu, W., Qian, C., Ouyang, W.: Trb: A novel triplet representation for understanding 2d human body. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 9479–9488 (2019)
[3] Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.S., Lu, C.: Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 10863–10872 (2019)
[4] Andriluka, M., Iqbal, U., Insafutdinov, E., Pishchulin, L., Milan, A., Gall, J., Schiele, B.: Posetrack: A benchmark for human pose estimation and tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
[5] Wu, J., Zheng, H., Zhao, B., Li, Y., Yan, B., Liang, R., Wang, W., Zhou, S., Lin, G., Fu, Y., et al.: Ai challenger: a large-scale dataset for going deeper in image understanding. arXiv preprint arXiv:1711.06475 (2017)
[6] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dolla ́r, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision (ECCV) (2014)
[7] Wang, Y., Peng, C., Liu, Y.: Mask-pose cascaded cnn for 2d hand pose estimation from single color image. IEEE Transactions on Circuits and Systems for Video Technology (2018)
[8] Mueller, F., Mehta, D., Sotnychenko, O., Sridhar, S., Casas, D., Theobalt, C.: Real-time hand tracking under occlusion from an egocentric rgb-d sensor. In: Proceedings of International Conference on Computer Vision (ICCV) (2017)
[9] Zimmermann, C., Brox, T.: Learning to estimate 3d hand pose from single rgb images. arXiv preprint arXiv: 1705.01389 (2017)
[10] Zimmermann, C., Ceylan, D., Yang, J., Russell, B., Argus, M., Brox, T.: Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. In: Proceedings of International Conference on Computer Vision (ICCV) (2019)
[11] Gomez-Donoso, F., Orts-Escolano, S., Cazorla, M.: Large-scale multiview 3d hand pose dataset. arXiv preprint arXiv:1707.03742 (2017)
[12] Mueller, F., Bernard, F., Sotnychenko, O., Mehta, D., Sridhar, S., Casas, D., Theobalt, C.: Ganerated hands for real-time 3d hand tracking from monocular rgb. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
[13] Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
[14] Wu, W., Qian, C., Yang, S., Wang, Q., Cai, Y., Zhou, Q.: Look at boundary: A boundary-aware face alignment algorithm. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
[15] Koestinger, M., Wohlhart, P., Roth, P.M., Bischof, H.: Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In: IEEE International Conference on Computer Vision Workshop (2011)
[16] Burgos-Artizzu, X.P., Perona, P., Dolla ́r, P.: Robust face landmark estimation under occlusion. In: Proceedings of the 2013 IEEE International Conference on Computer Vision (2013)
[17] Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces in-the-wild challenge: The first facial landmark localization challenge. In: IEEE International Conference on Computer Vision Workshop (2013)
[18] Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: Openpose: real- time multi-person 2d pose estimation using part affinity fields. arXiv preprint arXiv:1812.08008 (2018)
[19] Hidalgo, G., Raaj, Y., Idrees, H., Xiang, D., Joo, H., Simon, T., Sheikh, Y.: Single-network whole-body pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
[20] Cao,Z.,Simon,T.,Wei,S.E.,Sheikh,Y.:Realtimemulti-person2dposeestimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
[21] Newell, A., Huang, Z., Deng, J.: Associative embedding: End-to-end learning for joint detection and grouping. In: Advances in Neural Information Processing Systems (2017)
[22] Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. arXiv preprint arXiv:1902.09212 (2019)
[23] Honari, S., Yosinski, J., Vincent, P., Pal, C.: Recombinator networks: Learning coarse-to-fine feature aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
[24] Kowalski,M.,Naruniec,J.,Trzcinski,T.:Deepalignmentnetwork:Aconvolutional neural network for robust face alignment. In: IEEE Conference on Computer Vision and Pattern Recognition Workshop (2017)
[25] Valle, R., Buenaposada, J.M., Valdes, A., Baumela, L.: A deeply-initialized coarse- to-fine ensemble of regression trees for face alignment. In: Proceedings of the Eu- ropean Conference on Computer Vision (ECCV) (2018)
[26] Sun, K., Zhao, Y., Jiang, B., Cheng, T., Xiao, B., Liu, D., Mu, Y., Wang, X., Liu, W., Wang, J.: High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514 (2019)
[27] Fang, H.S., Li, J., Tang, H., Xu, C., Zhu, H., Li, Y., Xiu, Y., Lu, C.: https://github.com/Fang-Haoshu/Halpe-FullBody (2020)