This is the original PyTorch implementation of the following work: SRNet: Improving Generalization in 3D Human Pose Estimation with a Split-and-Recombine Approach in ECCV 2020.
Recently, our method has been verified in 3D Human Mesh Recovery as a decoder to obtain both per-frame accuracy and motion smoothness in ICCV 2021!
Beyond this task, You can make full use of prior knowledge in your task to design the group strategies. Our proposed method (Split-and-Recombine) is an efficient and effective way to replace fully connected layer with about [1/group] parameters (group is 5 in this task) and better performance.
- Support single-frame setting (e.g., -arc 1,1,1)
- Support multi-frame setting (e.g., -arc 3,3,3,3,3 for 243 frames)
- Support four normalization (--norm {base,proj,weak_proj,lcn})
- Support cross-subject, cross-action, cross-camera settings
- Support VideoPose3d, SimpleBaseline as our baseline.
Monocular 3D human pose estimation is to input 2d poses to lift into 3d relative poses. Take root (index=0) joint as the zero-position under camera coordinate by default.
Human poses that are rare or unseen in a training set are challenging for a network to predict. Similar to the long-tailed distribution problem in visual recognition, the small number of examples for such poses limits the ability of networks to model them. Interestingly, local pose distributions suffer less from the long-tail problem, i.e., local
joint configurations within a rare pose may appear within other poses in the training set, making them less rare.
We propose to take advantage of this fact for better generalization to rare and unseen poses. To be specific, our method splits the body into local regions and processes them in separate network branches, utilizing the property that a joint's position depends mainly on the joints within its local body region. Global coherence is maintained by recombining the global context from the rest of the body into each branch as a low-dimensional vector. With the reduced dimensionality of less relevant body areas, the training set distribution within network branches more closely reflects the statistics of local poses instead of global body poses, without sacrificing information important for joint inference. The proposed split-and-recombine approach, called SRNet, can be easily adapted to both single-image and temporal models, and it leads to appreciable improvements in the prediction of rare and unseen poses.
The comparison of Different network structures used for 2D to 3D pose estimation.
To get started as quickly as possible, follow the instructions in this section. It allows you to train a model from scratch, test our pretrained models, and produce basic visualizations. For more detailed instructions, please refer to DOCUMENTATION.md.
Make sure you have the following dependencies installed before proceeding:
- Python 3+ distribution
- PyTorch >= 0.4.0
- pip install matplotlib==3.1.1
First, we build new files to store models:
mkdir checkpoint
mkdir best_checkpoint
The ${ROOT} is described as below.
|-- data/
|-- checkpoint/
|-- best_checkpoint/
|-- common/
|-- config/
|-- run.py
Please follow the instruction from VideoPose3D to process the data from the official Human3.6M website. You can download the processed skeleton-based Human3.6M datasets in the link. Put the data into the dictory data/.
mkdir data
cd data
The data directory structure is shown as follows.
./
└── data/
├── data_2d_h36m_gt.npz
├── data_3d_h36m.npz
data_2d_h36m_gt.npz
is the 2d ground-truth pose of Human3.6M dataset.
data_3d_h36m.npz
is the 3d ground-truth pose of Human3.6M dataset.
We provide single-frame and multi-frame models in the link for inference and finetune:
Pretrained models should be put in the checkpoint/
e.g.,
latest_epoch_fc.bin
is the last checkpoint of the fully connected network
latest_epoch_sr.bin
is the last checkpoint with 1 frame of the split-and-recombine network
srnet_gp5_t243_mul.bin
is the last checkpoint with 243 frames of the split-and-recombine network
Using --evaluate {model_name} for testing the model.
Using --resume {model_name} for resuming some checkpoint to finetune the model.
For example:
python run.py -arc 3,3,3,3,3 --model srnet --evaluate srnet_gp5_t243_mul.bin
python run.py -arc 1,1,1 --model srnet --evaluate latest_epoch_sr.bin
python run.py -arc 1,1,1 --model fc --evaluate latest_epoch_fc.bin
There are three training and test settings, the commonly used is the cross-subject (by default). We train on five subjects with all four cameras and all fifteen actions and test on other two subjects with all cameras and all actions.
To train the split-and-recombine network with 243 frames as input and 1 frames as output from the scratch, run:
python run.py -arc 3,3,3,3,3 --model srnet -mn {given_model_name}
-mn
is the model name to save the specific model.
To train the VideoPose3d with 243 frames as input and 1 frames as output from the scratch, run:
python run.py -arc 3,3,3,3,3 --model fc -mn {given_model_name}
To use cross-action setting, we train on only one action with all subjects and all cameras, and test on other fourteen actions with all subjects and all cameras.
You can add the arguments in the command like: --use-action-split True --train-action Discussion
To use cross-camera setting, we train on only one camera with all subjects and all actions, and test on other three cameras with all subjects and all actions.
You can add the arguments in the command like: --cam-train [0] --cam-test [1,2,3]
For convenience of different hyper-parameter settings, you can edit the scipt run_os.py to run experiments for once.
We also put some configuration examples in the dictory config/. To facilitate reproduction, we provide the training logs for single-frame and multi-frame settings here. You can check the hyperparameters, training loss and test results for each epoch in these logs as well.
If you find this repository useful for your work, please consider citing it as follows:
@inproceedings{Zeng2020SRNet,
title={SRNet: Improving Generalization in 3D Human Pose Estimation with a Split-and-Recombine Approach},
author={Ailing Zeng and Xiao Sun and Fuyang Huang and Minhao Liu and Qiang Xu and Stephen Ching-Feng Lin},
booktitle={ECCV},
year={2020}
}