As a sanity check, run evaluation using our ImageNet fine-tuned models:
ViT-Large | ViT-Huge | |
---|---|---|
pre-trained checkpoint on Kinetics-400 | download | download |
md5 | edf3a5 | 3d7f64 |
ViT-Large | ViT-Huge | |
---|---|---|
pre-trained checkpoint on Kinetics-600 | download | download |
md5 | 9a9645 | 27495e |
ViT-Large | ViT-Huge | |
---|---|---|
pre-trained checkpoint on Kinetics-700 | download | download |
md5 | cdbada | 4c4e3c |
Evaluate ViT-Large: (${KINETICS_DIR}
is a directory containing {train, val}
sets of Kinetics):
python run_finetune.py --path_to_data_dir ${KINETICS_DIR} --rand_aug --epochs 50 --repeat_aug 2 --model vit_large_patch16 --batch_size 2 --distributed --dist_eval --smoothing 0.1 --mixup 0.8 --cutmix 1.0 --mixup_prob 1.0 --blr 0.0024 --num_frames 16 --sampling_rate 4 --dropout 0.3 --warmup_epochs 5 --layer_decay 0.75 --drop_path_rate 0.2 --aa rand-m7-mstd0.5-inc1 --clip_grad 5.0 --fp32"}${FINETUNE_APPENDIX}
This should give:
* Acc@1 84.35
- The pre-trained models we provide are trained with normalized pixels
--norm_pix_loss
(1600 effective epochs). The models are pretrained in PySlowFast codebase.