We directly use existing pre-trained model. For video, we use Kinetics-400 pre-trained model by VideoMAE for 1600 epochs.
We made two minor modifications to the implementation of the ViT for flexible experiments:
- We unbind the queue, key, value linear projection layer to three linear layers:
# From:
self.qkv = nn.Linear(dim, all_head_dim * 3, bias=False)
# To:
self.q_proj = nn.Linear(dim, all_head_dim, bias=False)
self.v_proj = nn.Linear(dim, all_head_dim, bias=False)
self.k_proj = nn.Linear(dim, all_head_dim, bias=False)
- We decompose the encapsulated
Mlp
block:
# From:
self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
# To:
self.fc1 = nn.Linear(dim, mlp_hidden_dim)
self.fc2 = nn.Linear(mlp_hidden_dim, dim)
self.act = act_layer()
self.mlp_drop = nn.Dropout(drop)
Therefore, we need to convert the checkpoint provided by VideoMAE using convert.py.
Or you can directly use our preprocessed videomae_pretrain_vit_b_1600.pth.