Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't Reproduce Zero Shot Performance MSRVTT and LSMDC with Intervid-10m-FLT Checkpoint #139

Open
fmthoker opened this issue Jun 17, 2024 · 2 comments

Comments

@fmthoker
Copy link

fmthoker commented Jun 17, 2024

Dear Authors,
I am trying to reproduce Zeroshot performance with the checkpoint ViCLIP-L-14 InternVid-10M-FLT .
However, the performance is different from reported numbers in the paper. Here are the results I obtain:

MSRVTT:
txt_r1 txt_r5 txt_r10 txt_r_mean img_r1 img_r5 img_r10 img_r_mean r_mean
msrvtt_1k_test/ 38.9 62.2 74.0 58.37 39.4 61.9 73.0 58.10 58.23
msrvtt_1k_test_emb/ 39.0 62.2 73.3 58.17 39.1 63.2 73.9 58.73 58.45

LSMDC:

txt_r1 txt_r5 txt_r10 txt_r_mean img_r1 img_r5 img_r10 img_r_mean r_mean
test/ 15.2 29.0 35.6 26.6 17.8 32.1 40.1 30.00 28.30
test_emb/ 15.8 29.1 36.7 27.2 18.5 32.7 40.8 30.67 28.93

Here is the script that i run to obtain the performances:

source /ibex/user/thokerfm/anaconda3/bin/activate viclip
export PYTHONPATH=.

MASTER_NODE=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=$((RANDOM % (65535 - 1024 + 1) + 1024))

echo $MASTER_NODE
echo $MASTER_PORT

OUTPUT_DIR='expirements_zero_shot/ViClip-InternVid-10M-FLT/lsmdc/'

OMP_NUM_THREADS=1
torchrun --rdzv_endpoint=${MASTER_NODE}:${MASTER_PORT}
--nnodes=1
--nproc_per_node=4
--rdzv_backend=c10d
tasks/retrieval.py
$(dirname $0)/config.py
wandb.enable False
train_corpus viclip
evaluate True
output_dir ${OUTPUT_DIR}
model.vision_encoder.pretrained 'CLIP-ViT-L/14'
model.text_encoder.pretrained 'CLIP-ViT-L/14'
pretrained_path pretrained_viclip_models/ViClip-InternVid-10M-FLT.pth

@leexinhao
Copy link
Collaborator

I guess you didn't turn on wise ft. We average the internvid10M-fliered weights with the original CLIP weights during the test.

@fmthoker
Copy link
Author

@leexinhao thanks for the reply, after evaluating with wise ft = True, indeed the results are better:

MSRVTT:
txt_r1 txt_r5 txt_r10 txt_r_mean img_r1 img_r5 img_r10 img_r_mean r_mean
msrvtt_1k_test/ 42.0 65.7 75.3 61.0 41.9 66.5 75.6 61.33 61.17
msrvtt_1k_test_emb/ 42.8 66.8 75.5 61.7 42.8 67.2 75.5 61.83 61.77

LSMDC:
txt_r1 txt_r5 txt_r10 txt_r_mean img_r1 img_r5 img_r10 img_r_mean r_mean
test/ 16.4 32.0 39.3 29.23 18.8 36.1 43.5 32.80 31.02
test_emb/ 17.9 33.2 40.8 30.63 18.7 36.9 44.5 33.37 32.00

Can you please confirm which numbers are reported in the paper ( test or test_emb) ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants