Changing model in text-to-video retrieval tutorial leads to poor performance #2493
Unanswered
yiling-chen
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all,
I am wondering if anyone tried this tutorial and replaced the CLIP4clip model with other models supported by towhee?
https://codelabs.towhee.io/how-to-build-a-text-video-retrieval-engine/index
By following the tutorial, I can reproduce the metrics on MSR-VTT by using CLIP4clip.
However, when I tried the FrozenInTime and BridgeFormer models, I could only got 0.007 and 0.003 Recall@1.
Compared to 0.421 of CLIP4clip, it is obviously not the right number.
I didn't make many changes to the code. For both FrozenInTime and BridgeFormer, I had to change the embed dim to 256. I also modified the video decoding by referring to the example code of the operator. The remaining code remained the same.
For your reference,
FrozenInTime
dc = ( towhee.read_csv(test_sample_csv_path) .runas_op['video_id', 'id'](func=lambda x: int(x[-4:])) .video_decode.ffmpeg['video_path', 'frames'](sample_type='uniform_temporal_subsample', args={'num_samples': 4}) .runas_op['frames', 'frames'](func=lambda x: [y for y in x]) .video_text_embedding.frozen_in_time['frames', 'vec'](model_name='frozen_in_time_base_16_244', modality='video', device=device) .to_milvus['id', 'vec'](collection=collection, batch=30) )
BridgeFormer
dc = ( towhee.read_csv(test_sample_csv_path) .runas_op['video_id', 'id'](func=lambda x: int(x[-4:])) .video_decode.ffmpeg['video_path', 'frames']() .runas_op['frames', 'frames'](func=lambda x: [y for y in x]) .video_text_embedding.bridge_former['frames', 'vec'](model_name='frozen_model', modality='video') .to_milvus['id', 'vec'](collection=collection, batch=30) )
I am wondering if anyone who had experiences in these operators can provide any insights?
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions