Changing model in text-to-video retrieval tutorial leads to poor performance #2493

yiling-chen · 2023-03-28T01:50:49Z

yiling-chen
Mar 28, 2023

Hi all,

I am wondering if anyone tried this tutorial and replaced the CLIP4clip model with other models supported by towhee?
https://codelabs.towhee.io/how-to-build-a-text-video-retrieval-engine/index
By following the tutorial, I can reproduce the metrics on MSR-VTT by using CLIP4clip.

However, when I tried the FrozenInTime and BridgeFormer models, I could only got 0.007 and 0.003 Recall@1.
Compared to 0.421 of CLIP4clip, it is obviously not the right number.

I didn't make many changes to the code. For both FrozenInTime and BridgeFormer, I had to change the embed dim to 256. I also modified the video decoding by referring to the example code of the operator. The remaining code remained the same.

For your reference,

FrozenInTime
dc = ( towhee.read_csv(test_sample_csv_path) .runas_op['video_id', 'id'](func=lambda x: int(x[-4:])) .video_decode.ffmpeg['video_path', 'frames'](sample_type='uniform_temporal_subsample', args={'num_samples': 4}) .runas_op['frames', 'frames'](func=lambda x: [y for y in x]) .video_text_embedding.frozen_in_time['frames', 'vec'](model_name='frozen_in_time_base_16_244', modality='video', device=device) .to_milvus['id', 'vec'](collection=collection, batch=30) )

BridgeFormer
dc = ( towhee.read_csv(test_sample_csv_path) .runas_op['video_id', 'id'](func=lambda x: int(x[-4:])) .video_decode.ffmpeg['video_path', 'frames']() .runas_op['frames', 'frames'](func=lambda x: [y for y in x]) .video_text_embedding.bridge_former['frames', 'vec'](model_name='frozen_model', modality='video') .to_milvus['id', 'vec'](collection=collection, batch=30) )

I am wondering if anyone who had experiences in these operators can provide any insights?
Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changing model in text-to-video retrieval tutorial leads to poor performance #2493

{{title}}

Replies: 0 comments

Select a reply

Changing model in text-to-video retrieval tutorial leads to poor performance #2493

yiling-chen Mar 28, 2023

Replies: 0 comments

yiling-chen
Mar 28, 2023