Do we need to crop the HiREST videos? #10

yeliudev · 2024-02-07T03:58:35Z

Hi @RenShuhuai-Andy, thanks for sharing this great work! For some videos in HiREST dataset, the filenames are "xxxx_35_79.mp4". Do we need to crop the original videos according to the timestamps in the filename (e.g., cropping the 35s to 79s in this case)?

RenShuhuai-Andy · 2024-02-07T07:51:16Z

Hi, thanks for your interest.

Yes, for HiREST_step task in TimeIT (instruct_action_0.5k_hirest.json), we need to crop the videos into clips. Details can be found in https://github.com/RenShuhuai-Andy/TimeChat/blob/master/docs/DATA.md#process-hirest.

yeliudev · 2024-02-09T16:16:54Z

Hi, thanks for your interest.

Yes, for HiREST_step task in TimeIT (instruct_action_0.5k_hirest.json), we need to crop the videos into clips. Details can be found in https://github.com/RenShuhuai-Andy/TimeChat/blob/master/docs/DATA.md#process-hirest.

Many thanks for your reply! For the VATEX videos from Valley, are the videos cropped according to the filenames as well?

yeliudev · 2024-02-09T17:01:26Z

Also, using the provided checkpoint, the evaluation results on Charades-STA are different from those reported in Table 2 of the paper. Below are my reproduced results.

# pred video timestamps 3720; # gt video timestamps 3720
IOU 0.3: 47.33870967741935
IOU 0.5: 28.091397849462364
IOU 0.7: 12.82258064516129

Is it because the released model differs from the one reported in the paper (it seems that the paper version is trained on TimeIT only, but the released one is also trained on Valley), or the results in the paper were obtained with ASR captions on Charades-STA? If so, would it be possible to share the code for obtaining ASR captions (the file in whisper_outputs_with_time/tiny.en.cleaned/)? Thank you!

RenShuhuai-Andy · 2024-02-16T13:41:33Z

Hi, thanks for your interest.
Yes, for HiREST_step task in TimeIT (instruct_action_0.5k_hirest.json), we need to crop the videos into clips. Details can be found in https://github.com/RenShuhuai-Andy/TimeChat/blob/master/docs/DATA.md#process-hirest.

Many thanks for your reply! For the VATEX videos from Valley, are the videos cropped according to the filenames as well?

We follow the instructions from Valley (https://github.com/RupertLuo/Valley/blob/main/Crawler/README.md#vatex) to download the VATEX videos and do not conduct cropping. The ann_file can be found in https://eric-xw.github.io/vatex-website/data/vatex_training_v1.0.json

RenShuhuai-Andy · 2024-02-16T14:35:29Z

Also, using the provided checkpoint, the evaluation results on Charades-STA are different from those reported in Table 2 of the paper. Below are my reproduced results.
# pred video timestamps 3720; # gt video timestamps 3720
IOU 0.3: 47.33870967741935
IOU 0.5: 28.091397849462364
IOU 0.7: 12.82258064516129
Is it because the released model differs from the one reported in the paper (it seems that the paper version is trained on TimeIT only, but the released one is also trained on Valley), or the results in the paper were obtained with ASR captions on Charades-STA? If so, would it be possible to share the code for obtaining ASR captions (the file in whisper_outputs_with_time/tiny.en.cleaned/)? Thank you!

The results reported in the paper were obtained using the TimeIT + Valley dataset (we will note this more clearly in our paper update), and we don't use asr in our evaluation. For your convenience, you can find the code for asr in https://github.com/RenShuhuai-Andy/TimeChat/blob/master/docs/DATA.md#automatic-speech-transcription
Our released ckpt is different from the version used in the paper. The released ckpt was trained after cleaning the code and fixing a minor bug in QuerYD instructions data (some videos have the same start and end timestamps in the raw annotations file, so we only use one timestamp in the revision).
In our evaluation, the performance of the released ckpt on YouCook2 is higher than that in the paper, while the performance on Charades-STS & QVHighlight is lower. We also note that the output generated by LLM is different each time, which may cause fluctuations in the evaluation results. Please that we know if you want the ckpt of the paper version, we can also upload it.

yeliudev · 2024-02-19T08:46:56Z

Hi, thanks for your interest.
Yes, for HiREST_step task in TimeIT (instruct_action_0.5k_hirest.json), we need to crop the videos into clips. Details can be found in https://github.com/RenShuhuai-Andy/TimeChat/blob/master/docs/DATA.md#process-hirest.

Many thanks for your reply! For the VATEX videos from Valley, are the videos cropped according to the filenames as well?

We follow the instructions from Valley (https://github.com/RupertLuo/Valley/blob/main/Crawler/README.md#vatex) to download the VATEX videos and do not conduct cropping. The ann_file can be found in https://eric-xw.github.io/vatex-website/data/vatex_training_v1.0.json

Thanks for your detailed reply! But it seems that Valley cropped the VATEX videos according to the filenames (see here)...

yeliudev · 2024-02-19T08:49:33Z

Also, using the provided checkpoint, the evaluation results on Charades-STA are different from those reported in Table 2 of the paper. Below are my reproduced results.
# pred video timestamps 3720; # gt video timestamps 3720
IOU 0.3: 47.33870967741935
IOU 0.5: 28.091397849462364
IOU 0.7: 12.82258064516129
Is it because the released model differs from the one reported in the paper (it seems that the paper version is trained on TimeIT only, but the released one is also trained on Valley), or the results in the paper were obtained with ASR captions on Charades-STA? If so, would it be possible to share the code for obtaining ASR captions (the file in whisper_outputs_with_time/tiny.en.cleaned/)? Thank you!
The results reported in the paper were obtained using the TimeIT + Valley dataset (we will note this more clearly in our paper update), and we don't use asr in our evaluation. For your convenience, you can find the code for asr in https://github.com/RenShuhuai-Andy/TimeChat/blob/master/docs/DATA.md#automatic-speech-transcription

Our released ckpt is different from the version used in the paper. The released ckpt was trained after cleaning the code and fixing a minor bug in QuerYD instructions data (some videos have the same start and end timestamps in the raw annotations file, so we only use one timestamp in the revision).
In our evaluation, the performance of the released ckpt on YouCook2 is higher than that in the paper, while the performance on Charades-STS & QVHighlight is lower. We also note that the output generated by LLM is different each time, which may cause fluctuations in the evaluation results. Please that we know if you want the ckpt of the paper version, we can also upload it.

I see. Many thanks for your explanation! Does it mean that TimeChat was trained with ASR, but not using it during evaluation for fair comparison with existing methods?

yeliudev · 2024-02-19T10:05:07Z

Hi, thanks for your interest.
Yes, for HiREST_step task in TimeIT (instruct_action_0.5k_hirest.json), we need to crop the videos into clips. Details can be found in https://github.com/RenShuhuai-Andy/TimeChat/blob/master/docs/DATA.md#process-hirest.

Many thanks for your reply! For the VATEX videos from Valley, are the videos cropped according to the filenames as well?

We follow the instructions from Valley (https://github.com/RupertLuo/Valley/blob/main/Crawler/README.md#vatex) to download the VATEX videos and do not conduct cropping. The ann_file can be found in https://eric-xw.github.io/vatex-website/data/vatex_training_v1.0.json

Thanks for your detailed reply! But it seems that Valley cropped the VATEX videos according to the filenames (see here)...

Also, some QA pairs are missing in instruct_valley_72k.json (from TimeIT) compared with Valley_instruct_73k.json (from Valley), e.g., video VATEX_--33Lscn6sk_000004_000014.

# instruct_valley_72k.json
{
    "video": "vatex/videos/--33Lscn6sk_000004_000014.mp4",
    "QA": [
        {
            "q": "Can you describe what is happening in the video?",
            "a": "In the video, a girl is flipping an omelette while a woman watches her. Then, the woman shakes a skillet and flips a pancake out of it and back into it before flipping it again. The girl is also seen flipping a pancake multiple times. Another scene shows a girl standing with a frying pan and flipping a pancake twice. Lastly, a female films herself flipping food on a pan twice while someone is sitting at a table in the background."
        },
        {
            "q": "What is the girl doing with the omelette?",
            "a": "The girl is flipping the omelette."
        },
        {
            "q": "Who is watching the girl?",
            "a": "A woman is watching the girl."
        }
    ]
}

# Valley_instruct_73k.json
{
    "id": "VATEX_--33Lscn6sk_000004_000014",
    "v_id": "--33Lscn6sk_000004_000014",
    "video": "v_--33Lscn6sk.mp4",
    "source": "VATEX",
    "conversations": [
        {
            "from": "human",
            "value": "Can you describe what is happening in the video?\n<video>"
        },
        {
            "from": "gpt",
            "value": "In the video, a girl is flipping an omelette while a woman watches her. Then, the woman shakes a skillet and flips a pancake out of it and back into it before flipping it again. The girl is also seen flipping a pancake multiple times. Another scene shows a girl standing with a frying pan and flipping a pancake twice. Lastly, a female films herself flipping food on a pan twice while someone is sitting at a table in the background."
        },
        {
            "from": "human",
            "value": "What is the girl doing with the omelette?"
        },
        {
            "from": "gpt",
            "value": "The girl is flipping the omelette."
        },
        {
            "from": "human",
            "value": "Who is watching the girl?"
        },
        {
            "from": "gpt",
            "value": "A woman is watching the girl."
        },
        {
            "from": "human",
            "value": "What does the woman do with the skillet?"
        },
        {
            "from": "gpt",
            "value": "The woman shakes the skillet."
        },
        {
            "from": "human",
            "value": "What does the woman flip with the skillet?"
        },
        {
            "from": "gpt",
            "value": "The woman flips a pancake with the skillet."
        },
        {
            "from": "human",
            "value": "How many times does the girl flip the pancake?"
        },
        {
            "from": "gpt",
            "value": "The girl flips the pancake twice."
        }
    ]
}

RenShuhuai-Andy · 2024-02-22T09:20:38Z

Thanks for your detailed reply! But it seems that Valley cropped the VATEX videos according to the filenames (see RupertLuo/Valley#20)...

Yes you are right, sorry about that...

We have cropped the VATEX videos before training (which was done by my teammate), thus these is no problem with the released ckpt (the video filename in https://huggingface.co/datasets/ShuhuaiRen/TimeIT/blob/main/data/valley/instruct_valley_72k.json is also for the cropped version).

We have updated the code for processing the Valley dataset, please refer to https://github.com/RenShuhuai-Andy/TimeChat/blob/master/docs/DATA.md#process-valley. We notice that the Valley dataset has been updated (from 73K to 65K), you can reprocess the instruction json if you want to use the new dataset :)

RenShuhuai-Andy · 2024-02-22T09:20:55Z

I see. Many thanks for your explanation! Does it mean that TimeChat was trained with ASR, but not using it during evaluation for fair comparison with existing methods?

Yes

RenShuhuai-Andy · 2024-02-22T09:22:50Z

Also, some QA pairs are missing in instruct_valley_72k.json (from TimeIT) compared with Valley_instruct_73k.json (from Valley), e.g., video VATEX_--33Lscn6sk_000004_000014.

Yes, we use half of the QA pairs for accelerating training. To use full of QA pairs, you can reprocess the Valley instruction json using https://github.com/RenShuhuai-Andy/TimeChat/blob/master/utils/process_valley.py

yeliudev · 2024-02-22T09:50:01Z

I see... Data preprocessing is always tricky 🤣 Thank you so much!

I have a final question regarding the batch size during instruction tuning and fine-tuning (sorry for asking so much...I'm trying my best to understand your method). According to the training config stage2_finetune_time104k_valley72k.yaml, during instruction tuning, we are using 8 GPUs, while each GPU have batch_size_train = 1 & accum_grad_iters = 4, such that the equivalent batch size shall be 8(GPUs) * 1 (per-device batch size) * 4 (accumulate iters) = 32, which is well-aligned with the paper. However, iters_per_epoch is set to 1/8 of the dataset size (rather than 1/32). Does it mean that the instruction tuning actually went through the dataset 12 times (i.e., 12 epochs) instead of 3?

Also, I have tried to find the config (number of GPUs, per device batch size, accumulate iters, and how to set iters_per_epoch) for fine-tuning on YouCook2, Charades-STA, and QVHighlights. But I found that different settings are used in stage2_finetune_{youcook2,charades,qvhighlights}.yaml, which are listed below:

# youcook2

# number of GPUs: unknown
iters_per_epoch: 1192 # 1192 / 1
batch_size_train: 2
accum_grad_iters: 4

# charades

# number of GPUs: unknown
iters_per_epoch: 3102 # 12408 / 4
batch_size_train: 1
accum_grad_iters: 8


# qvhighlights

# number of GPUs: unknown
iters_per_epoch: 1714 # 6858 / 4
batch_size_train: 1
accum_grad_iters: 8

I was wondering whether you could kindly clarify the settings for fine-tuning. Thank you!

RenShuhuai-Andy · 2024-02-22T11:52:19Z

However, iters_per_epoch is set to 1/8 of the dataset size (rather than 1/32). Does it mean that the instruction tuning actually went through the dataset 12 times (i.e., 12 epochs) instead of 3?

no. At each epoch, we conduct next(data_loader) (yielding 8 samples for 1x8 gpus) iters_per_epoch times (1/8 of the dataset), thus it spans for the whole dataset (see https://github.com/RenShuhuai-Andy/TimeChat/blob/master/timechat/tasks/base_task.py#L205).

accum_grad_iters is only used to control the frequency of parameters updating (see https://github.com/RenShuhuai-Andy/TimeChat/blob/master/timechat/tasks/base_task.py#L230), instead of the number of samples per iter.

Accordingly, the iters_per_epoch should be set to len(dataset)/num_of_gpus, and the actual bsz is batch_size_train * num_of_gpus * accum_grad_iters. For downstream dataset fine-tuning, you can try

# youcook2

# number of GPUs: 8
iters_per_epoch: 149 # 1192 / 8
batch_size_train: 1
accum_grad_iters: 4

# charades

# number of GPUs: 8
iters_per_epoch: 1551 # 12408 / 8
batch_size_train: 1
accum_grad_iters: 4


# qvhighlights

# number of GPUs: 8
iters_per_epoch: 858 # 6858 / 8
batch_size_train: 1
accum_grad_iters: 4

You can also increase the training epoch for better performance.

yeliudev · 2024-02-22T13:54:04Z

Thank you so much for your detailed reply!

RenShuhuai-Andy referenced this issue Feb 7, 2024

process HiREST dataset

1f731e6

RenShuhuai-Andy referenced this issue Feb 16, 2024

add asr

3e9cf3f

RenShuhuai-Andy closed this as completed Mar 4, 2024

RenShuhuai-Andy mentioned this issue Apr 16, 2024

Question about batch size #23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do we need to crop the HiREST videos? #10

Do we need to crop the HiREST videos? #10

yeliudev commented Feb 7, 2024

RenShuhuai-Andy commented Feb 7, 2024

yeliudev commented Feb 9, 2024

yeliudev commented Feb 9, 2024 •

edited

Loading

RenShuhuai-Andy commented Feb 16, 2024

RenShuhuai-Andy commented Feb 16, 2024

yeliudev commented Feb 19, 2024

yeliudev commented Feb 19, 2024

yeliudev commented Feb 19, 2024

RenShuhuai-Andy commented Feb 22, 2024

RenShuhuai-Andy commented Feb 22, 2024

RenShuhuai-Andy commented Feb 22, 2024

yeliudev commented Feb 22, 2024

RenShuhuai-Andy commented Feb 22, 2024

yeliudev commented Feb 22, 2024

Do we need to crop the HiREST videos? #10

Do we need to crop the HiREST videos? #10

Comments

yeliudev commented Feb 7, 2024

RenShuhuai-Andy commented Feb 7, 2024

yeliudev commented Feb 9, 2024

yeliudev commented Feb 9, 2024 • edited Loading

RenShuhuai-Andy commented Feb 16, 2024

RenShuhuai-Andy commented Feb 16, 2024

yeliudev commented Feb 19, 2024

yeliudev commented Feb 19, 2024

yeliudev commented Feb 19, 2024

RenShuhuai-Andy commented Feb 22, 2024

RenShuhuai-Andy commented Feb 22, 2024

RenShuhuai-Andy commented Feb 22, 2024

yeliudev commented Feb 22, 2024

RenShuhuai-Andy commented Feb 22, 2024

yeliudev commented Feb 22, 2024

yeliudev commented Feb 9, 2024 •

edited

Loading