working inference code for video model? #7

Namzakku · 2024-07-24T03:43:24Z

Hi!
I tried to combine the inference instruction you provided and follow the inference code from the hf tutorials in
https://colab.research.google.com/drive/1dTdro-k7NFqRgGq5-TlGHM-6k2sYQhXp#scrollTo=4ccbd183-f15a-4f94-a526-9ceeec3f61e0
but got meaningless results.

I also tried to use your collator but got CUDA error: device-side assert triggered in the generate().

Can you provide a working code or give some hints?

The text was updated successfully, but these errors were encountered:

zjysteven · 2024-07-24T13:16:51Z

I will take a look asap. By "meaningless results" do you mean the generated text completely go off or it's not desirable enough?

Namzakku · 2024-07-24T15:47:03Z

The generated text I got was a mixture of symbols or a mixture of texts from multiple languages and it didn't answer any of the question at all.

Any one more thing I got concern is that the train loss I got during training declined significantly only after a few steps. You can see in the below graph.

I tried many times but still got the same situation. I used the same dataset that worked with the llava video hf tutorial but the loss was more reasonable.

zjysteven · 2024-07-24T15:53:23Z

I see. I'm running experiments now. For the loss, there indeed is a difference between the implementation here and that tutorial colab you shared: While their labels include both questions and answers (the model will also learn to predict user's questions), ours only set labels to account for the answers (the model only learns to predict the responses). Not sure if this fully explains the loss differences though.

zjysteven · 2024-07-24T23:13:48Z

Hi @Namzakku, I've put up a full example notebook here https://colab.research.google.com/drive/1ejXG58cpMXvkcsx2qqTFK2BqWBVBEr7Y?usp=sharing that showcases a training run and the following inference of llava-next-video-7b, with the data from ShareGPT4Video. The output of the finetuned model makes sense to me and indicates that the training worked (there is noticeable difference between the output of the original and the finetuned model). The exact running script is updated in the example_scripts/example_video.sh in the repo. Please pull the latest code and try again.

For the loss scale, notice that in the previous colab tutorial you shared it was using a gradient accumulation steps of 8, while in my example_video.sh it is 1. This actually could well explain the discrepancy in loss scale where initially you would see a loss of 10+ (but if you also use gradient_accumulation_steps=8 you would observe a loss of ~1 or ~2).

Namzakku · 2024-07-25T09:17:06Z

Thanks for the great notebook!
I have tested a bunch of work today and it works like a charm!
Regarding the loss scale, indeed there is a difference of gradient accumulation as you mentioned.

Also, I found a tiny problem when training.
For some minor videos in my dataset, they only have 5-6 frames and is less than the default 8 frames, which made the training crash. After I reduced the default frame to 4, it worked correctly.
I didn't experience this when using the hf notebook, maybe they have additional padding somewhere?
I haven't tried rasing the default frame to 16 or 32 yet, so further testing should be needed.

zjysteven · 2024-07-25T12:37:25Z

Glad to know it worked! For the number of frames, it's weird as I do have frame padding implemented so it should work with different number of frames (I also confirmed that for video that has less total_frames than the num_sampled_frames, the loading function will still successfully sample num_sampled_frames from it with some frames sampled multiple times). In the HF tutorial there isn't any padding since all videos are constantly sampled with 8 frames.

Anyway by any chance do you have the error message available so that I can confirm the reason for crashed training? Meanwhile I will do some testing locally to see if I can reproduce.

Namzakku · 2024-07-25T13:03:38Z

sure! this is the error I got.

Traceback (most recent call last):
  File "/workspace/lmms-finetune/train.py", line 156, in <module>
    train()
  File "/workspace/lmms-finetune/train.py", line 142, in train
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2236, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 464, in __iter__
    next_batch = next(dataloader_iter)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 694, in reraise
    raise exception
ValueError: Caught ValueError in DataLoader worker process 3.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/workspace/lmms-finetune/collators/llava_next_video.py", line 37, in __call__
    videos[i] = np.concatenate([video, pad], axis=0)
  File "<__array_function__ internals>", line 200, in concatenate
ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 398 and the array at index 1 has size 7

zjysteven · 2024-07-25T13:57:40Z

Thanks. I can reproduce, and it turned out that there is some dimension issue with the padding. Will push a fix soon.

Namzakku closed this as completed Jul 25, 2024

zjysteven added a commit that referenced this issue Jul 25, 2024

fix video padding (#7)

e09f33f

danielwusg pushed a commit to sunfanyunn/lmms-finetune that referenced this issue Nov 18, 2024

fix video padding (zjysteven#7)

8c1ad4b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

working inference code for video model? #7

working inference code for video model? #7

Namzakku commented Jul 24, 2024 •

edited

Loading

zjysteven commented Jul 24, 2024

Namzakku commented Jul 24, 2024

zjysteven commented Jul 24, 2024

zjysteven commented Jul 24, 2024 •

edited

Loading

Namzakku commented Jul 25, 2024

zjysteven commented Jul 25, 2024 •

edited

Loading

Namzakku commented Jul 25, 2024

zjysteven commented Jul 25, 2024

working inference code for video model? #7

working inference code for video model? #7

Comments

Namzakku commented Jul 24, 2024 • edited Loading

zjysteven commented Jul 24, 2024

Namzakku commented Jul 24, 2024

zjysteven commented Jul 24, 2024

zjysteven commented Jul 24, 2024 • edited Loading

Namzakku commented Jul 25, 2024

zjysteven commented Jul 25, 2024 • edited Loading

Namzakku commented Jul 25, 2024

zjysteven commented Jul 25, 2024

Namzakku commented Jul 24, 2024 •

edited

Loading

zjysteven commented Jul 24, 2024 •

edited

Loading

zjysteven commented Jul 25, 2024 •

edited

Loading