Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

working inference code for video model? #7

Closed
Namzakku opened this issue Jul 24, 2024 · 8 comments
Closed

working inference code for video model? #7

Namzakku opened this issue Jul 24, 2024 · 8 comments

Comments

@Namzakku
Copy link

Namzakku commented Jul 24, 2024

Hi!
I tried to combine the inference instruction you provided and follow the inference code from the hf tutorials in
https://colab.research.google.com/drive/1dTdro-k7NFqRgGq5-TlGHM-6k2sYQhXp#scrollTo=4ccbd183-f15a-4f94-a526-9ceeec3f61e0
but got meaningless results.

I also tried to use your collator but got CUDA error: device-side assert triggered in the generate().

Can you provide a working code or give some hints?

@zjysteven
Copy link
Owner

I will take a look asap. By "meaningless results" do you mean the generated text completely go off or it's not desirable enough?

@Namzakku
Copy link
Author

The generated text I got was a mixture of symbols or a mixture of texts from multiple languages and it didn't answer any of the question at all.

Any one more thing I got concern is that the train loss I got during training declined significantly only after a few steps. You can see in the below graph.
image
I tried many times but still got the same situation. I used the same dataset that worked with the llava video hf tutorial but the loss was more reasonable.

@zjysteven
Copy link
Owner

I see. I'm running experiments now. For the loss, there indeed is a difference between the implementation here and that tutorial colab you shared: While their labels include both questions and answers (the model will also learn to predict user's questions), ours only set labels to account for the answers (the model only learns to predict the responses). Not sure if this fully explains the loss differences though.

@zjysteven
Copy link
Owner

zjysteven commented Jul 24, 2024

Hi @Namzakku, I've put up a full example notebook here https://colab.research.google.com/drive/1ejXG58cpMXvkcsx2qqTFK2BqWBVBEr7Y?usp=sharing that showcases a training run and the following inference of llava-next-video-7b, with the data from ShareGPT4Video. The output of the finetuned model makes sense to me and indicates that the training worked (there is noticeable difference between the output of the original and the finetuned model). The exact running script is updated in the example_scripts/example_video.sh in the repo. Please pull the latest code and try again.

For the loss scale, notice that in the previous colab tutorial you shared it was using a gradient accumulation steps of 8, while in my example_video.sh it is 1. This actually could well explain the discrepancy in loss scale where initially you would see a loss of 10+ (but if you also use gradient_accumulation_steps=8 you would observe a loss of ~1 or ~2).

@Namzakku
Copy link
Author

Thanks for the great notebook!
I have tested a bunch of work today and it works like a charm!
Regarding the loss scale, indeed there is a difference of gradient accumulation as you mentioned.

Also, I found a tiny problem when training.
For some minor videos in my dataset, they only have 5-6 frames and is less than the default 8 frames, which made the training crash. After I reduced the default frame to 4, it worked correctly.
I didn't experience this when using the hf notebook, maybe they have additional padding somewhere?
I haven't tried rasing the default frame to 16 or 32 yet, so further testing should be needed.

@zjysteven
Copy link
Owner

zjysteven commented Jul 25, 2024

Glad to know it worked! For the number of frames, it's weird as I do have frame padding implemented so it should work with different number of frames (I also confirmed that for video that has less total_frames than the num_sampled_frames, the loading function will still successfully sample num_sampled_frames from it with some frames sampled multiple times). In the HF tutorial there isn't any padding since all videos are constantly sampled with 8 frames.

Anyway by any chance do you have the error message available so that I can confirm the reason for crashed training? Meanwhile I will do some testing locally to see if I can reproduce.

@Namzakku
Copy link
Author

sure! this is the error I got.

Traceback (most recent call last):
  File "/workspace/lmms-finetune/train.py", line 156, in <module>
    train()
  File "/workspace/lmms-finetune/train.py", line 142, in train
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2236, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 464, in __iter__
    next_batch = next(dataloader_iter)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 694, in reraise
    raise exception
ValueError: Caught ValueError in DataLoader worker process 3.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/workspace/lmms-finetune/collators/llava_next_video.py", line 37, in __call__
    videos[i] = np.concatenate([video, pad], axis=0)
  File "<__array_function__ internals>", line 200, in concatenate
ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 398 and the array at index 1 has size 7

@zjysteven
Copy link
Owner

Thanks. I can reproduce, and it turned out that there is some dimension issue with the padding. Will push a fix soon.

zjysteven added a commit that referenced this issue Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants