[WIP] Add ImageBind Model Implementation by dg845 · Pull Request #26310 · huggingface/transformers

dg845 · 2023-09-21T07:21:21Z

What does this PR do?

This PR adds the ImageBind model (paper, code), a multimodal model which can map six different modalities to the same shared representation space.

As stated in their blog post,

"[ImageBind is] the first AI model capable of binding information from six modalities. The model learns a single embedding, or shared representation space, not just for text, image/video, and audio, but also for sensors that record depth (3D), thermal (infrared radiation), and inertial measurement units (IMU), which calculate motion and position."

Fixes #23240. Based on a previous PR #23284.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts
@ArthurZucker
@shehanmunasinghe

…MU) and update config classes for text and image modalities.

…al, IMU).

…, thermal).

…ality.

LysandreJik · 2023-09-25T10:32:00Z

Awesome @dg845! Let us know when you'd like for us to review this PR

…h, thermal, imu).

…ImageBind follows Audio Spectrogram Transformer audio processing).

…uding audio (depth, thermal).

…s/image processors to ImageBind's __init__.py file.

… processing.

…clipped images) following VideoMAE.

github-actions · 2023-11-22T08:05:53Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker · 2023-11-22T10:57:37Z

Hey! Do you need some help on this integration ? 🤗

dg845 · 2023-11-23T10:20:45Z

Hi @ArthurZucker, unfortunately I haven't been able to find time to work on this PR recently, but should be able to work on it more in the near future. I don't think I've hit any blockers yet.

amyeroberts · 2023-12-19T21:08:43Z

Hi @dg845 - any update on progress with adding the model? Do you think you'll be able to finish the PR soon? It's an impactful model and we'd like to have in the library as soon as possible. If it's not something you'll have time for, would you be open to someone help to finish the PR - making sure of course you still get the contribution as you've already done a large part?

dg845 · 2023-12-21T16:34:25Z

Hi @amyeroberts, I'm not sure if I will be able to finish it soon. I'm open to having someone else help finish the PR - I will also try to work on it/help out as much as I can.

isaac-chung · 2024-01-23T11:46:18Z

@dg845 just curious, what is left on your TO-DO list for this PR? Would be helpful to whoever is assisting.

dg845 · 2024-01-24T02:09:15Z

I believe the current TODOs are as follows:

Test the checkpoint conversion script convert_imagebind_original_pytorch_to_hf.py to make sure there aren't any errors
Use the checkpoint conversion script to create a small random test model (it looks like there might already be one at dg845/imagebind-test-dev but not sure if it's error-free)
Use the checkpoint conversion script to convert the full ImageBind checkpoint
Fix the imports for the preprocessing code (e.g. feature_extraction_imagebind.py, image_processing_imagebind.py, processing_imagebind.py, tokenization_imagebind.py, etc.) if necessary
Test the preprocessing code against the reference implementation (e.g. make sure the tests in test_image_processing_imagebind.py, test_processor_imagebind.py, test_tokenization_imagebind.py are passing)
Test the modeling code against the reference implementation (e.g. make sure the tests in test_modeling_imagebind.py are passing, using the test checkpoint from (2))
Write integration tests (combining preprocessing code and modeling code) and make sure they pass (using the full checkpoint created in (3))
Finish writing the docstrings and other documentation in the code itself
Finish the documentation in /docs/source/en/model_doc/imagebind.md

As a note, I believe the official ImageBind repo doesn't explicitly specify how to preprocess IMU data (e.g. in imagebind/data.py), and I'm not sure if there is extra preprocessing needed for depth and thermal data that's not in load_and_transform_vision_data.

For IMU data preprocessing, I referred to the IMU2Clip repo, also from Facebook/Meta Research, as well as this issue in the ImageBind repo: facebookresearch/ImageBind#66.

For depth and thermal data preprocessing, I referred to the Omnivore repo (which I believe is previous work by the same authors as ImageBind).

It's not obvious that either of these things is the right thing to do - might make sense to confirm with the authors that doing so is reasonable. I guess another possible path would be to only implement the text/image/audio portion of the model, but in my opinion this is less than ideal.

github-actions · 2024-02-17T08:05:57Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

dg845 added 2 commits September 20, 2023 23:52

initial commit for ImageBind model

d72c9a3

add initial testing code for ImageBind model

6be5464

dg845 mentioned this pull request Sep 21, 2023

[New model] ImageBind: One Embedding Space To Bind Them All #23240

Open

2 tasks

dg845 added 9 commits September 21, 2023 19:18

Add config classes for remaining modalities (audio, depth, thermal, I…

190e727

…MU) and update config classes for text and image modalities.

Update ImageBindOutput with remaining modalities (audio, depth, therm…

3692190

…al, IMU).

Add embedding classes for image-like modalities (vision, audio, depth…

4037f6a

…, thermal).

Implement IMU embedding class.

970dc5d

Add module to convert still images into video frames.

ffd1460

Add implementation for shared model encoder blocks.

ee74943

Add key and value biases to ImageBindAttention.

93ce319

Add ImageBind heads and postprocessors.

c7968d6

Update ImageBindModel.forward to compare images against any other mod…

0000bbc

…ality.

dg845 added 17 commits September 25, 2023 17:51

Separate normalized embeddings into their own output field.

a1bdbf7

Add initial tester/test classes for remaining modalities (audio, dept…

69fa517

…h, thermal, imu).

Create initial audio feature extractor based on ASTFeatureExtractor (…

a8341e4

…ImageBind follows Audio Spectrogram Transformer audio processing).

Add image processing classes for remaining image-like modalities excl…

ac926ad

…uding audio (depth, thermal).

Add IMU feature extractor class declaration and add feature extractor…

e151140

…s/image processors to ImageBind's __init__.py file.

Update ImageBindAudioFeatureExtractor to use ImageBind-specific audio…

789559a

… processing.

Add final dropout layer to ImageBindImuTransformer.

84851a5

Fix typo

43016df

Change model test parameters to be closer to ImageBind defaults.

93d7749

Update audio feature extractor to output batched and clipped audio.

1b4bb43

Add modeling support for batched and clipped vision and audio inputs.

d9a0a80

Update ImageBind image processor to always output video (batched and …

b5d46cd

…clipped images) following VideoMAE.

Merge branch 'main' into imagebind-model

029d424

Implement ImageBindDepthImageProcessor.

a9d432c

Implement ImageBindImuFeatureExtractor.

90543ce

Fix some modeling code bugs.

8ce499b

Move Image2Video logic into RGBDTPatchEmbedding.

484cd3f

dg845 added 7 commits October 17, 2023 01:38

Fix attention kv bias initialization bug.

284ffe5

Implement ImageBind conversion script.

c5d1e3b

Fix bugs in ImageBind conversion script.

4a8aaf5

Fix conversion script test configs.

06f9536

Fix ImageBindAudioEmbeddings.

f691396

Fix num_patches calculation.

ba64517

Fix audio num_patches calculation in conversion script.

78e537d

Merge branch 'main' into imagebind-model

befcc26

huggingface deleted a comment from github-actions bot Dec 18, 2023

huggingface deleted a comment from github-actions bot Jan 15, 2024

github-actions bot closed this Feb 25, 2024

EduardoPach mentioned this pull request May 7, 2024

Adding imagebind #30690

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add ImageBind Model Implementation#26310

[WIP] Add ImageBind Model Implementation#26310
dg845 wants to merge 36 commits intohuggingface:mainfrom
dg845:imagebind-model

dg845 commented Sep 21, 2023

Uh oh!

LysandreJik commented Sep 25, 2023

Uh oh!

github-actions bot commented Nov 22, 2023

Uh oh!

ArthurZucker commented Nov 22, 2023

Uh oh!

dg845 commented Nov 23, 2023

Uh oh!

amyeroberts commented Dec 19, 2023

Uh oh!

dg845 commented Dec 21, 2023

Uh oh!

isaac-chung commented Jan 23, 2024

Uh oh!

dg845 commented Jan 24, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Feb 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

dg845 commented Sep 21, 2023

What does this PR do?

Before submitting

Who can review?

Uh oh!

LysandreJik commented Sep 25, 2023

Uh oh!

github-actions bot commented Nov 22, 2023

Uh oh!

ArthurZucker commented Nov 22, 2023

Uh oh!

dg845 commented Nov 23, 2023

Uh oh!

amyeroberts commented Dec 19, 2023

Uh oh!

dg845 commented Dec 21, 2023

Uh oh!

isaac-chung commented Jan 23, 2024

Uh oh!

dg845 commented Jan 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dg845 commented Jan 24, 2024 •

edited

Loading