Skip to content

[WIP] Add ImageBind Model Implementation#26310

Closed
dg845 wants to merge 36 commits intohuggingface:mainfrom
dg845:imagebind-model
Closed

[WIP] Add ImageBind Model Implementation#26310
dg845 wants to merge 36 commits intohuggingface:mainfrom
dg845:imagebind-model

Conversation

@dg845
Copy link
Contributor

@dg845 dg845 commented Sep 21, 2023

What does this PR do?

This PR adds the ImageBind model (paper, code), a multimodal model which can map six different modalities to the same shared representation space.

As stated in their blog post,

"[ImageBind is] the first AI model capable of binding information from six modalities. The model learns a single embedding, or shared representation space, not just for text, image/video, and audio, but also for sensors that record depth (3D), thermal (infrared radiation), and inertial measurement units (IMU), which calculate motion and position."

imagebind_figure_2

Fixes #23240. Based on a previous PR #23284.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts
@ArthurZucker
@shehanmunasinghe

@LysandreJik
Copy link
Member

Awesome @dg845! Let us know when you'd like for us to review this PR

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@ArthurZucker
Copy link
Collaborator

Hey! Do you need some help on this integration ? 🤗

@dg845
Copy link
Contributor Author

dg845 commented Nov 23, 2023

Hi @ArthurZucker, unfortunately I haven't been able to find time to work on this PR recently, but should be able to work on it more in the near future. I don't think I've hit any blockers yet.

@huggingface huggingface deleted a comment from github-actions bot Dec 18, 2023
@amyeroberts
Copy link
Contributor

Hi @dg845 - any update on progress with adding the model? Do you think you'll be able to finish the PR soon? It's an impactful model and we'd like to have in the library as soon as possible. If it's not something you'll have time for, would you be open to someone help to finish the PR - making sure of course you still get the contribution as you've already done a large part?

@dg845
Copy link
Contributor Author

dg845 commented Dec 21, 2023

Hi @amyeroberts, I'm not sure if I will be able to finish it soon. I'm open to having someone else help finish the PR - I will also try to work on it/help out as much as I can.

@huggingface huggingface deleted a comment from github-actions bot Jan 15, 2024
@isaac-chung
Copy link
Contributor

@dg845 just curious, what is left on your TO-DO list for this PR? Would be helpful to whoever is assisting.

@dg845
Copy link
Contributor Author

dg845 commented Jan 24, 2024

I believe the current TODOs are as follows:

  1. Test the checkpoint conversion script convert_imagebind_original_pytorch_to_hf.py to make sure there aren't any errors
  2. Use the checkpoint conversion script to create a small random test model (it looks like there might already be one at dg845/imagebind-test-dev but not sure if it's error-free)
  3. Use the checkpoint conversion script to convert the full ImageBind checkpoint
  4. Fix the imports for the preprocessing code (e.g. feature_extraction_imagebind.py, image_processing_imagebind.py, processing_imagebind.py, tokenization_imagebind.py, etc.) if necessary
  5. Test the preprocessing code against the reference implementation (e.g. make sure the tests in test_image_processing_imagebind.py, test_processor_imagebind.py, test_tokenization_imagebind.py are passing)
  6. Test the modeling code against the reference implementation (e.g. make sure the tests in test_modeling_imagebind.py are passing, using the test checkpoint from (2))
  7. Write integration tests (combining preprocessing code and modeling code) and make sure they pass (using the full checkpoint created in (3))
  8. Finish writing the docstrings and other documentation in the code itself
  9. Finish the documentation in /docs/source/en/model_doc/imagebind.md

As a note, I believe the official ImageBind repo doesn't explicitly specify how to preprocess IMU data (e.g. in imagebind/data.py), and I'm not sure if there is extra preprocessing needed for depth and thermal data that's not in load_and_transform_vision_data.

For IMU data preprocessing, I referred to the IMU2Clip repo, also from Facebook/Meta Research, as well as this issue in the ImageBind repo: facebookresearch/ImageBind#66.

For depth and thermal data preprocessing, I referred to the Omnivore repo (which I believe is previous work by the same authors as ImageBind).

It's not obvious that either of these things is the right thing to do - might make sense to confirm with the authors that doing so is reasonable. I guess another possible path would be to only implement the text/image/audio portion of the model, but in my opinion this is less than ideal.

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Feb 25, 2024
@EduardoPach EduardoPach mentioned this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New model] ImageBind: One Embedding Space To Bind Them All

5 participants