[WIP] Add ImageBind Model Implementation#26310
[WIP] Add ImageBind Model Implementation#26310dg845 wants to merge 36 commits intohuggingface:mainfrom
Conversation
…MU) and update config classes for text and image modalities.
|
Awesome @dg845! Let us know when you'd like for us to review this PR |
…h, thermal, imu).
…ImageBind follows Audio Spectrogram Transformer audio processing).
…uding audio (depth, thermal).
…s/image processors to ImageBind's __init__.py file.
…clipped images) following VideoMAE.
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
|
Hey! Do you need some help on this integration ? 🤗 |
|
Hi @ArthurZucker, unfortunately I haven't been able to find time to work on this PR recently, but should be able to work on it more in the near future. I don't think I've hit any blockers yet. |
|
Hi @dg845 - any update on progress with adding the model? Do you think you'll be able to finish the PR soon? It's an impactful model and we'd like to have in the library as soon as possible. If it's not something you'll have time for, would you be open to someone help to finish the PR - making sure of course you still get the contribution as you've already done a large part? |
|
Hi @amyeroberts, I'm not sure if I will be able to finish it soon. I'm open to having someone else help finish the PR - I will also try to work on it/help out as much as I can. |
|
@dg845 just curious, what is left on your TO-DO list for this PR? Would be helpful to whoever is assisting. |
|
I believe the current TODOs are as follows:
As a note, I believe the official ImageBind repo doesn't explicitly specify how to preprocess IMU data (e.g. in For IMU data preprocessing, I referred to the IMU2Clip repo, also from Facebook/Meta Research, as well as this issue in the ImageBind repo: facebookresearch/ImageBind#66. For depth and thermal data preprocessing, I referred to the Omnivore repo (which I believe is previous work by the same authors as ImageBind). It's not obvious that either of these things is the right thing to do - might make sense to confirm with the authors that doing so is reasonable. I guess another possible path would be to only implement the text/image/audio portion of the model, but in my opinion this is less than ideal. |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
What does this PR do?
This PR adds the ImageBind model (paper, code), a multimodal model which can map six different modalities to the same shared representation space.
As stated in their blog post,
Fixes #23240. Based on a previous PR #23284.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@amyeroberts
@ArthurZucker
@shehanmunasinghe