-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Add Bagel #38569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add Bagel #38569
Conversation
|
Hey @Shakib-IO , Added u as collaborator and this is the draft PR. I have started working on modelling code and if you don't mind you can start working on Processor class (also to avoid merge conflicts 😅 ). Later once a rough processor version is ready we can complete the modelling file to support all the tasks. |
|
Thanks @yaswanth19. |
|
I think it might be benificial to implement a ImageTextToText Model first due to the MLLM nature of the BAGEL model, then add generative cabability to it with some extra library like diffusers. |
|
Hi @zucchini-nlp , Made some progress here overall and managed to match text generation logits. Further, I have a few queries which will help me in implementing remaining parts
|
|
@yaswanth19 nicee!
You mean in the image model ig, since for the text part packed FA2 yet doesn't work 100% good. I am adding packed attention for all implementations in #39121 for Qwen-VL models. Not sure if Bagel also uses 3D positions, you can take a look at dummy tests to see how I packed sequences with "image+text". Pretty many workarounds to account for 3D vision positions 😅 |
For the text model itself. If you have some time then please refer
Sounds good - atleast with current state in mind this will create some more steps on user side but shouldn't be nothing major.
AFAIK, I don't think so but it has nuance with position_ids which makes using position_ids to compute FA kwargs incorrect. so atleast rn , I am passing manually |
Oke, having a look today. Btw, for full attention within an image, we can use |
|
Ah, you are trying to apply packed FA2 when autoregressively decoding. Supposing that the attention mask is a usual 2D tensor and we don't pass transformers/src/transformers/modeling_flash_attention_utils.py Lines 280 to 284 in f4d0765
So as long as we can take the FA2 with Oh and btw, I just remembered that we already have another model which uses different image processor configs depending on model. In case you want to look how it was implemented, it is the Hybrid version here :) |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, bagel |
|
Hi @zucchini-nlp, unfortunately I won’t be able to continue with this PR due to limited bandwidth off late and slow iteration on my side - primarily because the original codebase relies heavily on FA2 and some uncoventional but super optimized codeflow, which my setup doesn’t support 😿. Please feel free to pick this up based on you bandwidth or for community contributions, the PR already contains 99% of the independent building blocks, though the code structure is still brittle/hacky. Mainly refer the modelling file which can later be distilled into modular. In its current state, the main TODOs are:
I would be happy to contribute to any other model if you have some good VLM in mind for the library 🤗 😅 |
|
CC: @zucchini-nlp for viz |
|
@yaswanth19 no problems. I don't think there are many requests to support Bagel currently, so let's leave the PR open for community contributions if anyone wants to pick it up. Thanks a lot for your efforts on the model, unfortunately image generation transformer models are a bit tricky and aren't standardized. I understand that it can take a lot of time |
Fixes #38267
Just a draft PR rn to build upon and iterate, discuss during the integration.