Skip to content

Conversation

@yaswanth19
Copy link
Contributor

@yaswanth19 yaswanth19 commented Jun 3, 2025

Fixes #38267

Just a draft PR rn to build upon and iterate, discuss during the integration.

@yaswanth19 yaswanth19 marked this pull request as draft June 3, 2025 19:26
@yaswanth19
Copy link
Contributor Author

yaswanth19 commented Jun 14, 2025

Hey @Shakib-IO , Added u as collaborator and this is the draft PR. I have started working on modelling code and if you don't mind you can start working on Processor class (also to avoid merge conflicts 😅 ). Later once a rough processor version is ready we can complete the modelling file to support all the tasks.

@Shakib-IO
Copy link
Contributor

Thanks @yaswanth19.

@Seas0
Copy link

Seas0 commented Jun 30, 2025

I think it might be benificial to implement a ImageTextToText Model first due to the MLLM nature of the BAGEL model, then add generative cabability to it with some extra library like diffusers.

@yaswanth19
Copy link
Contributor Author

yaswanth19 commented Jul 18, 2025

Hi @zucchini-nlp , Made some progress here overall and managed to match text generation logits. Further, I have a few queries which will help me in implementing remaining parts

  • In the original implementation image preprocessor is initialized twice depending on what generation mode (text or image) essentially it's the same set of ops but with diff values for stride, max and min image size. Is there a way we can have nested values in preprecosser.config to load both these set of values somehow on from_pretrained or we need to define different set of variables like vit_stride, vae_stride.
  • Also in original implementation FA and unpacking is used extensively and this might not be completely comaptible with FA2 interface is transformers. Will get a concrete picture once the end-end code is ready. RN i am manually passing fa kwargs as in qwen_omni and here I have a query that is we always return same set of cu_seq_lens for cu_seq_q and cu_seq_k in qwen modeling and even in model_flash_attn_utils. But shouldn't cu_seq_k>=cu_seq_q coz we will have add cached KV values before computing attention. Is there a gap in my understanding here or is it happening but I missed that code🙏

@zucchini-nlp
Copy link
Member

@yaswanth19 nicee!

  1. Yes, we should keep everything in image preprocessors and if that is required by the model, we can have a dict-like values (maybe stride={"image_generation": 2, "text_generation": 4} or anything that you think fits well for the model). It is not super transformers-style though we don't limit what values users can put in processing config
  2. Let's check first if the performance is affected a lot by using the default SDPA. If the model uses packed FA2, we might need to correctly adjust the masks.

return same set of cu_seq_lens for cu_seq_q and cu_seq_k in qwen modeling and even in model_flash_attn_utils. But shouldn't cu_seq_k>=cu_seq_q?

You mean in the image model ig, since for the text part packed FA2 yet doesn't work 100% good. I am adding packed attention for all implementations in #39121 for Qwen-VL models. Not sure if Bagel also uses 3D positions, you can take a look at dummy tests to see how I packed sequences with "image+text". Pretty many workarounds to account for 3D vision positions 😅

@yaswanth19
Copy link
Contributor Author

yaswanth19 commented Jul 18, 2025

You mean in the image model ig

For the text model itself. If you have some time then please refer BagelTextAttention and BagelModel on how the flow looks for language model in modelling file. I am trying to first match the logits using FA2 implementation and then for eager and SDPA we might need to create custom attention mask (full attn b/w image tokens and only causal attn for text tokens). Its a bit hacky rn 😅

we can have a dict-like values anything that you think fits well for the model)

Sounds good - atleast with current state in mind this will create some more steps on user side but shouldn't be nothing major.

Bagel also uses 3D positions

AFAIK, I don't think so but it has nuance with position_ids which makes using position_ids to compute FA kwargs incorrect. so atleast rn , I am passing manually

@zucchini-nlp
Copy link
Member

For the text model itself. If you have some time then please refer BagelTextAttention and BagelForConditionalGeneration on how the flow looks for language model in modelling file. I am trying to first match the logits using FA2 implementation and then for eager and SDPA we might need to create custom attention mask (full attn b/w image tokens and only causal attn for text tokens). Its a bit hacky rn 😅

Oke, having a look today. Btw, for full attention within an image, we can use token_type_ids if it is same as in Gemma3 models. In gemma3 each image attend fully to itself. Though it doesn't attent to all future images, it will be easy to implement with this fn

@zucchini-nlp
Copy link
Member

zucchini-nlp commented Jul 18, 2025

Ah, you are trying to apply packed FA2 when autoregressively decoding. Supposing that the attention mask is a usual 2D tensor and we don't pass position_ids to the mask creation function, it should work well ootb. The cumulative length will be different in decoding phase for q-k, it's handled here

elif query_length == 1:
max_seqlen_in_batch_q = 1
cu_seqlens_q = torch.arange(
batch_size + 1, dtype=torch.int32, device=query_layer.device
) # There is a memcpy here, that is very bad.

So as long as we can take the FA2 with attention_mask path and no positions are passed, it will be working. We can see how to deal with position_ids later since Bagel also has non-standard format for it

Oh and btw, I just remembered that we already have another model which uses different image processor configs depending on model. In case you want to look how it was implemented, it is the Hybrid version here :)

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, bagel

@yaswanth19
Copy link
Contributor Author

yaswanth19 commented Aug 14, 2025

Hi @zucchini-nlp, unfortunately I won’t be able to continue with this PR due to limited bandwidth off late and slow iteration on my side - primarily because the original codebase relies heavily on FA2 and some uncoventional but super optimized codeflow, which my setup doesn’t support 😿.
I’ve been working with some hacks to get around this, but it’s been too slow to make progress and consuming too much of my time.

Please feel free to pick this up based on you bandwidth or for community contributions, the PR already contains 99% of the independent building blocks, though the code structure is still brittle/hacky. Mainly refer the modelling file which can later be distilled into modular.

In its current state, the main TODOs are:

  • Utilize CFG for image generation. There are two cfgs namely for text and image inputs.
  • One block is missing which is utilized for image editing and does image processing via vae model (along with vit model). The corresponding function in codebase is prepare_vae_images and forward_cache_update_vae.
  • Update processor to also include processing for vae transforms. The image processing flow is same just the values differ - we can add vae_image_mean, vae_image_std etc.. and create a _preprocess func to run for different inputs.
  • Refactoring to be more upto transformers standards and test suite.
  • It supports three task primarily T2I, Image editing, Text generation. Thinking can be added to all the 3 tasks here.

I would be happy to contribute to any other model if you have some good VLM in mind for the library 🤗 😅

@yaswanth19
Copy link
Contributor Author

CC: @zucchini-nlp for viz

@zucchini-nlp
Copy link
Member

@yaswanth19 no problems. I don't think there are many requests to support Bagel currently, so let's leave the PR open for community contributions if anyone wants to pick it up. Thanks a lot for your efforts on the model, unfortunately image generation transformer models are a bit tricky and aren't standardized. I understand that it can take a lot of time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for BAGEL from ByteDance

4 participants