Add Bagel #38569

yaswanth19 · 2025-06-03T19:25:32Z

Just a draft PR rn to build upon and iterate, discuss during the integration.

yaswanth19 · 2025-06-14T05:21:05Z

Hey @Shakib-IO , Added u as collaborator and this is the draft PR. I have started working on modelling code and if you don't mind you can start working on Processor class (also to avoid merge conflicts 😅 ). Later once a rough processor version is ready we can complete the modelling file to support all the tasks.

Shakib-IO · 2025-06-14T16:59:22Z

Thanks @yaswanth19.

Seas0 · 2025-06-30T03:27:02Z

I think it might be benificial to implement a ImageTextToText Model first due to the MLLM nature of the BAGEL model, then add generative cabability to it with some extra library like diffusers.

yaswanth19 · 2025-07-18T06:11:08Z

Hi @zucchini-nlp , Made some progress here overall and managed to match text generation logits. Further, I have a few queries which will help me in implementing remaining parts

In the original implementation image preprocessor is initialized twice depending on what generation mode (text or image) essentially it's the same set of ops but with diff values for stride, max and min image size. Is there a way we can have nested values in preprecosser.config to load both these set of values somehow on from_pretrained or we need to define different set of variables like vit_stride, vae_stride.

Refer this func where we have two diff set of transforms (vae_transform and vit_transform): https://github.com/ByteDance-Seed/Bagel/blob/db7e7c3940f410853b4155835cd85adf9c6b6779/inferencer.py#L62

Also in original implementation FA and unpacking is used extensively and this might not be completely comaptible with FA2 interface is transformers. Will get a concrete picture once the end-end code is ready. RN i am manually passing fa kwargs as in qwen_omni and here I have a query that is we always return same set of cu_seq_lens for cu_seq_q and cu_seq_k in qwen modeling and even in model_flash_attn_utils. But shouldn't cu_seq_k>=cu_seq_q coz we will have add cached KV values before computing attention. Is there a gap in my understanding here or is it happening but I missed that code🙏

zucchini-nlp · 2025-07-18T07:16:40Z

@yaswanth19 nicee!

Yes, we should keep everything in image preprocessors and if that is required by the model, we can have a dict-like values (maybe stride={"image_generation": 2, "text_generation": 4} or anything that you think fits well for the model). It is not super transformers-style though we don't limit what values users can put in processing config
Let's check first if the performance is affected a lot by using the default SDPA. If the model uses packed FA2, we might need to correctly adjust the masks.

return same set of cu_seq_lens for cu_seq_q and cu_seq_k in qwen modeling and even in model_flash_attn_utils. But shouldn't cu_seq_k>=cu_seq_q?

You mean in the image model ig, since for the text part packed FA2 yet doesn't work 100% good. I am adding packed attention for all implementations in #39121 for Qwen-VL models. Not sure if Bagel also uses 3D positions, you can take a look at dummy tests to see how I packed sequences with "image+text". Pretty many workarounds to account for 3D vision positions 😅

yaswanth19 · 2025-07-18T07:35:47Z

You mean in the image model ig

For the text model itself. If you have some time then please refer BagelTextAttention and BagelModel on how the flow looks for language model in modelling file. I am trying to first match the logits using FA2 implementation and then for eager and SDPA we might need to create custom attention mask (full attn b/w image tokens and only causal attn for text tokens). Its a bit hacky rn 😅

we can have a dict-like values anything that you think fits well for the model)

Sounds good - atleast with current state in mind this will create some more steps on user side but shouldn't be nothing major.

Bagel also uses 3D positions

AFAIK, I don't think so but it has nuance with position_ids which makes using position_ids to compute FA kwargs incorrect. so atleast rn , I am passing manually

zucchini-nlp · 2025-07-18T07:41:45Z

For the text model itself. If you have some time then please refer BagelTextAttention and BagelForConditionalGeneration on how the flow looks for language model in modelling file. I am trying to first match the logits using FA2 implementation and then for eager and SDPA we might need to create custom attention mask (full attn b/w image tokens and only causal attn for text tokens). Its a bit hacky rn 😅

Oke, having a look today. Btw, for full attention within an image, we can use token_type_ids if it is same as in Gemma3 models. In gemma3 each image attend fully to itself. Though it doesn't attent to all future images, it will be easy to implement with this fn

zucchini-nlp · 2025-07-18T13:01:53Z

Ah, you are trying to apply packed FA2 when autoregressively decoding. Supposing that the attention mask is a usual 2D tensor and we don't pass position_ids to the mask creation function, it should work well ootb. The cumulative length will be different in decoding phase for q-k, it's handled here

transformers/src/transformers/modeling_flash_attention_utils.py

Lines 280 to 284 in f4d0765

    
           elif query_length == 1: 
        
               max_seqlen_in_batch_q = 1 
        
               cu_seqlens_q = torch.arange( 
        
                   batch_size + 1, dtype=torch.int32, device=query_layer.device 
        
               )  # There is a memcpy here, that is very bad.

So as long as we can take the FA2 with attention_mask path and no positions are passed, it will be working. We can see how to deal with position_ids later since Bagel also has non-standard format for it

Oh and btw, I just remembered that we already have another model which uses different image processor configs depending on model. In case you want to look how it was implemented, it is the Hybrid version here :)

github-actions · 2025-08-14T07:51:12Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, bagel

yaswanth19 · 2025-08-14T08:18:32Z

Hi @zucchini-nlp, unfortunately I won’t be able to continue with this PR due to limited bandwidth off late and slow iteration on my side - primarily because the original codebase relies heavily on FA2 and some uncoventional but super optimized codeflow, which my setup doesn’t support 😿.
I’ve been working with some hacks to get around this, but it’s been too slow to make progress and consuming too much of my time.

Please feel free to pick this up based on you bandwidth or for community contributions, the PR already contains 99% of the independent building blocks, though the code structure is still brittle/hacky. Mainly refer the modelling file which can later be distilled into modular.

In its current state, the main TODOs are:

Utilize CFG for image generation. There are two cfgs namely for text and image inputs.
One block is missing which is utilized for image editing and does image processing via vae model (along with vit model). The corresponding function in codebase is prepare_vae_images and forward_cache_update_vae.
Update processor to also include processing for vae transforms. The image processing flow is same just the values differ - we can add vae_image_mean, vae_image_std etc.. and create a _preprocess func to run for different inputs.
Refactoring to be more upto transformers standards and test suite.
It supports three task primarily T2I, Image editing, Text generation. Thinking can be added to all the 3 tasks here.

I would be happy to contribute to any other model if you have some good VLM in mind for the library 🤗 😅

yaswanth19 · 2025-08-17T14:27:25Z

CC: @zucchini-nlp for viz

zucchini-nlp · 2025-08-17T15:08:42Z

@yaswanth19 no problems. I don't think there are many requests to support Bagel currently, so let's leave the PR open for community contributions if anyone wants to pick it up. Thanks a lot for your efforts on the model, unfortunately image generation transformer models are a bit tricky and aren't standardized. I understand that it can take a lot of time

first commit

574f25a

yaswanth19 marked this pull request as draft June 3, 2025 19:26

changes

31dfdac

yaswanth19 and others added 3 commits June 14, 2025 16:54

dump

9eb25fb

Add ae model arc🚀

c479731

Merge branch 'main' into add-bagel

ad2fea1

yaswanth19 and others added 10 commits June 24, 2025 15:47

Merge branch 'main' into add-bagel

319aa06

Dump changes for now

4326649

config working

7706ed5

added processor file draft

f1b1615

A few more missing pieces

f0d16d0

cleanup

a722f37

some more progress

9a20d92

vision forward pass succesfull

99fb8d8

remove

764452f

Merge branch 'main' into add-bagel

00d0178

Shakib-IO and others added 10 commits July 1, 2025 19:28

Image Processor

e8a9887

Soem progress and generate is working

941fda9

add converstion mapping

22b6bf9

Merge branch 'main' into add-bagel

b179da2

Merge branch 'main' into add-bagel

53f405b

up

ea9c1de

Merge branch 'main' into add-bagel

a32fcd5

import fixes

348ece1

Update Image Processor

2ec43ac

add processing_bagel

49f31a7

Shakib-IO force-pushed the add-bagel branch from f5ec28c to 49f31a7 Compare July 15, 2025 16:28

Shakib-IO and others added 3 commits July 15, 2025 12:29

Update bagel

a27f131

Processor update

d88508f

raw logits matching with some hacks

6fe0785

yaswanth19 and others added 13 commits July 19, 2025 15:26

Some fixes

817c3b2

support both transforms

c33fe42

Merge branch 'main' into add-bagel

48a56d2

Don't know where it's going

f7b01a2

This is working atleast

3a1c27c

rough forward step logic - still need to fix cfg

84ff02f

enough for today

aacb512

Merge branch 'main' into add-bagel

f5df973

Merge branch 'main' into add-bagel

757bb26

make style

c480758

temp push

cad7bd1

docs

fb5c5b5

cleanup

6349ee6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Bagel #38569

Add Bagel #38569

Uh oh!

yaswanth19 commented Jun 3, 2025 •

edited

Loading

Uh oh!

yaswanth19 commented Jun 14, 2025 •

edited

Loading

Uh oh!

Shakib-IO commented Jun 14, 2025

Uh oh!

Seas0 commented Jun 30, 2025

Uh oh!

yaswanth19 commented Jul 18, 2025 •

edited

Loading

Uh oh!

zucchini-nlp commented Jul 18, 2025

Uh oh!

yaswanth19 commented Jul 18, 2025 •

edited

Loading

Uh oh!

zucchini-nlp commented Jul 18, 2025

Uh oh!

zucchini-nlp commented Jul 18, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 14, 2025

Uh oh!

yaswanth19 commented Aug 14, 2025 •

edited

Loading

Uh oh!

yaswanth19 commented Aug 17, 2025

Uh oh!

zucchini-nlp commented Aug 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add Bagel #38569

Are you sure you want to change the base?

Add Bagel #38569

Uh oh!

Conversation

yaswanth19 commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaswanth19 commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Shakib-IO commented Jun 14, 2025

Uh oh!

Seas0 commented Jun 30, 2025

Uh oh!

yaswanth19 commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented Jul 18, 2025

Uh oh!

yaswanth19 commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented Jul 18, 2025

Uh oh!

zucchini-nlp commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 14, 2025

Uh oh!

yaswanth19 commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaswanth19 commented Aug 17, 2025

Uh oh!

zucchini-nlp commented Aug 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yaswanth19 commented Jun 3, 2025 •

edited

Loading

yaswanth19 commented Jun 14, 2025 •

edited

Loading

yaswanth19 commented Jul 18, 2025 •

edited

Loading

yaswanth19 commented Jul 18, 2025 •

edited

Loading

zucchini-nlp commented Jul 18, 2025 •

edited

Loading

yaswanth19 commented Aug 14, 2025 •

edited

Loading