[WIP]: Base multimodal model for VLLM's `transformers` backend #36367

zucchini-nlp · 2025-02-24T08:56:54Z

What does this PR do?

This is a draft PR to get your feedback on the ideas I have to enable transformers backend for Vision LLMs in VLLM. A few things had to be changed, which are a bit breaking.

We need a base model without LM head for all models. We cannot make assumption about model's arch and load vision/LM separately, since some models have mm proj, some have pooling, etc. We can maybe use AutoModelForImageTextToText but that means we have to infer modality from config and also map all future multimodal models through their own auto-class (AutoVideoTextToText?) . So I decided a base class is the most generic approach
- The current state is a bit of mess to support BC, I hope it will be better after we allow mapping ckpt keys in fine-grained way
For processors, one new helper added to help VLLM get prompt replacements without having to add _get_prompt_replacements in each processor. On VLLM side we'll set return_mm_token_type_ids to True by default and then infer positions by splitting mm_token_type_ids to their respective mm token lengths each. Thus each chunk will be a separate placeholder

Pseudocode, will add more in draft PR on VLLM repo

mm_positions = torch.where(mm_token_type_ids == 1)[1]
mm_tokens_image = hf_processor._get_num_mm_tokens(image_inputs)

chunked_mm_positions = torch.split(mm_positions, mm_tokens_image["image])
ranges = [
    PlaceholderRange(offset=positions[0].item(), length=positions.shape[0]) 
    for positions in chunked_mm_positions
]
mm_placeholders = {"image: ranges}

HuggingFaceDocBuilderDev · 2025-02-24T09:22:30Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

draft

96f83ce

zucchini-nlp mentioned this pull request Feb 24, 2025

[Model] Support VLMs with transformers backend vllm-project/vllm#13754

Closed

zucchini-nlp added 8 commits March 6, 2025 11:15

dump

bf9da7e

Merge remote-tracking branch 'upstream/main' into vllm-multimodal

7e30704

aria

cf453c8

aya vision

22228d4

emu3

c00766f

fuyu, gemma3, pg, got2ocr

4269bd0

llavas, mistral3, kosmos2

c7f2f51

qwen models

dc7a0fc

zucchini-nlp mentioned this pull request Mar 24, 2025

Gemma3 not supported in main branch #36940

Closed

4 tasks

fix copies

7d01bcc

zucchini-nlp changed the title ~~[WIP]: LLaVa for VLLM's transformers backend~~ [WIP]: Base multimodal model for VLLM's transformers backend Mar 25, 2025

zucchini-nlp added 6 commits March 25, 2025 18:40

make sure all modifed models can be loaded

d8141b5

fix copies

6b88946

instructblip

dd2f11f

fix tests

be64752

Merge branch 'main' into vllm-multimodal

33a401a

fix copies

63f86b5

zucchini-nlp closed this Jun 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP]: Base multimodal model for VLLM's `transformers` backend #36367

[WIP]: Base multimodal model for VLLM's `transformers` backend #36367

Uh oh!

zucchini-nlp commented Feb 24, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Feb 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[WIP]: Base multimodal model for VLLM's transformers backend #36367

[WIP]: Base multimodal model for VLLM's transformers backend #36367

Uh oh!

Conversation

zucchini-nlp commented Feb 24, 2025

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Feb 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[WIP]: Base multimodal model for VLLM's `transformers` backend #36367

[WIP]: Base multimodal model for VLLM's `transformers` backend #36367