Skip to content

Conversation

@zucchini-nlp
Copy link
Member

What does this PR do?

This is a draft PR to get your feedback on the ideas I have to enable transformers backend for Vision LLMs in VLLM. A few things had to be changed, which are a bit breaking.

  • We need a base model without LM head for all models. We cannot make assumption about model's arch and load vision/LM separately, since some models have mm proj, some have pooling, etc. We can maybe use AutoModelForImageTextToText but that means we have to infer modality from config and also map all future multimodal models through their own auto-class (AutoVideoTextToText?) . So I decided a base class is the most generic approach

    • The current state is a bit of mess to support BC, I hope it will be better after we allow mapping ckpt keys in fine-grained way
  • For processors, one new helper added to help VLLM get prompt replacements without having to add _get_prompt_replacements in each processor. On VLLM side we'll set return_mm_token_type_ids to True by default and then infer positions by splitting mm_token_type_ids to their respective mm token lengths each. Thus each chunk will be a separate placeholder

Pseudocode, will add more in draft PR on VLLM repo

mm_positions = torch.where(mm_token_type_ids == 1)[1]
mm_tokens_image = hf_processor._get_num_mm_tokens(image_inputs)

chunked_mm_positions = torch.split(mm_positions, mm_tokens_image["image])
ranges = [
    PlaceholderRange(offset=positions[0].item(), length=positions.shape[0]) 
    for positions in chunked_mm_positions
]
mm_placeholders = {"image: ranges}

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp zucchini-nlp changed the title [WIP]: LLaVa for VLLM's transformers backend [WIP]: Base multimodal model for VLLM's transformers backend Mar 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants