Skip to content

model support: Sarashina2VisionForCausalLM#10632

Merged
zhyncs merged 2 commits intomainfrom
chang/sarashina
Sep 19, 2025
Merged

model support: Sarashina2VisionForCausalLM#10632
zhyncs merged 2 commits intomainfrom
chang/sarashina

Conversation

@CatherineSue
Copy link
Copy Markdown
Collaborator

@CatherineSue CatherineSue commented Sep 19, 2025

Motivation

This PR focues on adding support for Sarashina2VisionForCausalLM architecture. Such as sbintuitions/sarashina2-vision-8b

The model type is built upon Qwen2VL and LlamaForCausalLM.

Caveats

  1. sbintuitions/sarashina2-vision-8b has outdated imports from a previous version of transformers. See: https://huggingface.co/sbintuitions/sarashina2-vision-8b/blob/main/processing_sarashina2_vision.py#L41
  2. The chat template of the model doesn't follow the normal Jinja multimodal standard. It directly use message['content'] without checking the type. We added a customized chat template.

Modifications

  • Add processor for sarashina2_vision.
  • Add model file for sarashina2_vision.
  • Add get_input_embeddings to LlamaModel as it is required in Qwen2VLProcessor
  • Add customized chat template in examples/chat_template. It is able to check message['content']['type']

Accuracy Tests

Simple Example

Server start up command:

python3 -m sglang.launch_server --model /models/sbintuitions/sarashina2-vision-8b --port=8080 --tp-size=8 --chat-template=examples/chat_template/vision_template_sarashina.jinja --trust-remote-code --disable-fast-image-processor

Image: "https://images.pexels.com/photos/210186/pexels-photo-210186.jpeg?auto=compress&cs=tinysrgb&dpr=2&h=4096&w=4096"
Screenshot 2025-09-18 at 5 26 29 PM

Text-only: "What is 1+1? What is the capital of France?"
Screenshot 2025-09-18 at 5 26 55 PM

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @CatherineSue, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for the Sarashina2VisionForCausalLM model, a multimodal architecture that combines a Llama-based language model with a Qwen2VL vision component. The changes enable the system to process and generate responses based on both text and image inputs for this specific model, addressing its unique processing requirements and chat template format.

Highlights

  • New Model Support: Added comprehensive support for the Sarashina2VisionForCausalLM multimodal model, which integrates a Llama text backbone with a Qwen2VL vision encoder.
  • Processor Integration: Implemented a dedicated processor for Sarashina2Vision to handle multimodal data, including image processing and tokenization.
  • Llama Model Enhancement: Extended the LlamaModel with a get_input_embeddings method, required for compatibility with the new vision model's processing.
  • Dynamic Image Sizing: Updated the get_processor utility to automatically inject default image size parameters for Sarashina2Vision models, similar to Qwen2-VL.
  • Custom Chat Template: A customized chat template was added to correctly handle the model's specific multimodal message content structure, addressing a caveat in its default behavior.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Sarashina2VisionForCausalLM model. The changes are well-structured, including the model definition, a dedicated processor, and necessary modifications to existing utilities.

My review focuses on the new model implementation and processor. I've identified a bug in the model's __init__ method related to config handling, an unused variable, and a maintainability concern in the processor's monkey-patching logic. The suggestions aim to fix the bug, improve code clarity, and make the code more robust to upstream changes.

Overall, this is a solid contribution. Addressing the feedback will further improve the quality and maintainability of the new model support.

Comment on lines +78 to +96
if hasattr(text_config, "model_type") and text_config.model_type == "llama":
llama_config = LlamaConfig(**text_config.__dict__)
# Set vocab_size from main config if available
if hasattr(config, "vocab_size"):
llama_config.vocab_size = config.vocab_size
self.llm = LlamaForCausalLM(
llama_config,
quant_config=quant_config,
prefix=add_prefix("llm", prefix),
)
else:
# Set vocab_size from main config if available
if hasattr(config, "vocab_size"):
config.vocab_size = config.vocab_size
self.llm = LlamaForCausalLM(
config,
quant_config=quant_config,
prefix=add_prefix("llm", prefix),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block for initializing the Llama text model can be simplified to remove redundant code and fix a no-op assignment. The line config.vocab_size = config.vocab_size is a bug and has no effect. The logic can be consolidated for better readability and robustness.

        if hasattr(text_config, "model_type") and text_config.model_type == "llama":
            llama_config = LlamaConfig(**text_config.__dict__)
        else:
            llama_config = config

        # Set vocab_size from main config if available
        if hasattr(config, "vocab_size"):
            llama_config.vocab_size = config.vocab_size

        self.llm = LlamaForCausalLM(
            llama_config,
            quant_config=quant_config,
            prefix=add_prefix("llm", prefix),
        )

def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]):
"""Load model weights."""
params_dict = dict(self.named_parameters())
loaded_params = set()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The loaded_params variable is initialized but never used. It should be removed to improve code clarity.

Comment on lines +36 to +47
allowed_params = {
"do_resize",
"resample",
"do_rescale",
"rescale_factor",
"do_normalize",
"image_mean",
"image_std",
"do_convert_rgb",
"data_format",
"input_data_format",
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This hardcoded set of allowed_params is brittle and may break if the signature of the upstream Sarashina2VisionImageProcessor._preprocess method changes. While this monkey-patching is a necessary workaround, consider adding a comment with a direct link to the source of this signature on Hugging Face to make future maintenance easier.

@zhyncs zhyncs merged commit c1815a9 into main Sep 19, 2025
8 of 55 checks passed
@zhyncs zhyncs deleted the chang/sarashina branch September 19, 2025 00:30
chenxu140 added a commit to ping1jing2/sglang that referenced this pull request Sep 20, 2025
* origin/qwen3: (30 commits)
  chore: bump sgl-kernel 0.3.11 (sgl-project#10630)
  feat: add fused moe config for Qwen3-Next-80B-A3B-Instruct on B200 (sgl-project#10631)
  model support: Sarashina2VisionForCausalLM (sgl-project#10632)
  [Performance] Qwen3-Next: speed up update_mamba_state_after_mtp_verify by 10x; e2e up to 3.54% faster (sgl-project#10586)
  [Performance] Qwen3-Next: replace arange to cached query_start_loc_li… (sgl-project#10553)
  [Feature] Speculative decoding support lookahead (sgl-project#9873)
  refactor: use registry for _get_attention_backend_from_str (sgl-project#10629)
  [router] refactor worker to builder pattern 1/n (sgl-project#10628)
  Garbage collector regression in the online server (sgl-project#10621)
  feat: Add FlexAttention Backend for Efficient Sparse Attention (sgl-project#9947)
  Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py (sgl-project#10579)
  [Performance] qwen3-next improve causal conv1d in prefill phase (sgl-project#10595)
  Fix sgl_kernel import failure on devices other than CUDA (sgl-project#10610)
  support qwen3-next-fp8 deepep (sgl-project#10622)
  update deepep version for qwen3-next deepep moe (sgl-project#10624)
  Feat/add heartbeat mechanism for nixl conn (sgl-project#10222)
  [RL] Add destroy process group api (sgl-project#9979)
  fix deepep assert when PD disaggregation == null (sgl-project#8274)
  Scale kkt after reduction (sgl-project#10604)
  [improvement] add average input/output token length for hicache benchmark stats output (sgl-project#10525)
  ...
HanHan009527 pushed a commit to HanHan009527/sglang that referenced this pull request Oct 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants