model support: Sarashina2VisionForCausalLM by CatherineSue · Pull Request #10632 · sgl-project/sglang

CatherineSue · 2025-09-19T00:27:26Z

Motivation

This PR focues on adding support for Sarashina2VisionForCausalLM architecture. Such as sbintuitions/sarashina2-vision-8b

The model type is built upon Qwen2VL and LlamaForCausalLM.

Caveats

sbintuitions/sarashina2-vision-8b has outdated imports from a previous version of transformers. See: https://huggingface.co/sbintuitions/sarashina2-vision-8b/blob/main/processing_sarashina2_vision.py#L41
The chat template of the model doesn't follow the normal Jinja multimodal standard. It directly use message['content'] without checking the type. We added a customized chat template.

Modifications

Add processor for sarashina2_vision.
Add model file for sarashina2_vision.
Add get_input_embeddings to LlamaModel as it is required in Qwen2VLProcessor
Add customized chat template in examples/chat_template. It is able to check message['content']['type']

Accuracy Tests

Simple Example

Server start up command:

python3 -m sglang.launch_server --model /models/sbintuitions/sarashina2-vision-8b --port=8080 --tp-size=8 --chat-template=examples/chat_template/vision_template_sarashina.jinja --trust-remote-code --disable-fast-image-processor

Image: "https://images.pexels.com/photos/210186/pexels-photo-210186.jpeg?auto=compress&cs=tinysrgb&dpr=2&h=4096&w=4096"

Text-only: "What is 1+1? What is the capital of France?"

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-09-19T00:27:41Z

Summary of Changes

Hello @CatherineSue, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for the Sarashina2VisionForCausalLM model, a multimodal architecture that combines a Llama-based language model with a Qwen2VL vision component. The changes enable the system to process and generate responses based on both text and image inputs for this specific model, addressing its unique processing requirements and chat template format.

Highlights

New Model Support: Added comprehensive support for the Sarashina2VisionForCausalLM multimodal model, which integrates a Llama text backbone with a Qwen2VL vision encoder.
Processor Integration: Implemented a dedicated processor for Sarashina2Vision to handle multimodal data, including image processing and tokenization.
Llama Model Enhancement: Extended the LlamaModel with a get_input_embeddings method, required for compatibility with the new vision model's processing.
Dynamic Image Sizing: Updated the get_processor utility to automatically inject default image size parameters for Sarashina2Vision models, similar to Qwen2-VL.
Custom Chat Template: A customized chat template was added to correctly handle the model's specific multimodal message content structure, addressing a caveat in its default behavior.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for the Sarashina2VisionForCausalLM model. The changes are well-structured, including the model definition, a dedicated processor, and necessary modifications to existing utilities.

My review focuses on the new model implementation and processor. I've identified a bug in the model's __init__ method related to config handling, an unused variable, and a maintainability concern in the processor's monkey-patching logic. The suggestions aim to fix the bug, improve code clarity, and make the code more robust to upstream changes.

Overall, this is a solid contribution. Addressing the feedback will further improve the quality and maintainability of the new model support.

gemini-code-assist · 2025-09-19T00:29:10Z

python/sglang/srt/models/sarashina2_vision.py

+        if hasattr(text_config, "model_type") and text_config.model_type == "llama":
+            llama_config = LlamaConfig(**text_config.__dict__)
+            # Set vocab_size from main config if available
+            if hasattr(config, "vocab_size"):
+                llama_config.vocab_size = config.vocab_size
+            self.llm = LlamaForCausalLM(
+                llama_config,
+                quant_config=quant_config,
+                prefix=add_prefix("llm", prefix),
+            )
+        else:
+            # Set vocab_size from main config if available
+            if hasattr(config, "vocab_size"):
+                config.vocab_size = config.vocab_size
+            self.llm = LlamaForCausalLM(
+                config,
+                quant_config=quant_config,
+                prefix=add_prefix("llm", prefix),
+            )


This block for initializing the Llama text model can be simplified to remove redundant code and fix a no-op assignment. The line config.vocab_size = config.vocab_size is a bug and has no effect. The logic can be consolidated for better readability and robustness.

if hasattr(text_config, "model_type") and text_config.model_type == "llama": llama_config = LlamaConfig(**text_config.__dict__) else: llama_config = config # Set vocab_size from main config if available if hasattr(config, "vocab_size"): llama_config.vocab_size = config.vocab_size self.llm = LlamaForCausalLM( llama_config, quant_config=quant_config, prefix=add_prefix("llm", prefix), )

gemini-code-assist · 2025-09-19T00:29:10Z

python/sglang/srt/models/sarashina2_vision.py

+    def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]):
+        """Load model weights."""
+        params_dict = dict(self.named_parameters())
+        loaded_params = set()


The loaded_params variable is initialized but never used. It should be removed to improve code clarity.

gemini-code-assist · 2025-09-19T00:29:10Z

python/sglang/srt/multimodal/processors/sarashina2_vision.py

+                allowed_params = {
+                    "do_resize",
+                    "resample",
+                    "do_rescale",
+                    "rescale_factor",
+                    "do_normalize",
+                    "image_mean",
+                    "image_std",
+                    "do_convert_rgb",
+                    "data_format",
+                    "input_data_format",
+                }


This hardcoded set of allowed_params is brittle and may break if the signature of the upstream Sarashina2VisionImageProcessor._preprocess method changes. While this monkey-patching is a necessary workaround, consider adding a comment with a direct link to the source of this signature on Hugging Face to make future maintenance easier.

* origin/qwen3: (30 commits) chore: bump sgl-kernel 0.3.11 (sgl-project#10630) feat: add fused moe config for Qwen3-Next-80B-A3B-Instruct on B200 (sgl-project#10631) model support: Sarashina2VisionForCausalLM (sgl-project#10632) [Performance] Qwen3-Next: speed up update_mamba_state_after_mtp_verify by 10x; e2e up to 3.54% faster (sgl-project#10586) [Performance] Qwen3-Next: replace arange to cached query_start_loc_li… (sgl-project#10553) [Feature] Speculative decoding support lookahead (sgl-project#9873) refactor: use registry for _get_attention_backend_from_str (sgl-project#10629) [router] refactor worker to builder pattern 1/n (sgl-project#10628) Garbage collector regression in the online server (sgl-project#10621) feat: Add FlexAttention Backend for Efficient Sparse Attention (sgl-project#9947) Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py (sgl-project#10579) [Performance] qwen3-next improve causal conv1d in prefill phase (sgl-project#10595) Fix sgl_kernel import failure on devices other than CUDA (sgl-project#10610) support qwen3-next-fp8 deepep (sgl-project#10622) update deepep version for qwen3-next deepep moe (sgl-project#10624) Feat/add heartbeat mechanism for nixl conn (sgl-project#10222) [RL] Add destroy process group api (sgl-project#9979) fix deepep assert when PD disaggregation == null (sgl-project#8274) Scale kkt after reduction (sgl-project#10604) [improvement] add average input/output token length for hicache benchmark stats output (sgl-project#10525) ...

model support: Sarashina2VisionForCausalLM

c64785d

CatherineSue requested review from JustinTong0323 and mickqian as code owners September 19, 2025 00:27

sglang-bot added the run-ci label Sep 19, 2025

Add improved chat template for sarashina vl models

ca23887

zhyncs approved these changes Sep 19, 2025

View reviewed changes

gemini-code-assist bot reviewed Sep 19, 2025

View reviewed changes

zhyncs merged commit c1815a9 into main Sep 19, 2025
8 of 55 checks passed

zhyncs deleted the chang/sarashina branch September 19, 2025 00:30

lifuhuang pushed a commit that referenced this pull request Sep 20, 2025

model support: Sarashina2VisionForCausalLM (#10632)

3f37ec5

HanHan009527 pushed a commit to HanHan009527/sglang that referenced this pull request Oct 9, 2025

model support: Sarashina2VisionForCausalLM (sgl-project#10632)

3162875

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model support: Sarashina2VisionForCausalLM#10632

model support: Sarashina2VisionForCausalLM#10632
zhyncs merged 2 commits intomainfrom
chang/sarashina

CatherineSue commented Sep 19, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Sep 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 19, 2025

Uh oh!

gemini-code-assist bot Sep 19, 2025

Uh oh!

gemini-code-assist bot Sep 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CatherineSue commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Caveats

Modifications

Accuracy Tests

Simple Example

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Sep 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CatherineSue commented Sep 19, 2025 •

edited

Loading