model support: Sarashina2VisionForCausalLM#10632
Conversation
Summary of ChangesHello @CatherineSue, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces support for the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for the Sarashina2VisionForCausalLM model. The changes are well-structured, including the model definition, a dedicated processor, and necessary modifications to existing utilities.
My review focuses on the new model implementation and processor. I've identified a bug in the model's __init__ method related to config handling, an unused variable, and a maintainability concern in the processor's monkey-patching logic. The suggestions aim to fix the bug, improve code clarity, and make the code more robust to upstream changes.
Overall, this is a solid contribution. Addressing the feedback will further improve the quality and maintainability of the new model support.
| if hasattr(text_config, "model_type") and text_config.model_type == "llama": | ||
| llama_config = LlamaConfig(**text_config.__dict__) | ||
| # Set vocab_size from main config if available | ||
| if hasattr(config, "vocab_size"): | ||
| llama_config.vocab_size = config.vocab_size | ||
| self.llm = LlamaForCausalLM( | ||
| llama_config, | ||
| quant_config=quant_config, | ||
| prefix=add_prefix("llm", prefix), | ||
| ) | ||
| else: | ||
| # Set vocab_size from main config if available | ||
| if hasattr(config, "vocab_size"): | ||
| config.vocab_size = config.vocab_size | ||
| self.llm = LlamaForCausalLM( | ||
| config, | ||
| quant_config=quant_config, | ||
| prefix=add_prefix("llm", prefix), | ||
| ) |
There was a problem hiding this comment.
This block for initializing the Llama text model can be simplified to remove redundant code and fix a no-op assignment. The line config.vocab_size = config.vocab_size is a bug and has no effect. The logic can be consolidated for better readability and robustness.
if hasattr(text_config, "model_type") and text_config.model_type == "llama":
llama_config = LlamaConfig(**text_config.__dict__)
else:
llama_config = config
# Set vocab_size from main config if available
if hasattr(config, "vocab_size"):
llama_config.vocab_size = config.vocab_size
self.llm = LlamaForCausalLM(
llama_config,
quant_config=quant_config,
prefix=add_prefix("llm", prefix),
)| def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]): | ||
| """Load model weights.""" | ||
| params_dict = dict(self.named_parameters()) | ||
| loaded_params = set() |
| allowed_params = { | ||
| "do_resize", | ||
| "resample", | ||
| "do_rescale", | ||
| "rescale_factor", | ||
| "do_normalize", | ||
| "image_mean", | ||
| "image_std", | ||
| "do_convert_rgb", | ||
| "data_format", | ||
| "input_data_format", | ||
| } |
There was a problem hiding this comment.
This hardcoded set of allowed_params is brittle and may break if the signature of the upstream Sarashina2VisionImageProcessor._preprocess method changes. While this monkey-patching is a necessary workaround, consider adding a comment with a direct link to the source of this signature on Hugging Face to make future maintenance easier.
* origin/qwen3: (30 commits) chore: bump sgl-kernel 0.3.11 (sgl-project#10630) feat: add fused moe config for Qwen3-Next-80B-A3B-Instruct on B200 (sgl-project#10631) model support: Sarashina2VisionForCausalLM (sgl-project#10632) [Performance] Qwen3-Next: speed up update_mamba_state_after_mtp_verify by 10x; e2e up to 3.54% faster (sgl-project#10586) [Performance] Qwen3-Next: replace arange to cached query_start_loc_li… (sgl-project#10553) [Feature] Speculative decoding support lookahead (sgl-project#9873) refactor: use registry for _get_attention_backend_from_str (sgl-project#10629) [router] refactor worker to builder pattern 1/n (sgl-project#10628) Garbage collector regression in the online server (sgl-project#10621) feat: Add FlexAttention Backend for Efficient Sparse Attention (sgl-project#9947) Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py (sgl-project#10579) [Performance] qwen3-next improve causal conv1d in prefill phase (sgl-project#10595) Fix sgl_kernel import failure on devices other than CUDA (sgl-project#10610) support qwen3-next-fp8 deepep (sgl-project#10622) update deepep version for qwen3-next deepep moe (sgl-project#10624) Feat/add heartbeat mechanism for nixl conn (sgl-project#10222) [RL] Add destroy process group api (sgl-project#9979) fix deepep assert when PD disaggregation == null (sgl-project#8274) Scale kkt after reduction (sgl-project#10604) [improvement] add average input/output token length for hicache benchmark stats output (sgl-project#10525) ...
Motivation
This PR focues on adding support for
Sarashina2VisionForCausalLMarchitecture. Such assbintuitions/sarashina2-vision-8bThe model type is built upon Qwen2VL and LlamaForCausalLM.
Caveats
message['content']without checking the type. We added a customized chat template.Modifications
get_input_embeddingstoLlamaModelas it is required in Qwen2VLProcessorexamples/chat_template. It is able to check message['content']['type']Accuracy Tests
Simple Example
Server start up command:
Image:

"https://images.pexels.com/photos/210186/pexels-photo-210186.jpeg?auto=compress&cs=tinysrgb&dpr=2&h=4096&w=4096"Text-only:

"What is 1+1? What is the capital of France?"Benchmarking and Profiling
Checklist