-
Notifications
You must be signed in to change notification settings - Fork 282
support interns1 #1060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
support interns1 #1060
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @xhx1022, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request integrates the InternS1 multimodal model into the LightLLM framework, expanding its capabilities to process both text and image inputs. The changes include the implementation of the model's vision processing architecture, a specialized tokenizer designed to handle multimodal prompts, and necessary adjustments to the weight loading and API infrastructure to ensure seamless operation. This enhancement allows LightLLM to leverage InternS1 for tasks requiring a comprehensive understanding of diverse data modalities.
Highlights
- InternS1 Model Integration: Full support for the InternS1 multimodal model has been added, enabling its use within the LightLLM framework.
- Vision Transformer Implementation: A new
InternS1VisionTransformerand associated components (configurations, embeddings, attention, MLP) have been introduced to handle visual input processing for the InternS1 model. - Custom Multimodal Tokenizer: An
InternS1Tokenizerhas been implemented to correctly process and embed image tokens within the prompt, facilitating multimodal input handling. - Weight Loading Adaptation: Logic has been added to adapt model weight keys during loading, specifically renaming
model.language_model.xxxtomodel.xxxfor InternS1 compatibility. - API and Configuration Updates: The multimodal API and configuration utilities have been updated to recognize and properly initialize the InternS1 model and its visual components.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the interns1 multimodal model. The changes are comprehensive, including the model definition, visual processing, tokenizer, and integration into the server infrastructure. My review has identified a few critical and high-severity issues that need to be addressed. These include a critical bug where model weights are not being loaded, leading to an uninitialized model, and another that would cause a runtime error due to an undefined variable. I've also pointed out some areas for improvement in terms of code correctness, maintainability, and best practices, such as fixing incorrect logic, removing hardcoded values, and cleaning up debug code.
| self.vision_tower = InternS1VisionModel._from_config(config=cfg.vision_config,torch_dtype=self.dtype) | ||
| self.multi_modal_projector = InternS1MultiModalProjector(cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vision_tower and multi_modal_projector are instantiated from config, but their weights are never loaded from the checkpoint. InternS1VisionModel._from_config creates a model with uninitialized weights. This will result in the model producing random outputs. You should use a method like from_pretrained to load the model with its weights, or manually load the state dictionary.
| if vision_feature_layer == -1: | ||
| vision_features = self.vision_tower(pixel_values=pixel_values).last_hidden_state | ||
| else: | ||
| vision_features = self.vision_model(pixel_values=pixel_values).hidden_states[vision_feature_layer] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable self.vision_model is used here, but it is not defined in the class. The vision model is stored in self.vision_tower. This will cause a runtime AttributeError.
| vision_features = self.vision_model(pixel_values=pixel_values).hidden_states[vision_feature_layer] | |
| vision_features = self.vision_tower(pixel_values=pixel_values).hidden_states[vision_feature_layer] |
| self, | ||
| hidden_states: torch.Tensor, | ||
| attention_mask: Optional[torch.Tensor] = None, | ||
| output_attentions: Optional[torch.Tensor] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if height % scale_factor != 0 or width % scale_factor != 0: | ||
| raise ValueError("Height and width must be divisible by scale_factor for proper downsampling.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check height % scale_factor != 0 is incorrect when scale_factor is a float (e.g., 0.5). For any integer height, height % 0.5 will evaluate to 0.0, so this check will never trigger, potentially leading to errors or unexpected behavior during downsampling if the dimensions are not divisible. The intention is likely to ensure that height is divisible by 1 / scale_factor. Please adjust the logic to correctly perform this check.
| if isinstance(text_config, dict): | ||
| text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "qwen2" # todo | ||
| text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config) | ||
| elif text_config is None: | ||
| text_config = CONFIG_MAPPING["qwen2"]() # todo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are todo comments on lines 118 and 121 indicating that the text_config handling is not finalized. Hardcoding the default model_type to "qwen2" might lead to unexpected behavior if another text model is used. It would be better to either make this configurable or raise an error if model_type is not specified in the text_config.
| img_tensors.append(image_pixel_values) | ||
|
|
||
| else: | ||
| raise Exception("Unsupport input types: {} for {}".format(type(img), img)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to raise a more specific exception than the generic Exception. For example, TypeError or ValueError would be more informative about the nature of the error.
| raise Exception("Unsupport input types: {} for {}".format(type(img), img)) | |
| raise TypeError("Unsupport input types: {} for {}".format(type(img), img)) |
| def init_imageitem_extral_params( | ||
| self, img: ImageItem, multi_params: MultimodalParams, sampling_params: SamplingParams | ||
| ): | ||
| img.extra_params["image_patch_max_num"] = 12 # 好丑的写法,后面改动 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value 12 for image_patch_max_num is hardcoded. The comment # 好丑的写法,后面改动 (ugly way of writing, change later) indicates this is a temporary solution. This value should be made configurable, for example by reading it from the model configuration in the __init__ method and storing it as an instance attribute.
| # print("[debug] prompt: ", prompt) | ||
| # print("[debug] origin_ids: ", origin_ids) | ||
| # import copy | ||
| # origin_ids_ = copy.deepcopy(origin_ids) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # print("[debug] input_ids: ", input_ids) | ||
| # data = { | ||
| # "origin_ids": origin_ids_, | ||
| # "input_ids": input_ids | ||
| # } | ||
| # with open("input_ids_lightllm.json", "w") as f: | ||
| # json.dump(data, f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No description provided.