GLM-4-0414 and GLM-4.1V Code Refactor#12117
Conversation
Summary of ChangesHello @zRzRzRzRzRzRzR, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant refactoring of the GLM-4-0414 and GLM-4.1V model implementations. The primary goal is to improve the modularity and flexibility of the model components, particularly by enabling robust support for pipeline parallelism. This involves redesigning how model layers are initialized and how data flows through them, utilizing new interfaces like PPMissingLayer for distributed operations. The changes also streamline multimodal input processing for the GLM-4.1V model and enhance overall weight management, leading to a cleaner and more scalable architecture. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant refactoring for the GLM-4 and GLM-4.1V models, primarily to add support for Pipeline Parallelism (PP) and improve modularity. The changes decouple model components from a monolithic configuration object, which is a positive step towards cleaner code. However, I've identified some critical issues in the new PP implementation that need to be addressed. Specifically, the weight tying logic for the language model head is incorrect in glm4.py and completely missing in glm4v.py, which will cause runtime errors in a multi-stage pipeline. Additionally, I've noted a change in error handling during weight loading that could mask important configuration problems. Please see my detailed comments for suggestions on how to fix these issues.
| else: | ||
| emb_token_weight = self.pp_group.recv( | ||
| size=(config.vocab_size, config.hidden_size), | ||
| dtype=next(self.model.parameters()).dtype, | ||
| src=self.pp_group.first_rank, | ||
| ) | ||
| self.lm_head.weight.copy_(emb_token_weight) |
There was a problem hiding this comment.
The weight tying logic for pipeline parallelism appears to be incorrect. The else block at line 451 will be executed by all non-first ranks, including intermediate pipeline stages. However, only the last rank initializes self.lm_head as a ParallelLMHead with a weight attribute. Intermediate ranks use PPMissingLayer, which does not have a weight attribute. This will cause an AttributeError on intermediate ranks when self.lm_head.weight.copy_ is called. This block should likely be elif self.pp_group.is_last_rank: to ensure only the last rank attempts to receive and copy the weights.
| else: | |
| emb_token_weight = self.pp_group.recv( | |
| size=(config.vocab_size, config.hidden_size), | |
| dtype=next(self.model.parameters()).dtype, | |
| src=self.pp_group.first_rank, | |
| ) | |
| self.lm_head.weight.copy_(emb_token_weight) | |
| elif self.pp_group.is_last_rank: | |
| emb_token_weight = self.pp_group.recv( | |
| size=(config.vocab_size, config.hidden_size), | |
| dtype=next(self.model.parameters()).dtype, | |
| src=self.pp_group.first_rank, | |
| ) | |
| self.lm_head.weight.copy_(emb_token_weight) |
| else: | ||
| raise KeyError(f"Parameter '{name}' not found in model.") | ||
| logger.warning(f"Parameter {name} not found in params_dict") |
There was a problem hiding this comment.
Changing the error handling for missing parameters from raising a KeyError to logging a warning could mask significant issues during model loading. If a weight from the checkpoint is not found in the model's parameters, it often indicates a mismatch between the model architecture and the checkpoint, which can lead to a partially uninitialized model. This can cause subtle and hard-to-debug errors. It is generally safer to fail fast in such situations. Please consider reverting this to raise an exception, or at least make this lenient behavior configurable and disabled by default.
|
In #12117, the implementation of is_neox_style was not considered, leading to incorrect implementation of GLM-4V. |
|
Benchmark result looks reasonable to me for glm4.6 and 4.5V |
Used new interfaces, including the addition of PPMissingLayer, and discarded some useless old code