Skip to content

GLM-4-0414 and GLM-4.1V Code Refactor#12117

Merged
hnyls2002 merged 12 commits intosgl-project:mainfrom
zRzRzRzRzRzRzR:glm
Oct 27, 2025
Merged

GLM-4-0414 and GLM-4.1V Code Refactor#12117
hnyls2002 merged 12 commits intosgl-project:mainfrom
zRzRzRzRzRzRzR:glm

Conversation

@zRzRzRzRzRzRzR
Copy link
Copy Markdown
Contributor

Used new interfaces, including the addition of PPMissingLayer, and discarded some useless old code

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @zRzRzRzRzRzRzR, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant refactoring of the GLM-4-0414 and GLM-4.1V model implementations. The primary goal is to improve the modularity and flexibility of the model components, particularly by enabling robust support for pipeline parallelism. This involves redesigning how model layers are initialized and how data flows through them, utilizing new interfaces like PPMissingLayer for distributed operations. The changes also streamline multimodal input processing for the GLM-4.1V model and enhance overall weight management, leading to a cleaner and more scalable architecture.

Highlights

  • GLM-4-0414 Model Refactoring: The core GLM-4-0414 model architecture has been refactored to enhance modularity and support pipeline parallelism.
  • Pipeline Parallelism Integration: Introduced PPMissingLayer and PPProxyTensors to enable efficient pipeline parallelism across model components, including embedding, decoder layers, and normalization.
  • Modular Component Design: Glm4MLP and Glm4Attention classes were rewritten to accept direct configuration parameters, moving away from a monolithic config object for greater flexibility.
  • GLM-4.1V Vision Model Updates: The Glm4vVisionBlock was refactored to be a standalone nn.Module and now utilizes VisionAttention, improving its integration and flexibility.
  • Enhanced Weight Management: New methods like load_kv_cache_scales, get_embed_and_head, and set_embed_and_head were added for more granular control over model weights and KV cache scaling.
  • Multimodal Input Handling: Glm4vForConditionalGeneration now includes updated routines for handling multimodal inputs and mrope-enabled positions, streamlining processing for vision-language models.
  • Code Cleanup: Useless old code has been discarded, and configuration parameters are now passed directly to components, simplifying the codebase.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant refactoring for the GLM-4 and GLM-4.1V models, primarily to add support for Pipeline Parallelism (PP) and improve modularity. The changes decouple model components from a monolithic configuration object, which is a positive step towards cleaner code. However, I've identified some critical issues in the new PP implementation that need to be addressed. Specifically, the weight tying logic for the language model head is incorrect in glm4.py and completely missing in glm4v.py, which will cause runtime errors in a multi-stage pipeline. Additionally, I've noted a change in error handling during weight loading that could mask important configuration problems. Please see my detailed comments for suggestions on how to fix these issues.

Comment on lines +451 to +457
else:
emb_token_weight = self.pp_group.recv(
size=(config.vocab_size, config.hidden_size),
dtype=next(self.model.parameters()).dtype,
src=self.pp_group.first_rank,
)
self.lm_head.weight.copy_(emb_token_weight)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The weight tying logic for pipeline parallelism appears to be incorrect. The else block at line 451 will be executed by all non-first ranks, including intermediate pipeline stages. However, only the last rank initializes self.lm_head as a ParallelLMHead with a weight attribute. Intermediate ranks use PPMissingLayer, which does not have a weight attribute. This will cause an AttributeError on intermediate ranks when self.lm_head.weight.copy_ is called. This block should likely be elif self.pp_group.is_last_rank: to ensure only the last rank attempts to receive and copy the weights.

Suggested change
else:
emb_token_weight = self.pp_group.recv(
size=(config.vocab_size, config.hidden_size),
dtype=next(self.model.parameters()).dtype,
src=self.pp_group.first_rank,
)
self.lm_head.weight.copy_(emb_token_weight)
elif self.pp_group.is_last_rank:
emb_token_weight = self.pp_group.recv(
size=(config.vocab_size, config.hidden_size),
dtype=next(self.model.parameters()).dtype,
src=self.pp_group.first_rank,
)
self.lm_head.weight.copy_(emb_token_weight)

Comment on lines 314 to +615
else:
raise KeyError(f"Parameter '{name}' not found in model.")
logger.warning(f"Parameter {name} not found in params_dict")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Changing the error handling for missing parameters from raising a KeyError to logging a warning could mask significant issues during model loading. If a weight from the checkpoint is not found in the model's parameters, it often indicates a mismatch between the model architecture and the checkpoint, which can lead to a partially uninitialized model. This can cause subtle and hard-to-debug errors. It is generally safer to fail fast in such situations. Please consider reverting this to raise an exception, or at least make this lenient behavior configurable and disabled by default.

@zRzRzRzRzRzRzR
Copy link
Copy Markdown
Contributor Author

In #12117, the implementation of is_neox_style was not considered, leading to incorrect implementation of GLM-4V.

@JustinTong0323 JustinTong0323 self-assigned this Oct 26, 2025
@JustinTong0323
Copy link
Copy Markdown
Collaborator

Benchmark result looks reasonable to me for glm4.6 and 4.5V

@yuan-luo yuan-luo self-assigned this Oct 27, 2025
@yuan-luo
Copy link
Copy Markdown
Collaborator

In #12117, the implementation of is_neox_style was not considered, leading to incorrect implementation of GLM-4V.

I guess you are mentioning: #11722

@hnyls2002 hnyls2002 enabled auto-merge (squash) October 27, 2025 08:56
@hnyls2002 hnyls2002 disabled auto-merge October 27, 2025 08:57
@hnyls2002 hnyls2002 merged commit a88b006 into sgl-project:main Oct 27, 2025
68 of 127 checks passed
@zRzRzRzRzRzRzR zRzRzRzRzRzRzR deleted the glm branch October 27, 2025 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants