add glm support + patch#3329
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
📝 WalkthroughWalkthroughIntroduces support for GLM-4.6V multimodal models across workflows, documentation, loaders, and monkeypatches. Adds GitHub workflow cleanup steps, GLM-4.6V documentation and training example, processor loading for GLM4V variants, rope scaling attention patches, and processing strategy for image/video token masking. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~35 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
.github/workflows/preview-docs.yml (1)
31-33: Cleanup step looks good, consider adding error handling.The disk space cleanup is a common pattern for GitHub Actions runners. The removed directories (dotnet, android, ghc, CodeQL) are typically safe to delete and can free up significant space.
💡 Optional: Add error handling to prevent workflow failure
If any directory doesn't exist or can't be removed, the workflow will fail. Consider adding error handling:
- - name: cleanup node - run: | - sudo rm -rf /usr/share/dotnet /usr/local/lib/android /opt/ghc /opt/hostedtoolcache/CodeQL + - name: cleanup node + run: | + sudo rm -rf /usr/share/dotnet /usr/local/lib/android /opt/ghc /opt/hostedtoolcache/CodeQL || true.github/workflows/docs.yml (1)
15-17: Cleanup step looks good, consider adding error handling.The disk space cleanup is consistent with the preview workflow and follows a common pattern for GitHub Actions runners. The removed directories are typically safe to delete and can free up significant space for the documentation build.
💡 Optional: Add error handling to prevent workflow failure
If any directory doesn't exist or can't be removed, the workflow will fail. Consider adding error handling:
- - name: cleanup node - run: | - sudo rm -rf /usr/share/dotnet /usr/local/lib/android /opt/ghc /opt/hostedtoolcache/CodeQL + - name: cleanup node + run: | + sudo rm -rf /usr/share/dotnet /usr/local/lib/android /opt/ghc /opt/hostedtoolcache/CodeQL || truesrc/axolotl/loaders/processor.py (1)
16-38: Consider adding error handling for missingprocessor_config.If
cfg.processor_configisNoneor not set,from_pretrainedcalls on lines 30-31 will fail with an unclear error. Consider adding a guard or falling back tocfg.base_model.🔎 Proposed fix
def _load_glm4v_processor(cfg: DictDefault, tokenizer: PreTrainedTokenizerBase): """ Load GLM4V/GLM4V_MoE processor manually. The model's preprocessor_config.json has an incorrect image_processor_type (Glm46VImageProcessor instead of Glm4vImageProcessor), causing AutoProcessor to fail. This function manually constructs the processor with correct components. See: https://github.com/axolotl-ai-cloud/axolotl/issues/3312 """ from transformers.models.glm4v.image_processing_glm4v import Glm4vImageProcessor from transformers.models.glm4v.processing_glm4v import Glm4vProcessor from transformers.models.glm4v.video_processing_glm4v import Glm4vVideoProcessor + processor_path = cfg.processor_config or cfg.base_model - image_processor = Glm4vImageProcessor.from_pretrained(cfg.processor_config) - video_processor = Glm4vVideoProcessor.from_pretrained(cfg.processor_config) + image_processor = Glm4vImageProcessor.from_pretrained(processor_path) + video_processor = Glm4vVideoProcessor.from_pretrained(processor_path) processor = Glm4vProcessor( image_processor=image_processor, tokenizer=tokenizer, video_processor=video_processor, ) return processordocs/multimodal.qmd (1)
186-194: Consider addingchat_templateto the documentation.Most other model sections specify the
chat_templatevalue. The example config useschat_template: tokenizer_default. Consider adding this for completeness:🔎 Proposed addition
```yaml base_model: zai-org/GLM-4.6V + +chat_template: tokenizer_default</details> </blockquote></details> </blockquote></details> <details> <summary>📜 Review details</summary> **Configuration used**: Path: .coderabbit.yaml **Review profile**: CHILL **Plan**: Pro <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 92ee4256f73159dc20a204ba5186ebff002658ae and 0b06bc3f07a4646f3b985246affee9d3612f7217. </details> <details> <summary>📒 Files selected for processing (8)</summary> * `.github/workflows/docs.yml` * `.github/workflows/preview-docs.yml` * `docs/multimodal.qmd` * `examples/glm4/glm-4-6v-flash-qlora.yaml` * `src/axolotl/loaders/patch_manager.py` * `src/axolotl/loaders/processor.py` * `src/axolotl/monkeypatch/models/glm4v/modeling.py` * `src/axolotl/processing_strategies.py` </details> <details> <summary>🧰 Additional context used</summary> <details> <summary>🧠 Learnings (2)</summary> <details> <summary>📚 Learning: 2025-07-31T11:48:46.313Z</summary>Learnt from: SalmanMohammadi
Repo: axolotl-ai-cloud/axolotl PR: 2994
File: index.qmd:7-8
Timestamp: 2025-07-31T11:48:46.313Z
Learning: In documentation workflow testing PRs, temporary test content like "test test" may be intentionally added to trigger documentation preview actions and validate workflow fixes, rather than being accidental debug text.**Applied to files:** - `.github/workflows/preview-docs.yml` </details> <details> <summary>📚 Learning: 2025-08-22T13:23:41.455Z</summary>Learnt from: winglian
Repo: axolotl-ai-cloud/axolotl PR: 3095
File: src/axolotl/cli/merge_lora.py:65-81
Timestamp: 2025-08-22T13:23:41.455Z
Learning: Thelora_on_cpuconfiguration in Axolotl is only relevant when loading the full model into memory (standard LoRA merge approach), not when processing individual shards in the memory-efficient approach.**Applied to files:** - `examples/glm4/glm-4-6v-flash-qlora.yaml` </details> </details><details> <summary>🧬 Code graph analysis (3)</summary> <details> <summary>src/axolotl/loaders/patch_manager.py (1)</summary><blockquote> <details> <summary>src/axolotl/monkeypatch/models/glm4v/modeling.py (1)</summary> * `patch_glm4v_attention_rope_scaling` (23-131) </details> </blockquote></details> <details> <summary>src/axolotl/loaders/processor.py (2)</summary><blockquote> <details> <summary>tests/test_exact_deduplication.py (1)</summary> * `cfg` (201-216) </details> <details> <summary>src/axolotl/utils/dict.py (1)</summary> * `DictDefault` (6-38) </details> </blockquote></details> <details> <summary>src/axolotl/processing_strategies.py (2)</summary><blockquote> <details> <summary>src/axolotl/utils/mistral/mistral_tokenizer.py (1)</summary> * `chat_template` (41-43) </details> <details> <summary>src/axolotl/utils/mistral/mistral3_processor.py (1)</summary> * `chat_template` (42-44) </details> </blockquote></details> </details><details> <summary>🪛 GitHub Actions: lint</summary> <details> <summary>src/axolotl/processing_strategies.py</summary> [low] 472-472: Bandit [B105:hardcoded_password_string] Possible hardcoded password: '<|image|>' --- [low] 473-473: Bandit [B105:hardcoded_password_string] Possible hardcoded password: '<|begin_of_image|>' --- [low] 474-474: Bandit [B105:hardcoded_password_string] Possible hardcoded password: '<|end_of_image|>' --- [low] 475-475: Bandit [B105:hardcoded_password_string] Possible hardcoded password: '<|video|>' --- [low] 476-476: Bandit [B105:hardcoded_password_string] Possible hardcoded password: '<|begin_of_video|>' --- [low] 477-477: Bandit [B105:hardcoded_password_string] Possible hardcoded password: '<|end_of_video|>' </details> </details> <details> <summary>🪛 Ruff (0.14.10)</summary> <details> <summary>src/axolotl/monkeypatch/models/glm4v/modeling.py</summary> 30-30: Unused function argument: `device` (ARG001) --- 71-71: Unused function argument: `position_ids` (ARG001) </details> <details> <summary>src/axolotl/processing_strategies.py</summary> 472-472: Possible hardcoded password assigned to: "image_token" (S105) --- 473-473: Possible hardcoded password assigned to: "begin_image_token" (S105) --- 474-474: Possible hardcoded password assigned to: "end_image_token" (S105) --- 475-475: Possible hardcoded password assigned to: "video_token" (S105) --- 476-476: Possible hardcoded password assigned to: "begin_video_token" (S105) --- 477-477: Possible hardcoded password assigned to: "end_video_token" (S105) </details> </details> </details> <details> <summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)</summary> * GitHub Check: PyTest from Source Dist (3.11, 2.8.0) * GitHub Check: PyTest from Source Dist (3.11, 2.9.0) * GitHub Check: PyTest (3.11, 2.9.0) * GitHub Check: PyTest (3.11, 2.8.0) * GitHub Check: PyTest (3.11, 2.7.1) * GitHub Check: PyTest from Source Dist (3.11, 2.7.1) * GitHub Check: preview </details> <details> <summary>🔇 Additional comments (10)</summary><blockquote> <details> <summary>.github/workflows/preview-docs.yml (1)</summary><blockquote> `14-14`: **LGTM! Good practice to include the workflow file itself.** Adding the workflow file to the paths trigger ensures that changes to the workflow configuration will also trigger a preview, which is helpful for validating workflow modifications. </blockquote></details> <details> <summary>src/axolotl/loaders/processor.py (1)</summary><blockquote> `47-49`: **LGTM!** The conditional branch correctly detects GLM4V model types and delegates to the manual loader with an appropriate log message. </blockquote></details> <details> <summary>src/axolotl/processing_strategies.py (2)</summary><blockquote> `457-511`: **Well-structured processing strategy for GLM4V.** The implementation correctly follows the established pattern from other strategies (e.g., `VoxtralProcessingStrategy`, `Mistral3ProcessingStrategy`) and properly masks all image/video special tokens in the labels. --- `560-568`: **LGTM!** Good use of try/except to handle cases where `Glm4vProcessor` may not be available in older transformers versions, maintaining backward compatibility. </blockquote></details> <details> <summary>src/axolotl/monkeypatch/models/glm4v/modeling.py (3)</summary><blockquote> `30-30`: **Unused parameters are intentional for API compatibility.** The `device` parameter in `patched_rotary_init` and `position_ids` in `patched_forward` are unused but must remain to match the original method signatures. The static analysis warnings (ARG001) are false positives in this context. Also applies to: 71-71 --- `23-27`: **LGTM on the patch structure.** The monkeypatch correctly replaces both `Glm4vTextRotaryEmbedding.__init__` and `Glm4vTextAttention.forward` to handle `rope_parameters` and fix the `mrope_section` issue described in the PR. Also applies to: 63-63, 131-131 --- `84-86`: **Verify fallback `mrope_section` calculation.** The fallback `[self.head_dim // 3] * 3` may not sum to `head_dim`. For example, if `head_dim=128`, this produces `[42, 42, 42]` which sums to `126`, not `128`. This could trigger the exact issue this patch aims to fix. Consider calculating the remainder and distributing it: <details> <summary>🔎 Proposed fix</summary> ```diff if mrope_section is None: - # Fallback: assume 3 equal parts - mrope_section = [self.head_dim // 3] * 3 + # Fallback: split into 3 parts, distributing remainder + base = self.head_dim // 3 + remainder = self.head_dim % 3 + mrope_section = [base + (1 if i < remainder else 0) for i in range(3)]#!/bin/bash # Check how other models handle mrope_section fallback rg -n "mrope_section" --type py -C 3src/axolotl/loaders/patch_manager.py (1)
193-198: LGTM!The GLM4V patch integration follows the established pattern for model-specific patches in this file (similar to mistral3, qwen3_next, llama4). Clean and consistent.
docs/multimodal.qmd (1)
22-22: LGTM!GLM-4.6V correctly added to the supported models list with proper anchor link.
examples/glm4/glm-4-6v-flash-qlora.yaml (1)
1-51: No action required. The config correctly relies on auto-detection ofmodel_config_typefrom the HuggingFace model config, and the GLM4V patches inpatch_manager.pywill be applied automatically when the model reports its type as "glm4v" or "glm4v_moe".
| self.image_token = "<|image|>" | ||
| self.begin_image_token = "<|begin_of_image|>" | ||
| self.end_image_token = "<|end_of_image|>" | ||
| self.video_token = "<|video|>" | ||
| self.begin_video_token = "<|begin_of_video|>" | ||
| self.end_video_token = "<|end_of_video|>" |
There was a problem hiding this comment.
Add # nosec comments to suppress Bandit false positives.
The pipeline is failing due to Bandit flagging these token strings as potential hardcoded passwords. These are special tokens, not secrets. Add # nosec comments as done elsewhere in this file (line 255, 420).
🔎 Proposed fix
- self.image_token = "<|image|>"
- self.begin_image_token = "<|begin_of_image|>"
- self.end_image_token = "<|end_of_image|>"
- self.video_token = "<|video|>"
- self.begin_video_token = "<|begin_of_video|>"
- self.end_video_token = "<|end_of_video|>"
+ self.image_token = "<|image|>" # nosec
+ self.begin_image_token = "<|begin_of_image|>" # nosec
+ self.end_image_token = "<|end_of_image|>" # nosec
+ self.video_token = "<|video|>" # nosec
+ self.begin_video_token = "<|begin_of_video|>" # nosec
+ self.end_video_token = "<|end_of_video|>" # nosec📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| self.image_token = "<|image|>" | |
| self.begin_image_token = "<|begin_of_image|>" | |
| self.end_image_token = "<|end_of_image|>" | |
| self.video_token = "<|video|>" | |
| self.begin_video_token = "<|begin_of_video|>" | |
| self.end_video_token = "<|end_of_video|>" | |
| self.image_token = "<|image|>" # nosec | |
| self.begin_image_token = "<|begin_of_image|>" # nosec | |
| self.end_image_token = "<|end_of_image|>" # nosec | |
| self.video_token = "<|video|>" # nosec | |
| self.begin_video_token = "<|begin_of_video|>" # nosec | |
| self.end_video_token = "<|end_of_video|>" # nosec |
🧰 Tools
🪛 GitHub Actions: lint
[low] 472-472: Bandit [B105:hardcoded_password_string] Possible hardcoded password: '<|image|>'
[low] 473-473: Bandit [B105:hardcoded_password_string] Possible hardcoded password: '<|begin_of_image|>'
[low] 474-474: Bandit [B105:hardcoded_password_string] Possible hardcoded password: '<|end_of_image|>'
[low] 475-475: Bandit [B105:hardcoded_password_string] Possible hardcoded password: '<|video|>'
[low] 476-476: Bandit [B105:hardcoded_password_string] Possible hardcoded password: '<|begin_of_video|>'
[low] 477-477: Bandit [B105:hardcoded_password_string] Possible hardcoded password: '<|end_of_video|>'
🪛 Ruff (0.14.10)
472-472: Possible hardcoded password assigned to: "image_token"
(S105)
473-473: Possible hardcoded password assigned to: "begin_image_token"
(S105)
474-474: Possible hardcoded password assigned to: "end_image_token"
(S105)
475-475: Possible hardcoded password assigned to: "video_token"
(S105)
476-476: Possible hardcoded password assigned to: "begin_video_token"
(S105)
477-477: Possible hardcoded password assigned to: "end_video_token"
(S105)
🤖 Prompt for AI Agents
In src/axolotl/processing_strategies.py around lines 472 to 477, Bandit flags
the token string constants as hardcoded secrets; add trailing "# nosec" comments
to each token assignment (self.image_token, self.begin_image_token,
self.end_image_token, self.video_token, self.begin_video_token,
self.end_video_token) exactly like the existing instances at lines ~255 and ~420
to suppress the false positives, ensuring no other code or functionality is
changed.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
| evals_per_epoch: 0 | ||
| saves_per_epoch: 1 | ||
| weight_decay: 0.0 | ||
| output_dir: ./outputs/glm-4-6v-flash-qlora |
There was a problem hiding this comment.
Let's move this higher up as it's a config that we change often
|
|
||
| apply_mistral_tokenizer_image_patch() | ||
|
|
||
| if self.cfg.model_config_type in ("glm4v", "glm4v_moe"): |
There was a problem hiding this comment.
which glm4v_moe is affected by this?
| """ | ||
| Load GLM4V/GLM4V_MoE processor manually. | ||
|
|
||
| The model's preprocessor_config.json has an incorrect image_processor_type |
There was a problem hiding this comment.
I'm wondering if it's even wrong https://github.com/huggingface/transformers/blob/main/src/transformers/models/glm46v/modeling_glm46v.py
This is a new addition last month
| def patch_glm4v_attention_rope_scaling(): | ||
| """ | ||
| Patch Glm4vTextAttention and Glm4vTextRotaryEmbedding to handle rope_parameters | ||
| and partial rotary factor (for GLM-4.6V). |
There was a problem hiding this comment.
Any source for this implementation?
|
|
||
|
|
||
| class Glm4vProcessingStrategy(ProcessingStrategy): | ||
| """Processing Strategy class for GLM4V and GLM4V-MoE vision models. |
There was a problem hiding this comment.
If this works for GLM4V-moe too, maybe add it to our docs which are supported?
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
|
NOTES: needs latest transformers from source |
| 4. Run the fine-tuning: | ||
|
|
||
| ```bash | ||
| axolotl train examples/glm46v/glm-4-6v-qlora.yaml | ||
| ``` | ||
|
|
||
| Or for the Flash variant: | ||
|
|
||
| ```bash | ||
| axolotl train examples/glm46v/glm-4-6v-flash-qlora.yaml | ||
| ``` |
There was a problem hiding this comment.
Let's put the Flash one first as it's lighter. Maybe add a comment about the size of the model and the Flash vram usage, put it there. See the gemma3n as example doc.
| - Vision datasets should follow the OpenAI Messages format with image content. See [multimodal docs](https://docs.axolotl.ai/docs/multimodal.html#dataset-format). | ||
| - You can run a full finetuning by removing the `adapter: qlora` and `load_in_4bit: true` from the config. | ||
| - Read more on how to load your own dataset at [docs](https://docs.axolotl.ai/docs/dataset_loading.html). | ||
| - The dataset format follows the OpenAI Messages format as seen [here](https://docs.axolotl.ai/docs/dataset-formats/conversation.html#chat_template). |
There was a problem hiding this comment.
Point to the Vision data format in the multi-modal docs. See ministral3 /vision as example
Maybe you want to put a text-only version for train as well. To do so, just remove the vision config and train like a dense layer. You may need to set model_type: GLM... text class (if possible).
There was a problem hiding this comment.
This file belongs to glm4
NanoCode012
left a comment
There was a problem hiding this comment.
Could you also add short summary for what went wrong with flex attention and flash attention here to keep track of it?
| @@ -0,0 +1,50 @@ | |||
| base_model: zai-org/GLM-4.6V-Flash | |||
| trust_remote_code: true | |||
There was a problem hiding this comment.
Could we add a revision_of_model pin?
|
Does this require transformers v5? |
|
Yess |
|
spoke with ved, this requires transformers main/v5 |
NanoCode012
left a comment
There was a problem hiding this comment.
Let's rebase this and re-test as v5 PR is in
To what version should trl be updated to?
Do you need to add that config to our example configs? If you had a multi-gpu example, maybe make a separate config for that |
|
Trl LTS 0.27.0 , |
| 3. Swap to the Axolotl transformers v5 branch | ||
|
|
||
| ```bash | ||
| git fetch | ||
| git checkout transformers-v5 | ||
|
|
||
| # Install packages for transformers v5 | ||
| pip install -e . | ||
| ``` |
There was a problem hiding this comment.
Not required anymore, given we merge v5 in
| ## Text-only training (no vision) | ||
|
|
||
| - If you only want to finetune **text**: | ||
| - Start from the same config and **remove the vision-specific fields** (e.g. `is_multimodal`, `image_column`, `image_size`, and any vision processor settings). | ||
| - Train it like a standard dense LLM (similar to other text-only configs). | ||
| - Depending on the GLM checkpoints you use, you may need to set `model_type` to the appropriate **GLM text class** (e.g. the text-only GLM variant for that family), if auto-detection does not pick it up correctly. |
There was a problem hiding this comment.
I think this needs to be checked. Either provide a text config that folks can use or omit this section
| @@ -0,0 +1,53 @@ | |||
| base_model: zai-org/GLM-4.6V-Flash | |||
There was a problem hiding this comment.
let's title this file, glm-4-6v-flash-ddp.yaml to be more explicit
We're at |
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

Description
GLM 4.6 support
masking / padding
Glm4vProcessingStrategy##TEST
config
examples/glm4/glm-4-6v-flash-qlora.yamlSummary by CodeRabbit
New Features
Documentation