Skip to content

fix(gemma4): key shared KV by layer_type on transformers >=5.8#3701

Merged
winglian merged 1 commit into
axolotl-ai-cloud:mainfrom
thad0ctor:fix/gemma4-vlm-kv-shared-layer-index
Jun 4, 2026
Merged

fix(gemma4): key shared KV by layer_type on transformers >=5.8#3701
winglian merged 1 commit into
axolotl-ai-cloud:mainfrom
thad0ctor:fix/gemma4-vlm-kv-shared-layer-index

Conversation

@thad0ctor

@thad0ctor thad0ctor commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Description

Fix Gemma4 KV-sharing fine-tuning on transformers >=5.8. The fused Gemma4 attention monkeypatch keyed shared_kv_states by kv_shared_layer_index/layer_idx, but transformers 5.8 removed kv_shared_layer_index and now keys by layer_type. The key is now derived from whichever attribute the installed transformers exposes, keeping both the old (<=5.5) and new (>=5.8) APIs working.

Motivation and Context

On Axolotl v0.17.0 (pinned transformers==5.9.0), LoRA fine-tuning Gemma4 vision models (e.g. gemma-4-E2B-it, gemma-4-E4B-it) fails with AttributeError: 'Gemma4TextAttention' object has no attribute 'kv_shared_layer_index'. Only models with num_kv_shared_layers > 0 hit it (E2B: 20/35 layers, E4B: 18/42); gemma-4-31B-it and gemma-4-26B-A4B-it have 0 shared layers and were unaffected. v0.16 worked.

How has this been tested?

  • Real-weights GPU reproduce->fix (transformers 5.8.1, bf16, sdpa, gradient checkpointing): E2B reproduces the exact error with the old key, then runs a clean forward+backward (finite loss & grads); E4B reproduces and passes forward.
  • Config-level rendezvous check across E2B/E4B/31B/26B-A4B: every shared layer's read-key is written by a storing layer under both keying conventions.
  • New regression test TestFusedAttnSharedKV with num_kv_shared_layers > 0; suite: 7 passed, 1 xfailed.

AI Usage Disclaimer

Yes — Claude Code Opus 4.8 assisted with debug and fix

Types of changes

  • Bug fix (non-breaking change which fixes an issue)

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced Gemma4 shared KV attention mechanism to properly support multiple Transformers library conventions.
  • Tests

    • Added regression tests validating Gemma4 shared KV forward and backward propagation with various layer configurations.

The fused Gemma4 attention monkeypatch read and stored shared KV states
by `kv_shared_layer_index`/`layer_idx`, but transformers 5.8 dropped the
`kv_shared_layer_index` attribute and switched to keying `shared_kv_states`
by `layer_type`. On the pinned transformers 5.9, any Gemma4 model with
`num_kv_shared_layers > 0` (e.g. gemma-4-E2B vision) raised
`AttributeError: 'Gemma4TextAttention' object has no attribute
'kv_shared_layer_index'` once execution reached a shared layer.

Derive the read/store key from whichever attribute the installed
transformers exposes, keeping compatibility with both the old and new
APIs. Add a fused-attn regression with `num_kv_shared_layers > 0` so the
shared-KV branch is actually exercised (existing tests defaulted to 0).
@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds helper functions to the Gemma4 fused attention monkeypatch that abstract away shared-KV dictionary key derivation, enabling support for multiple Transformers version conventions. The fused forward pass is updated to use these helpers for both KV retrieval and storage, and regression tests exercise the shared-KV code path end-to-end.

Changes

Gemma4 Shared-KV Key Mapping

Layer / File(s) Summary
Shared-KV key mapping helpers and integration
src/axolotl/monkeypatch/models/gemma4/fused_attn.py
_shared_kv_read_key() and _shared_kv_store_key() helper functions choose the correct shared-KV dictionary keys based on which Gemma4 attention attributes exist, abstracting version differences. Fused forward pass updated to retrieve and store shared KV tensors using these helpers instead of direct attribute access.
Regression tests for shared-KV fused forward
tests/monkeypatch/test_gemma4_fused_attn.py
_build_kv_shared_config() creates a hybrid Gemma4 model configuration with shared-KV layers, and TestFusedAttnSharedKV.test_shared_kv_forward_backward() validates fused output shape, finiteness, cosine similarity to stock implementation, and backward pass correctness.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Suggested labels

ready to merge

Suggested reviewers

  • winglian
  • ved1beta
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the core fix: addressing how shared KV is keyed in Gemma4 for transformers >=5.8, which directly aligns with the main change of introducing key-mapping helpers to support multiple keying conventions.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/axolotl/monkeypatch/models/gemma4/fused_attn.py`:
- Around line 34-44: The shared-KV keying is inconsistent and unsafe: change
_shared_kv_read_key and _shared_kv_store_key to consistently prefer
attn.kv_shared_layer_index (return attn.kv_shared_layer_index when present)
rather than falling back to layer_type/store layer_idx in only one function;
also make the selection safe by checking/getting kv_shared_layer_index on both
producer and consumer (e.g., use getattr(attn, "kv_shared_layer_index", None)
and only use it if not None on both sides), otherwise fall back to the same
consistent legacy key (e.g., attn.layer_idx or attn.layer_type) in both
_shared_kv_read_key and _shared_kv_store_key so read/store use the identical
cache keying.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 51b26822-2991-4e3b-b0c6-6a409acc6f00

📥 Commits

Reviewing files that changed from the base of the PR and between 09d325b and cc09537.

📒 Files selected for processing (2)
  • src/axolotl/monkeypatch/models/gemma4/fused_attn.py
  • tests/monkeypatch/test_gemma4_fused_attn.py

Comment thread src/axolotl/monkeypatch/models/gemma4/fused_attn.py
@codecov

codecov Bot commented Jun 3, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 20.00000% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...rc/axolotl/monkeypatch/models/gemma4/fused_attn.py 20.00% 8 Missing ⚠️

📢 Thoughts on this report? Let us know!

@winglian winglian merged commit 41ef48f into axolotl-ai-cloud:main Jun 4, 2026
50 of 64 checks passed
@coderabbitai coderabbitai Bot mentioned this pull request Jun 5, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants